archived/skills/ground-truth/SKILL.md
Establish and refine ground truth labels for evaluation datasets. Use when creating, reviewing, or updating labels for any judgment/reasoning task.
npx skillsauth add nicsuzor/academicops ground-truthInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Establish rigorous, defensible ground truth labels for evaluation datasets. Ensures labels derive from authoritative sources rather than intuition, and documents reasoning for reproducibility and auditability.
Invoke when:
Ground truth derives from explicit guidelines, not intuition.
When labeling, the answer must come from the established criteria themselves, not from general judgment about what "should" be the case.
❌ "This seems like good journalism, so it shouldn't be flagged"
✅ "Guideline X permits quoting harmful language when [condition]. This article meets that condition."
Before labeling any record, load and review the authoritative criteria:
For each record:
Label structure:
ground_truth:
violating: true/false
reasons:
- Primary reason with guideline reference
- 'OPTIONAL: Secondary observation that scorers need not require'
Reason categories:
High disagreement signals:
When encountering genuine ambiguity, document it - don't force a label.
Prefix with "OPTIONAL:" for secondary observations:
Example:
reasons:
- Article provides critical framing and therefore DOES NOT VIOLATE quote attribution rules.
- 'OPTIONAL: Uses "activists" - guidelines discourage this when implying negative connotations, but usage here is neutral.'
| Pitfall | Correction | | -------------------------- | ---------------------------------------- | | Labeling by intuition | Find explicit guideline provision | | Assuming guidelines agree | Check each criterion separately | | Over-strict interpretation | Guidelines often permit with conditions | | Ignoring context | Most guidelines consider framing/purpose | | Binary thinking | Use OPTIONAL for nuanced observations |
When refining labels:
For each labeling decision, provide:
tools
Streamlit implementation of the analyst presentation layer. Use when building or updating a Streamlit dashboard that displays pre-computed research data. This is the Streamlit-specific HOW for the tech-agnostic principles in the aops-tools analyst skill — display only, never transform.
tools
Python plotting and statistical-modelling libraries (matplotlib, seaborn, statsmodels) for the analyst presentation and statistical-methodology layers. Use when producing publication-quality figures or fitting statistical models in Python. Library-specific HOW for the tech-agnostic principles in the aops-tools analyst skill.
tools
dbt (data build tool) implementation of the analyst transformation layer. Use when a project has a dbt/ directory or you need to build, test, or document SQL transformations as version-controlled, reproducible dbt models. This is the dbt-specific HOW for the tech-agnostic principles in the aops-tools analyst skill.
development
Core academicOps skill — institutional memory, strategic coordination, workflow routing, and framework governance. Merges butler (chief-of-staff) with framework development conventions.