archived/skills/training-set-builder/SKILL.md
Extract structured training examples from document sets to build datasets for teaching LLMs specific tasks or styles. Use when processing review documents, feedback annotations, or revision histories.
npx skillsauth add nicsuzor/academicops training-set-builderInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Extract structured training examples (source → feedback → revised → context) from document sets to build datasets for teaching LLMs specific tasks or styles.
[[AXIOMS.md]]
Process document collections to create training data that captures:
Each training example should contain:
{
"example_id": "unique_identifier",
"source": {
"text": "Original content before feedback/revision",
"location": "Section/page reference (if applicable)",
"metadata": {
"document": "Source document name/title",
"type": "document type (article, grant, code, etc.)",
"additional_context": "Any relevant contextual information"
}
},
"feedback": {
"comment": "The specific feedback, critique, or intervention",
"type": "Category of feedback (structural, substantive, clarity, etc.)",
"action": "What action is being recommended (revise, add, remove, etc.)"
},
"revised": {
"text": "Improved version after applying feedback (if available)",
"location": "Where the revision appears (if applicable)"
},
"context": {
"pattern_type": "What pattern should the LLM learn to recognize?",
"teaching_point": "What is this example teaching?",
"scope": "How broad is this feedback (specific, section-level, document-level)"
}
}
The user will provide documents and explain their structure. Common patterns include:
Your job: Extract training-worthy patterns from whatever structure the user provides.
A high-quality training example should:
✓ Show a clear before/after or problem/solution pattern ✓ Be specific enough to teach pattern recognition ✓ Include sufficient context for learning ✓ Represent a replicable skill or judgment ✓ Capture the style/approach being taught
❌ Generic comments with no actionable content ("Good work")
❌ Administrative/procedural comments ("Due date extended")
❌ Context-free feedback that can't teach a pattern
⚠️ Ambiguous examples (include but flag with "quality": "ambiguous")
When the user provides documents, first:
Understand the structure
Identify training opportunities
Clarify with the user
For each extractable training example:
Identify the source material
Capture the feedback/intervention
Document the revision (if available)
Extract the pattern
Provide appropriate categorization for the examples:
Type categories (adapt to the domain):
Action categories:
Scope categories:
data/training-examples/
├── {collection_name}/
│ ├── collection_summary.md # Overview of the document set
│ ├── extracted_examples.json # All extracted examples as structured data
│ └── training_examples.jsonl # One example per line (for easy loading)
# Training Data Collection: {Name}
**Purpose**: {What is this dataset teaching?}
**Source**: {Where did these documents come from?}
**Example Count**: {Number of extracted examples}
## Document Overview
- Document 1: {description}
- Document 2: {description}
...
## Training Focus
{1-2 paragraphs describing what patterns/skills this dataset teaches}
## Extraction Notes
{Any important context about the extraction process, ambiguities, or decisions made}
{
"collection_name": "descriptive_name",
"purpose": "What this dataset teaches",
"extraction_date": "2025-01-18",
"examples": [
{
"example_id": "collection_001",
"source": {...},
"feedback": {...},
"revised": {...},
"context": {...},
"quality": "high | medium | low | ambiguous"
},
...
],
"metadata": {
"total_examples": 42,
"document_count": 5,
"extractor_notes": "Any relevant notes"
}
}
One JSON object per line for efficient loading:
{"example_id": "collection_001", "source": {...}, "feedback": {...}, "revised": {...}, "context": {...}}
{"example_id": "collection_002", "source": {...}, "feedback": {...}, "revised": {...}, "context": {...}}
If feedback doesn't clearly map to source:
"quality": "ambiguous"context.teaching_pointIf multiple interpretations possible:
If source/revised text is unclear:
If no clear revision available:
revised to null or empty✓ Clear relationship between source/feedback/revision ✓ Specific enough to teach pattern recognition ✓ Representative of the target style/skill ✓ Sufficient context for learning ✓ Well-categorized and contextualized
When the user provides documents:
Discovery
Extraction
Formatting
Validation
Output
Before confirming extraction complete:
Completeness
Quality
Format
User provides: Review comment on a paper
This quote is repeated on p2 and 14: "high-profile investigators
receive[] accolades for their valuable work, while the people who
made that work possible remain unacknowledged" (Rahman 2020).
Extracted example:
{
"example_id": "example_001",
"source": {
"text": "high-profile investigators receive[] accolades for their valuable work, while the people who made that work possible remain unacknowledged (Rahman 2020)",
"location": "p2, p14",
"metadata": {
"document": "Power and Humility in OSI",
"type": "journal article",
"section": "multiple"
}
},
"feedback": {
"comment": "This quote is repeated on p2 and 14",
"type": "stylistic",
"action": "remove"
},
"revised": {
"text": "[Quote appears only once in revised version]",
"location": "p2"
},
"context": {
"pattern_type": "Detect duplicate content across document",
"teaching_point": "LLM should identify repeated quotes/text and flag for removal",
"scope": "specific"
}
}
User provides: Review comment on introduction
The introduction almost seems to make a broader claim about power
in tech, but doesn't quite make the critique explicit enough and
comes off a little disjoint as a result.
Extracted example:
{
"example_id": "example_002",
"source": {
"text": "[Introduction section discussing power in tech and OSI without explicit connection]",
"location": "Introduction",
"metadata": {
"document": "Power and Humility in OSI",
"type": "journal article",
"section": "Introduction"
}
},
"feedback": {
"comment": "The introduction almost seems to make a broader claim about power in tech, but doesn't quite make the critique explicit enough and comes off a little disjoint as a result.",
"type": "clarity",
"action": "clarify"
},
"revised": {
"text": null,
"location": null
},
"context": {
"pattern_type": "Identify implied arguments needing explicit articulation",
"teaching_point": "Recognize when connections between broader claims and specific context are insufficiently explicit, particularly in introductions",
"scope": "section"
}
}
Cannot extract text from document:
Ambiguous structure:
No clear training value:
tools
Streamlit implementation of the analyst presentation layer. Use when building or updating a Streamlit dashboard that displays pre-computed research data. This is the Streamlit-specific HOW for the tech-agnostic principles in the aops-tools analyst skill — display only, never transform.
tools
Python plotting and statistical-modelling libraries (matplotlib, seaborn, statsmodels) for the analyst presentation and statistical-methodology layers. Use when producing publication-quality figures or fitting statistical models in Python. Library-specific HOW for the tech-agnostic principles in the aops-tools analyst skill.
tools
dbt (data build tool) implementation of the analyst transformation layer. Use when a project has a dbt/ directory or you need to build, test, or document SQL transformations as version-controlled, reproducible dbt models. This is the dbt-specific HOW for the tech-agnostic principles in the aops-tools analyst skill.
development
Core academicOps skill — institutional memory, strategic coordination, workflow routing, and framework governance. Merges butler (chief-of-staff) with framework development conventions.