aops-tools/skills/extract/SKILL.md
General extraction/ingestion skill that routes to specific workflows based on input type. Extracts structured information from documents, emails, reviews, feedback, and other sources.
npx skillsauth add nicsuzor/academicops extractInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Taxonomy note: This skill provides domain expertise (HOW) for extracting structured information from documents and sources. See [[aops-core/skills/remember/references/TAXONOMY.md]] for the skill/workflow distinction.
General-purpose extraction skill that intelligently routes to specialized workflows based on input type. Extracts structured information from various sources and stores it appropriately (public framework vs. private data).
Universal axioms apply (enforced by rbg). P#52 (Read-Then-Write Memory) is especially relevant.
MANDATORY before creating any new extracted content: search PKB for existing knowledge on the same subject.
mcp__pkb__search(query="[topic/person/document subject]")
This prevents duplicate memories and grounds extraction in accumulated knowledge. See [[remember]] skill's "search first" step as the model.
Provide a unified entry point for all extraction tasks:
When invoked, analyze the input and route to the appropriate workflow:
Signals:
Route to: workflows/training-data.md
Storage:
$ACA_DATA/processed/review_training/Signals:
Route to: Archive extraction logic (selective extraction, use /remember for storage)
Storage: Use Skill(skill="remember") for PKB storage
Signals:
Route to: Existing aops-core/skills/decision-extract/SKILL.md
Storage: Daily note with decision formatting
Signals:
Route to: workflows/document-knowledge.md (to be created)
Storage: Depends on content - PKB or framework docs
Signals:
/convert-to-md (alias preserved for backwards compatibility)Route to: workflows/docs-to-md.md
Storage: Converted .md files replace originals in the same directory
procedures/review-inline-comments.mdreview.txt + source.pdf + metadata.jsonworkflows/revision-history.md (to be created)See procedures/review-inline-comments.md for detailed procedure.
Quick summary:
CRITICAL: Training data often contains sensitive information (author names, unpublished work, specific critiques).
Sensitive data → $ACA_DATA/processed/review_training/{collection_name}/:
extracted_examples.json (full text/feedback pairs)training_pairs.jsonl (machine-readable format)collection_summary.md (with identifying information)Generalized patterns → Framework (public repo):
aops-core/skills/hydrator/workflows/peer-review.md (update with principles)aops-core/skills/*/references/ (depersonalized examples)High-quality extraction:
Generalization quality:
Apply selective extraction logic.
Key principle: Most archival documents have NO long-term value. Be highly selective.
Extract: Concrete outcomes, significant relationships, financial records Skip: Newsletters, invitations, administrative routine, mass communications
Storage: Use Skill(skill="remember") with proper tags and canonical identifiers.
Delegate to aops-core/skills/decision-extract/SKILL.md.
Key principle: Extract tasks requiring approval/choice that are blocking other work.
Storage: Daily note with formatted decision list for batch processing.
$ACA_DATA/processed/Directory structure:
$ACA_DATA/processed/
├── review_training/
│ ├── {collection_name}/
│ │ ├── extracted_examples.json
│ │ ├── training_pairs.jsonl
│ │ ├── collection_summary.md
│ │ └── source_documents/
│ └── ...
├── email_archive/
│ └── ...
└── ...
Access: This directory is:
/home/nic/brain)When adding examples to public framework docs:
/extract vs. specialized skillsUse /extract:
Use specialized skill directly:
/remember - When you know you want to add to knowledge base/decision-extract - When specifically extracting decisions/review-training - When processing matched review/source pairs (legacy)Note on /convert-to-md: This trigger is now an alias for /extract. Invoking /convert-to-md routes to the workflows/docs-to-md.md workflow.
/extract → analyze input → route to:
- /remember (for archival preservation)
- /decision-extract (for pending decisions)
- training-data workflow (for LLM training data)
- document-knowledge workflow (for general extraction)
| Scenario | Behavior |
| ------------------------------ | -------------------------------------------------------- |
| Unclear input type | Ask user to clarify extraction goal |
| Cannot convert document format | Try alternative conversion, document failure |
| Ambiguous feedback | Flag with "quality": "ambiguous", include with caveats |
| No clear extraction value | Ask user if they want to skip or force extraction |
| Storage location unclear | Default to $ACA_DATA/processed/, confirm with user |
Input: DOCX file with inline comments from peer review
User: /extract /path/to/review.docx --type peer-review
Agent:
pandoc --track-changes=all$ACA_DATA/processed/review_training/aoir2026/aops-core/skills/hydrator/workflows/peer-review.md with depersonalized principlesInput: Directory of email MSG files
User: /extract emails/2025-Q1/ --type archive
Agent:
/remember to store in PKBInput: Mixed documents without type specified
User: /extract documents/
Agent:
Before completing extraction:
Completeness:
Quality:
Sensitivity:
$ACA_DATA/processed/Documentation:
tools
Program / portfolio supervision — the autonomous top loop above /supervisor. "Ready the release" → discover and decompose the constituent epics → run /supervisor on each → surface only escalations + merge-ready PRs. Stateless tick driven by /loop; all cross-tick state lives in the program task body.
development
Mirror PKB tasks onto the Cowork native task list at claim time and sync completion back to PKB. Cowork-only; ships only in the cowork build of aops-core.
testing
Instruction quality gate — reviews agent instructions (task bodies, workflow steps, skill procedures, self-test protocols) for shallow-execution vulnerabilities before deployment. Two modes: author (pre-hoc review) and audit (trace a failure back to the instruction gap). The bar is excellence, not compliance.
content-media
Design-stage fitness rubric — persona immersion, scenario design, dimensions that define what excellence looks like for the people a feature serves. Two modes — author (produce a rubric for a new spec) and critique (red-team an existing spec). Output lives on the spec, not in the verification brief. Owned by pauli.