skills/flyaoc-evaluating-agentic-ontology/SKILL.md
Build multi-agent systems for end-to-end ontology curation from scientific literature. Applies FlyAOC's agent architecture patterns—memorization, pipeline, single-agent, and multi-agent—to extract structured, ontology-grounded annotations from document corpora. Use when asked to: 'curate knowledge from papers into a structured ontology', 'build an agent pipeline for scientific literature extraction', 'design a multi-agent system for document annotation', 'extract Gene Ontology or controlled-vocabulary terms from text', 'reconcile evidence across multiple documents into structured annotations', 'build a retrieval-augmented scientific reasoning system'.
npx skillsauth add ndpvt-web/arxiv-claude-skills flyaoc-evaluating-agentic-ontologyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and implement multi-agent systems that curate structured, ontology-grounded annotations from large document corpora. Based on the FlyAOC benchmark (FlyBench), it applies the finding that multi-agent architectures with delegated paper analysis and bounded orchestrator context consistently outperform monolithic pipelines and single-agent designs for scientific knowledge extraction. The core pattern—search, read, reconcile evidence, ground to controlled vocabularies—generalizes beyond Drosophila genetics to any domain requiring structured curation from unstructured text.
FlyAOC evaluates four agent architectures for end-to-end ontology curation, where an agent receives only an entity name (e.g., a gene symbol) and must search a corpus of thousands of papers, read relevant documents, and produce structured annotations grounded in controlled vocabularies.
The critical architectural insight is that multi-agent designs—where an orchestrator delegates individual paper analysis to specialized subagents and never sees raw paper text—avoid the context overflow problem that degrades single-agent performance. Each subagent processes one document in isolation and returns pre-resolved annotations with ontology IDs. The orchestrator aggregates, deduplicates, and ranks results while keeping its context window bounded. This yields consistent performance scaling: more papers processed means better results, unlike single-agent designs where performance peaks then declines as context fills.
The retrieval-confirmation finding is equally important for system design: agents primarily use retrieval to confirm parametric knowledge rather than discover genuinely new information. This means the retrieval component should be designed to surface both confirmatory and contradictory evidence. For domains where terminology is specialized or historical (expression patterns, legacy synonyms), retrieval provides the largest gains over memorization. For well-represented domains (standard Gene Ontology terms), retrieval gains are marginal. Design your retrieval budget accordingly—allocate more retrieval effort to rare, domain-specific annotation types.
Define the annotation schema. Specify the exact structured output: what entity types, relationship types, and controlled vocabulary identifiers the system must produce. Use JSON schemas with required fields for term ID, term name, evidence text, and source document. Example: { "go_id": "GO:0007391", "go_name": "dorsal closure", "aspect": "biological_process", "evidence": "...", "paper_id": "PMC123456" }.
Build or connect to the controlled vocabulary index. Load the target ontology (Gene Ontology OBO files, MeSH XML, or custom vocabulary) into a searchable index. Implement two tools: search_ontology(query) -> [(term_id, term_name, similarity)] for free-text lookup and validate_term(term_id) -> bool for ID verification. This prevents hallucinated ontology IDs.
Set up BM25 corpus search. Index your document corpus with BM25 (using rank-bm25 or whoosh). Implement search_corpus(query, top_k) -> [doc_ids] returning ranked document identifiers. Support both entity-name queries and refined queries combining entity names with functional keywords.
Implement the document reader tool. Create read_paper(doc_id) -> {title, abstract, sections} that returns structured full text. Partition long documents into sections (introduction, methods, results, discussion) so subagents can focus on relevant sections.
Build the multi-agent architecture with three roles:
Implement iterative ontology grounding with retry. When a subagent extracts a concept that doesn't match any ontology term exactly, it should retry with alternative phrasings—synonyms, broader terms, or natural language descriptions. This feedback loop is the key advantage over rigid pipelines. Example: if "wing disc development" fails, try "imaginal disc development" or "wing morphogenesis".
Set a retrieval budget and allocation strategy. Based on FlyAOC findings: allocate ~16 papers per entity for good coverage. For well-known annotation types (standard functional terms), fewer papers suffice. For rare annotation types (historical synonyms, tissue-specific expression), allocate more. Track which ground-truth supporting papers are retrieved to measure discovery effectiveness.
Aggregate and rank predictions. The orchestrator collects annotations from all subagents, deduplicates by ontology term ID, counts supporting evidence across papers, and produces a ranked list. Assign confidence scores based on: (a) number of independent papers supporting the annotation, (b) ontology similarity to other confirmed annotations, (c) specificity of the evidence text.
Evaluate using semantic similarity scoring. Don't require exact ontology term matches. Use Wang semantic similarity over the ontology DAG: recall = sum(max(sim(gt, pred) for pred in predictions) for gt in ground_truth) / |ground_truth|. This credits predictions that are ontologically close but not identical to expert annotations.
Iterate on architecture, not model scale. FlyAOC shows diminishing returns from scaling backbone models but significant gains from architectural improvements. Invest effort in better retrieval strategies, context management, and agent coordination rather than simply upgrading to a larger LLM.
Example 1: Gene Function Curation Agent
User: Build a system that takes a gene symbol and produces Gene Ontology
annotations by searching a corpus of papers.
Approach:
1. Define output schema:
{
"gene": "dpp",
"annotations": [
{
"go_id": "GO:0007391",
"go_name": "dorsal closure",
"aspect": "biological_process",
"evidence_text": "dpp signaling is required for dorsal closure...",
"source_paper": "PMC2345678",
"confidence": 0.92
}
]
}
2. Load Gene Ontology OBO file into a search index (goatools + whoosh).
3. Index paper corpus with BM25 (rank_bm25 library).
4. Implement orchestrator that:
- Searches corpus for "dpp" -> retrieves top 16 papers
- Spawns one subagent per paper with prompt:
"Read this paper about gene dpp. Extract all Gene Ontology-relevant
statements. For each, identify the GO aspect (BP/MF/CC), find the
best matching GO term using search_ontology(), validate with
validate_term(), and return structured JSON."
- Aggregates subagent outputs, deduplicates by go_id,
ranks by evidence count.
Output:
[
{"go_id": "GO:0007391", "go_name": "dorsal closure",
"aspect": "BP", "papers": 4, "confidence": 0.95},
{"go_id": "GO:0048814", "go_name": "wing morphogenesis",
"aspect": "BP", "papers": 3, "confidence": 0.88},
{"go_id": "GO:0005125", "go_name": "cytokine activity",
"aspect": "MF", "papers": 2, "confidence": 0.72}
]
Example 2: Historical Synonym Extraction
User: I need to find all historical names a gene has been called across
decades of literature, linked back to the papers where each name appeared.
Approach:
1. Define synonym schema:
{"gene": "N", "synonyms": [{"name": "Notch", "papers": [...]}]}
2. Orchestrator searches corpus for the canonical gene symbol.
3. For each retrieved paper, a subagent scans for:
- Explicit alias statements ("also known as", "formerly called")
- Parenthetical synonyms: "gene-X (also GeneY)"
- Nomenclature mapping tables
- References to the gene under different names in older citations
4. Orchestrator aggregates synonyms, normalizes casing,
deduplicates, and records the earliest paper for each synonym.
Output:
{
"gene": "N",
"canonical_name": "Notch",
"synonyms": [
{"name": "split", "first_seen": "PMC001234", "year": 1917, "count": 12},
{"name": "notch-1", "first_seen": "PMC045678", "year": 1985, "count": 5},
{"name": "N(spl)", "first_seen": "PMC091011", "year": 1992, "count": 3}
]
}
Example 3: Choosing the Right Agent Architecture
User: I'm building a literature extraction system. Should I use a pipeline
or a multi-agent approach?
Decision framework (from FlyAOC findings):
| Factor | Pipeline | Multi-Agent |
|-------------------------------|-------------------|-----------------------|
| Implementation complexity | Low | Medium |
| Handles ambiguous terms | No (no retry) | Yes (iterative) |
| Scales with more documents | Linear | Linear |
| Context overflow risk | None (stages) | None (delegated) |
| Ontology grounding accuracy | Low (batch, rigid)| High (retry loops) |
| Best for well-defined schemas | Yes | Overkill |
| Best for open-ended curation | No | Yes |
Recommendation:
- Use Pipeline when: annotations map 1:1 to known patterns, ontology is
small, and you need speed over accuracy.
- Use Multi-Agent when: documents are long, ontology is large, terms are
ambiguous, or you need evidence reconciliation across papers.
- Avoid Single-Agent when: processing >8 documents per entity (context
degradation observed in FlyAOC beyond this threshold).
validate_term() as a hard gate—reject any annotation with an invalid ID."conflict": true flag and the supporting evidence for each.Paper: FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases — Zhang et al., 2026. Focus on Section 3 (agent architectures and tool definitions), Section 5 (architecture comparison results), and Section 6 (analysis of retrieval vs. parametric knowledge). Code: https://github.com/xingjian-zhang/flyaoc.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".