skills/corpusqa-10-million-token/SKILL.md
Corpus-level QA over massive document collections using memory-augmented agentic processing. Synthesize answers that require global integration, comparison, and statistical aggregation across hundreds of documents. Use when: 'analyze all these documents and answer...', 'compare metrics across this corpus', 'aggregate statistics from these reports', 'what patterns exist across all files in...', 'summarize findings across the entire dataset', 'rank entities by computed metrics from these documents'.
npx skillsauth add ndpvt-web/arxiv-claude-skills corpusqa-10-million-tokenInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to answer complex analytical questions over large document corpora (hundreds to thousands of files) where evidence is dispersed across many documents and cannot be resolved by retrieving a few relevant chunks. Based on the CorpusQA framework, the core technique decouples reasoning from raw text by extracting structured schemas from each document, aggregating them into queryable tables, and executing programmatic computations — then presenting synthesized answers grounded in the full corpus. When the corpus exceeds context limits, a memory-augmented iterative processing loop replaces naive RAG.
Decoupling reasoning from text. The CorpusQA framework separates the analytical reasoning task from the unstructured text problem. Instead of asking an LLM to reason directly over millions of tokens of raw prose, you first extract structured key-value schemas from each document (entity attributes like revenue, enrollment, square footage), aggregate those schemas into a unified table (entities as rows, attributes as columns), and then execute the analytical query against that structured representation. This guarantees programmatically verifiable answers — the ground truth comes from SQL-like execution over extracted data, not from LLM generation.
Why RAG collapses. Standard retrieval-augmented generation assumes answers live in a few retrievable chunks. For corpus-level analytical queries, the relevant evidence is spread across every document in the collection. A question like "What is the median tuition across all 300 universities?" requires data from all 300 documents — no retrieval system will fetch all of them. CorpusQA experiments show RAG accuracy drops to 1-2% at scale while memory-augmented approaches retain 10-22%.
Memory-augmented agentic architecture. When the corpus exceeds context limits, process documents iteratively in chunks, maintaining a fixed-size memory buffer that accumulates extracted structured data. Each chunk is read, relevant schema fields are extracted and merged into the memory buffer, and the chunk is discarded. After all documents are processed, the final answer is computed solely from the consolidated memory. This avoids the context-window ceiling and the retrieval bottleneck simultaneously.
Inventory the corpus. List all documents, count them, estimate total token size. Determine whether the corpus fits in context or requires iterative chunked processing (threshold: if total tokens exceed available context, use the memory-augmented loop).
Analyze the query to identify required attributes. Parse the user's question to determine what entity-level attributes are needed. For "Which companies had revenue growth above 15%?", the required attributes are: company name, revenue (current period), revenue (prior period). Write these down explicitly as the extraction schema.
Extract structured schemas from each document. For each document, extract the required key-value pairs into a JSON object. Use consistent field names across all documents. If a document lacks a required field, record it as null. Apply validation: if working with critical data, use multi-pass extraction or cross-check extracted values against the source text.
{"entity": "Acme Corp", "revenue_2024": 4200000, "revenue_2023": 3500000, "sector": "Technology"}
Aggregate schemas into a unified table. Merge all per-document JSON objects into a single array (or DataFrame/CSV). Each row is one entity (document), each column is an attribute. This is the structured representation of the entire corpus.
Translate the user's question into a computable query. Convert the natural language question into a concrete computation over the aggregated table — filtering conditions, aggregation functions (SUM, AVG, COUNT, MEDIAN), sorting/ranking, ratio calculations, or multi-step derived metrics. Express this as pseudocode or SQL.
SELECT entity, (revenue_2024 - revenue_2023) / revenue_2023 * 100 AS growth_pct
FROM corpus_table
WHERE growth_pct > 15
ORDER BY growth_pct DESC
Execute the computation. Run the query against the aggregated table. For simple queries, compute inline. For complex multi-step queries, use Python/pandas or actual SQL. This produces the ground-truth answer.
Format and present the answer with provenance. Return the computed result alongside the entities and source documents that contributed to it. Include the computation logic so the user can verify.
Handle oversized corpora with iterative memory processing. If the corpus cannot be fully loaded, process documents in batches:
Validate and cross-check. For numerical answers, sanity-check results (are counts reasonable given corpus size? are percentages in valid ranges?). For ranking queries, verify that the ordering criterion was applied correctly. Flag documents where extraction confidence was low.
Report limitations transparently. State how many documents were processed vs. total, extraction success rate, and any documents that were skipped or had missing fields. If the answer depends on incomplete data, say so.
Example 1: Cross-corpus statistical comparison
User: "I have 150 university annual reports in ./reports/. What is the average six-year graduation rate for universities with admission rates below 50%, and which schools exceed that average by more than 15 percentage points?"
Approach:
./reports/*.pdf or *.md to inventory all 150 files.{university, admission_rate, graduation_rate_6yr}.admission_rate < 50.AVG(graduation_rate_6yr) for the filtered set.graduation_rate_6yr > avg + 15.Output:
Filtered corpus: 87 universities with admission rate < 50%
Average 6-year graduation rate (filtered): 72.4%
Threshold (avg + 15pp): 87.4%
Universities exceeding threshold:
| University | Admission Rate | 6-Year Grad Rate |
|---------------------|---------------|-------------------|
| Stanford University | 4.3% | 95.2% |
| MIT | 3.9% | 94.8% |
| Yale University | 6.1% | 93.1% |
| ... (12 more) | | |
Data extracted from 87/150 reports (63 had admission rates >= 50%).
3 reports had missing graduation rate data and were excluded.
Example 2: Financial metric ranking across filings
User: "From the 10-K filings in ./filings/, rank the top 10 companies by debt-to-equity ratio and flag any with ratio above 3.0."
Approach:
./filings/ — find 200 10-K filings.{company, total_debt, total_equity}.debt_to_equity = total_debt / total_equity per company.Output:
Top 10 companies by debt-to-equity ratio:
| Rank | Company | Total Debt ($M) | Equity ($M) | D/E Ratio | Flag |
|------|---------------|-----------------|-------------|-----------|------|
| 1 | TelcoCorp | 89,200 | 12,400 | 7.19 | !! |
| 2 | RetailMax | 45,100 | 8,900 | 5.07 | !! |
| ... | | | | | |
4 companies flagged with D/E ratio > 3.0.
Extraction success: 195/200 filings (5 had non-standard formatting, skipped).
Example 3: Building a verifiable QA benchmark from a document set
User: "I want to create a QA benchmark from our internal knowledge base (500 docs). Generate questions with provable answers."
Approach:
Output:
[
{
"question": "What is the average project completion rate across all teams with more than 10 members?",
"difficulty": "medium",
"sql": "SELECT AVG(completion_rate) FROM teams WHERE member_count > 10",
"answer": "78.3%",
"source_docs": ["team_alpha.md", "team_beta.md", ... ],
"num_sources": 34
}
]
| Problem | Cause | Resolution |
|---------|-------|------------|
| Extraction returns null for expected fields | Document uses non-standard formatting or terminology | Re-attempt with a more specific extraction prompt that includes the document's actual terminology. Log and exclude if still failing. |
| Aggregated counts don't match corpus size | Some documents failed extraction or were duplicates | Report N extracted / M total in output. Check for duplicate entities and merge. |
| Computed percentages exceed 100% or are negative | Unit mismatch or misidentified fields (e.g., extracting revenue in thousands vs. millions) | Normalize units during extraction. Add unit to schema definition. |
| Memory buffer grows too large during iterative processing | Too many attributes extracted per document | Reduce schema to only query-relevant fields. Summarize/compress buffer periodically. |
| Answer differs from user's expectation | Ambiguous query interpretation (e.g., "average" could mean mean or median) | Clarify the computation before executing. Show the SQL/pseudocode equivalent for user confirmation. |
Paper: CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning (Lu et al., 2026). Key sections: Section 3 for the six-step data synthesis pipeline, Section 4.3 for the memory-augmented agent (MemAgent) architecture, Table 2-3 for performance degradation curves showing RAG collapse vs. agentic robustness.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".