skills/fs-researcher-test-time-scaling-long-horizon/SKILL.md
File-system-based dual-agent deep research framework that scales beyond context windows. Separates evidence gathering (Context Builder) from report writing (Report Writer) using a persistent hierarchical knowledge base on disk. Use this skill when the user says: "research this topic in depth", "write a comprehensive report on X", "do deep research about Y", "investigate and write up Z thoroughly", "build a knowledge base and then write a report", "scale up research quality with more rounds".
npx skillsauth add ndpvt-web/arxiv-claude-skills fs-researcher-test-time-scaling-long-horizonInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to conduct long-horizon deep research tasks that exceed a single context window by implementing a dual-agent, file-system-based architecture from the FS-Researcher paper. Instead of cramming search results and report drafting into one overloaded context, you separate the work into two distinct phases: a Context Builder that browses, distills, and archives information into a hierarchical knowledge base on disk, and a Report Writer that composes a final report section-by-section using only that knowledge base as its source of facts. The file system acts as durable external memory, enabling iterative refinement across multiple sessions without context overflow.
FS-Researcher solves the fundamental bottleneck of deep research with LLM agents: long trajectories of web browsing and evidence collection consume so many tokens that little budget remains for thoughtful report composition. Prior approaches stuff everything into one context, forcing a tradeoff between breadth of evidence and quality of writing. FS-Researcher eliminates this tradeoff by externalizing state to the file system.
The Context Builder agent acts as a digital librarian. It decomposes the research topic into subtopics, searches the web, reads pages, and distills findings into structured Markdown notes organized in a tree of folders reflecting semantic relationships. Each note includes inline citations as relative file paths pointing to archived raw source pages. The agent maintains control files (todos with [PENDING]/[IN-PROGRESS]/[COMPLETE] status, checklists for acceptance criteria, and logs of session decisions). At the end of each session, it performs a checklist-based review, identifying gaps for the next iteration. Running more Context Builder rounds directly improves final report quality -- this is the test-time scaling mechanism.
The Report Writer agent then works exclusively from the knowledge base -- no web access. It first creates an outline, then writes exactly one section per session, performing section-level reviews against quality checklists before marking each complete. After all sections are written, it conducts an overall report-level review and revises as needed. This section-by-section approach prevents shallow "fact-listing" and enables analytical depth through local planning and self-correction.
Initialize the workspace. Create a workspace directory with this structure:
workspace/
├── index.md # Topic decomposition and KB table of contents
├── todos.md # Task tracker with [PENDING]/[IN-PROGRESS]/[COMPLETE]
├── checklist.md # Acceptance criteria for research quality
├── log.md # Session-by-session decisions and review findings
├── knowledge_base/ # Hierarchical distilled notes (Markdown)
└── sources/ # Archived raw webpage content
Decompose the research topic into subtopics. Write an index.md that breaks the user's question into 5-15 investigable subtopics arranged hierarchically. Create corresponding folders in knowledge_base/ with descriptive names (e.g., knowledge_base/scaling_laws/compute_optimal/).
Run Context Builder rounds. For each round, follow the inspect-plan-execute cycle:
index.md, todos.md, and existing notes to understand current coverage gaps.knowledge_base/, and archive raw pages in sources/. Each note must include specific facts (not vague summaries) and cite sources via relative file paths.checklist.md criteria. Log gaps in log.md and update todos.md.Scale by running additional Context Builder rounds. Each round deepens coverage. Run at least 3 rounds for adequate breadth; 5+ rounds for comprehensive research. Each round should target gaps identified in the previous review.
Create the report outline. The Report Writer reads index.md and scans the knowledge base to draft a section-by-section outline. Write this as outline.md in the workspace. Each section heading should map to specific knowledge base folders.
Write the report section by section. For each section:
todos.md.Conduct a full report review. After all sections are written, read the complete report and check for: logical flow between sections, redundancy, missing cross-references, citation consistency, and overall coherence. Revise as needed.
Produce the final deliverable. Assemble all sections into a single report file with a proper introduction, table of contents, and bibliography derived from the sources directory.
Example 1: Technical Deep Research
User: "Research the current state of protein structure prediction methods
and write a comprehensive report."
Approach:
1. Create workspace at ./protein_research/
2. Decompose into subtopics in index.md:
- AlphaFold2 and AlphaFold3 architecture
- Competing methods (ESMFold, RoseTTAFold, OpenFold)
- Benchmarks and accuracy metrics (CASP, CAMEO)
- Limitations and failure modes
- Downstream applications (drug design, enzyme engineering)
- Open challenges (dynamics, complexes, disordered regions)
3. Context Builder Round 1: Search for each subtopic, archive 15-20 source
pages, write initial notes. Discover gap: limited coverage of industrial
applications.
4. Context Builder Round 2: Focus on industrial applications, recent 2025-2026
papers, and comparative benchmarks. Archive 10 more sources.
5. Context Builder Round 3: Checklist review reveals weak coverage of
limitations. Search specifically for failure cases and critical analyses.
6. Report Writer: Create outline with 8 sections. Write each section drawing
from the relevant KB subfolder. Section on limitations cites 6 sources
from knowledge_base/limitations/.
Output: A 4000-word report with 30+ cited sources, organized as:
workspace/
├── report.md # Final assembled report
├── outline.md # Section outline
├── index.md # Topic map
├── knowledge_base/
│ ├── alphafold/
│ │ ├── architecture.md
│ │ └── alphafold3_changes.md
│ ├── competing_methods/
│ │ ├── esmfold.md
│ │ └── rosettafold.md
│ ├── benchmarks/
│ │ └── casp_results.md
│ ├── limitations/
│ │ ├── disordered_regions.md
│ │ └── dynamics.md
│ └── applications/
│ ├── drug_design.md
│ └── enzyme_engineering.md
└── sources/
├── alphafold3_nature_2024.md
├── esmfold_science_2023.md
└── ... (30+ archived pages)
Example 2: Business Consulting Research
User: "Investigate the market opportunity for AI-powered legal document
review tools. I need a thorough analysis."
Approach:
1. Create workspace at ./legal_ai_research/
2. Decompose into: market size, key players, technology landscape,
regulatory environment, buyer personas, competitive dynamics,
pricing models, adoption barriers.
3. Context Builder Round 1: Broad search across all subtopics.
Archive market reports, vendor pages, regulatory documents.
4. Context Builder Round 2: Deep dive on competitive landscape.
Search for each identified vendor, pricing, and customer reviews.
5. Context Builder Round 3: Fill gaps on regulatory requirements
(GDPR, attorney-client privilege implications, bar association
guidance on AI tools).
6. Report Writer: Outline follows consulting format:
- Executive Summary
- Market Overview & Sizing
- Technology Landscape
- Competitive Analysis
- Regulatory Considerations
- Go-to-Market Recommendations
Write each section from KB. The competitive analysis section
cross-references 4 vendor notes and 2 market reports.
Output: A structured consulting-style report with data-backed claims
and traceable sources.
Example 3: Scaling Up Quality on a Specific Question
User: "I need the highest quality analysis possible on quantum error
correction approaches. Spend extra time on research."
Approach:
1. Initialize workspace. Decompose into 12 subtopics covering
surface codes, color codes, concatenated codes, LDPC codes,
hardware implementations, threshold theorems, etc.
2. Run 7 Context Builder rounds (more than default) to maximize depth:
- Rounds 1-2: Broad coverage of all subtopics
- Rounds 3-4: Fill checklist gaps, seek primary sources
- Rounds 5-6: Seek conflicting viewpoints, recent preprints
- Round 7: Final gap analysis and supplementary searches
3. Knowledge base grows to 50+ notes across 12 subfolders.
4. Report Writer produces a 12-section report with extensive
cross-referencing between sections.
Key insight: Each additional Context Builder round measurably
improves comprehensiveness and analytical insight, with diminishing
returns after round 5. For maximum quality, allocate 5-7 rounds.
Do:
knowledge_base/transformer_architectures/attention_mechanisms/ is navigable; knowledge_base/topic_3/subtopic_2/ is not.[source](../sources/arxiv_2024_attention.md)). The Report Writer depends on traceability.Avoid:
sources/ for reference only.index.md as the navigation map. The Report Writer should read the index first, then selectively read only the notes relevant to the current section.log.md before ending. The next session picks up by reading the log, not by re-reading all prior context.Paper: FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents (Zhu et al., 2026). Look for: Section 2.2-2.3 on the dual-agent architecture and workspace design, Table 1 for benchmark results showing +3.02 RACE improvement, and Table 3 for ablation results quantifying the contribution of each component (dual-agent split: -10.35 RACE when removed; persistent workspace: -4.07; section-by-section writing: -5.13).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".