skills/deepread-document-structure-aware-reasoning/SKILL.md
Structure-aware document reasoning that converts PDFs/long documents into hierarchically indexed paragraphs with coordinate metadata, then uses a dual-tool "locate then read" strategy (Retrieve + ReadSection) to answer complex questions requiring evidence scattered across distant document sections. Trigger phrases: - "analyze this PDF and answer questions about it" - "find evidence across sections of this document" - "search this long document for specific information" - "extract structured answers from this report" - "answer questions about this paper/filing/manual" - "build a document QA pipeline"
npx skillsauth add ndpvt-web/arxiv-claude-skills deepread-document-structure-aware-reasoningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to process long documents (PDFs, reports, filings, manuals, papers) by preserving their hierarchical structure — headings, sections, paragraph boundaries — rather than treating them as flat text. Using the DeepRead paradigm, Claude converts documents into structured Markdown with coordinate-style metadata (doc_id, sec_id, para_idx), then reasons over them with a two-phase "locate then read" strategy: first retrieving relevant paragraphs by semantic search with scanning context, then reading contiguous sections in order to synthesize accurate answers. This approach dramatically outperforms flat-chunking RAG when answers require integrating evidence from multiple distant document regions.
The problem with flat chunking: Standard RAG systems split documents into fixed-size overlapping chunks (e.g., 800 tokens with 400-token overlap), discarding the document's native structure. This means a retrieval hit in one chunk provides no information about where in the document it sits, what section it belongs to, or what adjacent content exists. For questions requiring multi-hop reasoning across distant sections, flat chunking forces the model to guess context.
DeepRead's structural indexing: Instead of arbitrary chunks, DeepRead indexes at paragraph granularity. Each paragraph receives a coordinate-style metadata key: {doc_id: d, sec_id: i, para_idx: j} where d identifies the document, i identifies the section (mapped from the heading hierarchy), and j is the paragraph's sequential position within that section. This metadata is cheap to store but enables two powerful operations: (1) the Retrieve tool can return not just matching paragraphs but their exact structural coordinates plus a scanning window of surrounding paragraphs, and (2) the ReadSection tool can fetch any contiguous range of paragraphs within a section by specifying [j_start, j_end], preserving reading order.
The "locate then read" loop: DeepRead wraps these tools in a ReAct-style multi-turn agent loop (up to 50 rounds). The agent receives the document's table of contents in its system prompt, uses Retrieve to find candidate locations, examines structural coordinates to understand where evidence lives, then uses ReadSection to read broader context around hits. This mimics how a human expert skims a table of contents, jumps to relevant sections, reads surrounding paragraphs for context, and iterates until the answer is complete.
Convert the document to structured Markdown. Parse the PDF (or other format) into Markdown that preserves heading hierarchy (#, ##, ###) and paragraph boundaries. Use an OCR/parsing tool (e.g., marker, pymupdf4llm, or an LLM-based parser) that retains structural elements rather than flattening to plain text.
Build the section-paragraph coordinate index. Walk the Markdown AST to identify sections (by heading level) and paragraphs within each section. Assign each paragraph a coordinate triple (doc_id, sec_id, para_idx). Store the mapping from sec_id to heading text for the table of contents.
Generate the table of contents (TOC). Extract the heading hierarchy into a compact TOC string that lists section IDs alongside their titles and paragraph counts. This TOC will be injected into the system prompt so the agent knows the document's structure without reading the full text.
Create paragraph-level embeddings. Embed each paragraph using a dense retriever (e.g., sentence-transformers, OpenAI embeddings, or a dedicated model like Qwen3-embedding). Store embeddings alongside the coordinate metadata in a vector index (FAISS, ChromaDB, or similar).
Implement the Retrieve tool. Given a query string, perform semantic search to find the top-K most relevant paragraphs. For each hit at coordinate (d, i, j), expand with a scanning window W = (w_up, w_down) to include paragraphs [max(1, j - w_up), min(n_section, j + w_down)]. Deduplicate overlapping windows. Return results sorted by (doc_id, sec_id, para_idx) with coordinates visible in the output.
Implement the ReadSection tool. Given a doc_id, sec_id, and paragraph range [j_start, j_end], return the contiguous paragraphs in reading order, clipped to valid boundaries. This tool takes no query — it reads exactly what the agent asks for.
Compose the agent system prompt. Include: (a) the task description, (b) the full TOC with section IDs and paragraph counts, (c) tool descriptions for Retrieve and ReadSection with parameter specs, (d) instructions to use "locate then read" — first Retrieve to find candidates, then ReadSection to expand context.
Run the multi-turn reasoning loop. Execute a ReAct loop where the agent alternates between reasoning (thinking about what it knows and what it still needs) and tool calls. Cap at a maximum number of rounds (e.g., 15-50 depending on document complexity). The agent terminates by emitting a FINAL action with its answer.
Post-process and validate the answer. Extract the final answer, verify it references specific document locations (section and paragraph coordinates), and format it with citations pointing back to source coordinates.
Example 1: Financial Filing Analysis
User: "I have a 10-K filing (200 pages). Does the company's risk factor
discussion about supply chain match what they report in the MD&A section
about actual supply chain disruptions?"
Approach:
1. Convert the 10-K PDF to structured Markdown preserving Item numbers
(Item 1A: Risk Factors, Item 7: MD&A, etc.) as sections.
2. Index paragraphs with coordinates like:
{doc_id: "10K_2025", sec_id: "item_1a", para_idx: 3}
{doc_id: "10K_2025", sec_id: "item_7", para_idx: 12}
3. Retrieve("supply chain risk") → hits in Item 1A paragraphs 3, 7, 15
with scanning window showing surrounding context.
4. ReadSection(doc_id="10K_2025", sec_id="item_1a", j_start=1, j_end=20)
to read the full supply chain risk subsection.
5. Retrieve("supply chain disruption impact revenue") → hits in Item 7
paragraphs 12, 14.
6. ReadSection(doc_id="10K_2025", sec_id="item_7", j_start=10, j_end=18)
to read surrounding MD&A discussion.
7. Synthesize: compare the stated risks against reported outcomes.
Output:
"The Risk Factors section (Item 1A, paras 3-8) identifies three supply chain
risks: semiconductor shortages, single-source dependencies, and logistics
delays. The MD&A section (Item 7, paras 12-16) reports that semiconductor
shortages reduced Q3 revenue by $45M, confirming Risk Factor #1. However,
the single-source dependency risk (Item 1A, para 7) is not addressed in
MD&A, suggesting either no materialization or incomplete disclosure."
Example 2: Technical Manual Troubleshooting
User: "Our deployment keeps failing at the TLS handshake step. Search the
800-page infrastructure manual for relevant configuration steps."
Approach:
1. Parse manual into structured Markdown. TOC reveals:
sec_id: "ch4_networking" (Chapter 4: Networking)
sec_id: "ch4_3_tls" (Section 4.3: TLS Configuration)
sec_id: "ch9_troubleshoot" (Chapter 9: Troubleshooting)
sec_id: "appendix_b" (Appendix B: Error Codes)
2. Retrieve("TLS handshake failure configuration") → hits in:
- ch4_3_tls, para 5 (certificate chain setup)
- ch9_troubleshoot, para 22 (handshake timeout errors)
- appendix_b, para 8 (error code TLS_HANDSHAKE_FAILED)
3. ReadSection("manual", "ch4_3_tls", 1, 15) → full TLS config section.
4. ReadSection("manual", "ch9_troubleshoot", 20, 28) → troubleshooting
steps around the hit.
5. Cross-reference: config step in para 5 requires setting
`tls_min_version=1.2`, and troubleshooting para 23 notes that
omitting this causes handshake failures with newer clients.
Output:
"Based on Section 4.3 (paras 5-7) and Chapter 9 (para 23):
1. Verify `tls_min_version` is set to 1.2+ in your config (Section 4.3, para 5)
2. Ensure the full certificate chain is provided, not just the leaf cert
(Section 4.3, para 7)
3. If using mutual TLS, the client CA bundle path must be absolute
(Chapter 9, para 23 — this is the most common cause of handshake failures)"
Example 3: Building a Document QA Pipeline in Code
User: "Build me a Python pipeline that indexes a PDF using the DeepRead
approach and answers questions about it."
Approach:
1. Write a document parser that converts PDF → structured Markdown
(using pymupdf4llm or marker).
2. Implement coordinate indexing — walk the Markdown to extract sections
and assign (sec_id, para_idx) to each paragraph.
3. Build a vector index over paragraphs with metadata.
4. Implement Retrieve and ReadSection as callable tool functions.
5. Wire into an LLM agent loop with the TOC in the system prompt.
Output (key code structure):
# document_parser.py
def pdf_to_structured_markdown(pdf_path: str) -> str: ...
def extract_sections(markdown: str) -> list[Section]: ...
def build_paragraph_index(sections: list[Section]) -> ParagraphIndex: ...
# tools.py
def retrieve(query: str, index: ParagraphIndex, top_k=5,
window=(2, 2)) -> list[ParagraphHit]: ...
def read_section(doc_id: str, sec_id: str, j_start: int,
j_end: int, index: ParagraphIndex) -> str: ...
# agent.py
def build_system_prompt(toc: str, tool_descriptions: str) -> str: ...
def run_deepread_agent(question: str, index: ParagraphIndex,
max_rounds: int = 20) -> str: ...
(doc_id, sec_id, para_idx) rather than relevance score. Reading in document order helps the agent follow the author's logic.W = (2, 2) and increase only if recall is low.| Problem | Cause | Fix |
|---|---|---|
| Paragraphs have no section coordinates | PDF parser failed to detect headings (scanned PDF, non-standard formatting) | Fall back to page-level sectioning: treat each page as a "section" with paragraphs numbered sequentially |
| ReadSection returns empty | j_start exceeds actual paragraph count in section | Clip to valid range [1, n_section]; return the closest valid range with a note that the requested range was adjusted |
| Retrieve returns irrelevant hits | Embedding model struggles with domain-specific terminology | Add a reranking stage (cross-encoder) or prepend section titles to paragraph text before embedding to boost topical signal |
| Agent loops without converging | Question requires reasoning the LLM cannot perform, or evidence genuinely isn't in the document | Set a hard round limit (20-50); if the agent exhausts rounds, return the best partial answer with a confidence disclaimer |
| TOC is too large for system prompt | Document has hundreds of fine-grained subsections | Collapse the TOC to the top 2-3 heading levels; let Retrieve discover deeper subsections on demand |
Paper: DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search — Li et al., 2026. Focus on Section 3 (method), especially the coordinate metadata schema Gamma_{d,i,j}, the Retrieve/ReadSection tool definitions, and Algorithm 1 for the full agent loop.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".