Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

brycewang-stanford/large-document-reader

Name: large-document-reader
Author: brycewang-stanford

skills/43-wentorai-research-plugins/skills/tools/document/large-document-reader/SKILL.md

npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research large-document-reader

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Large Document Reader

Split long documents (books, reports, theses, legal filings, technical manuals) into structured chapters or sections for systematic, chapter-by-chapter reading and analysis within LLM context windows.

Overview

Large Language Models have finite context windows, and even models with 100K+ token limits can lose accuracy on information buried in the middle of very long inputs. Academic researchers frequently work with documents that exceed practical context limits: doctoral theses (200+ pages), government reports, book-length monographs, legal case compilations, and multi-volume technical standards.

This skill provides a systematic approach to splitting large documents into semantically meaningful chapters or sections, maintaining cross-references between parts, and reading each section with full comprehension. Rather than naive fixed-size chunking that breaks mid-sentence or mid-argument, this approach respects document structure -- headings, chapter breaks, section markers, and logical boundaries.

The result is a structured reading experience where each chapter is analyzed in full context, summaries are maintained across sessions, and the reader can navigate directly to any section of interest. This is especially valuable for literature reviews, systematic reviews, and comprehensive document analysis tasks.

Document Splitting Strategy

Hierarchy of Split Points

Documents should be split at the highest-level structural boundary that keeps each chunk within the target size:

| Priority | Boundary Type | Markers | |----------|--------------|---------| | 1 | Part/Volume | PART I, Volume 2, page breaks with Roman numerals | | 2 | Chapter | Chapter 1, CHAPTER, numbered headings level 1 | | 3 | Section | 1.1, Section, headings level 2 | | 4 | Subsection | 1.1.1, headings level 3 | | 5 | Paragraph break | Double newline, indentation change | | 6 | Sentence boundary | Period + space + capital letter |

Splitting Algorithm

def split_document(text, max_tokens=8000, overlap_tokens=200):
    """Split document respecting structural boundaries."""
    # Step 1: Detect document structure
    chapters = detect_chapters(text)

    if not chapters:
        # Fallback: split by sections
        chapters = detect_sections(text)

    if not chapters:
        # Fallback: split by paragraphs with size limit
        chapters = split_by_paragraphs(text, max_tokens)

    # Step 2: Merge small adjacent sections
    merged = merge_small_sections(chapters, min_tokens=500)

    # Step 3: Split oversized sections
    final = []
    for chapter in merged:
        if count_tokens(chapter.text) > max_tokens:
            sub_parts = split_by_paragraphs(chapter.text, max_tokens)
            for i, part in enumerate(sub_parts):
                final.append(Section(
                    title=f"{chapter.title} (Part {i+1})",
                    text=part,
                    index=len(final)
                ))
        else:
            chapter.index = len(final)
            final.append(chapter)

    # Step 4: Add overlap for continuity
    for i in range(1, len(final)):
        final[i].context_prefix = get_last_n_tokens(
            final[i-1].text, overlap_tokens
        )

    return final

Structure Detection Patterns

import re

CHAPTER_PATTERNS = [
    r'^#{1,2}\s+.+',                          # Markdown H1/H2
    r'^Chapter\s+\d+',                         # "Chapter 1"
    r'^\d+\.\s+[A-Z]',                        # "1. Introduction"
    r'^PART\s+[IVX]+',                         # "PART III"
    r'^\\(chapter|section)\{',                 # LaTeX commands
    r'^\f',                                    # Form feed (page break)
]

def detect_chapters(text):
    sections = []
    current_title = "Preamble"
    current_start = 0

    for match in re.finditer('|'.join(CHAPTER_PATTERNS), text, re.MULTILINE):
        if match.start() > current_start:
            sections.append(Section(
                title=current_title,
                text=text[current_start:match.start()].strip()
            ))
        current_title = match.group().strip()
        current_start = match.start()

    sections.append(Section(title=current_title, text=text[current_start:].strip()))
    return sections

Structured Reading Workflow

Phase 1: Survey

Read the table of contents, introduction, and conclusion first to build a mental model of the document's argument structure:

1. Extract and display Table of Contents
2. Read Introduction (typically Chapter 1)
3. Read Conclusion (typically last chapter)
4. Generate a document map: chapter titles + estimated page counts
5. Identify key themes and arguments

Phase 2: Sequential Deep Reading

Process each chapter with a standardized analysis template:

For each chapter:
  - Chapter title and position in document
  - Key arguments or findings (3-5 bullet points)
  - Methodology described (if applicable)
  - Data or evidence presented
  - Connections to previous chapters
  - Open questions or points for follow-up
  - Notable quotes or passages (with page/section references)

Phase 3: Synthesis

After all chapters are read, generate cross-cutting analyses:

- Thematic summary across all chapters
- Argument progression map
- Methodology comparison (if multiple studies)
- Contradiction or tension identification
- Gap analysis relative to research questions

Cross-Session Persistence

For documents that take multiple sessions to read, maintain a reading state file:

{
  "document": "thesis_smith_2024.pdf",
  "total_sections": 24,
  "completed": [0, 1, 2, 3, 4, 5],
  "current": 6,
  "summaries": {
    "0": "Preamble: Defines scope of study on...",
    "1": "Chapter 1: Introduction to the problem of...",
    "2": "Chapter 2: Literature review covering..."
  },
  "themes": ["data governance", "algorithmic fairness", "institutional trust"],
  "open_questions": [
    "How does the author reconcile findings in Ch3 with Ch5?"
  ]
}

Format-Specific Handling

| Format | Tool | Notes | |--------|------|-------| | PDF | pdfplumber, PyMuPDF | Extract text with layout awareness | | EPUB | ebooklib | Chapters are HTML files in the spine | | DOCX | python-docx | Headings define structure | | LaTeX | Regex on \chapter, \section | Native structure markers | | HTML | BeautifulSoup | Split on <h1>, <h2> tags | | Plain text | Heuristic detection | Use blank lines, indentation, page breaks |

Best Practices

Preserve cross-references: When a chapter references "as discussed in Section 3.2," maintain a reference index so the reader can retrieve that section.
Maintain running context: Each chunk should include a brief summary of preceding material (the overlap window) to maintain narrative continuity.
Respect tables and figures: Never split in the middle of a table, code block, or figure caption. These should be kept as atomic units.
Index creation: Build a searchable index of key terms, names, and concepts with section references for rapid lookup.
Citation extraction: Pull out all references cited in each chapter to build a cumulative bibliography.

References

pdfplumber: https://github.com/jsvine/pdfplumber
python-docx: https://python-docx.readthedocs.io
ebooklib: https://github.com/aerkalov/ebooklib
PyMuPDF (fitz): https://pymupdf.readthedocs.io

brycewang-stanford/large-document-reader

skills/43-wentorai-research-plugins/skills/tools/document/large-document-reader/SKILL.md

Split and read long documents chapter-by-chapter for structured analysis

1,661 stars

documentation

Updated Jun 4, 2026

$ install --global

skillsauth

npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research large-document-reader

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 4, 2026, 6:11 AM143.5s1 file scanned

SKILL.md

name:: large-document-reader
description:: Split and read long documents chapter-by-chapter for structured analysis
emoji:: 📖
category:: tools
subcategory:: document
keywords:: ["document reading", "chunking", "long document", "chapter splitting", "structured reading"]
source:: wentor-research-plugins

Large Document Reader

Overview

Document Splitting Strategy

Hierarchy of Split Points

Documents should be split at the highest-level structural boundary that keeps each chunk within the target size:

Splitting Algorithm

def split_document(text, max_tokens=8000, overlap_tokens=200):
    """Split document respecting structural boundaries."""
    # Step 1: Detect document structure
    chapters = detect_chapters(text)

    if not chapters:
        # Fallback: split by sections
        chapters = detect_sections(text)

    if not chapters:
        # Fallback: split by paragraphs with size limit
        chapters = split_by_paragraphs(text, max_tokens)

    # Step 2: Merge small adjacent sections
    merged = merge_small_sections(chapters, min_tokens=500)

    # Step 3: Split oversized sections
    final = []
    for chapter in merged:
        if count_tokens(chapter.text) > max_tokens:
            sub_parts = split_by_paragraphs(chapter.text, max_tokens)
            for i, part in enumerate(sub_parts):
                final.append(Section(
                    title=f"{chapter.title} (Part {i+1})",
                    text=part,
                    index=len(final)
                ))
        else:
            chapter.index = len(final)
            final.append(chapter)

    # Step 4: Add overlap for continuity
    for i in range(1, len(final)):
        final[i].context_prefix = get_last_n_tokens(
            final[i-1].text, overlap_tokens
        )

    return final

Structure Detection Patterns

import re

CHAPTER_PATTERNS = [
    r'^#{1,2}\s+.+',                          # Markdown H1/H2
    r'^Chapter\s+\d+',                         # "Chapter 1"
    r'^\d+\.\s+[A-Z]',                        # "1. Introduction"
    r'^PART\s+[IVX]+',                         # "PART III"
    r'^\\(chapter|section)\{',                 # LaTeX commands
    r'^\f',                                    # Form feed (page break)
]

def detect_chapters(text):
    sections = []
    current_title = "Preamble"
    current_start = 0

    for match in re.finditer('|'.join(CHAPTER_PATTERNS), text, re.MULTILINE):
        if match.start() > current_start:
            sections.append(Section(
                title=current_title,
                text=text[current_start:match.start()].strip()
            ))
        current_title = match.group().strip()
        current_start = match.start()

    sections.append(Section(title=current_title, text=text[current_start:].strip()))
    return sections

Structured Reading Workflow

Phase 1: Survey

Read the table of contents, introduction, and conclusion first to build a mental model of the document's argument structure:

1. Extract and display Table of Contents
2. Read Introduction (typically Chapter 1)
3. Read Conclusion (typically last chapter)
4. Generate a document map: chapter titles + estimated page counts
5. Identify key themes and arguments

Phase 2: Sequential Deep Reading

Process each chapter with a standardized analysis template:

For each chapter:
  - Chapter title and position in document
  - Key arguments or findings (3-5 bullet points)
  - Methodology described (if applicable)
  - Data or evidence presented
  - Connections to previous chapters
  - Open questions or points for follow-up
  - Notable quotes or passages (with page/section references)

Phase 3: Synthesis

After all chapters are read, generate cross-cutting analyses:

- Thematic summary across all chapters
- Argument progression map
- Methodology comparison (if multiple studies)
- Contradiction or tension identification
- Gap analysis relative to research questions

Cross-Session Persistence

For documents that take multiple sessions to read, maintain a reading state file:

{
  "document": "thesis_smith_2024.pdf",
  "total_sections": 24,
  "completed": [0, 1, 2, 3, 4, 5],
  "current": 6,
  "summaries": {
    "0": "Preamble: Defines scope of study on...",
    "1": "Chapter 1: Introduction to the problem of...",
    "2": "Chapter 2: Literature review covering..."
  },
  "themes": ["data governance", "algorithmic fairness", "institutional trust"],
  "open_questions": [
    "How does the author reconcile findings in Ch3 with Ch5?"
  ]
}

Format-Specific Handling

Best Practices

Preserve cross-references: When a chapter references "as discussed in Section 3.2," maintain a reference index so the reader can retrieve that section.
Maintain running context: Each chunk should include a brief summary of preceding material (the overlap window) to maintain narrative continuity.
Respect tables and figures: Never split in the middle of a table, code block, or figure caption. These should be kept as atomic units.
Index creation: Build a searchable index of key terms, names, and concepts with section references for rapid lookup.
Citation extraction: Pull out all references cited in each chapter to build a cumulative bibliography.

References

pdfplumber: https://github.com/jsvine/pdfplumber
python-docx: https://python-docx.readthedocs.io
ebooklib: https://github.com/aerkalov/ebooklib
PyMuPDF (fitz): https://pymupdf.readthedocs.io

Related Skills

brycewang-stanford/literature-review-tools

tools

VerifiedTrustedCommunity

Recommend AND run open-source AI tools, agents, Claude Code / Codex skills, and MCP servers for any stage of a literature review — searching, reading, extracting, synthesizing, screening, citation-checking, and paper writing. Use when the user asks "what tool should I use to..." OR "install/run/use <tool> to ..." for research/lit-review work: automating a survey or related-work section, PDF→Markdown extraction for LLMs (MinerU/marker/docling), PRISMA / systematic review (ASReview), citation-backed Q&A over PDFs (PaperQA2), wiring papers into Claude/Cursor via MCP (arxiv/paper-search/zotero servers), or chatting with a Zotero library. Ships a launcher (scripts/litrun.py) that installs each tool in an isolated venv and runs it. Curated catalog of 70+ vetted projects. 支持中英文（用于「文献综述工具选型」与「一键安装/运行」）。

3,109SKILL.mdUpdated Jul 28, 2026

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

development

VerifiedTrustedCommunity

Route empirical-research requests through the Auto-Empirical Research Skills catalog when this whole repository is installed as one skill in Codex, CodeBuddy, Claude Code, or another IDE. Use to choose and load the right vendored AERS skill for causal inference, econometrics, replication, data acquisition, manuscript writing, peer review and referee responses, citation checking, de-AIGC editing, or full empirical-paper workflows without reading the entire repository at once.

3,109SKILL.mdUpdated Jun 27, 2026

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

documentation

VerifiedTrustedCommunity

Use when the project collects primary data or runs a field, lab, or survey experiment, before the intervention begins — write the pre-analysis plan, size the sample from a power calculation, and register with the AEA RCT Registry. Apply after the design is chosen in aer-identification and before any outcome data are seen.

3,021SKILL.mdUpdated Jul 23, 2026

brycewang-stanford/aer-preregistration

brycewang-stanford/economist-data-skill

tools

VerifiedTrustedCommunity

Guide economists to authoritative data sources with explicit, confirmed data specifications before retrieval; interfaces with Playwright MCP to navigate portals and extract real data, not articles about data.

3,021SKILL.mdUpdated Jul 23, 2026

brycewang-stanford/economist-data-skill

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research.git

# Copy into Claude Code skills folder (global)
cp -r Awesome-Agent-Skills-for-Empirical-Research/skills/43-wentorai-research-plugins/skills/tools/document/large-document-reader ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research

1,661 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT