Citation Verification with Gemini File Search

Scan pandoc-flavored markdown drafts for citations, upload source PDFs to a Gemini File Search store, and verify each citation is grounded in its source. Produces a structured REVIEW-CITES.md report.

Prerequisites

GOOGLE_API_KEY env var set (Google AI Studio; on this machine: export GOOGLE_API_KEY="$(cat $GEMINI_API_KEY_FILE)")
Bun runtime
rclone with a google-drive: remote configured (used to bypass Google Drive FUSE deadlocks)
python3 with pymupdf4llm installed (used for PDF text extraction in passage grounding)
readwise CLI installed and authenticated (for Readwise article export in source materialization)
One or more .bib files with file fields mapping bibkeys to PDF paths (e.g., Paperpile's paperpile.bib)

Source Materialization

Before running cite-check, materialize all sources locally:

cd ${CLAUDE_SKILL_DIR}
bun materialize-sources.ts \
  --bib ~/Google\ Drive/My\ Drive/resources/Paperpile/paperpile.bib \
  --bib ./references/sources.bib \
  --refs ./references \
  --drafts ./drafts \
  --debug

This populates references/ with local copies of all cited sources:

Paperpile PDFs → batch rclone copy from Google Drive → references/<bibkey>.pdf
Readwise articles (reports, news, speeches without PDFs) → search by title, export markdown → references/<bibkey>.md
Gaps → printed at the end for manual action (Obsidian web clipper or manual sourcing)

After materialization, cite-check operates purely locally.

Usage

cd ${CLAUDE_SKILL_DIR}
bun install  # first time only

# Single bib file
bun cite-check.ts --bib ~/Google\ Drive/My\ Drive/resources/Paperpile/paperpile.bib --drafts <path-to-drafts>

# Multiple bib files (Paperpile + project-local; first bib wins on duplicate keys)
bun cite-check.ts \
  --bib ~/Google\ Drive/My\ Drive/resources/Paperpile/paperpile.bib \
  --bib ./references/sources.bib \
  --drafts <path-to-drafts>

CLI Flags

| Flag | Required | Default | Description | |------|----------|---------|-------------| | --bib <path> | Yes* | -- | Path to .bib file (repeatable; first wins on duplicate keys) | | --store <id> | No | auto-create | Use existing File Search store ID | | --drafts <dir> | No | ./drafts | Directory with markdown draft files | | --out <path> | No | <drafts>/REVIEW-CITES.md | Output report path | | --limit <n> | No | all | Check only first N citations (smoke test) | | --dry-run | No | false | Print prompts without querying | | --sequential | No | false | Run queries one-at-a-time instead of Batch API (default: batch) | | --retry-model <model> | No | gemini-3.1-pro-preview | Retry UNSUPPORTED results with a stronger model | | --audit | No | false | Audit source availability without querying (checks Paperpile PDFs) | | --debug | No | false | Verbose logging |

*Either --bib or --store is required.

Ask Mode: Targeted Source Queries

Ask a specific question about a single source:

# Does Bebchuk2019 support a specific claim?
bun cite-check.ts ask @Bebchuk2019-uq "do expense ratios fall since 2010?" --bib paperpile.bib

# What does a source say about a topic?
bun cite-check.ts ask @Brav2022-ht "what are retail turnout rates?" --bib paperpile.bib --bib sources.bib

The ask mode uploads the single source PDF via the legacy Files API (with manifest caching, 48h TTL), queries Gemini with inline file references, and prints the answer with supporting passages to stdout. No File Search store is created and no report is generated.

Cross-Directory File Resolution

When multiple --bib files are provided, file paths are resolved across all bib directories. This handles the common case where a project-local sources.bib has file = {All Papers/...} paths that are relative to the Paperpile folder rather than the project's references/ directory. The tool tries each bib directory as a fallback when the primary path doesn't exist on disk.

How It Works

Extract citations from markdown using pandoc [@bibkey] syntax
Parse bib file to map bibkeys to PDF file paths via file fields
Create or reuse a File Search Store — PDFs for cited bibkeys are imported into a persistent Gemini File Search store with bibkey metadata. Google Drive FUSE paths are copied locally via rclone to avoid EDEADLK deadlocks. Stores persist across runs (no 48h TTL); if cited sources have not changed, the existing store is reused without re-uploading.
Query Gemini with structured prompts for each citation, using the fileSearch tool with metadata filtering to scope each query to the relevant source documents
Classify each citation as SUPPORTED / PARTIAL / UNSUPPORTED / NOT_IN_STORE / ERROR
Verify grounding — for SUPPORTED/PARTIAL results, extract source PDF text via pymupdf4llm and run token-level LCS alignment to confirm the passage Gemini quoted actually exists in the source. Ungrounded passages are flagged [UNGROUNDED] in the report.
Write report to REVIEW-CITES.md

Bib File Format

The --bib flag expects a .bib file where entries have a file field with a path relative to the bib file's directory. Paperpile's exported paperpile.bib follows this convention:

@article{Hu2024-bm,
  author = {Edwin Hu and ...},
  title = {{Custom proxy voting advice}},
  file = {All Papers/H/Hu et al. 2024 - Custom proxy voting advice.pdf},
  year = {2024}
}

All bib entries are parsed. Entries with a file field (~95% of Paperpile entries) are imported into the File Search store. Only sources for bibkeys that are actually cited in the drafts are imported.

Citation Features

Bracketed [@key] and in-text @key citations
Locators: [@key, p. 42]
Compound cites: [@a; @b] (queried together)
Footnote indirection: citations in [^id]: footnote bodies
Bluebook signals: see, cf., see also, etc. (softens verification)
Parenthetical extraction: [@key] (holding that X)

Output

REVIEW-CITES.md with:

Summary counts (supported/partial/unsupported/not in store/error/ungrounded)
Details table: status, file:line, bibkey, claim, response
[UNGROUNDED] flag on any SUPPORTED/PARTIAL result whose passage failed grounding verification

Batch Mode (Default)

By default, all citation queries are submitted as a single Gemini Batch API job using the File Search tool with metadata filtering. Each query is scoped to the relevant source documents via bibkey metadata, so there is no cross-contamination between queries.

# Default (batch)
bun cite-check.ts --bib paperpile.bib --drafts ./drafts

# Sequential (one query at a time, useful for debugging)
bun cite-check.ts --bib paperpile.bib --drafts ./drafts --sequential

The --sequential flag runs each query as an individual generateContent call instead of a batch job. This is useful for debugging or when batch jobs hit rate limits.

Audit Mode

Run --audit before checking citations to see which sources are available and which need to be added:

bun cite-check.ts --bib paperpile.bib --bib sources.bib --drafts ./drafts --audit

The audit checks each cited bibkey for PDF availability on disk (via bib file field with cross-directory resolution). Missing sources should be added to Paperpile.

Exit code is 1 if any sources are missing, 0 if all sources are available. No Gemini store is created and no queries are sent.

Passage Grounding

After Gemini returns a SUPPORTED/PARTIAL result with a supporting_passage, the tool verifies the passage actually exists in the source PDF text using token-level LCS alignment (ported from langextract's WordAligner). Two gates reject bad matches:

Coverage gate (default 0.75): at least 75% of passage tokens must appear in the matched source span
Density gate (default 0.33): matched tokens must be at least 33% of the source span length (rejects scattered matches)

Signal cites (see, cf., etc.) use relaxed thresholds (0.5 coverage / 0.2 density) since they only need conceptual alignment.

Grounding requires extracting text from the source PDF. This uses pymupdf4llm (via extract-pdf-text.py) which preserves document structure, footnotes, and tables as clean markdown. Extracted text is cached in <drafts>/.cite-check-text/.

The Gemini File Search API behind the primary grounding signal — store creation, metadata filtering, and the groundingMetadata shape grounding.ts parses — is documented in skills/gemini-batch/references/file-search.md.

Google Drive FUSE Bypass

PDF files stored on Google Drive Desktop's FUSE mount (~/Google Drive/My Drive/) are subject to EDEADLK deadlocks when accessed concurrently or when not locally cached. The tool detects Google Drive paths — including through symlinks (e.g., references/All Papers → ~/Google Drive/.../Paperpile/All Papers) — and uses rclone copyto to fetch them to a local cache (~/.cache/cite-check-pdfs/) before upload or text extraction. Requires rclone with a google-drive: remote configured.

Architecture

cite-extract.ts          -- Pure citation extraction (no I/O)
gemini.ts                -- Gemini API wrapper (File Search store CRUD, query, legacy upload for ask mode, rclone FUSE bypass)
grounding.ts             -- Post-hoc passage grounding (tokenizer, LCS aligner)
extract-pdf-text.py      -- PDF text extraction via pymupdf4llm
materialize-sources.ts   -- Copy Paperpile PDFs + Readwise articles to references/
cite-check.ts            -- CLI orchestrator (extract -> import -> query -> ground -> report)

Citation Verification with Gemini File Search

Scan pandoc-flavored markdown drafts for citations, upload source PDFs to a Gemini File Search store, and verify each citation is grounded in its source. Produces a structured REVIEW-CITES.md report.

Prerequisites

GOOGLE_API_KEY env var set (Google AI Studio; on this machine: export GOOGLE_API_KEY="$(cat $GEMINI_API_KEY_FILE)")
Bun runtime
rclone with a google-drive: remote configured (used to bypass Google Drive FUSE deadlocks)
python3 with pymupdf4llm installed (used for PDF text extraction in passage grounding)
readwise CLI installed and authenticated (for Readwise article export in source materialization)
One or more .bib files with file fields mapping bibkeys to PDF paths (e.g., Paperpile's paperpile.bib)

Source Materialization

Before running cite-check, materialize all sources locally:

cd ${CLAUDE_SKILL_DIR}
bun materialize-sources.ts \
  --bib ~/Google\ Drive/My\ Drive/resources/Paperpile/paperpile.bib \
  --bib ./references/sources.bib \
  --refs ./references \
  --drafts ./drafts \
  --debug

This populates references/ with local copies of all cited sources:

Paperpile PDFs → batch rclone copy from Google Drive → references/<bibkey>.pdf
Readwise articles (reports, news, speeches without PDFs) → search by title, export markdown → references/<bibkey>.md
Gaps → printed at the end for manual action (Obsidian web clipper or manual sourcing)

After materialization, cite-check operates purely locally.

Usage

cd ${CLAUDE_SKILL_DIR}
bun install  # first time only

# Single bib file
bun cite-check.ts --bib ~/Google\ Drive/My\ Drive/resources/Paperpile/paperpile.bib --drafts <path-to-drafts>

# Multiple bib files (Paperpile + project-local; first bib wins on duplicate keys)
bun cite-check.ts \
  --bib ~/Google\ Drive/My\ Drive/resources/Paperpile/paperpile.bib \
  --bib ./references/sources.bib \
  --drafts <path-to-drafts>

CLI Flags

*Either --bib or --store is required.

Ask Mode: Targeted Source Queries

Ask a specific question about a single source:

# Does Bebchuk2019 support a specific claim?
bun cite-check.ts ask @Bebchuk2019-uq "do expense ratios fall since 2010?" --bib paperpile.bib

# What does a source say about a topic?
bun cite-check.ts ask @Brav2022-ht "what are retail turnout rates?" --bib paperpile.bib --bib sources.bib

Cross-Directory File Resolution

How It Works

Extract citations from markdown using pandoc [@bibkey] syntax
Parse bib file to map bibkeys to PDF file paths via file fields
Create or reuse a File Search Store — PDFs for cited bibkeys are imported into a persistent Gemini File Search store with bibkey metadata. Google Drive FUSE paths are copied locally via rclone to avoid EDEADLK deadlocks. Stores persist across runs (no 48h TTL); if cited sources have not changed, the existing store is reused without re-uploading.
Query Gemini with structured prompts for each citation, using the fileSearch tool with metadata filtering to scope each query to the relevant source documents
Classify each citation as SUPPORTED / PARTIAL / UNSUPPORTED / NOT_IN_STORE / ERROR
Verify grounding — for SUPPORTED/PARTIAL results, extract source PDF text via pymupdf4llm and run token-level LCS alignment to confirm the passage Gemini quoted actually exists in the source. Ungrounded passages are flagged [UNGROUNDED] in the report.
Write report to REVIEW-CITES.md

Bib File Format

The --bib flag expects a .bib file where entries have a file field with a path relative to the bib file's directory. Paperpile's exported paperpile.bib follows this convention:

@article{Hu2024-bm,
  author = {Edwin Hu and ...},
  title = {{Custom proxy voting advice}},
  file = {All Papers/H/Hu et al. 2024 - Custom proxy voting advice.pdf},
  year = {2024}
}

Citation Features

Bracketed [@key] and in-text @key citations
Locators: [@key, p. 42]
Compound cites: [@a; @b] (queried together)
Footnote indirection: citations in [^id]: footnote bodies
Bluebook signals: see, cf., see also, etc. (softens verification)
Parenthetical extraction: [@key] (holding that X)

Output

REVIEW-CITES.md with:

Summary counts (supported/partial/unsupported/not in store/error/ungrounded)
Details table: status, file:line, bibkey, claim, response
[UNGROUNDED] flag on any SUPPORTED/PARTIAL result whose passage failed grounding verification

Batch Mode (Default)

# Default (batch)
bun cite-check.ts --bib paperpile.bib --drafts ./drafts

# Sequential (one query at a time, useful for debugging)
bun cite-check.ts --bib paperpile.bib --drafts ./drafts --sequential

The --sequential flag runs each query as an individual generateContent call instead of a batch job. This is useful for debugging or when batch jobs hit rate limits.

Audit Mode

Run --audit before checking citations to see which sources are available and which need to be added:

bun cite-check.ts --bib paperpile.bib --bib sources.bib --drafts ./drafts --audit

The audit checks each cited bibkey for PDF availability on disk (via bib file field with cross-directory resolution). Missing sources should be added to Paperpile.

Exit code is 1 if any sources are missing, 0 if all sources are available. No Gemini store is created and no queries are sent.

Passage Grounding

Coverage gate (default 0.75): at least 75% of passage tokens must appear in the matched source span
Density gate (default 0.33): matched tokens must be at least 33% of the source span length (rejects scattered matches)

Signal cites (see, cf., etc.) use relaxed thresholds (0.5 coverage / 0.2 density) since they only need conceptual alignment.

The Gemini File Search API behind the primary grounding signal — store creation, metadata filtering, and the groundingMetadata shape grounding.ts parses — is documented in skills/gemini-batch/references/file-search.md.

Google Drive FUSE Bypass

Architecture

cite-extract.ts          -- Pure citation extraction (no I/O)
gemini.ts                -- Gemini API wrapper (File Search store CRUD, query, legacy upload for ask mode, rclone FUSE bypass)
grounding.ts             -- Post-hoc passage grounding (tokenizer, LCS aligner)
extract-pdf-text.py      -- PDF text extraction via pymupdf4llm
materialize-sources.ts   -- Copy Paperpile PDFs + Readwise articles to references/
cite-check.ts            -- CLI orchestrator (extract -> import -> query -> ground -> report)

Adoption

edwinhu/cite-check

$ install --global

Security Scan Results

SKILL.md

Citation Verification with Gemini File Search

Prerequisites

Source Materialization

Usage

CLI Flags

Ask Mode: Targeted Source Queries

Cross-Directory File Resolution

How It Works

Bib File Format

Citation Features

Output

Batch Mode (Default)

Audit Mode

Passage Grounding

Google Drive FUSE Bypass

Architecture

Related Skills

edwinhu/npx-ownership-panel

edwinhu/crsp-v2

edwinhu/fuzzy-name-matching

edwinhu/ds-tables

edwinhu/cite-check

$ install --global

Security Scan Results

SKILL.md

Citation Verification with Gemini File Search

Prerequisites

Source Materialization

Usage

CLI Flags

Ask Mode: Targeted Source Queries

Cross-Directory File Resolution

How It Works

Bib File Format

Citation Features

Output

Batch Mode (Default)

Audit Mode

Passage Grounding

Google Drive FUSE Bypass

Architecture

Related Skills

edwinhu/npx-ownership-panel

edwinhu/crsp-v2

edwinhu/fuzzy-name-matching

edwinhu/ds-tables