skills/cite-check/SKILL.md
Verify academic citations against source PDFs using Gemini File Search API. Use when 'check citations', 'verify cites', 'cite-check', 'run citation review', 'are my citations grounded', 'does source X support claim Y', 'what does source X say about Y', or validating that pandoc citations in markdown drafts are supported by their source documents.
npx skillsauth add edwinhu/workflows cite-checkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Scan pandoc-flavored markdown drafts for citations, upload source PDFs to a Gemini File Search store, and verify each citation is grounded in its source. Produces a structured REVIEW-CITES.md report.
GOOGLE_API_KEY env var set (Google AI Studio; on this machine: export GOOGLE_API_KEY="$(cat $GEMINI_API_KEY_FILE)")rclone with a google-drive: remote configured (used to bypass Google Drive FUSE deadlocks)python3 with pymupdf4llm installed (used for PDF text extraction in passage grounding)readwise CLI installed and authenticated (for Readwise article export in source materialization).bib files with file fields mapping bibkeys to PDF paths (e.g., Paperpile's paperpile.bib)Before running cite-check, materialize all sources locally:
cd ${CLAUDE_SKILL_DIR}
bun materialize-sources.ts \
--bib ~/Google\ Drive/My\ Drive/resources/Paperpile/paperpile.bib \
--bib ./references/sources.bib \
--refs ./references \
--drafts ./drafts \
--debug
This populates references/ with local copies of all cited sources:
rclone copy from Google Drive → references/<bibkey>.pdfreferences/<bibkey>.mdAfter materialization, cite-check operates purely locally.
cd ${CLAUDE_SKILL_DIR}
bun install # first time only
# Single bib file
bun cite-check.ts --bib ~/Google\ Drive/My\ Drive/resources/Paperpile/paperpile.bib --drafts <path-to-drafts>
# Multiple bib files (Paperpile + project-local; first bib wins on duplicate keys)
bun cite-check.ts \
--bib ~/Google\ Drive/My\ Drive/resources/Paperpile/paperpile.bib \
--bib ./references/sources.bib \
--drafts <path-to-drafts>
| Flag | Required | Default | Description |
|------|----------|---------|-------------|
| --bib <path> | Yes* | -- | Path to .bib file (repeatable; first wins on duplicate keys) |
| --store <id> | No | auto-create | Use existing File Search store ID |
| --drafts <dir> | No | ./drafts | Directory with markdown draft files |
| --out <path> | No | <drafts>/REVIEW-CITES.md | Output report path |
| --limit <n> | No | all | Check only first N citations (smoke test) |
| --dry-run | No | false | Print prompts without querying |
| --sequential | No | false | Run queries one-at-a-time instead of Batch API (default: batch) |
| --retry-model <model> | No | gemini-3.1-pro-preview | Retry UNSUPPORTED results with a stronger model |
| --audit | No | false | Audit source availability without querying (checks Paperpile PDFs) |
| --debug | No | false | Verbose logging |
*Either --bib or --store is required.
Ask a specific question about a single source:
# Does Bebchuk2019 support a specific claim?
bun cite-check.ts ask @Bebchuk2019-uq "do expense ratios fall since 2010?" --bib paperpile.bib
# What does a source say about a topic?
bun cite-check.ts ask @Brav2022-ht "what are retail turnout rates?" --bib paperpile.bib --bib sources.bib
The ask mode uploads the single source PDF via the legacy Files API (with manifest caching, 48h TTL), queries Gemini with inline file references, and prints the answer with supporting passages to stdout. No File Search store is created and no report is generated.
When multiple --bib files are provided, file paths are resolved across all bib directories. This handles the common case where a project-local sources.bib has file = {All Papers/...} paths that are relative to the Paperpile folder rather than the project's references/ directory. The tool tries each bib directory as a fallback when the primary path doesn't exist on disk.
[@bibkey] syntaxfile fieldsrclone to avoid EDEADLK deadlocks. Stores persist across runs (no 48h TTL); if cited sources have not changed, the existing store is reused without re-uploading.fileSearch tool with metadata filtering to scope each query to the relevant source documentspymupdf4llm and run token-level LCS alignment to confirm the passage Gemini quoted actually exists in the source. Ungrounded passages are flagged [UNGROUNDED] in the report.The --bib flag expects a .bib file where entries have a file field with a path relative to the bib file's directory. Paperpile's exported paperpile.bib follows this convention:
@article{Hu2024-bm,
author = {Edwin Hu and ...},
title = {{Custom proxy voting advice}},
file = {All Papers/H/Hu et al. 2024 - Custom proxy voting advice.pdf},
year = {2024}
}
All bib entries are parsed. Entries with a file field (~95% of Paperpile entries) are imported into the File Search store. Only sources for bibkeys that are actually cited in the drafts are imported.
[@key] and in-text @key citations[@key, p. 42][@a; @b] (queried together)[^id]: footnote bodiessee, cf., see also, etc. (softens verification)[@key] (holding that X)REVIEW-CITES.md with:
[UNGROUNDED] flag on any SUPPORTED/PARTIAL result whose passage failed grounding verificationBy default, all citation queries are submitted as a single Gemini Batch API job using the File Search tool with metadata filtering. Each query is scoped to the relevant source documents via bibkey metadata, so there is no cross-contamination between queries.
# Default (batch)
bun cite-check.ts --bib paperpile.bib --drafts ./drafts
# Sequential (one query at a time, useful for debugging)
bun cite-check.ts --bib paperpile.bib --drafts ./drafts --sequential
The --sequential flag runs each query as an individual generateContent call instead of a batch job. This is useful for debugging or when batch jobs hit rate limits.
Run --audit before checking citations to see which sources are available and which need to be added:
bun cite-check.ts --bib paperpile.bib --bib sources.bib --drafts ./drafts --audit
The audit checks each cited bibkey for PDF availability on disk (via bib file field with cross-directory resolution). Missing sources should be added to Paperpile.
Exit code is 1 if any sources are missing, 0 if all sources are available. No Gemini store is created and no queries are sent.
After Gemini returns a SUPPORTED/PARTIAL result with a supporting_passage, the tool verifies the passage actually exists in the source PDF text using token-level LCS alignment (ported from langextract's WordAligner). Two gates reject bad matches:
Signal cites (see, cf., etc.) use relaxed thresholds (0.5 coverage / 0.2 density) since they only need conceptual alignment.
Grounding requires extracting text from the source PDF. This uses pymupdf4llm (via extract-pdf-text.py) which preserves document structure, footnotes, and tables as clean markdown. Extracted text is cached in <drafts>/.cite-check-text/.
PDF files stored on Google Drive Desktop's FUSE mount (~/Google Drive/My Drive/) are subject to EDEADLK deadlocks when accessed concurrently or when not locally cached. The tool detects Google Drive paths — including through symlinks (e.g., references/All Papers → ~/Google Drive/.../Paperpile/All Papers) — and uses rclone copyto to fetch them to a local cache (~/.cache/cite-check-pdfs/) before upload or text extraction. Requires rclone with a google-drive: remote configured.
cite-extract.ts -- Pure citation extraction (no I/O)
gemini.ts -- Gemini API wrapper (File Search store CRUD, query, legacy upload for ask mode, rclone FUSE bypass)
grounding.ts -- Post-hoc passage grounding (tokenizer, LCS aligner)
extract-pdf-text.py -- PDF text extraction via pymupdf4llm
materialize-sources.ts -- Copy Paperpile PDFs + Readwise articles to references/
cite-check.ts -- CLI orchestrator (extract -> import -> query -> ground -> report)
testing
Internal skill for literature review and source materialization. Called after brainstorm, before setup. NOT user-facing.
documentation
This skill should be used when the user asks to 'write a paper', 'start a writing project', 'draft an article', 'write about', 'brainstorm writing topics', 'gather sources for a paper', 'what should I write about', or needs the writing workflow entry point for any writing task.
testing
Validate draft sections cover all PRECIS claims before review.
testing
Internal skill for creating PRECIS.md, OUTLINE.md, and ACTIVE_WORKFLOW.md. Called after brainstorm sources are gathered.