Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

brycewang-stanford/academic-paper-verify

Name: academic-paper-verify
Author: brycewang-stanford

skills/27-dariia-m-my_claude_skills/paper_verification/SKILL.md

npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research academic-paper-verify

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Academic Paper Verification

A systematic skill for verifying the integrity and replicability of an academic research paper. This covers everything from individual coefficient checks to full end-to-end replication.

Overview

Verification proceeds in six phases. Each phase produces structured output. Do not skip phases - earlier phases feed into later ones.

Phase 1: Discovery       -> inventory of all project files, scripts, outputs, paper
Phase 2: Table Audit     -> cross-check every number in every table
Phase 3: Inline Claims   -> verify quantitative claims in paper body text
Phase 4: Code Review     -> audit R scripts for correctness, modeling decisions, data pipeline
Phase 5: Manifest Build  -> create verification_manifest.json linking claims to code
Phase 6: Replication     -> write and run tests/verify_replication.R, fix failures

Before You Start

Identify the project root directory. Look for .Rproj files, README, or ask the user.
Read references/phase-details.md for the full procedure for each phase.
Read references/common-pitfalls.md for known failure modes to watch for.

Phase 1: Discovery

Scan the entire project and build an inventory. You need to know what you're working with before you can verify anything.

Find and catalog:

All .R and .Rmd scripts (note execution order if a master script exists)
All output files: .csv, .rds, .tex, .txt, .log in results/, output/, tables/, etc.
The LaTeX paper file(s): .tex in the root or paper/ or draft/ directory
Any data files: .csv, .dta, .rds, .xlsx in data/ or similar
Any configuration or parameter files

Produce: A file inventory printed to the console, organized by type, with notes on what each script appears to do (based on filename and a quick scan of its first ~30 lines).

Key questions to answer in this phase:

Is there a master script that runs everything in order?
Where do intermediate outputs land?
Which scripts produce which tables/figures?
Are there any scripts that appear unused or orphaned?

Phase 2: Table Audit

This is the most critical phase. Read references/phase-details.md Section 2 for the full procedure.

For every table in the paper:

Locate the table in the LaTeX source. Extract every number: coefficients, standard errors, t-statistics, p-values, confidence intervals, sample sizes (N), R-squared, F-statistics, means, medians, percentages - everything.
Locate the corresponding R output file that produced this table. This might be a .tex file generated by stargazer, modelsummary, xtable, kableExtra, huxtable, or similar. It could also be a .csv, .rds, or text log.
Cross-check every single number. Compare to the R output with appropriate tolerance:
- Coefficients and standard errors: match to the number of decimal places shown
- Sample sizes: must match exactly
- R-squared and similar: match to displayed precision
- Percentages: verify the arithmetic (numerator/denominator)
Check for rounding consistency - if a coefficient is 0.0347 in the R output and 0.035 in the paper, that is acceptable rounding. If it is 0.038, that is a discrepancy.
Verify that column headers, variable names, and panel labels in the paper match the specification in the code.
Check that the number of observations (N) is consistent across all tables that use the same sample. If Table 1 reports N=4,521 and Table 3 uses the same sample but reports N=4,519, that needs explanation.

Produce: A table-by-table verification report. For each table:

Table number and title
Source R script and output file
Number of values checked
List of any discrepancies with exact locations (paper line number, output file line number)
PASS/FAIL status

Phase 3: Inline Claims Audit

Read the paper body text (not just tables) and find every quantitative claim. These include:

"We find a 3.2 percentage point increase..."
"The effect is significant at the 5% level..."
"Our sample includes 12,450 observations..."
"Column 3 of Table 2 shows that..."
"The coefficient on X is negative and significant..."
Footnotes with numbers or statistical claims
Abstract claims about magnitudes and significance

For each claim, trace it back to a specific table cell, figure, or R output. Flag any claim that cannot be traced or that contradicts the evidence.

Produce: A claims checklist with claim text, source location in paper, evidence source, and VERIFIED/UNVERIFIED/DISCREPANCY status.

Phase 4: Code Review

Read every R script in the project, in execution order. This is not just a syntax check - you are auditing the analytical pipeline. Read references/phase-details.md Section 4 and references/common-pitfalls.md for what to look for.

Data Pipeline Verification:

At every merge, join, filter, subset, or mutate step, check: (a) How many observations before vs. after the transformation? (b) Do all column names needed downstream still exist? (c) Are key summary statistics (mean, min, max, N) reasonable after the step?
Flag any joins that could silently drop or duplicate observations
Flag any filters that might be too aggressive or too permissive
Check for proper handling of missing values (NA) - are they dropped, imputed, or ignored?
Verify that panel/time-series data is properly balanced or that imbalance is handled

Modeling Decisions:

Are the regression specifications consistent with what the paper describes? (e.g., if the paper says "we control for year fixed effects", is that in the code?)
Are standard errors clustered as described? (robust, clustered at the right level, etc.)
Are instrumental variables correctly specified? (first stage, exclusion restriction checks)
Is the sample restriction for each regression clearly defined and consistent with the paper?
Are interaction terms, polynomials, or transformations correctly implemented?
Do subsample analyses actually use the right subsamples?

Robustness and Red Flags:

Are there hardcoded values that should be computed? (e.g., filter(year > 2005) when the paper says "post-treatment period" without defining the cutoff)
Are there commented-out lines that suggest alternative specifications were tried?
Is there any evidence of p-hacking patterns (many specifications tried, only one reported)?
Are random seeds set for any stochastic procedures?
Are there warnings or errors being suppressed?

Produce: A script-by-script review with:

Script name and purpose
Data pipeline issues (with line numbers)
Modeling decision flags (with line numbers)
Red flags (with line numbers)
Overall assessment: CLEAN / MINOR ISSUES / MAJOR ISSUES

Phase 5: Build Verification Manifest

Create verification_manifest.json that maps every quantitative claim in the paper to the code that produces it.

Structure:

{
  "paper_file": "paper/main.tex",
  "generated_at": "2026-02-08T12:00:00Z",
  "claims": [
    {
      "id": "T1_R2_C3",
      "type": "coefficient",
      "paper_location": {"file": "paper/main.tex", "line": 234, "context": "Table 1, Row 2, Col 3"},
      "paper_value": "0.035",
      "source_script": "code/02_main_regression.R",
      "source_line": 87,
      "output_file": "results/table1.tex",
      "output_location": {"line": 15, "context": "second coefficient in column 3"},
      "expected_value": "0.0347",
      "tolerance": 0.001,
      "status": "PASS",
      "notes": "Acceptable rounding from 0.0347 to 0.035"
    },
    {
      "id": "BODY_P12_S3",
      "type": "inline_claim",
      "paper_location": {"file": "paper/main.tex", "line": 412, "context": "paragraph 12, sentence 3"},
      "paper_value": "3.2 percentage points",
      "source_script": "code/02_main_regression.R",
      "source_line": 87,
      "output_file": "results/table1.tex",
      "output_location": {"line": 15},
      "expected_value": "0.032",
      "tolerance": 0.001,
      "status": "PASS",
      "notes": "Coefficient 0.0323 reported as 3.2pp"
    }
  ],
  "summary": {
    "total_claims": 142,
    "passed": 139,
    "failed": 2,
    "unverified": 1
  }
}

Every coefficient, standard error, sample size, p-value, summary statistic, and verbal claim should appear in this manifest. Be exhaustive.

Phase 6: Replication Test Suite

Write tests/verify_replication.R that programmatically reruns the analysis and checks results against the manifest.

Read references/replication-script-template.md for the template and structure.

The test script must:

Source or rerun each analysis script in the correct order
Extract the relevant outputs (coefficients, SEs, N, R-squared, etc.)
Compare against the values in verification_manifest.json
Use appropriate tolerance for floating-point comparisons
Report PASS/FAIL for each claim with clear diagnostics on failure
Handle dependencies gracefully (if a data file is missing, report it, do not crash)

After writing the test script:

Run it
For any failures, diagnose the root cause
If the failure is due to a code bug (not a paper-code mismatch), fix the upstream script and document what you fixed
Rerun until all tests pass or all remaining failures are genuine paper-code discrepancies
Produce a final summary

Produce:

tests/verify_replication.R - the test script
tests/replication_results.json - structured test results
tests/replication_summary.md - human-readable summary of what passed, what failed, what was fixed, and what remains unresolved

Output Format

At the end of the full verification, produce a consolidated report. Use this structure:

# Paper Verification Report

## Executive Summary
- Total quantitative claims checked: X
- Passed: Y
- Failed: Z
- Unverified: W
- Code issues found: N (M major, K minor)

## Table-by-Table Results
[from Phase 2]

## Inline Claims Results
[from Phase 3]

## Code Review Findings
[from Phase 4]

## Replication Test Results
[from Phase 6]

## Recommendations
[prioritized list of issues to address]

Important Notes

Never silently skip a number. If you cannot verify a value, mark it UNVERIFIED with an explanation.
When in doubt, flag it. False positives are better than missed discrepancies.
Pay special attention to N (sample sizes) - these are the most common source of inconsistencies across tables and text.
If the project uses R packages that produce formatted output (stargazer, modelsummary, etc.), check the raw model objects too, not just the formatted output.
If you encounter Stata .do files or Python scripts mixed in, verify those too using the same principles.
The user may want you to run this on a subset (e.g., "just check Table 3"). Adapt accordingly but note what was not checked.

brycewang-stanford/academic-paper-verify

skills/27-dariia-m-my_claude_skills/paper_verification/SKILL.md

Thoroughly verify all code, tables, figures, modeling decisions, and quantitative claims in an academic paper against its source R scripts and output files. Use this skill whenever you need to audit, replicate, or verify an academic research paper - including cross-checking LaTeX tables against R output, validating econometric modeling choices, ensuring sample sizes are consistent, building a verification manifest, and running automated replication tests. Trigger this skill for any mention of: paper verification, replication check, table audit, code-paper consistency, reproducing results, verifying estimates, checking coefficients, or any variant of "does the paper match the code."

538 stars

development

Updated Apr 30, 2026

$ install --global

skillsauth

npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research academic-paper-verify

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 30, 2026, 5:08 AM12.5s4 files scanned

SKILL.md

name:: academic-paper-verify
description:: >
Trigger this skill for any mention of:: paper verification, replication check, table audit,

Academic Paper Verification

A systematic skill for verifying the integrity and replicability of an academic research paper. This covers everything from individual coefficient checks to full end-to-end replication.

Overview

Verification proceeds in six phases. Each phase produces structured output. Do not skip phases - earlier phases feed into later ones.

Phase 1: Discovery       -> inventory of all project files, scripts, outputs, paper
Phase 2: Table Audit     -> cross-check every number in every table
Phase 3: Inline Claims   -> verify quantitative claims in paper body text
Phase 4: Code Review     -> audit R scripts for correctness, modeling decisions, data pipeline
Phase 5: Manifest Build  -> create verification_manifest.json linking claims to code
Phase 6: Replication     -> write and run tests/verify_replication.R, fix failures

Before You Start

Identify the project root directory. Look for .Rproj files, README, or ask the user.
Read references/phase-details.md for the full procedure for each phase.
Read references/common-pitfalls.md for known failure modes to watch for.

Phase 1: Discovery

Scan the entire project and build an inventory. You need to know what you're working with before you can verify anything.

Find and catalog:

All .R and .Rmd scripts (note execution order if a master script exists)
All output files: .csv, .rds, .tex, .txt, .log in results/, output/, tables/, etc.
The LaTeX paper file(s): .tex in the root or paper/ or draft/ directory
Any data files: .csv, .dta, .rds, .xlsx in data/ or similar
Any configuration or parameter files

Produce: A file inventory printed to the console, organized by type, with notes on what each script appears to do (based on filename and a quick scan of its first ~30 lines).

Key questions to answer in this phase:

Is there a master script that runs everything in order?
Where do intermediate outputs land?
Which scripts produce which tables/figures?
Are there any scripts that appear unused or orphaned?

Phase 2: Table Audit

This is the most critical phase. Read references/phase-details.md Section 2 for the full procedure.

For every table in the paper:

Locate the table in the LaTeX source. Extract every number: coefficients, standard errors, t-statistics, p-values, confidence intervals, sample sizes (N), R-squared, F-statistics, means, medians, percentages - everything.
Locate the corresponding R output file that produced this table. This might be a .tex file generated by stargazer, modelsummary, xtable, kableExtra, huxtable, or similar. It could also be a .csv, .rds, or text log.
Cross-check every single number. Compare to the R output with appropriate tolerance:
- Coefficients and standard errors: match to the number of decimal places shown
- Sample sizes: must match exactly
- R-squared and similar: match to displayed precision
- Percentages: verify the arithmetic (numerator/denominator)
Check for rounding consistency - if a coefficient is 0.0347 in the R output and 0.035 in the paper, that is acceptable rounding. If it is 0.038, that is a discrepancy.
Verify that column headers, variable names, and panel labels in the paper match the specification in the code.
Check that the number of observations (N) is consistent across all tables that use the same sample. If Table 1 reports N=4,521 and Table 3 uses the same sample but reports N=4,519, that needs explanation.

Produce: A table-by-table verification report. For each table:

Table number and title
Source R script and output file
Number of values checked
List of any discrepancies with exact locations (paper line number, output file line number)
PASS/FAIL status

Phase 3: Inline Claims Audit

Read the paper body text (not just tables) and find every quantitative claim. These include:

"We find a 3.2 percentage point increase..."
"The effect is significant at the 5% level..."
"Our sample includes 12,450 observations..."
"Column 3 of Table 2 shows that..."
"The coefficient on X is negative and significant..."
Footnotes with numbers or statistical claims
Abstract claims about magnitudes and significance

For each claim, trace it back to a specific table cell, figure, or R output. Flag any claim that cannot be traced or that contradicts the evidence.

Produce: A claims checklist with claim text, source location in paper, evidence source, and VERIFIED/UNVERIFIED/DISCREPANCY status.

Phase 4: Code Review

Data Pipeline Verification:

At every merge, join, filter, subset, or mutate step, check: (a) How many observations before vs. after the transformation? (b) Do all column names needed downstream still exist? (c) Are key summary statistics (mean, min, max, N) reasonable after the step?
Flag any joins that could silently drop or duplicate observations
Flag any filters that might be too aggressive or too permissive
Check for proper handling of missing values (NA) - are they dropped, imputed, or ignored?
Verify that panel/time-series data is properly balanced or that imbalance is handled

Modeling Decisions:

Are the regression specifications consistent with what the paper describes? (e.g., if the paper says "we control for year fixed effects", is that in the code?)
Are standard errors clustered as described? (robust, clustered at the right level, etc.)
Are instrumental variables correctly specified? (first stage, exclusion restriction checks)
Is the sample restriction for each regression clearly defined and consistent with the paper?
Are interaction terms, polynomials, or transformations correctly implemented?
Do subsample analyses actually use the right subsamples?

Robustness and Red Flags:

Are there hardcoded values that should be computed? (e.g., filter(year > 2005) when the paper says "post-treatment period" without defining the cutoff)
Are there commented-out lines that suggest alternative specifications were tried?
Is there any evidence of p-hacking patterns (many specifications tried, only one reported)?
Are random seeds set for any stochastic procedures?
Are there warnings or errors being suppressed?

Produce: A script-by-script review with:

Script name and purpose
Data pipeline issues (with line numbers)
Modeling decision flags (with line numbers)
Red flags (with line numbers)
Overall assessment: CLEAN / MINOR ISSUES / MAJOR ISSUES

Phase 5: Build Verification Manifest

Create verification_manifest.json that maps every quantitative claim in the paper to the code that produces it.

Structure:

{
  "paper_file": "paper/main.tex",
  "generated_at": "2026-02-08T12:00:00Z",
  "claims": [
    {
      "id": "T1_R2_C3",
      "type": "coefficient",
      "paper_location": {"file": "paper/main.tex", "line": 234, "context": "Table 1, Row 2, Col 3"},
      "paper_value": "0.035",
      "source_script": "code/02_main_regression.R",
      "source_line": 87,
      "output_file": "results/table1.tex",
      "output_location": {"line": 15, "context": "second coefficient in column 3"},
      "expected_value": "0.0347",
      "tolerance": 0.001,
      "status": "PASS",
      "notes": "Acceptable rounding from 0.0347 to 0.035"
    },
    {
      "id": "BODY_P12_S3",
      "type": "inline_claim",
      "paper_location": {"file": "paper/main.tex", "line": 412, "context": "paragraph 12, sentence 3"},
      "paper_value": "3.2 percentage points",
      "source_script": "code/02_main_regression.R",
      "source_line": 87,
      "output_file": "results/table1.tex",
      "output_location": {"line": 15},
      "expected_value": "0.032",
      "tolerance": 0.001,
      "status": "PASS",
      "notes": "Coefficient 0.0323 reported as 3.2pp"
    }
  ],
  "summary": {
    "total_claims": 142,
    "passed": 139,
    "failed": 2,
    "unverified": 1
  }
}

Every coefficient, standard error, sample size, p-value, summary statistic, and verbal claim should appear in this manifest. Be exhaustive.

Phase 6: Replication Test Suite

Write tests/verify_replication.R that programmatically reruns the analysis and checks results against the manifest.

Read references/replication-script-template.md for the template and structure.

The test script must:

Source or rerun each analysis script in the correct order
Extract the relevant outputs (coefficients, SEs, N, R-squared, etc.)
Compare against the values in verification_manifest.json
Use appropriate tolerance for floating-point comparisons
Report PASS/FAIL for each claim with clear diagnostics on failure
Handle dependencies gracefully (if a data file is missing, report it, do not crash)

After writing the test script:

Run it
For any failures, diagnose the root cause
If the failure is due to a code bug (not a paper-code mismatch), fix the upstream script and document what you fixed
Rerun until all tests pass or all remaining failures are genuine paper-code discrepancies
Produce a final summary

Produce:

tests/verify_replication.R - the test script
tests/replication_results.json - structured test results
tests/replication_summary.md - human-readable summary of what passed, what failed, what was fixed, and what remains unresolved

Output Format

At the end of the full verification, produce a consolidated report. Use this structure:

# Paper Verification Report

## Executive Summary
- Total quantitative claims checked: X
- Passed: Y
- Failed: Z
- Unverified: W
- Code issues found: N (M major, K minor)

## Table-by-Table Results
[from Phase 2]

## Inline Claims Results
[from Phase 3]

## Code Review Findings
[from Phase 4]

## Replication Test Results
[from Phase 6]

## Recommendations
[prioritized list of issues to address]

Important Notes

Never silently skip a number. If you cannot verify a value, mark it UNVERIFIED with an explanation.
When in doubt, flag it. False positives are better than missed discrepancies.
Pay special attention to N (sample sizes) - these are the most common source of inconsistencies across tables and text.
If the project uses R packages that produce formatted output (stargazer, modelsummary, etc.), check the raw model objects too, not just the formatted output.
If you encounter Stata .do files or Python scripts mixed in, verify those too using the same principles.
The user may want you to run this on a subset (e.g., "just check Table 3"). Adapt accordingly but note what was not checked.

Related Skills

brycewang-stanford/literature-review-tools

tools

VerifiedTrustedCommunity

Recommend AND run open-source AI tools, agents, Claude Code / Codex skills, and MCP servers for any stage of a literature review — searching, reading, extracting, synthesizing, screening, citation-checking, and paper writing. Use when the user asks "what tool should I use to..." OR "install/run/use <tool> to ..." for research/lit-review work: automating a survey or related-work section, PDF→Markdown extraction for LLMs (MinerU/marker/docling), PRISMA / systematic review (ASReview), citation-backed Q&A over PDFs (PaperQA2), wiring papers into Claude/Cursor via MCP (arxiv/paper-search/zotero servers), or chatting with a Zotero library. Ships a launcher (scripts/litrun.py) that installs each tool in an isolated venv and runs it. Curated catalog of 70+ vetted projects. 支持中英文（用于「文献综述工具选型」与「一键安装/运行」）。

3,109SKILL.mdUpdated Jul 28, 2026

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

development

VerifiedTrustedCommunity

Route empirical-research requests through the Auto-Empirical Research Skills catalog when this whole repository is installed as one skill in Codex, CodeBuddy, Claude Code, or another IDE. Use to choose and load the right vendored AERS skill for causal inference, econometrics, replication, data acquisition, manuscript writing, peer review and referee responses, citation checking, de-AIGC editing, or full empirical-paper workflows without reading the entire repository at once.

3,109SKILL.mdUpdated Jun 27, 2026

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

documentation

VerifiedTrustedCommunity

Use when the project collects primary data or runs a field, lab, or survey experiment, before the intervention begins — write the pre-analysis plan, size the sample from a power calculation, and register with the AEA RCT Registry. Apply after the design is chosen in aer-identification and before any outcome data are seen.

3,021SKILL.mdUpdated Jul 23, 2026

brycewang-stanford/aer-preregistration

brycewang-stanford/economist-data-skill

tools

VerifiedTrustedCommunity

Guide economists to authoritative data sources with explicit, confirmed data specifications before retrieval; interfaces with Playwright MCP to navigate portals and extract real data, not articles about data.

3,021SKILL.mdUpdated Jul 23, 2026

brycewang-stanford/economist-data-skill

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research.git

# Copy into Claude Code skills folder (global)
cp -r Awesome-Agent-Skills-for-Empirical-Research/skills/27-dariia-m-my_claude_skills/paper_verification ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research

538 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT