.claude/skills/stata/SKILL.md
Context and tools for working with Stata — writing .do files, running them in batch mode, and efficiently consulting the bundled PDF documentation.
npx skillsauth add AndreaMentasti/tweet-election stataInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill whenever writing, editing, running, or debugging Stata .do files,
or when you need to look up Stata syntax, commands, or options.
Stata is a statistical software package for data management, analysis, and graphics. It is widely used in economics, political science, epidemiology, and other social sciences.
Key concepts:
.do files: Scripts (plain text) containing Stata commands, executed sequentially..dta files: Stata's binary data format..log files: Text output captured during a Stata session or batch run.ado files: Stata programs (user-written or official) that extend functionality.local (temporary) and global (persistent within session) named values.i.varname notation for categorical regressors.predict, margins, estat).Finding the installation: Stata is installed in C:\Program Files\ on Windows.
To auto-detect the path:
STATA_DIR=$(ls -d "/c/Program Files"/Stata* "/c/Program Files"/StataNow* 2>/dev/null | sort -V | tail -1)
echo "$STATA_DIR"
From bash (Claude Code terminal):
stata -b do path/to/script.do # batch mode, creates .log file
stata path/to/script.do # interactive window
IMPORTANT: ALWAYS use the stata wrapper command (at ~/bin/stata), NEVER call
StataSE-64.exe directly. The wrapper auto-moves batch-mode logs from the current
directory to quality_reports/stata_logs/. Calling the exe directly leaves stray
.log files in the project root.
From PowerShell (user terminal):
stata -b do path\to\script.do
The stata alias must point to StataSE-64.exe (or StataMP-64.exe depending on
edition). See SETUP.md for how to configure this.
Checking for errors after batch run:
grep "^r(" script.log # Stata error codes start with r(
If the log contains r( lines, the script hit an error at that point.
.do Files — Essentialsversion 17
clear all
set more off
* --- 0. Paths ---
do config_local.do // sets $root
* --- 1. Load data ---
use "$root/data/processed/tweets_processed.dta", clear
* --- 2. Analysis ---
reg vote_share engagement_score, robust
* --- 3. Output ---
esttab using "$root/output/tables/table1.tex", booktabs label replace
graph export "$root/output/figures/fig1.png", replace width(2400)
version 17 at top for reproducibility$root global macro (set in config_local.do)clear all / set more off at the startreplace on all output commands (allows re-running)width(2400) for ~300 DPI figures at 8 inchesgraphregion(color(white)) bgcolor(white) for white backgrounds///encode assigns codes alphabetically — verify ordering before interpretingsort is not stable — add enough keys to uniquely identify rowsvce(cluster var) not just robust when observations are groupeddestring for numeric-as-string; encode for true categoricals — never confuse themStata bundles its full documentation as PDFs inside the installation directory:
<STATA_INSTALL_DIR>/docs/
To find the docs directory automatically:
STATA_DOCS=$(ls -d "/c/Program Files"/Stata*/docs "/c/Program Files"/StataNow*/docs 2>/dev/null | sort -V | tail -1)
echo "$STATA_DOCS"
| File | Manual | Pages | Key contents |
|------|--------|-------|--------------|
| r.pdf | Base Reference | 3,502 | regress, logit, probit, test, predict, margins — the most-used manual |
| u.pdf | User's Guide | 403 | Stata basics, syntax, data types, programming intro |
| d.pdf | Data Management | 1,000 | import, merge, reshape, append, encode, destring |
| ts.pdf | Time Series | 1,026 | arima, var, vec, irf, tsset |
| xt.pdf | Panel Data | 699 | xtreg, xtlogit, xtpoisson, xtset |
| me.pdf | Mixed Effects | 572 | mixed, melogit, mepoisson |
| st.pdf | Survival Analysis | 645 | stset, stcox, streg, sts |
| mv.pdf | Multivariate | 750 | factor, pca, cluster, manova |
| sem.pdf | SEM | 680 | sem, gsem, path diagrams |
| g.pdf | Graphics | 799 | twoway, graph bar, scheme, options |
| p.pdf | Programming | 667 | program, macro, mata interface |
| m.pdf | Mata | 1,214 | Stata's matrix programming language |
| bayes.pdf | Bayesian | 911 | bayesmh, bayesian estimation |
| causal.pdf | Causal Inference | 746 | teffects, didregress, stteffects |
| lasso.pdf | Lasso | 394 | lasso, elasticnet, cross-validation |
| mi.pdf | Multiple Imputation | 400 | mi impute, mi estimate |
| svy.pdf | Survey | 236 | svyset, svy: prefix |
| tables.pdf | Tables | 361 | collect, table, dtable, etable |
| pss.pdf | Power/Sample Size | 869 | power, sample size calculations |
| meta.pdf | Meta-Analysis | 439 | meta set, meta forestplot |
| fn.pdf | Functions | 193 | Built-in functions reference |
| i.pdf | Glossary/Index | 328 | Combined subject index |
| stoc.pdf | Subject TOC | 59 | Combined table of contents across all manuals |
| adapt.pdf | Adaptive Designs | 252 | Group sequential trials |
| bma.pdf | Bayesian Model Averaging | 241 | bmaregress, model selection |
| cm.pdf | Choice Models | 329 | cmclogit, conditional logit, mixed logit |
| dsge.pdf | DSGE Models | 179 | Dynamic stochastic general equilibrium |
| erm.pdf | Extended Regression | 307 | Extended regression models (endogeneity, selection, treatment) |
| fmm.pdf | Finite Mixture Models | 149 | fmm prefix, latent class |
| gsm.pdf | Getting Started (Mac) | 158 | Mac-specific setup guide |
| gsu.pdf | Getting Started (Unix) | 165 | Unix-specific setup guide |
| gsw.pdf | Getting Started (Windows) | 161 | Windows-specific setup guide |
| h2oml.pdf | H2O Machine Learning | 379 | h2oml, random forest, gradient boosting |
| ig.pdf | Installation Guide | 21 | License, installation |
| irt.pdf | Item Response Theory | 251 | irt, Rasch, 2PL, 3PL models |
| rpt.pdf | Reporting | 222 | putdocx, putpdf, collect, automated reports |
| sp.pdf | Spatial | 232 | spregress, spatial autoregressive models |
Start with stoc.pdf (59 pages) to find which manual covers a topic.
Problem: 17,000 pages of PDFs. Reading even one manual wastes tokens. Solution: Use targeted extraction — never read a full manual.
Prerequisites: pdftotext (bundled with poppler/mingw) and pip install pdfplumber.
See SETUP.md for installation.
pdftotext (fast, plain text, best for prose)# Auto-detect docs path
STATA_DOCS=$(ls -d "/c/Program Files"/Stata*/docs "/c/Program Files"/StataNow*/docs 2>/dev/null | sort -V | tail -1)
# Extract specific pages (e.g., pages 1200-1220 for regress)
pdftotext -f 1200 -l 1220 "$STATA_DOCS/r.pdf" -
# Search the subject TOC for a command
pdftotext "$STATA_DOCS/stoc.pdf" - | grep -i "regress"
pdfplumber (Python, best for tables and structured content)import pdfplumber, glob, os
def find_stata_docs():
"""Auto-detect Stata docs directory."""
for pattern in [r"C:\Program Files\StataNow*\docs",
r"C:\Program Files\Stata*\docs"]:
matches = glob.glob(pattern)
if matches:
return sorted(matches)[-1]
return None
def stata_doc_lookup(manual: str, start_page: int, end_page: int) -> str:
"""Extract text from a Stata manual. Pages are 0-indexed."""
docs = find_stata_docs()
path = os.path.join(docs, manual)
with pdfplumber.open(path) as pdf:
text = []
for i in range(start_page, min(end_page, len(pdf.pages))):
page_text = pdf.pages[i].extract_text()
if page_text:
text.append(page_text)
return "\n".join(text)
# Example: read TOC of r.pdf to find page numbers
print(stata_doc_lookup("r.pdf", 2, 8))
pdftotext + grep (search without reading)STATA_DOCS=$(ls -d "/c/Program Files"/Stata*/docs "/c/Program Files"/StataNow*/docs 2>/dev/null | sort -V | tail -1)
# Find which page mentions "margins" in the base reference
pdftotext "$STATA_DOCS/r.pdf" - | grep -n "margins"
# Find a command across ALL manuals
for f in "$STATA_DOCS"/*.pdf; do
if pdftotext "$f" - 2>/dev/null | grep -q "didregress"; then
echo "Found in: $(basename $f)"
fi
done
stoc.pdf — search the subject TOC to identify which manualpdftotext -f START -l END manual.pdf -r.pdf ~ 2.1M tokens (NEVER do this)stoc.pdf (59 pages) ~ 35K tokens (acceptable for initial lookup)| Task | Where to look |
|------|--------------|
| Regression syntax | r.pdf, search TOC for "regress" |
| Merge datasets | d.pdf, search for "merge" |
| Panel data models | xt.pdf, search for "xtreg" |
| Export LaTeX tables | r.pdf search "esttab" or tables.pdf |
| Graph options | g.pdf TOC |
| String functions | fn.pdf or d.pdf search "string functions" |
| Date/time handling | d.pdf or u.pdf chapter on dates |
| Causal inference | causal.pdf TOC |
| Survey weights | svy.pdf TOC |
testing
Perform adversarial visual audit of Quarto or Beamer slides checking for overflow, font consistency, box fatigue, and layout issues.
testing
Validate bibliography entries against citations in all lecture files. Find missing entries and unused references.
testing
Translate Beamer LaTeX to Quarto RevealJS. Multi-phase workflow with TikZ extraction and QA.
development
Multi-agent slide review (visual, pedagogy, proofreading). Use for comprehensive quality check before milestones.