skills/14-luischanci-claude-code-research-starter/dot-claude/skills/stata/SKILL.md
Comprehensive Stata reference for writing correct .do files, data management, econometrics, causal inference, graphics, Mata programming, and 17+ community packages (reghdfe, estout, did, rdrobust, etc.). Covers syntax, options, gotchas, and idiomatic patterns. Use this skill whenever the user asks you to write, debug, or explain Stata code.
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research stataInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You have access to comprehensive Stata reference files. Do not load all files. Read only the 1-3 files relevant to the user's current task using the routing table below.
These are Stata-specific pitfalls that lead to silent bugs. Internalize these before writing any code.
Stata's . (and .a-.z) are greater than all numbers.
* WRONG — includes observations where income is missing!
gen high_income = (income > 50000)
* RIGHT
gen high_income = (income > 50000) if !missing(income)
* WRONG — missing ages appear in this list
list if age > 60
* RIGHT
list if age > 60 & !missing(age)
= vs === is assignment; == is comparison. Mixing them up is a syntax error or silent bug.
* WRONG — syntax error
gen employed = 1 if status = 1
* RIGHT
gen employed = 1 if status == 1
Locals use `name' (backtick + single-quote). Globals use $name or ${name}.
Forgetting the closing quote is the #1 macro bug.
local controls "age education income"
regress wage `controls' // correct
regress wage `controls // WRONG — missing closing quote
regress wage 'controls' // WRONG — wrong quote characters
by Requires Prior Sort (Use bysort)* WRONG — error if data not sorted by id
by id: gen first = (_n == 1)
* RIGHT — bysort sorts automatically
bysort id: gen first = (_n == 1)
* Also RIGHT — explicit sort
sort id
by id: gen first = (_n == 1)
i. and c.)Use i. for categorical, c. for continuous. Omitting i. treats categories as continuous.
* WRONG — treats race as continuous (e.g., race=3 has 3x effect of race=1)
regress wage race education
* RIGHT — creates dummies automatically
regress wage i.race education
* Interactions
regress wage i.race##c.education // full interaction
regress wage i.race#c.education // interaction only (no main effects)
generate vs replacegenerate creates new variables; replace modifies existing ones. Using generate on an existing variable name is an error.
gen x = 1
gen x = 2 // ERROR: x already defined
replace x = 2 // correct
* May miss "Male", "MALE", etc.
keep if gender == "male"
* Safer
keep if lower(gender) == "male"
merge Always Check _mergemerge 1:1 id using other.dta
tab _merge // always inspect
assert _merge == 3 // or handle mismatches
drop _merge
preserve / restore for Temporary Changespreserve
collapse (mean) income, by(state)
* ... do something with collapsed data ...
restore // original data is back
fweight — frequency weights (replication)aweight — analytic/regression weights (inverse variance)pweight — probability/sampling weights (survey data, implies robust SE)iweight — importance weights (rarely used)capture Swallows Errorscapture some_command
if _rc != 0 {
di as error "Failed with code: " _rc
exit _rc
}
///regress y x1 x2 x3 ///
x4 x5 x6, ///
vce(robust)
r() vs e() vs s()r() — r-class commands (summarize, tabulate, etc.)e() — e-class commands (estimation: regress, logit, etc.)s() — s-class commands (parsing)A new estimation command overwrites previous e() results. Store them first:
regress y x1 x2
estimates store model1
Read only the files relevant to the user's task. Paths are relative to this SKILL.md file.
| File | Topics & Key Commands |
|------|----------------------|
| references/basics-getting-started.md | use, save, describe, browse, sysuse, basic workflow |
| references/data-import-export.md | import delimited, import excel, ODBC, export, web data |
| references/data-management.md | generate, replace, merge, append, reshape, collapse, recode, egen, encode/decode |
| references/variables-operators.md | Variable types, byte/int/long/float/double, operators, missing values (.<.a), if/in qualifiers |
| references/string-functions.md | substr(), regexm(), strtrim(), split, ustrlen(), regex, Unicode |
| references/date-time-functions.md | date(), clock(), %td/%tc formats, mdy(), dofm(), business calendars |
| references/mathematical-functions.md | round(), log(), exp(), abs(), mod(), cond(), distributions, random numbers |
| File | Topics & Key Commands |
|------|----------------------|
| references/descriptive-statistics.md | summarize, tabulate, correlate, tabstat, codebook, weighted stats |
| references/linear-regression.md | regress, vce(robust), vce(cluster), test, lincom, margins, predict, ivregress |
| references/panel-data.md | xtset, xtreg fe/re, Hausman test, xtabond, dynamic panels |
| references/time-series.md | tsset, ARIMA, VAR, dfuller, pperron, irf, forecasting |
| references/limited-dependent-variables.md | logit, probit, tobit, poisson, nbreg, mlogit, ologit, margins for nonlinear |
| references/bootstrap-simulation.md | bootstrap, simulate, permute, Monte Carlo |
| references/survey-data-analysis.md | svyset, svy:, subpop(), complex survey design, replicate weights |
| references/missing-data-handling.md | mi impute, mi estimate, FIML, misstable, diagnostics |
| references/maximum-likelihood.md | ml model, custom likelihood functions, ml init, gradient-based optimization |
| references/gmm-estimation.md | gmm, moment conditions, estat overid, J-test |
| File | Topics & Key Commands |
|------|----------------------|
| references/treatment-effects.md | teffects ra/ipw/ipwra/aipw, stteffects, ATE/ATT/ATET |
| references/difference-in-differences.md | DiD, parallel trends, event studies, staggered adoption |
| references/regression-discontinuity.md | Sharp/fuzzy RD, bandwidth selection, rdplot |
| references/matching-methods.md | PSM, nearest neighbor, kernel matching, teffects nnmatch |
| references/sample-selection.md | heckman, heckprobit, treatment models, exclusion restrictions |
| File | Topics & Key Commands |
|------|----------------------|
| references/survival-analysis.md | stset, stcox, streg, Kaplan-Meier, parametric models |
| references/sem-factor-analysis.md | sem, gsem, CFA, path analysis, alpha, reliability |
| references/nonparametric-methods.md | kdensity, rank tests, qreg, npregress |
| references/spatial-analysis.md | spmatrix, spregress, spatial weights, Moran's I |
| references/machine-learning.md | lasso, elasticnet, cvlasso, cross-validation |
| File | Topics & Key Commands |
|------|----------------------|
| references/graphics.md | twoway, scatter, line, bar, histogram, graph combine, graph export, schemes |
| File | Topics & Key Commands |
|------|----------------------|
| references/programming-basics.md | local, global, foreach, forvalues, program define, syntax, return |
| references/advanced-programming.md | syntax, mata, classes, _prefix, dialog boxes, tempfile/tempvar |
| references/mata-introduction.md | Mata basics, when to use Mata vs ado, data types |
| references/mata-programming.md | Mata functions, flow control, structures, pointers |
| references/mata-matrix-operations.md | Matrix creation, decompositions, solvers, st_matrix() |
| references/mata-data-access.md | st_data(), st_view(), st_store(), performance tips |
| File | Topics & Key Commands |
|------|----------------------|
| references/tables-reporting.md | putexcel, putdocx, putpdf, LaTeX integration, collect |
| references/workflow-best-practices.md | Project structure, master do-files, version control, debugging, common mistakes |
| references/external-tools-integration.md | Python via python:, R via rsource, shell commands, Git |
| File | What It Does |
|------|-------------|
| packages/reghdfe.md | High-dimensional fixed effects OLS (absorbs multiple FE sets efficiently) |
| packages/estout.md | esttab/estout: publication-quality regression tables |
| packages/outreg2.md | Alternative regression table exporter (Word, Excel, TeX) |
| packages/asdoc.md | One-command Word document creation for any Stata output |
| packages/tabout.md | Cross-tabulations and summary tables to file |
| packages/coefplot.md | Coefficient plots from stored estimates |
| packages/graph-schemes.md | grstyle, schemepack, plotplain — better graph themes |
| packages/did.md | Modern DiD: csdid, did_multiplegt, did_imputation (Callaway-Sant'Anna, de Chaisemartin-D'Haultfoeuille, Borusyak-Jaravel-Spiess) |
| packages/event-study.md | eventstudyinteract, eventdd — event study estimators |
| packages/rdrobust.md | Robust RD estimation with optimal bandwidth (rdrobust, rdplot, rdbwselect) |
| packages/psmatch2.md | Propensity score matching (nearest neighbor, kernel, radius) |
| packages/synth.md | Synthetic control method (synth, synth_runner) |
| packages/ivreg2.md | Enhanced IV/2SLS: ivreg2, xtivreg2 with additional diagnostics |
| packages/xtabond2.md | Dynamic panel GMM (Arellano-Bond/Blundell-Bond) |
| packages/binsreg.md | Binned scatter plots with CI (binsreg, binstest) |
| packages/nprobust.md | Nonparametric kernel estimation and inference |
| packages/diagnostics.md | bacondecomp, xttest3, collinearity, heteroskedasticity tests |
| packages/winsor.md | Winsorizing and trimming: winsor2, winsor |
| packages/data-manipulation.md | gtools (fast collapse/egen), rangestat, egenmore |
| packages/package-management.md | ssc install, net install, ado update, finding packages |
* Estimate models
eststo clear
eststo: regress y x1 x2, vce(robust)
eststo: regress y x1 x2 x3, vce(robust)
eststo: regress y x1 x2 x3 x4, vce(cluster id)
* Export table
esttab using "results.tex", replace ///
se star(* 0.10 ** 0.05 *** 0.01) ///
label booktabs ///
title("Main Results") ///
mtitles("(1)" "(2)" "(3)")
xtset panelid timevar // declare panel structure
xtdescribe // check balance
xtsum outcome // within/between variation
* Fixed effects
xtreg y x1 x2, fe vce(cluster panelid)
* Or with reghdfe (preferred for multiple FE)
reghdfe y x1 x2, absorb(panelid timevar) vce(cluster panelid)
* Classic 2x2 DiD
gen post = (year >= treatment_year)
gen treat_post = treated * post
regress y treated post treat_post, vce(cluster id)
* Modern staggered DiD (Callaway & Sant'Anna)
csdid y x1 x2, ivar(id) time(year) gvar(first_treat) agg(event)
csdid_plot
* Publication-quality scatter with fit line
twoway (scatter y x, mcolor(navy%50) msize(small)) ///
(lfit y x, lcolor(cranberry) lwidth(medthick)), ///
title("Title Here") ///
xtitle("X Label") ytitle("Y Label") ///
legend(off) scheme(s2color)
graph export "figure1.pdf", replace as(pdf)
graph export "figure1.png", replace as(png) width(2400)
* Load and inspect
import delimited "raw_data.csv", clear varnames(1)
describe
codebook, compact
* Clean
rename *, lower // lowercase all varnames
destring income, replace force // convert string to numeric
replace income = . if income < 0
* Label
label variable income "Annual household income (USD)"
label define yesno 0 "No" 1 "Yes"
label values employed yesno
* Save
compress
save "clean_data.dta", replace
mi set mlong
mi register imputed income education
mi impute chained (regress) income (ologit) education = age i.gender, add(20) rseed(12345)
mi estimate: regress wage income education age i.gender
development
Conduct rigorous thematic analysis (TA) of qualitative data following Braun and Clarke's (2006) six-phase framework. Use whenever the user mentions 'thematic analysis', 'TA', 'Braun and Clarke', 'qualitative coding', 'identifying themes', or asks for help analysing interviews, focus groups, open-ended survey responses, or transcripts to identify patterns. Also trigger for questions about inductive vs theoretical coding, semantic vs latent themes, essentialist vs constructionist epistemology, building a thematic map, or writing up a qualitative findings section. Covers all six phases, the four upfront analytic decisions, the 15-point quality checklist, and the five common pitfalls. Produces a Word document write-up and an annotated thematic map. Does NOT cover IPA, grounded theory, discourse analysis, conversation analysis, or narrative analysis — use a different method for those.
development
Guide users through writing a systematic literature review (SLR) following the PRISMA 2020 framework. Use this skill whenever the user mentions 'systematic review', 'systematic literature review', 'SLR', 'PRISMA', 'PRISMA 2020', 'PRISMA flow diagram', 'PRISMA checklist', or asks for help writing, structuring, or auditing a literature review that follows reporting guidelines. Also trigger when the user asks about inclusion/exclusion criteria for a review, search strategies for databases like Scopus/WoS/PubMed, study selection processes, risk of bias assessment, or narrative synthesis for a review paper. This skill covers the full PRISMA 2020 checklist (27 items), produces a Word document manuscript in strict journal article format, generates an annotated PRISMA flow diagram, and enforces APA 7th Edition referencing throughout. It does NOT cover meta-analysis or statistical pooling. By Chuah Kee Man.
testing
Performs placebo-in-time sensitivity analysis with hierarchical null model and optional Bayesian assurance. Use when checking model robustness, verifying lack of pre-intervention effects, or estimating study power.
data-ai
Fit, summarize, plot, and interpret a chosen CausalPy experiment. Use after the causal method has been selected, including when configuring PyMC/sklearn models and scale-aware custom priors.