Stata Skill

You have access to comprehensive Stata reference files. Do not load all files. Read only the 1-3 files relevant to the user's current task using the routing table below.

Critical Gotchas

These are Stata-specific pitfalls that lead to silent bugs. Internalize these before writing any code.

Missing Values Sort to +Infinity

Stata's . (and .a-.z) are greater than all numbers.

* WRONG — includes observations where income is missing!
gen high_income = (income > 50000)

* RIGHT
gen high_income = (income > 50000) if !missing(income)

* WRONG — missing ages appear in this list
list if age > 60

* RIGHT
list if age > 60 & !missing(age)

`=` vs `==`

= is assignment; == is comparison. Mixing them up is a syntax error or silent bug.

* WRONG — syntax error
gen employed = 1 if status = 1

* RIGHT
gen employed = 1 if status == 1

Local Macro Syntax

Locals use `name' (backtick + single-quote). Globals use $name or ${name}. Forgetting the closing quote is the #1 macro bug.

local controls "age education income"
regress wage `controls'        // correct
regress wage `controls         // WRONG — missing closing quote
regress wage 'controls'        // WRONG — wrong quote characters

`by` Requires Prior Sort (Use `bysort`)

* WRONG — error if data not sorted by id
by id: gen first = (_n == 1)

* RIGHT — bysort sorts automatically
bysort id: gen first = (_n == 1)

* Also RIGHT — explicit sort
sort id
by id: gen first = (_n == 1)

Factor Variable Notation (`i.` and `c.`)

Use i. for categorical, c. for continuous. Omitting i. treats categories as continuous.

* WRONG — treats race as continuous (e.g., race=3 has 3x effect of race=1)
regress wage race education

* RIGHT — creates dummies automatically
regress wage i.race education

* Interactions
regress wage i.race##c.education    // full interaction
regress wage i.race#c.education     // interaction only (no main effects)

`generate` vs `replace`

generate creates new variables; replace modifies existing ones. Using generate on an existing variable name is an error.

gen x = 1
gen x = 2          // ERROR: x already defined
replace x = 2      // correct

String Comparison Is Case-Sensitive

* May miss "Male", "MALE", etc.
keep if gender == "male"

* Safer
keep if lower(gender) == "male"

`merge` Always Check `_merge`

merge 1:1 id using other.dta
tab _merge                      // always inspect
assert _merge == 3              // or handle mismatches
drop _merge

`preserve` / `restore` for Temporary Changes

preserve
collapse (mean) income, by(state)
* ... do something with collapsed data ...
restore   // original data is back

Weights Are Not Interchangeable

fweight — frequency weights (replication)
aweight — analytic/regression weights (inverse variance)
pweight — probability/sampling weights (survey data, implies robust SE)
iweight — importance weights (rarely used)

`capture` Swallows Errors

capture some_command
if _rc != 0 {
    di as error "Failed with code: " _rc
    exit _rc
}

Line Continuation Uses `///`

regress y x1 x2 x3 ///
    x4 x5 x6, ///
    vce(robust)

Stored Results: `r()` vs `e()` vs `s()`

r() — r-class commands (summarize, tabulate, etc.)
e() — e-class commands (estimation: regress, logit, etc.)
s() — s-class commands (parsing)

A new estimation command overwrites previous e() results. Store them first:

regress y x1 x2
estimates store model1

Routing Table

Read only the files relevant to the user's task. Paths are relative to this SKILL.md file.

Data Operations

| File | Topics & Key Commands | |------|----------------------| | references/basics-getting-started.md | use, save, describe, browse, sysuse, basic workflow | | references/data-import-export.md | import delimited, import excel, ODBC, export, web data | | references/data-management.md | generate, replace, merge, append, reshape, collapse, recode, egen, encode/decode | | references/variables-operators.md | Variable types, byte/int/long/float/double, operators, missing values (.<.a), if/in qualifiers | | references/string-functions.md | substr(), regexm(), strtrim(), split, ustrlen(), regex, Unicode | | references/date-time-functions.md | date(), clock(), %td/%tc formats, mdy(), dofm(), business calendars | | references/mathematical-functions.md | round(), log(), exp(), abs(), mod(), cond(), distributions, random numbers |

Statistics & Econometrics

| File | Topics & Key Commands | |------|----------------------| | references/descriptive-statistics.md | summarize, tabulate, correlate, tabstat, codebook, weighted stats | | references/linear-regression.md | regress, vce(robust), vce(cluster), test, lincom, margins, predict, ivregress | | references/panel-data.md | xtset, xtreg fe/re, Hausman test, xtabond, dynamic panels | | references/time-series.md | tsset, ARIMA, VAR, dfuller, pperron, irf, forecasting | | references/limited-dependent-variables.md | logit, probit, tobit, poisson, nbreg, mlogit, ologit, margins for nonlinear | | references/bootstrap-simulation.md | bootstrap, simulate, permute, Monte Carlo | | references/survey-data-analysis.md | svyset, svy:, subpop(), complex survey design, replicate weights | | references/missing-data-handling.md | mi impute, mi estimate, FIML, misstable, diagnostics | | references/maximum-likelihood.md | ml model, custom likelihood functions, ml init, gradient-based optimization | | references/gmm-estimation.md | gmm, moment conditions, estat overid, J-test |

Causal Inference

| File | Topics & Key Commands | |------|----------------------| | references/treatment-effects.md | teffects ra/ipw/ipwra/aipw, stteffects, ATE/ATT/ATET | | references/difference-in-differences.md | DiD, parallel trends, event studies, staggered adoption | | references/regression-discontinuity.md | Sharp/fuzzy RD, bandwidth selection, rdplot | | references/matching-methods.md | PSM, nearest neighbor, kernel matching, teffects nnmatch | | references/sample-selection.md | heckman, heckprobit, treatment models, exclusion restrictions |

Advanced Methods

| File | Topics & Key Commands | |------|----------------------| | references/survival-analysis.md | stset, stcox, streg, Kaplan-Meier, parametric models | | references/sem-factor-analysis.md | sem, gsem, CFA, path analysis, alpha, reliability | | references/nonparametric-methods.md | kdensity, rank tests, qreg, npregress | | references/spatial-analysis.md | spmatrix, spregress, spatial weights, Moran's I | | references/machine-learning.md | lasso, elasticnet, cvlasso, cross-validation |

Graphics

| File | Topics & Key Commands | |------|----------------------| | references/graphics.md | twoway, scatter, line, bar, histogram, graph combine, graph export, schemes |

Programming

| File | Topics & Key Commands | |------|----------------------| | references/programming-basics.md | local, global, foreach, forvalues, program define, syntax, return | | references/advanced-programming.md | syntax, mata, classes, _prefix, dialog boxes, tempfile/tempvar | | references/mata-introduction.md | Mata basics, when to use Mata vs ado, data types | | references/mata-programming.md | Mata functions, flow control, structures, pointers | | references/mata-matrix-operations.md | Matrix creation, decompositions, solvers, st_matrix() | | references/mata-data-access.md | st_data(), st_view(), st_store(), performance tips |

Output & Workflow

| File | Topics & Key Commands | |------|----------------------| | references/tables-reporting.md | putexcel, putdocx, putpdf, LaTeX integration, collect | | references/workflow-best-practices.md | Project structure, master do-files, version control, debugging, common mistakes | | references/external-tools-integration.md | Python via python:, R via rsource, shell commands, Git |

Community Packages

| File | What It Does | |------|-------------| | packages/reghdfe.md | High-dimensional fixed effects OLS (absorbs multiple FE sets efficiently) | | packages/estout.md | esttab/estout: publication-quality regression tables | | packages/outreg2.md | Alternative regression table exporter (Word, Excel, TeX) | | packages/asdoc.md | One-command Word document creation for any Stata output | | packages/tabout.md | Cross-tabulations and summary tables to file | | packages/coefplot.md | Coefficient plots from stored estimates | | packages/graph-schemes.md | grstyle, schemepack, plotplain — better graph themes | | packages/did.md | Modern DiD: csdid, did_multiplegt, did_imputation (Callaway-Sant'Anna, de Chaisemartin-D'Haultfoeuille, Borusyak-Jaravel-Spiess) | | packages/event-study.md | eventstudyinteract, eventdd — event study estimators | | packages/rdrobust.md | Robust RD estimation with optimal bandwidth (rdrobust, rdplot, rdbwselect) | | packages/psmatch2.md | Propensity score matching (nearest neighbor, kernel, radius) | | packages/synth.md | Synthetic control method (synth, synth_runner) | | packages/ivreg2.md | Enhanced IV/2SLS: ivreg2, xtivreg2 with additional diagnostics | | packages/xtabond2.md | Dynamic panel GMM (Arellano-Bond/Blundell-Bond) | | packages/binsreg.md | Binned scatter plots with CI (binsreg, binstest) | | packages/nprobust.md | Nonparametric kernel estimation and inference | | packages/diagnostics.md | bacondecomp, xttest3, collinearity, heteroskedasticity tests | | packages/winsor.md | Winsorizing and trimming: winsor2, winsor | | packages/data-manipulation.md | gtools (fast collapse/egen), rangestat, egenmore | | packages/package-management.md | ssc install, net install, ado update, finding packages |

Common Patterns

Regression Table Workflow

* Estimate models
eststo clear
eststo: regress y x1 x2, vce(robust)
eststo: regress y x1 x2 x3, vce(robust)
eststo: regress y x1 x2 x3 x4, vce(cluster id)

* Export table
esttab using "results.tex", replace ///
    se star(* 0.10 ** 0.05 *** 0.01) ///
    label booktabs ///
    title("Main Results") ///
    mtitles("(1)" "(2)" "(3)")

Panel Data Setup

xtset panelid timevar          // declare panel structure
xtdescribe                      // check balance
xtsum outcome                   // within/between variation

* Fixed effects
xtreg y x1 x2, fe vce(cluster panelid)
* Or with reghdfe (preferred for multiple FE)
reghdfe y x1 x2, absorb(panelid timevar) vce(cluster panelid)

Difference-in-Differences

* Classic 2x2 DiD
gen post = (year >= treatment_year)
gen treat_post = treated * post
regress y treated post treat_post, vce(cluster id)

* Modern staggered DiD (Callaway & Sant'Anna)
csdid y x1 x2, ivar(id) time(year) gvar(first_treat) agg(event)
csdid_plot

Graph Export

* Publication-quality scatter with fit line
twoway (scatter y x, mcolor(navy%50) msize(small)) ///
       (lfit y x, lcolor(cranberry) lwidth(medthick)), ///
    title("Title Here") ///
    xtitle("X Label") ytitle("Y Label") ///
    legend(off) scheme(s2color)
graph export "figure1.pdf", replace as(pdf)
graph export "figure1.png", replace as(png) width(2400)

Data Cleaning Pipeline

* Load and inspect
import delimited "raw_data.csv", clear varnames(1)
describe
codebook, compact

* Clean
rename *, lower                 // lowercase all varnames
destring income, replace force  // convert string to numeric
replace income = . if income < 0

* Label
label variable income "Annual household income (USD)"
label define yesno 0 "No" 1 "Yes"
label values employed yesno

* Save
compress
save "clean_data.dta", replace

Multiple Imputation

mi set mlong
mi register imputed income education
mi impute chained (regress) income (ologit) education = age i.gender, add(20) rseed(12345)
mi estimate: regress wage income education age i.gender

Stata Skill

You have access to comprehensive Stata reference files. Do not load all files. Read only the 1-3 files relevant to the user's current task using the routing table below.

Critical Gotchas

These are Stata-specific pitfalls that lead to silent bugs. Internalize these before writing any code.

Missing Values Sort to +Infinity

Stata's . (and .a-.z) are greater than all numbers.

* WRONG — includes observations where income is missing!
gen high_income = (income > 50000)

* RIGHT
gen high_income = (income > 50000) if !missing(income)

* WRONG — missing ages appear in this list
list if age > 60

* RIGHT
list if age > 60 & !missing(age)

`=` vs `==`

= is assignment; == is comparison. Mixing them up is a syntax error or silent bug.

* WRONG — syntax error
gen employed = 1 if status = 1

* RIGHT
gen employed = 1 if status == 1

Local Macro Syntax

Locals use `name' (backtick + single-quote). Globals use $name or ${name}. Forgetting the closing quote is the #1 macro bug.

local controls "age education income"
regress wage `controls'        // correct
regress wage `controls         // WRONG — missing closing quote
regress wage 'controls'        // WRONG — wrong quote characters

`by` Requires Prior Sort (Use `bysort`)

* WRONG — error if data not sorted by id
by id: gen first = (_n == 1)

* RIGHT — bysort sorts automatically
bysort id: gen first = (_n == 1)

* Also RIGHT — explicit sort
sort id
by id: gen first = (_n == 1)

Factor Variable Notation (`i.` and `c.`)

Use i. for categorical, c. for continuous. Omitting i. treats categories as continuous.

* WRONG — treats race as continuous (e.g., race=3 has 3x effect of race=1)
regress wage race education

* RIGHT — creates dummies automatically
regress wage i.race education

* Interactions
regress wage i.race##c.education    // full interaction
regress wage i.race#c.education     // interaction only (no main effects)

`generate` vs `replace`

generate creates new variables; replace modifies existing ones. Using generate on an existing variable name is an error.

gen x = 1
gen x = 2          // ERROR: x already defined
replace x = 2      // correct

String Comparison Is Case-Sensitive

* May miss "Male", "MALE", etc.
keep if gender == "male"

* Safer
keep if lower(gender) == "male"

`merge` Always Check `_merge`

merge 1:1 id using other.dta
tab _merge                      // always inspect
assert _merge == 3              // or handle mismatches
drop _merge

`preserve` / `restore` for Temporary Changes

preserve
collapse (mean) income, by(state)
* ... do something with collapsed data ...
restore   // original data is back

Weights Are Not Interchangeable

fweight — frequency weights (replication)
aweight — analytic/regression weights (inverse variance)
pweight — probability/sampling weights (survey data, implies robust SE)
iweight — importance weights (rarely used)

`capture` Swallows Errors

capture some_command
if _rc != 0 {
    di as error "Failed with code: " _rc
    exit _rc
}

Line Continuation Uses `///`

regress y x1 x2 x3 ///
    x4 x5 x6, ///
    vce(robust)

Stored Results: `r()` vs `e()` vs `s()`

r() — r-class commands (summarize, tabulate, etc.)
e() — e-class commands (estimation: regress, logit, etc.)
s() — s-class commands (parsing)

A new estimation command overwrites previous e() results. Store them first:

regress y x1 x2
estimates store model1

Routing Table

Read only the files relevant to the user's task. Paths are relative to this SKILL.md file.

Data Operations

Statistics & Econometrics

Causal Inference

Advanced Methods

Graphics

| File | Topics & Key Commands | |------|----------------------| | references/graphics.md | twoway, scatter, line, bar, histogram, graph combine, graph export, schemes |

Programming

Output & Workflow

Community Packages

Common Patterns

Regression Table Workflow

* Estimate models
eststo clear
eststo: regress y x1 x2, vce(robust)
eststo: regress y x1 x2 x3, vce(robust)
eststo: regress y x1 x2 x3 x4, vce(cluster id)

* Export table
esttab using "results.tex", replace ///
    se star(* 0.10 ** 0.05 *** 0.01) ///
    label booktabs ///
    title("Main Results") ///
    mtitles("(1)" "(2)" "(3)")

Panel Data Setup

xtset panelid timevar          // declare panel structure
xtdescribe                      // check balance
xtsum outcome                   // within/between variation

* Fixed effects
xtreg y x1 x2, fe vce(cluster panelid)
* Or with reghdfe (preferred for multiple FE)
reghdfe y x1 x2, absorb(panelid timevar) vce(cluster panelid)

Difference-in-Differences

* Classic 2x2 DiD
gen post = (year >= treatment_year)
gen treat_post = treated * post
regress y treated post treat_post, vce(cluster id)

* Modern staggered DiD (Callaway & Sant'Anna)
csdid y x1 x2, ivar(id) time(year) gvar(first_treat) agg(event)
csdid_plot

Graph Export

* Publication-quality scatter with fit line
twoway (scatter y x, mcolor(navy%50) msize(small)) ///
       (lfit y x, lcolor(cranberry) lwidth(medthick)), ///
    title("Title Here") ///
    xtitle("X Label") ytitle("Y Label") ///
    legend(off) scheme(s2color)
graph export "figure1.pdf", replace as(pdf)
graph export "figure1.png", replace as(png) width(2400)

Data Cleaning Pipeline

* Load and inspect
import delimited "raw_data.csv", clear varnames(1)
describe
codebook, compact

* Clean
rename *, lower                 // lowercase all varnames
destring income, replace force  // convert string to numeric
replace income = . if income < 0

* Label
label variable income "Annual household income (USD)"
label define yesno 0 "No" 1 "Yes"
label values employed yesno

* Save
compress
save "clean_data.dta", replace

Multiple Imputation

mi set mlong
mi register imputed income education
mi impute chained (regress) income (ologit) education = age i.gender, add(20) rseed(12345)
mi estimate: regress wage income education age i.gender

Adoption

brycewang-stanford/stata

$ install --global

Security Scan Results

SKILL.md

Stata Skill

Critical Gotchas

Missing Values Sort to +Infinity

= vs ==

Local Macro Syntax

by Requires Prior Sort (Use bysort)

Factor Variable Notation (i. and c.)

generate vs replace

String Comparison Is Case-Sensitive

merge Always Check _merge

preserve / restore for Temporary Changes

Weights Are Not Interchangeable

capture Swallows Errors

Line Continuation Uses ///

Stored Results: r() vs e() vs s()

Routing Table

Data Operations

Statistics & Econometrics

Causal Inference

Advanced Methods

Graphics

Programming

Output & Workflow

Community Packages

Common Patterns

Regression Table Workflow

Panel Data Setup

Difference-in-Differences

Graph Export

Data Cleaning Pipeline

Multiple Imputation

Related Skills

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

brycewang-stanford/economist-data-skill

brycewang-stanford/stata

$ install --global

Security Scan Results

SKILL.md

Stata Skill

Critical Gotchas

Missing Values Sort to +Infinity

= vs ==

Local Macro Syntax

by Requires Prior Sort (Use bysort)

Factor Variable Notation (i. and c.)

generate vs replace

String Comparison Is Case-Sensitive

merge Always Check _merge

preserve / restore for Temporary Changes

Weights Are Not Interchangeable

capture Swallows Errors

Line Continuation Uses ///

Stored Results: r() vs e() vs s()

Routing Table

Data Operations

Statistics & Econometrics

Causal Inference

Advanced Methods

Graphics

Programming

Output & Workflow

Community Packages

Common Patterns

Regression Table Workflow

Panel Data Setup

Difference-in-Differences

Graph Export

Data Cleaning Pipeline

Multiple Imputation

Related Skills

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

`=` vs `==`

`by` Requires Prior Sort (Use `bysort`)

Factor Variable Notation (`i.` and `c.`)

`generate` vs `replace`

`merge` Always Check `_merge`

`preserve` / `restore` for Temporary Changes

`capture` Swallows Errors

Line Continuation Uses `///`

Stored Results: `r()` vs `e()` vs `s()`

`=` vs `==`

`by` Requires Prior Sort (Use `bysort`)

Factor Variable Notation (`i.` and `c.`)

`generate` vs `replace`

`merge` Always Check `_merge`

`preserve` / `restore` for Temporary Changes

`capture` Swallows Errors

Line Continuation Uses `///`

Stored Results: `r()` vs `e()` vs `s()`