skills/analysis/econometrics/stata-analyst-guide/SKILL.md
Stata workflows for publication-ready sociology and social science research
npx skillsauth add wentorai/research-plugins stata-analyst-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Complete Stata workflow for sociology and social science research, from survey data preparation through publication-ready regression tables and visualizations. This skill covers the analytical techniques most commonly used in top sociology journals.
Stata is the dominant statistical software in sociology, political science, demography, and many social science disciplines. Its command-line interface, reproducible do-file workflow, and comprehensive support for survey data, multilevel models, and categorical data analysis make it the tool of choice for researchers working with complex social datasets.
This skill provides ready-to-use Stata code for the most common analytical tasks in social science research: descriptive statistics for diverse variable types, regression modeling with proper controls and robustness checks, interaction effects with meaningful visualizations, and automated production of APA/ASA-formatted tables suitable for direct inclusion in journal manuscripts.
The examples draw on typical social science data structures: individual-level survey data with sampling weights, nested data (individuals within organizations or regions), longitudinal panels, and event-history data. All code follows the conventions expected by reviewers at journals such as the American Sociological Review, American Journal of Sociology, and Social Forces.
* Social science surveys typically require survey weights
svyset psu [pweight=finalweight], strata(stratum)
* Weighted means and proportions
svy: mean income education_years age
svy: proportion race gender marital_status
* Weighted cross-tabulation
svy: tabulate education_cat income_quintile, row se
* Descriptive statistics table for paper
estpost summarize age education_years income ///
children household_size, detail
esttab using "tables/descriptives.tex", ///
cells("mean(fmt(2)) sd(fmt(2)) min max count") ///
label title("Descriptive Statistics") replace
* T-tests with survey weights
svy: mean income, over(gender)
lincom [income]Male - [income]Female
* ANOVA
svy: regress income i.race i.education_cat
testparm i.race
testparm i.education_cat
* Effect sizes (Cohen's d)
esize twosample income, by(gender)
* Model building strategy (nested models for sociology papers)
* Model 1: Bivariate
reg income i.gender [pweight=finalweight], robust
estimates store m1
* Model 2: Add demographics
reg income i.gender age age_sq i.race i.marital [pweight=finalweight], robust
estimates store m2
* Model 3: Add human capital
reg income i.gender age age_sq i.race i.marital ///
education_years experience experience_sq [pweight=finalweight], robust
estimates store m3
* Model 4: Add job characteristics
reg income i.gender age age_sq i.race i.marital ///
education_years experience experience_sq ///
i.occupation i.industry hours_worked [pweight=finalweight], robust
estimates store m4
* Publication-ready table
esttab m1 m2 m3 m4 using "tables/regression_income.tex", ///
b(3) se(3) star(* 0.05 ** 0.01 *** 0.001) ///
label title("OLS Regression of Income") ///
mtitles("Bivariate" "Demographics" "Human Capital" "Full Model") ///
stats(N r2_a, labels("Observations" "Adjusted R-squared") fmt(0 3)) ///
addnotes("Standard errors in parentheses." ///
"All models use survey weights.") ///
replace
* Binary outcome: employment status
logit employed i.gender age age_sq i.race i.education_cat ///
children i.marital [pweight=finalweight], robust
estimates store logit1
* Report odds ratios
logit employed i.gender age age_sq i.race i.education_cat ///
children i.marital [pweight=finalweight], robust or
estimates store logit_or
* Average marginal effects (preferred in sociology)
margins, dydx(*) post
estimates store ame
* Predicted probabilities by group
logit employed i.gender##i.race age education_years [pweight=finalweight], robust
margins gender#race, atmeans
marginsplot, title("Predicted Probability of Employment")
* Gender x education interaction on income
reg income c.education_years##i.gender age i.race [pweight=finalweight], robust
* Visualize interaction
margins gender, at(education_years=(8(2)20))
marginsplot, ///
title("Returns to Education by Gender") ///
ytitle("Predicted Income ($)") ///
xtitle("Years of Education") ///
legend(order(1 "Male" 2 "Female")) ///
scheme(s2mono)
graph export "figures/education_gender_interaction.pdf", replace
* Test whether the effect of X on Y varies by moderator Z
reg outcome c.x_var##c.moderator controls [pweight=finalweight], robust
* Simple slopes at meaningful values of moderator
margins, dydx(x_var) at(moderator=(10 25 50 75 90)) // Percentiles
marginsplot, recast(line) recastci(rarea) ///
title("Effect of X on Y at Different Levels of Moderator")
* Students nested within schools
mixed test_score gender ses || school_id:, ///
variance mle
* Random slopes
mixed test_score gender c.ses || school_id: ses, ///
covariance(unstructured) mle
* Calculate ICC
estat icc
* Store and compare models
estimates store mlm1
mixed test_score gender c.ses school_quality || school_id: ses, ///
covariance(unstructured) mle
estimates store mlm2
lrtest mlm1 mlm2
* Set publication-ready scheme
set scheme s2mono
* Coefficient plot
coefplot m2 m3 m4, ///
drop(_cons) xline(0) ///
title("Regression Coefficients Across Models") ///
legend(order(2 "Demographics" 4 "Human Capital" 6 "Full")) ///
graphregion(color(white))
graph export "figures/coefplot.pdf", replace
* Distribution comparison
twoway (kdensity income if gender==1, lcolor(navy)) ///
(kdensity income if gender==2, lcolor(cranberry)), ///
title("Income Distribution by Gender") ///
legend(order(1 "Male" 2 "Female")) ///
xtitle("Annual Income ($)") ytitle("Density") ///
graphregion(color(white))
graph export "figures/income_density.pdf", replace
* Master do-file structure for replication
* master.do
* ==========================================
* Project: [Title]
* Author: [Name]
* Date: [Date]
* Description: Master script for replication
* ==========================================
version 17
clear all
set more off
set maxvar 10000
global root "~/research/project_name"
global raw "$root/data/raw"
global processed "$root/data/processed"
global tables "$root/tables"
global figures "$root/figures"
global logs "$root/logs"
log using "$logs/master_log.smcl", replace
do "$root/code/01_data_cleaning.do"
do "$root/code/02_descriptives.do"
do "$root/code/03_main_analysis.do"
do "$root/code/04_robustness.do"
do "$root/code/05_tables_figures.do"
log close
tools
10 document processing skills. Trigger: extracting text from PDFs, parsing references, document Q&A. Design: parsing pipelines (GROBID, marker) and structured extraction tools.
documentation
Guide to tldraw for infinite canvas whiteboarding and diagram creation
testing
Create graphical abstracts, schematic diagrams, and scientific illustrations
documentation
Create UML diagrams and architecture visualizations with PlantUML