skills/43-wentorai-research-plugins/skills/analysis/econometrics/stata-analyst-guide/SKILL.md
Stata workflows for publication-ready sociology and social science research
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research stata-analyst-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Complete Stata workflow for sociology and social science research, from survey data preparation through publication-ready regression tables and visualizations. This skill covers the analytical techniques most commonly used in top sociology journals.
Stata is the dominant statistical software in sociology, political science, demography, and many social science disciplines. Its command-line interface, reproducible do-file workflow, and comprehensive support for survey data, multilevel models, and categorical data analysis make it the tool of choice for researchers working with complex social datasets.
This skill provides ready-to-use Stata code for the most common analytical tasks in social science research: descriptive statistics for diverse variable types, regression modeling with proper controls and robustness checks, interaction effects with meaningful visualizations, and automated production of APA/ASA-formatted tables suitable for direct inclusion in journal manuscripts.
The examples draw on typical social science data structures: individual-level survey data with sampling weights, nested data (individuals within organizations or regions), longitudinal panels, and event-history data. All code follows the conventions expected by reviewers at journals such as the American Sociological Review, American Journal of Sociology, and Social Forces.
* Social science surveys typically require survey weights
svyset psu [pweight=finalweight], strata(stratum)
* Weighted means and proportions
svy: mean income education_years age
svy: proportion race gender marital_status
* Weighted cross-tabulation
svy: tabulate education_cat income_quintile, row se
* Descriptive statistics table for paper
estpost summarize age education_years income ///
children household_size, detail
esttab using "tables/descriptives.tex", ///
cells("mean(fmt(2)) sd(fmt(2)) min max count") ///
label title("Descriptive Statistics") replace
* T-tests with survey weights
svy: mean income, over(gender)
lincom [income]Male - [income]Female
* ANOVA
svy: regress income i.race i.education_cat
testparm i.race
testparm i.education_cat
* Effect sizes (Cohen's d)
esize twosample income, by(gender)
* Model building strategy (nested models for sociology papers)
* Model 1: Bivariate
reg income i.gender [pweight=finalweight], robust
estimates store m1
* Model 2: Add demographics
reg income i.gender age age_sq i.race i.marital [pweight=finalweight], robust
estimates store m2
* Model 3: Add human capital
reg income i.gender age age_sq i.race i.marital ///
education_years experience experience_sq [pweight=finalweight], robust
estimates store m3
* Model 4: Add job characteristics
reg income i.gender age age_sq i.race i.marital ///
education_years experience experience_sq ///
i.occupation i.industry hours_worked [pweight=finalweight], robust
estimates store m4
* Publication-ready table
esttab m1 m2 m3 m4 using "tables/regression_income.tex", ///
b(3) se(3) star(* 0.05 ** 0.01 *** 0.001) ///
label title("OLS Regression of Income") ///
mtitles("Bivariate" "Demographics" "Human Capital" "Full Model") ///
stats(N r2_a, labels("Observations" "Adjusted R-squared") fmt(0 3)) ///
addnotes("Standard errors in parentheses." ///
"All models use survey weights.") ///
replace
* Binary outcome: employment status
logit employed i.gender age age_sq i.race i.education_cat ///
children i.marital [pweight=finalweight], robust
estimates store logit1
* Report odds ratios
logit employed i.gender age age_sq i.race i.education_cat ///
children i.marital [pweight=finalweight], robust or
estimates store logit_or
* Average marginal effects (preferred in sociology)
margins, dydx(*) post
estimates store ame
* Predicted probabilities by group
logit employed i.gender##i.race age education_years [pweight=finalweight], robust
margins gender#race, atmeans
marginsplot, title("Predicted Probability of Employment")
* Gender x education interaction on income
reg income c.education_years##i.gender age i.race [pweight=finalweight], robust
* Visualize interaction
margins gender, at(education_years=(8(2)20))
marginsplot, ///
title("Returns to Education by Gender") ///
ytitle("Predicted Income ($)") ///
xtitle("Years of Education") ///
legend(order(1 "Male" 2 "Female")) ///
scheme(s2mono)
graph export "figures/education_gender_interaction.pdf", replace
* Test whether the effect of X on Y varies by moderator Z
reg outcome c.x_var##c.moderator controls [pweight=finalweight], robust
* Simple slopes at meaningful values of moderator
margins, dydx(x_var) at(moderator=(10 25 50 75 90)) // Percentiles
marginsplot, recast(line) recastci(rarea) ///
title("Effect of X on Y at Different Levels of Moderator")
* Students nested within schools
mixed test_score gender ses || school_id:, ///
variance mle
* Random slopes
mixed test_score gender c.ses || school_id: ses, ///
covariance(unstructured) mle
* Calculate ICC
estat icc
* Store and compare models
estimates store mlm1
mixed test_score gender c.ses school_quality || school_id: ses, ///
covariance(unstructured) mle
estimates store mlm2
lrtest mlm1 mlm2
* Set publication-ready scheme
set scheme s2mono
* Coefficient plot
coefplot m2 m3 m4, ///
drop(_cons) xline(0) ///
title("Regression Coefficients Across Models") ///
legend(order(2 "Demographics" 4 "Human Capital" 6 "Full")) ///
graphregion(color(white))
graph export "figures/coefplot.pdf", replace
* Distribution comparison
twoway (kdensity income if gender==1, lcolor(navy)) ///
(kdensity income if gender==2, lcolor(cranberry)), ///
title("Income Distribution by Gender") ///
legend(order(1 "Male" 2 "Female")) ///
xtitle("Annual Income ($)") ytitle("Density") ///
graphregion(color(white))
graph export "figures/income_density.pdf", replace
* Master do-file structure for replication
* master.do
* ==========================================
* Project: [Title]
* Author: [Name]
* Date: [Date]
* Description: Master script for replication
* ==========================================
version 17
clear all
set more off
set maxvar 10000
global root "~/research/project_name"
global raw "$root/data/raw"
global processed "$root/data/processed"
global tables "$root/tables"
global figures "$root/figures"
global logs "$root/logs"
log using "$logs/master_log.smcl", replace
do "$root/code/01_data_cleaning.do"
do "$root/code/02_descriptives.do"
do "$root/code/03_main_analysis.do"
do "$root/code/04_robustness.do"
do "$root/code/05_tables_figures.do"
log close
development
Conduct rigorous thematic analysis (TA) of qualitative data following Braun and Clarke's (2006) six-phase framework. Use whenever the user mentions 'thematic analysis', 'TA', 'Braun and Clarke', 'qualitative coding', 'identifying themes', or asks for help analysing interviews, focus groups, open-ended survey responses, or transcripts to identify patterns. Also trigger for questions about inductive vs theoretical coding, semantic vs latent themes, essentialist vs constructionist epistemology, building a thematic map, or writing up a qualitative findings section. Covers all six phases, the four upfront analytic decisions, the 15-point quality checklist, and the five common pitfalls. Produces a Word document write-up and an annotated thematic map. Does NOT cover IPA, grounded theory, discourse analysis, conversation analysis, or narrative analysis — use a different method for those.
development
Guide users through writing a systematic literature review (SLR) following the PRISMA 2020 framework. Use this skill whenever the user mentions 'systematic review', 'systematic literature review', 'SLR', 'PRISMA', 'PRISMA 2020', 'PRISMA flow diagram', 'PRISMA checklist', or asks for help writing, structuring, or auditing a literature review that follows reporting guidelines. Also trigger when the user asks about inclusion/exclusion criteria for a review, search strategies for databases like Scopus/WoS/PubMed, study selection processes, risk of bias assessment, or narrative synthesis for a review paper. This skill covers the full PRISMA 2020 checklist (27 items), produces a Word document manuscript in strict journal article format, generates an annotated PRISMA flow diagram, and enforces APA 7th Edition referencing throughout. It does NOT cover meta-analysis or statistical pooling. By Chuah Kee Man.
testing
Performs placebo-in-time sensitivity analysis with hierarchical null model and optional Bayesian assurance. Use when checking model robustness, verifying lack of pre-intervention effects, or estimating study power.
data-ai
Fit, summarize, plot, and interpret a chosen CausalPy experiment. Use after the causal method has been selected, including when configuring PyMC/sklearn models and scale-aware custom priors.