skills/43-wentorai-research-plugins/skills/analysis/wrangling/stata-data-cleaning/SKILL.md
Clean, transform, and validate messy research data using Stata
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research stata-data-cleaningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Clean, transform, and validate messy research datasets in Stata. This skill covers the complete data preparation pipeline from raw survey or administrative data to analysis-ready datasets, with emphasis on documentation, reproducibility, and handling the common data quality issues encountered in social science, economics, and health research.
Data cleaning typically consumes 60-80% of research time in empirical studies, yet it is often under-documented and poorly reproducible. Stata provides a powerful set of commands for data manipulation, but knowing which commands to use and in what order requires experience with common data quality issues: inconsistent coding, duplicate observations, string formatting problems, implausible values, and complex missing data patterns.
This skill provides a systematic, step-by-step data cleaning workflow in Stata. Each step produces a log of changes made, enabling full reproducibility and audit trails. The workflow is organized around the principle that raw data should never be modified in place -- instead, cleaning scripts transform raw data into processed datasets while preserving the original.
The approach follows best practices from the World Bank's DIME Analytics team and the J-PAL research transparency guidelines, making it suitable for projects that require rigorous data documentation for peer review, replication packages, or regulatory compliance.
* ============================================
* Data Cleaning Script: [Project Name]
* Author: [Name]
* Date: [Date]
* Input: raw/survey_data_raw.dta
* Output: processed/survey_data_clean.dta
* ============================================
clear all
set more off
log using "logs/cleaning_log.smcl", replace
* Load raw data
use "raw/survey_data_raw.dta", clear
* Basic inspection
describe
summarize
codebook, compact
* Check dimensions
display "Observations: " _N
display "Variables: " c(k)
* Check for duplicates on ID variable
duplicates report respondent_id
duplicates list respondent_id if duplicates(respondent_id) > 0
* Generate a data quality summary
foreach var of varlist _all {
quietly {
count if missing(`var')
local nmiss = r(N)
local pctmiss = (`nmiss' / _N) * 100
}
if `pctmiss' > 0 {
display "`var': `nmiss' missing (`pctmiss'%)"
}
}
* Check value ranges for numeric variables
foreach var of varlist age income years_education {
summarize `var', detail
* Flag implausible values
count if `var' < 0 & !missing(`var')
count if `var' > 150 & !missing(`var')
}
* Trim whitespace
replace name = strtrim(name)
replace name = stritrim(name) // Remove internal multiple spaces
* Standardize case
replace city = proper(city) // Title case
replace country = upper(country) // Upper case
replace email = lower(email) // Lower case
* Remove special characters
replace phone = ustrregexra(phone, "[^0-9]", "")
* Fix encoding issues
replace name = ustrfix(name)
* Standardize common variations
replace department = "Computer Science" if ///
inlist(department, "CS", "Comp Sci", "Comp. Sci.", "CompSci")
replace gender = "Female" if inlist(gender, "F", "f", "female", "FEMALE")
replace gender = "Male" if inlist(gender, "M", "m", "male", "MALE")
* Split full name into first and last
gen first_name = word(full_name, 1)
gen last_name = word(full_name, -1)
* Extract year from date string "March 15, 2024"
gen year = real(word(date_string, -1))
* Parse numeric values from strings like "$1,234.56"
gen income_clean = real(subinstr(subinstr(income_str, "$", "", .), ",", "", .))
* Install missing data analysis tools
ssc install mdesc
ssc install misstable
* Summary of missing data
mdesc
* Missing data patterns
misstable summarize
misstable patterns
* Create missing indicator variables
foreach var of varlist income education occupation {
gen mi_`var' = missing(`var')
}
* Test whether missing is random (Little's MCAR test approximation)
* Compare means of observed variables by missing status
foreach var of varlist income education {
ttest age, by(mi_`var')
ttest gender_numeric, by(mi_`var')
}
* Common survey codes for missing
* -99 = refused, -88 = don't know, -77 = not applicable
foreach var of varlist income satisfaction trust_score {
replace `var' = .r if `var' == -99 // .r = refused
replace `var' = .d if `var' == -88 // .d = don't know
replace `var' = .n if `var' == -77 // .n = not applicable
}
* Extended missing values preserve the reason for missingness
* while still being treated as missing in analyses
* Create age groups
recode age (18/29 = 1 "18-29") (30/44 = 2 "30-44") ///
(45/59 = 3 "45-59") (60/max = 4 "60+"), gen(age_group)
* Create binary indicator
gen high_income = (income > 75000) if !missing(income)
* Create composite scale (e.g., Likert items)
alpha item1 item2 item3 item4 item5, gen(scale_score) item
* Cronbach's alpha is reported; scale_score is the mean
* Standardize continuous variables
foreach var of varlist income education_years age {
egen z_`var' = std(`var')
}
* Winsorize extreme values
winsor2 income, cuts(1 99) replace
* Parse date strings
gen interview_date = date(date_string, "MDY")
format interview_date %td
* Extract components
gen interview_year = year(interview_date)
gen interview_month = month(interview_date)
gen interview_dow = dow(interview_date) // 0=Sunday
* Calculate durations
gen days_since_treatment = interview_date - treatment_date
gen months_since = (interview_date - treatment_date) / 30.44
* These assertions halt execution if violated
assert _N == 5000 // Expected sample size
assert !missing(respondent_id) // No missing IDs
assert age >= 18 & age <= 120 if !missing(age) // Plausible age range
assert inlist(gender, "Male", "Female", "Other", "") | missing(gender)
* Cross-variable consistency checks
assert education_years >= 0 if !missing(education_years)
assert income >= 0 if !missing(income)
assert end_date >= start_date if !missing(end_date) & !missing(start_date)
* Identify duplicates
duplicates tag respondent_id, gen(dup_flag)
list respondent_id survey_date if dup_flag > 0, sepby(respondent_id)
* Keep most recent observation per respondent
bysort respondent_id (survey_date): keep if _n == _N
* Or keep first observation
bysort respondent_id (survey_date): keep if _n == 1
* Label all variables
label variable age "Age at time of interview (years)"
label variable income "Annual household income (USD)"
label variable education_years "Total years of formal education"
* Save cleaned dataset
compress // Reduce file size
save "processed/survey_data_clean.dta", replace
* Export codebook
codebook, compact
describe, short
* Close log
log close
log using to capture all output for audit trails.development
Conduct rigorous thematic analysis (TA) of qualitative data following Braun and Clarke's (2006) six-phase framework. Use whenever the user mentions 'thematic analysis', 'TA', 'Braun and Clarke', 'qualitative coding', 'identifying themes', or asks for help analysing interviews, focus groups, open-ended survey responses, or transcripts to identify patterns. Also trigger for questions about inductive vs theoretical coding, semantic vs latent themes, essentialist vs constructionist epistemology, building a thematic map, or writing up a qualitative findings section. Covers all six phases, the four upfront analytic decisions, the 15-point quality checklist, and the five common pitfalls. Produces a Word document write-up and an annotated thematic map. Does NOT cover IPA, grounded theory, discourse analysis, conversation analysis, or narrative analysis — use a different method for those.
development
Guide users through writing a systematic literature review (SLR) following the PRISMA 2020 framework. Use this skill whenever the user mentions 'systematic review', 'systematic literature review', 'SLR', 'PRISMA', 'PRISMA 2020', 'PRISMA flow diagram', 'PRISMA checklist', or asks for help writing, structuring, or auditing a literature review that follows reporting guidelines. Also trigger when the user asks about inclusion/exclusion criteria for a review, search strategies for databases like Scopus/WoS/PubMed, study selection processes, risk of bias assessment, or narrative synthesis for a review paper. This skill covers the full PRISMA 2020 checklist (27 items), produces a Word document manuscript in strict journal article format, generates an annotated PRISMA flow diagram, and enforces APA 7th Edition referencing throughout. It does NOT cover meta-analysis or statistical pooling. By Chuah Kee Man.
testing
Performs placebo-in-time sensitivity analysis with hierarchical null model and optional Bayesian assurance. Use when checking model robustness, verifying lack of pre-intervention effects, or estimating study power.
data-ai
Fit, summarize, plot, and interpret a chosen CausalPy experiment. Use after the causal method has been selected, including when configuring PyMC/sklearn models and scale-aware custom priors.