svy Skill

svy: design-based analysis of complex survey data in Python. Covers survey design specification (strata, PSU, weights, FPC), variance estimation (Taylor linearization, BRR, jackknife, bootstrap), descriptive estimation (means, totals, proportions, ratios, medians), survey-weighted GLM regression (gaussian, binomial, Poisson), domain/subpopulation analysis, calibration, and survey data I/O (SAS, SPSS, Stata). Uses Polars DataFrames natively. Use when analyzing data from complex sample surveys (NHANES, CPS, ACS PUMS, MEPS, ECLS-K, BRFSS, DHS). For non-survey regression, use statsmodels; for fixed effects, use pyfixest; for panel/IV models, use linearmodels.

Comprehensive skill for complex survey data analysis with svy. Use decision trees below to find the right guidance, then load detailed references.

What is svy?

svy is the Python package for design-based analysis of complex survey data:

Survey-aware estimation: Means, totals, proportions, ratios, medians with proper design-based standard errors
GLM regression: Survey-weighted linear, logistic, and Poisson regression with design-adjusted inference
Flexible variance estimation: Taylor linearization (default), bootstrap, BRR (including Fay's modification), and jackknife (JK1, JKn) replicate methods
Domain estimation: Correct subpopulation analysis without pre-filtering (preserves design structure)
Native Polars: Built on Polars DataFrames, not pandas
Survey data I/O: Read SAS (.sas7bdat), SPSS (.sav), Stata (.dta), and CSV with metadata
Calibration: Post-stratification, raking, and GREG calibration for weight adjustment
Validated: Results numerically equivalent to R's survey package across all methods

Version Notes

This skill targets svy 0.13.0 (released 2026-03-25). svy supersedes samplics (archived 2026-03-10), an earlier library by the same author (Mamadou S. Diallo, Ph.D.). Key differences from samplics:

Unified Sample object replaces separate TaylorEstimator / ReplicateEstimator classes
Polars-native (samplics used numpy arrays)
Expanded GLM support and data I/O module (svy.io)
The API is substantially different from samplics — do not assume samplics patterns carry over

How to Use This Skill

Reference File Structure

| File | Purpose | When to Read | |------|---------|--------------| | estimation.md | Means, totals, proportions, ratios, medians, domain estimation, cross-tabs, hypothesis tests | Descriptive survey statistics | | regression.md | Survey-weighted OLS, logistic, Poisson regression; extracting results; diagnostics | Survey regression models | | design-weights.md | Design specification, replicate weights, weight manipulation, variance setup, survey data I/O, federal survey patterns | Setting up the survey design object |

Reading Order

New to svy? Start with design-weights.md then estimation.md
Need survey-weighted regression? Read design-weights.md then regression.md
Have replicate weights already? Read design-weights.md (replicate design section) then estimation.md or regression.md
Setting up a federal survey (NHANES, CPS, etc.)? Read design-weights.md (federal survey patterns table)
Coming from samplics? Read design-weights.md for the new API; the Sample object replaces TaylorEstimator/ReplicateEstimator

Related Skills

| Skill | Relationship | |-------|-------------| | data-scientist | Provides methodology guidance (especially survey-analysis.md); svy provides implementation. Load data-scientist for "when and why" to use survey methods | | statsmodels | Complement for non-survey regression (OLS, GLM, time series, diagnostics). WLS in statsmodels is NOT survey-weighted regression — it does not account for stratification or clustering | | pyfixest | Complement for fixed effects models and DiD. pyfixest does not handle complex survey designs; use svy for survey-weighted estimation, pyfixest for FE/DiD | | linearmodels | Complement for panel models (RE, FD, Fama-MacBeth) and IV/GMM. Does not handle survey designs | | polars | svy uses Polars DataFrames natively. Load polars skill for data preparation before passing to svy |

Quick Decision Trees

"I need to analyze survey data"

What task?
├─ Descriptive statistics (mean, total, proportion)
│   └─ ./references/estimation.md
├─ Regression model
│   ├─ Linear (continuous outcome) → ./references/regression.md
│   ├─ Logistic (binary outcome) → ./references/regression.md
│   └─ Poisson (count outcome) → ./references/regression.md
├─ Set up the survey design object
│   └─ ./references/design-weights.md
├─ Read survey data from SAS/SPSS/Stata
│   └─ ./references/design-weights.md
├─ Subpopulation / domain analysis
│   └─ ./references/estimation.md
└─ Cross-tabulation
    └─ ./references/estimation.md

"I need survey-weighted regression"

What model?
├─ Linear regression (continuous Y)
│   └─ family="gaussian" → ./references/regression.md
├─ Logistic regression (binary Y)
│   └─ family="binomial" → ./references/regression.md
├─ Poisson regression (count Y)
│   └─ family="poisson" → ./references/regression.md
├─ Ordinal logistic / Cox survival / IV
│   └─ Not in svy — use rpy2 + R survey package (see rpy2 bridge below)
└─ Fixed effects + survey weights
    └─ Not directly supported — see Boundaries below

"I need to set up variance estimation"

What do you have?
├─ Design variables (strata, PSU, weights)
│   └─ Taylor linearization → ./references/design-weights.md
├─ Pre-computed replicate weights
│   ├─ BRR weights → ./references/design-weights.md
│   ├─ Jackknife weights → ./references/design-weights.md
│   └─ Bootstrap weights → ./references/design-weights.md
├─ Need to create replicate weights from design
│   └─ ./references/design-weights.md
└─ Not sure what I have
    └─ Read survey documentation first → ./references/design-weights.md (federal survey table)

"I need descriptive statistics from a survey"

What statistic?
├─ Population mean → ./references/estimation.md
├─ Population total → ./references/estimation.md
├─ Proportion → ./references/estimation.md
├─ Ratio (Y/X) → ./references/estimation.md
├─ Median / quantile → ./references/estimation.md
├─ Cross-tabulation → ./references/estimation.md
├─ By subgroup (domain estimation) → ./references/estimation.md
└─ Hypothesis test (t-test) → ./references/estimation.md

Boundaries

svy covers:

Design-based estimation (descriptive and regression) for complex surveys
Taylor and replicate-weight variance estimation
Domain/subpopulation analysis
Calibration and weight adjustment
Survey data I/O

svy does NOT cover (use other tools):

Fixed effects models — use pyfixest (survey weights + FE is methodologically complex; consult data-scientist skill)
Panel data models (RE, FD, between) — use linearmodels
Difference-in-differences — use pyfixest
Causal inference methods (IV, RD, synthetic control) — use pyfixest/linearmodels/statsmodels
Time series analysis — use statsmodels
Machine learning — use scikit-learn
Ordinal logistic, Cox proportional hazards, negative binomial — use rpy2 + R survey package
Survey sampling design and sample size calculation — use data-scientist skill for methodology

The rpy2 Bridge

For models svy does not support (ordinal logistic, survival models, negative binomial GLM, cumulative link models), fall back to R's survey package via rpy2:

Decision rule: If the model family is not "gaussian", "binomial", or "poisson", use rpy2.

The R survey package (survey::svyglm, survey::svyolr, survey::svycoxph) covers the full range of survey-weighted models. Set up the survey design in R using the same design variables you would pass to svy.Design. See R survey package documentation at r-survey.r-forge.r-project.org for API details.

Legacy: samplics

samplics (2020-2026) is archived. svy supersedes it with a cleaner API, Polars integration, and expanded methods. If working with legacy code that uses samplics:

The API is substantially different — TaylorEstimator/ReplicateEstimator classes are replaced by svy.Sample
samplics used numpy arrays; svy uses Polars DataFrames
Consult samplics documentation at samplics-org.github.io/samplics/ for legacy reference
Migration requires rewriting, not find-and-replace

File-First Execution in Research Workflows

Important: In data research pipelines (see CLAUDE.md), svy analyses are executed through script files, not interactively. This ensures auditability and reproducibility.

The pattern:

Write estimation/regression code to scripts/stage8_analysis/{step}_{task-name}.py
Execute via Bash with automatic output capture wrapper script
Validation results get automatically embedded in scripts as comments
If failed, create versioned copy for fixes

Closely read agent_reference/SCRIPT_EXECUTION_REFERENCE.md for the mandatory file-first execution protocol. All survey analysis scripts must follow the Inline Audit Trail (IAT) standard — document design specification choices (why these strata/PSU/weights, what variance method, domain definitions) with # INTENT:, # REASONING:, and # ASSUMES: comments.

Quick Reference

Essential Import

import svy

Core Workflow

# 1. Load data
data = svy.io.read_stata("nhanes.dta")

# 2. Specify design
design = svy.Design(stratum="sdmvstra", psu="sdmvpsu", wgt="wtmec2yr")

# 3. Create sample object
sample = svy.Sample(data=data, design=design)

# 4. Estimate
mean_bmi = sample.estimation.mean("bmxbmi")
model = sample.glm.fit(y="bmxbmi", x=["ridageyr", svy.Cat("riagendr")], family="gaussian")

Core Operations

| Operation | Code | |-----------|------| | Design (Taylor) | svy.Design(stratum="s", psu="p", wgt="w") | | Sample object | svy.Sample(data=df, design=design) | | Mean | sample.estimation.mean("var") | | Total | sample.estimation.total("var") | | Proportion | sample.estimation.prop("var") | | Ratio | sample.estimation.ratio(y="num", x="denom") | | Median | sample.estimation.median("var") | | Domain estimation | sample.estimation.mean("var", by="group") | | Linear regression | sample.glm.fit(y="y", x=[...], family="gaussian") | | Logistic regression | sample.glm.fit(y="y", x=[...], family="binomial") | | Poisson regression | sample.glm.fit(y="y", x=[...], family="poisson") | | Categorical predictor | svy.Cat("varname") | | Read Stata | svy.io.read_stata("file.dta") | | Read SAS | svy.io.read_sas("file.sas7bdat") | | Read SPSS | svy.io.read_spss("file.sav") |

Topic Index

| Topic | Reference File | |-------|---------------| | Survey design setup | ./references/design-weights.md | | Taylor linearization | ./references/design-weights.md | | Replicate weights (BRR, jackknife, bootstrap) | ./references/design-weights.md | | Fay's BRR modification | ./references/design-weights.md | | Weight types and handling | ./references/design-weights.md | | Federal survey design patterns | ./references/design-weights.md | | Singleton PSU handling | ./references/design-weights.md | | Calibration and post-stratification | ./references/design-weights.md | | Reading SAS/SPSS/Stata files | ./references/design-weights.md | | Population means | ./references/estimation.md | | Population totals | ./references/estimation.md | | Proportions | ./references/estimation.md | | Ratios | ./references/estimation.md | | Medians and quantiles | ./references/estimation.md | | Domain / subpopulation estimation | ./references/estimation.md | | Cross-tabulations | ./references/estimation.md | | Survey-weighted t-tests | ./references/estimation.md | | Design effects (DEFF) | ./references/estimation.md | | Survey-weighted OLS | ./references/regression.md | | Survey-weighted logistic regression | ./references/regression.md | | Survey-weighted Poisson regression | ./references/regression.md | | Extracting regression results | ./references/regression.md | | Survey regression vs. WLS vs. cluster-robust | ./references/regression.md | | Categorical predictors (svy.Cat) | ./references/regression.md | | Model diagnostics in survey context | ./references/regression.md | | rpy2 bridge to R survey package | ./references/regression.md | | samplics migration | ./references/design-weights.md | | Polars DataFrame integration | ./references/design-weights.md |

Citation

When this library is used as a primary analytical tool, include in the report's Software & Tools references:

Diallo, M.S. svy: Python package for complex survey sampling and analysis [Computer software]. (Formerly samplics.)

Cite when: svy is used for survey-weighted estimation with complex survey designs (strata, PSU, replicate weights). Do not cite when: Only imported but no survey estimation performed.

For method-specific citations (e.g., variance estimation techniques), consult the reference files in this skill and agent_reference/CITATION_REFERENCE.md.

svy Skill

Comprehensive skill for complex survey data analysis with svy. Use decision trees below to find the right guidance, then load detailed references.

What is svy?

svy is the Python package for design-based analysis of complex survey data:

Survey-aware estimation: Means, totals, proportions, ratios, medians with proper design-based standard errors
GLM regression: Survey-weighted linear, logistic, and Poisson regression with design-adjusted inference
Flexible variance estimation: Taylor linearization (default), bootstrap, BRR (including Fay's modification), and jackknife (JK1, JKn) replicate methods
Domain estimation: Correct subpopulation analysis without pre-filtering (preserves design structure)
Native Polars: Built on Polars DataFrames, not pandas
Survey data I/O: Read SAS (.sas7bdat), SPSS (.sav), Stata (.dta), and CSV with metadata
Calibration: Post-stratification, raking, and GREG calibration for weight adjustment
Validated: Results numerically equivalent to R's survey package across all methods

Version Notes

Unified Sample object replaces separate TaylorEstimator / ReplicateEstimator classes
Polars-native (samplics used numpy arrays)
Expanded GLM support and data I/O module (svy.io)
The API is substantially different from samplics — do not assume samplics patterns carry over

How to Use This Skill

Reference File Structure

Reading Order

New to svy? Start with design-weights.md then estimation.md
Need survey-weighted regression? Read design-weights.md then regression.md
Have replicate weights already? Read design-weights.md (replicate design section) then estimation.md or regression.md
Setting up a federal survey (NHANES, CPS, etc.)? Read design-weights.md (federal survey patterns table)
Coming from samplics? Read design-weights.md for the new API; the Sample object replaces TaylorEstimator/ReplicateEstimator

Related Skills

Quick Decision Trees

"I need to analyze survey data"

What task?
├─ Descriptive statistics (mean, total, proportion)
│   └─ ./references/estimation.md
├─ Regression model
│   ├─ Linear (continuous outcome) → ./references/regression.md
│   ├─ Logistic (binary outcome) → ./references/regression.md
│   └─ Poisson (count outcome) → ./references/regression.md
├─ Set up the survey design object
│   └─ ./references/design-weights.md
├─ Read survey data from SAS/SPSS/Stata
│   └─ ./references/design-weights.md
├─ Subpopulation / domain analysis
│   └─ ./references/estimation.md
└─ Cross-tabulation
    └─ ./references/estimation.md

"I need survey-weighted regression"

What model?
├─ Linear regression (continuous Y)
│   └─ family="gaussian" → ./references/regression.md
├─ Logistic regression (binary Y)
│   └─ family="binomial" → ./references/regression.md
├─ Poisson regression (count Y)
│   └─ family="poisson" → ./references/regression.md
├─ Ordinal logistic / Cox survival / IV
│   └─ Not in svy — use rpy2 + R survey package (see rpy2 bridge below)
└─ Fixed effects + survey weights
    └─ Not directly supported — see Boundaries below

"I need to set up variance estimation"

What do you have?
├─ Design variables (strata, PSU, weights)
│   └─ Taylor linearization → ./references/design-weights.md
├─ Pre-computed replicate weights
│   ├─ BRR weights → ./references/design-weights.md
│   ├─ Jackknife weights → ./references/design-weights.md
│   └─ Bootstrap weights → ./references/design-weights.md
├─ Need to create replicate weights from design
│   └─ ./references/design-weights.md
└─ Not sure what I have
    └─ Read survey documentation first → ./references/design-weights.md (federal survey table)

"I need descriptive statistics from a survey"

What statistic?
├─ Population mean → ./references/estimation.md
├─ Population total → ./references/estimation.md
├─ Proportion → ./references/estimation.md
├─ Ratio (Y/X) → ./references/estimation.md
├─ Median / quantile → ./references/estimation.md
├─ Cross-tabulation → ./references/estimation.md
├─ By subgroup (domain estimation) → ./references/estimation.md
└─ Hypothesis test (t-test) → ./references/estimation.md

Boundaries

svy covers:

Design-based estimation (descriptive and regression) for complex surveys
Taylor and replicate-weight variance estimation
Domain/subpopulation analysis
Calibration and weight adjustment
Survey data I/O

svy does NOT cover (use other tools):

Fixed effects models — use pyfixest (survey weights + FE is methodologically complex; consult data-scientist skill)
Panel data models (RE, FD, between) — use linearmodels
Difference-in-differences — use pyfixest
Causal inference methods (IV, RD, synthetic control) — use pyfixest/linearmodels/statsmodels
Time series analysis — use statsmodels
Machine learning — use scikit-learn
Ordinal logistic, Cox proportional hazards, negative binomial — use rpy2 + R survey package
Survey sampling design and sample size calculation — use data-scientist skill for methodology

The rpy2 Bridge

For models svy does not support (ordinal logistic, survival models, negative binomial GLM, cumulative link models), fall back to R's survey package via rpy2:

Decision rule: If the model family is not "gaussian", "binomial", or "poisson", use rpy2.

Legacy: samplics

samplics (2020-2026) is archived. svy supersedes it with a cleaner API, Polars integration, and expanded methods. If working with legacy code that uses samplics:

The API is substantially different — TaylorEstimator/ReplicateEstimator classes are replaced by svy.Sample
samplics used numpy arrays; svy uses Polars DataFrames
Consult samplics documentation at samplics-org.github.io/samplics/ for legacy reference
Migration requires rewriting, not find-and-replace

File-First Execution in Research Workflows

Important: In data research pipelines (see CLAUDE.md), svy analyses are executed through script files, not interactively. This ensures auditability and reproducibility.

The pattern:

Write estimation/regression code to scripts/stage8_analysis/{step}_{task-name}.py
Execute via Bash with automatic output capture wrapper script
Validation results get automatically embedded in scripts as comments
If failed, create versioned copy for fixes

Quick Reference

Essential Import

import svy

Core Workflow

# 1. Load data
data = svy.io.read_stata("nhanes.dta")

# 2. Specify design
design = svy.Design(stratum="sdmvstra", psu="sdmvpsu", wgt="wtmec2yr")

# 3. Create sample object
sample = svy.Sample(data=data, design=design)

# 4. Estimate
mean_bmi = sample.estimation.mean("bmxbmi")
model = sample.glm.fit(y="bmxbmi", x=["ridageyr", svy.Cat("riagendr")], family="gaussian")

Core Operations

Topic Index

Citation

When this library is used as a primary analytical tool, include in the report's Software & Tools references:

Diallo, M.S. svy: Python package for complex survey sampling and analysis [Computer software]. (Formerly samplics.)

Cite when: svy is used for survey-weighted estimation with complex survey designs (strata, PSU, replicate weights). Do not cite when: Only imported but no survey estimation performed.

For method-specific citations (e.g., variance estimation techniques), consult the reference files in this skill and agent_reference/CITATION_REFERENCE.md.

Adoption

brycewang-stanford/svy

$ install --global

Security Scan Results

SKILL.md

svy Skill

What is svy?

Version Notes

How to Use This Skill

Reference File Structure

Reading Order

Related Skills

Quick Decision Trees

"I need to analyze survey data"

"I need survey-weighted regression"

"I need to set up variance estimation"

"I need descriptive statistics from a survey"

Boundaries

The rpy2 Bridge

Legacy: samplics

File-First Execution in Research Workflows

Quick Reference

Essential Import

Core Workflow

Core Operations

Topic Index

Citation

Related Skills

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

brycewang-stanford/economist-data-skill

brycewang-stanford/svy

$ install --global

Security Scan Results

SKILL.md

svy Skill

What is svy?

Version Notes

How to Use This Skill

Reference File Structure

Reading Order

Related Skills

Quick Decision Trees

"I need to analyze survey data"

"I need survey-weighted regression"

"I need to set up variance estimation"

"I need descriptive statistics from a survey"

Boundaries

The rpy2 Bridge

Legacy: samplics

File-First Execution in Research Workflows

Quick Reference

Essential Import

Core Workflow

Core Operations

Topic Index

Citation

Related Skills

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

brycewang-stanford/economist-data-skill