skills/sdv-synthetic-data/SKILL.md
Generate synthetic data with SDV (Synthetic Data Vault). Learn patterns from real data with machine learning and produce privacy-preserving synthetic data. Use cases: (1) single-table synthetic data generation, (2) multi-table (relational DB) synthetic data generation, (3) time-series synthetic data generation, (4) synthetic data quality evaluation, and (5) metadata and constraint setup
npx skillsauth add mullzhang/skills sdv-synthetic-dataInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use SDV (Synthetic Data Vault) to generate high-quality synthetic data that learns patterns from real data.
Always read this section first and strictly follow the rules below.
Do not read data files with the Read tool
.csv, .xlsx, .xls, .json, .pkl, .html, .txtDo not generate data preview code
print(data), print(df), data.head(), data.tail(), data.sample(), etc.Do not log actual data values
The scripts automatically detect data structure, so do not inspect data file contents or schema in advance. Follow the workflow, ask the user only for required confirmations (number of rows, seed value, etc.), and run the scripts directly.
| Format | Extension | Read Method |
|------|--------|-------------|
| CSV | .csv | pd.read_csv() |
| Excel | .xlsx, .xls | pd.read_excel() |
| JSON | .json | pd.read_json() |
| File Type | Extension | Description |
|-------------|--------|------|
| Synthetic data | .csv, .xlsx, .json | Generated synthetic data |
| Synthesizer | .pkl | Trained model |
| Metadata | .json | Data structure definition |
| Quality report | .html, .txt | Evaluation report |
This skill should check the Python environment of the target project and run scripts in that environment.
Guideline (priority order):
uv.lock or pyproject.toml exists, prioritize the uv environment and run with uv run python scripts/....venv/ exists, use that virtual environment's Python (example: .venv/bin/python scripts/...)Always choose execution commands to match the target project's environment (uv/venv/poetry/pipenv, etc.).
Follow these steps when generating synthetic data. All scripts accept options as runtime arguments, so complete all confirmations in Steps 1 to 4 before executing Step 5.
Confirm number of rows: Ask the user how many rows to generate (num_rows). Use the AskUserQuestion tool with the following prompt:
Confirm seed value: Confirm the seed value (seed) for reproducibility. Use the AskUserQuestion tool with the following prompt:
sample supports the seed argument (see script implementation)Select synthesizer: Confirm the synthesizer type if needed
Confirm model and metadata saving: Ask whether trained artifacts should be saved for reuse. Use the AskUserQuestion tool with the following prompt:
--save-model, --save-metadata) for save/load detailsGenerate data: Execute scripts by passing all confirmed options from Steps 1 to 4 (row count, seed value, synthesizer, save settings) as arguments
Use scripts/generate_single_table.py to generate synthetic data for a single table.
Note: Scripts exist under .codex/skills/sdv-synthetic-data/scripts/. Do not create new scripts; execute existing scripts directly (example: uv run python .codex/skills/sdv-synthetic-data/scripts/generate_single_table.py ...).
| Synthesizer | Characteristics | Recommended Use |
|---|---|---|
| GaussianCopulaSynthesizer | Fast, transparent, customizable | Default choice |
| CTGANSynthesizer | Uses GAN, high fidelity | More complex patterns |
| TVAESynthesizer | Uses VAE, high fidelity | Complex patterns |
| CopulaGANSynthesizer | GaussianCopula + CTGAN | Hybrid |
For exact selection behavior, refer to the --synthesizer option in generate_single_table.py.
generate_single_table.py and generate_multi_table.py use Metadata.detect_from_dataframe(...).
If manual adjustment is needed, save metadata with --save-metadata, edit it, and load it later.
numerical: Numericdatetime: Datetime (datetime_format required)categorical: Categoricalboolean: Booleanid: Identifier (regex_format can define patterns)email, phone_number, ssn, etc.: PII auto-anonymizationAdd constraints to enforce business rules 100%.
Constraint setup requires metadata edits or additional implementation. Extend scripts when needed.
Use generate_multi_table.py. Specify tables and relationships in the config file, and control generation volume with --scale.
generate_multi_table.py auto-detects each table with metadata.detect_from_dataframe(..., table_name=...) and applies primary_key and relationships from the config file. If needed, save metadata with --save-metadata and tune it later.
Time-series generation is currently unsupported by existing scripts. Add a new script if needed.
Use evaluate_quality.py for quality evaluation. Diagnostic reports are reviewed at execution time and can be explicitly output with --diagnostic or --diagnostic-output.
Use each script's --save-model / --save-metadata options for save/load workflows.
See reusable scripts in scripts/:
generate_single_table.py: Single-table synthetic data generation (--rows, --seed, --synthesizer, --epochs, --save-model, --save-metadata)generate_multi_table.py: Multi-table synthetic data generation (--config, --output-dir, --output-format, --scale, --seed, --save-model, --save-metadata)evaluate_quality.py: Quality report generation (--output, --diagnostic, --diagnostic-output)sample_rows.py: Sample rows from input data (--rows or --fraction, --replace, --seed, --sheet)development
Create a finite Markdown questionnaire file that contains grouped questions, recommended options, answer fields, and optional rationale fields, then read the completed file and continue from the user's answers. Use when Codex needs to ask multiple questions for requirements, specifications, acceptance criteria, product decisions, design choices, implementation tradeoffs, or any situation where conversational back-and-forth would fatigue the user or make the remaining question count unclear.
development
Detect Python dead-code candidates that are referenced only from tests by running Vulture twice and diffing results (production paths vs production+test paths). Use when auditing cleanup targets, reviewing unused-code reports, or validating whether symbols are reachable only through tests.
development
Parse PuLP and solver logs (CBC, HiGHS, Gurobi, CPLEX) to diagnose infeasible/unbounded/time-limit/execution failures, extract key metrics, and propose prioritized next debugging actions. Use when given optimization run logs, solver stdout/stderr, or LP/MPS export errors and you need root-cause clues that generalize across optimization problems. When LP/MPS/log artifacts exist, include all of them in the analysis.
development
For early development phases. Prevent excessive fallbacks, backward compatibility code, and duplication during code generation. Use when generating, modifying, or refactoring code.