skills/ds-plan/SKILL.md
Phase 2 of the /ds workflow — data profiling and task breakdown. Invoked by the ds-brainstorm chain; not user-invocable.
npx skillsauth add edwinhu/workflows ds-planInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Announce: "Using ds-plan (Phase 2) to profile data and create task breakdown."
| Level | Remaining Context | Action | |-------|------------------|--------| | Normal | >35% | Proceed normally | | Warning | 25-35% | Complete current profiling task, then trigger ds-handoff | | Critical | ≤25% | Immediately trigger ds-handoff — do not start new profiling |
Profile the data and create an analysis plan based on the spec.
Requires .planning/SPEC.md from /ds first.
Load shared enforcement first.
Auto-load all constraints matching applies-to: ds-plan:
!uv run python3 ${CLAUDE_SKILL_DIR}/../../scripts/load-constraints.py ds-plan
You MUST have these constraints loaded before proceeding. No claiming you "remember" them. The ds-external-skill-discovery constraint governs Step 5b (External Skill Discovery Gate); ds-data-pull-profile governs Step 5c (Data Pull Profiling Gate).
SPEC MUST EXIST BEFORE PLANNING. This is not negotiable.
Before exploring data or creating tasks, you MUST have:
.planning/SPEC.md with objectives and constraintsIf .planning/SPEC.md doesn't exist, run /ds first.
</EXTREMELY-IMPORTANT>
.head() samples the clean front of the file — nulls, type drift, and grain problems live in the tail and in rare groups. A plan built from a head-sample is built on assumptions; it crashes 3 tasks into implementation and the user redoes hours of work. Delivering that plan fast is not helpful — it is counterproductive, and it reads as incompetent.After writing .planning/PLAN.md and initializing .planning/LEARNINGS.md, IMMEDIATELY discover and load ds-implement:
Read ${CLAUDE_SKILL_DIR}/../../skills/ds-implement/SKILL.md and follow its instructions.
DO NOT:
The workflow phases are SEQUENTIAL. Complete plan → immediately start implement.
| DO | DON'T | |-------|----------| | Read .planning/SPEC.md | Skip brainstorm phase | | Profile data (shape, types, stats) | Skip to analysis | | Identify data quality issues | Ignore missing/duplicate data | | Create ordered task list | Write final analysis code | | Write .planning/PLAN.md | Make completion claims |
Brainstorm answers: WHAT and WHY Plan answers: HOW and DATA QUALITY
This flowchart IS the specification. If prose elsewhere and this diagram disagree, the diagram wins. The two sub-gates (5b External Skill Discovery, 5c Data Pull Profiling) and the exit Plan Review are mandatory when their triggers fire — they are not optional steps a fast path can skip.
1. Verify SPEC.md exists ──(missing)──▶ STOP, run /ds first
│
▼
2. Profile data ──(2+ sources)──▶ parallel read-only profiler per source
│
▼
3. Identify DQ issues (nulls, dups, row counts)
│
▼
4. ETL strategy ──(heavy ETL trigger)──▶ server-side / chunked plan
│
▼
5b. External Skill Discovery ──(SPEC names wrds/gemini-batch/etc.)──▶ Glob refs/examples, ADOPT/PATCH
│
▼
5c. Data Pull Profiling gate ──(source ≥50M rows / ≥500MB / "large")──▶ read-only size profile → decision table
│
▼
6. Task breakdown (each task carries implements: [REQ-ID])
│
▼
7. Write .planning/PLAN.md
│
▼
Exit gate ──▶ dispatch ds-plan-reviewer ──(ISSUES)──▶ fix PLAN.md, re-dispatch (max 5)
└──(APPROVED)──▶ ds-implement
cat .planning/SPEC.md # verify-spec: read SPEC file to confirm it exists
If missing, stop and run /ds first.
For multiple data sources: Profile in parallel using background Task agents.
MANDATORY profiling steps:
import pandas as pd
# Basic structure
df.shape # (rows, columns)
df.dtypes # Column types
df.head(10) # Sample data
df.tail(5) # End of data
# Summary statistics
df.describe() # Numeric summaries
df.describe(include='object') # Categorical summaries
df.info() # Memory, non-null counts
# Data quality checks
df.isnull().sum() # Missing values per column
df.duplicated().sum() # Exact-duplicate rows (byte-identical)
df[col].value_counts() # Distribution of categories
# Grain / candidate-key identification (REQUIRED — do not skip)
# Profiling MUST output the row grain, not just a dup count. An all-columns
# df.duplicated() is unreliable in BOTH directions: it misses near-duplicates
# (amended/restated records that changed one field), AND it reports zero dupes
# after a join fan-out — fanned rows differ in the joined columns, so only a
# KEYED check (subset=grain) reveals them. Reporting "no duplicates" from the
# all-columns check is a false clean signal, not a verification.
# Identify the key empirically AND check it against the declared grain.
from itertools import combinations
cand = [c for c in df.columns if df[c].notna().any()]
for k in (1, 2, 3): # smallest unique column-set = de-facto PK
hit = next((c for c in combinations(cand, k)
if not df.duplicated(subset=list(c)).any()), None)
if hit:
print("candidate key:", hit); break
# Declared grain: look it up in the dataset's reference skill (e.g. wrds
# insider-form4.md → row PK (dcn, seqnum); event key (personid, trandate, ...)).
# Record BOTH the row PK and the coarser business/event key in PLAN.md.
df.duplicated(subset=DECLARED_PK).sum() # MUST be 0, else extraction fanned out
df.groupby(BUSINESS_KEY).size().gt(1).sum() # business-key collisions = restatement/amendment signal
# For time series
df[date_col].min(), df[date_col].max() # Date range
df.groupby(date_col).size() # Records per period
Use run_in_background: true for parallel execution.
When profiling 2+ data sources, launch agents in parallel: </EXTREMELY-IMPORTANT>
# PARALLEL + BACKGROUND: All Task calls in ONE message
Task(
subagent_type="general-purpose",
description="Profile dataset 1",
run_in_background=true,
# STRUCTURAL read-only enforcement — not advisory prose. Profiling is a
# read-only verification step; Write/Edit/NotebookEdit are withheld at the
# tool layer so a profiler CANNOT mutate pipeline files even if the prompt
# is ignored (P17 — agent tool restrictions are structural, never prose).
allowed_tools=["Read", "Glob", "Grep", "Bash"],
prompt="""
Profile this dataset and return a data quality report.
Dataset: /path/to/dataset1.csv
Required checks:
1. Shape: rows x columns
2. Data types: df.dtypes
3. Missing values: df.isnull().sum()
4. Exact-duplicate rows: df.duplicated().sum() — lower bound only; the keyed checks in 5-6 are authoritative (all-columns dedup reports zero on join fan-out)
5. GRAIN / candidate key: find the smallest column-set that is unique (the de-facto
primary key). If this dataset comes from a known source (WRDS, etc.), look up its
DECLARED grain in that source's reference skill and verify df.duplicated(subset=PK)==0.
6. Business/event-key collisions: pick the coarser real-world key and report rows that
share it but are NOT byte-identical — these are restatements/amendments/corrections
that df.duplicated() misses (e.g. Form 4 4/A re-filings).
7. Summary statistics: df.describe()
8. Unique value counts for categorical columns
9. Date range if time series
10. Memory usage: df.info()
Output format:
- Markdown table with column summary
- The row primary key and the business/event key you identified
- List of data quality issues found (call out any key-uniqueness or amendment/restatement findings)
- Recommendations for cleaning
Read-only profiling: you have Read/Glob/Grep/Bash only (enforced via allowed_tools).
""")
Task(
subagent_type="general-purpose",
description="Profile dataset 2",
run_in_background=true,
prompt="""
[Same template for dataset 2]
""")
Task(
subagent_type="general-purpose",
description="Profile dataset 3",
run_in_background=true,
prompt="""
[Same template for dataset 3]
""")
After launching agents:
# If a profiling agent runs a heavy script, monitor its progress
Monitor(
description="Profile large dataset progress",
timeout_ms=600000, persistent=false,
command="tail -f /tmp/profile_dataset1.log 2>/dev/null | grep --line-buffered -E '(rows|shape|complete|error)'"
)
Note: Background agents already notify on completion. Use Monitor only when you need streaming progress from a specific long-running script within the agent's work.
Benefits:
CRITICAL: Document ALL issues before proceeding:
| Check | What to Look For | |-------|------------------| | Missing values | Null counts, patterns of missingness | | Duplicates | Exact duplicates, key-based duplicates | | Outliers | Extreme values, impossible values | | Type issues | Strings in numeric columns, date parsing | | Cardinality | Unexpected unique values | | Distribution | Skewness, unexpected patterns |
Triggers when profiling reveals ANY of:
If triggered, assess these dimensions before creating the task breakdown.
If WRDS data is involved, also load the PostgreSQL vs SAS decision guide: Read ${CLAUDE_SKILL_DIR}/../../skills/wrds/references/postgres-vs-sas.md — use the decision flowchart to assign each ETL task to the right engine.
The anti-pattern: Pull entire tables into memory, then filter in pandas/R/SAS.
AskUserQuestion(questions=[{
"question": "Where should filtering happen for this data?",
"header": "Filtering",
"options": [
{"label": "Database-level (Recommended)", "description": "SQL WHERE clauses filter at source. Only matching rows transfer. Required for >1M row tables."},
{"label": "Application-level", "description": "Pull full dataset, filter in code. Only acceptable for small tables (<100K rows) or when database access is read-once."},
{"label": "Hybrid", "description": "Coarse filter at database (date range, key columns), fine filter in code (complex logic, cross-table conditions)."}
],
"multiSelect": false
}])
Document in PLAN.md: For each data source, specify WHERE the filtering happens and WHY.
The anti-pattern: Process years/groups sequentially when they're embarrassingly parallel.
Identify parallelizable dimensions from profiling:
AskUserQuestion(questions=[{
"question": "How should parallelizable tasks be executed?",
"header": "Parallelism",
"options": [
{"label": "Background Task agents (Recommended)", "description": "Spawn parallel Task agents for independent groups/years. Best for in-session work with Claude."},
{"label": "SGE array jobs", "description": "Submit as array jobs to grid scheduler. Best for WRDS/HPC cluster workloads."},
{"label": "Sequential", "description": "Process one at a time. Only when tasks have dependencies or parallelism isn't worth the overhead."}
],
"multiSelect": false
}])
Document in PLAN.md: For each task, note if it can be parallelized, on what dimension, and the chosen execution method.
When splitting by time ranges: Profile row counts per period BEFORE choosing splits. Data volume often grows exponentially — equal-width year ranges produce wildly unequal workloads. Query SELECT year, COUNT(*) FROM table GROUP BY year first, then split so each chunk has roughly equal row counts.
Shared-source contention check: If parallel workers all read the same large file (NFS, shared disk), they may contend on I/O and run SLOWER than sequential. Pattern: add a single-reader pre-split step that reads the source once, writes partitions to intermediate storage, then parallel workers each read their own partition. See etl-enforcement.md § Parallelism for the full checklist.
The anti-pattern: Re-read and re-process the same large source file in every task.
If multiple tasks read from the same large source:
AskUserQuestion(questions=[{
"question": "What format should be used for intermediate results?",
"header": "Cache format",
"options": [
{"label": "Parquet (Recommended)", "description": "Columnar, compressed, preserves dtypes. Best for tabular data. ~10x smaller than CSV."},
{"label": "CSV", "description": "Universal, human-readable. Use when downstream tools require CSV or data is small."},
{"label": "SQLite", "description": "Queryable intermediate storage. Best when downstream tasks need filtered reads from the same intermediate."},
{"label": "No caching needed", "description": "Each task reads from a different source, or sources are small enough to re-read."}
],
"multiSelect": false
}])
Document in PLAN.md: Data flow diagram showing which tasks produce intermediates, which consume them, and the storage format.
The anti-pattern: Submit the full batch (21K documents, 50M rows, $500 API call) without testing at small scale first. One bad schema, wrong prompt, or misconfigured parameter = entire batch wasted.
This is TDD for ETL: fail at 10 items in minutes, not at 21,000 items in hours.
Triggers when ANY task involves:
For each expensive task, ask the user how to scale up:
AskUserQuestion(questions=[{
"question": "How should we scale up testing for this batch/ETL operation?",
"header": "Scale-up",
"options": [
{"label": "Full scale-up (Recommended)", "description": "4 stages: 10 → 100 → 1,000 → full. Maximum safety for large batches (>5,000 items)."},
{"label": "Standard scale-up", "description": "3 stages: 10 → 100 → full. Good for medium batches (500-5,000 items)."},
{"label": "Minimal scale-up", "description": "2 stages: 10 → full. Quick validation for small batches (<500 items) or low-cost operations."},
{"label": "Custom stages", "description": "Define custom batch sizes and gate criteria for this specific pipeline."}
],
"multiSelect": false
}])
Then define the plan:
## ETL Strategy
<!-- Include this section when data > 1M rows or multiple sources -->
### Filter Strategy
| Source | Rows | Filter Location | Filter Columns | Justification |
|--------|------|-----------------|----------------|---------------|
| source1 | 5M | Database (SQL WHERE) | date, type | Too large for full pull |
| source2 | 50K | Application (pandas) | — | Small enough for full load |
### Parallelism Plan
| Task | Parallelizable? | Dimension | Method | Contention Risk |
|------|----------------|-----------|--------|-----------------|
| Task 1 | Yes | By year (2003-2023) | Background Task agents / SGE array | See Split Plan |
| Task 2 | No | — | Sequential (depends on Task 1 output) | N/A |
### Split Plan (if parallel tasks read same source)
<!-- Profile row counts first: SELECT year, COUNT(*) FROM table GROUP BY year -->
<!-- Then balance by row count, not year count -->
| Range | Rows | Size (est.) | Rationale |
|-------|------|-------------|-----------|
| 2003-2010 | 34M | 5.4GB | 8 years, low volume (~4M/yr) |
| 2011-2016 | 27M | 4.3GB | 6 years, moderate growth |
| 2017-2018 | 30M | 4.7GB | 2 years, volume explosion |
| 2019 | 22M | 3.6GB | 1 year, high volume |
| ... | ... | ... | 1 year each for high-volume years |
**Contention mitigation:** If source is large file on shared storage (NFS, network drive), add pre-split step that reads once via alternative path (database, API) and writes partitions to fast intermediate storage.
### Data Flow
source1.csv → [Task 1: Clean] → clean_source1.parquet → [Task 2: Merge]
source2.csv → [Task 1: Clean] → clean_source2.parquet ↗
→ [Task 3: Analyze] → results
<EXTREMELY-IMPORTANT>
### ETL Facts (incident-derived)
Checkpoint type: decision (user chooses approach — cannot auto-advance)
Before creating tasks, determine the implementation language for ETL and analysis:
AskUserQuestion(questions=[{
"question": "What language will be used for data processing / ETL?",
"header": "Language",
"options": [
{"label": "Python (Recommended)", "description": "pandas/polars in notebooks or scripts. Default for most analysis."},
{"label": "SAS", "description": "SAS on WRDS grid (qsas/qsub). For large-scale WRDS ETL with hash merges and SGE parallelism."},
{"label": "R", "description": "R scripts or notebooks. For statistical modeling."},
{"label": "Mixed", "description": "SAS for ETL, Python/R for analysis. Common for WRDS pipelines."}
],
"multiSelect": false
}])
If SAS or Mixed is selected:
Implementation Language: SAS (or Mixed: SAS ETL + Python analysis) in PLAN.md header${CLAUDE_SKILL_DIR}/../../skills/wrds/references/sas-etl.md and follow its instructions.${CLAUDE_SKILL_DIR}/../../skills/wrds/references/postgres-vs-sas.md. Use this to assign each ETL task to PostgreSQL or SAS based on the decision flowchart. Document the choice and rationale per task in PLAN.md.## SAS Performance Constraints section to PLAN.md (see template below)If any task will touch an external plugin skill (WRDS, gemini-batch, lseg-data, nlm, readwise, pdf, docx, pptx, xlsx, bluebook, etc.), you MUST complete the discovery checklist for each such skill before drafting tasks. Loading only rule references (e.g. sas-etl.md, postgres-vs-sas.md) is necessary but NOT sufficient. Rule refs teach syntax; domain refs teach the recipe; examples/ contains battle-tested implementations.
Skipping this is NOT HELPFUL — you will draft greenfield code that duplicates (worse than) a tested pipeline already sitting in skills/<skill>/examples/. Days of reinvention avoidable in 5 minutes.
</EXTREMELY-IMPORTANT>
List every external plugin skill any task will touch. Include any skill whose references/ or examples/ directory might contain relevant material — not just the skill you planned to load rule refs from.
1. IDENTIFY references
Glob skills/X/references/*.md # enumerate every reference
2. LOAD domain-specific references
For each task, map the data/task domain to reference filenames by name:
WRDS holdings/ownership → tfn-ownership.md
WRDS voting → iss-voting.md
WRDS TAQ microstructure → taq.md
WRDS Compustat → compustat.md
WRDS insider → insider-form4.md
WRDS EDGAR filings → edgar.md
WRDS ExecuComp → execucomp.md
WRDS ISS compensation → iss-compensation.md
WRDS ISS directors → iss-directors.md
WRDS SDC M&A / issuances → sdc-ma.md / sdc-issuances.md
WRDS PitchBook → pitchbook.md
WRDS LPC Dealscan → lpc-dealscan.md
WRDS FJC courts → fjc.md
WRDS FISD bonds → fisd-bonds.md
WRDS Form D / Reg D → formd.md
WRDS fund formation → fund-formation.md
(other skills: map by filename / domain match)
Read every matched domain reference in full. Rule refs alone are not enough.
3. IDENTIFY examples
Glob skills/X/examples/** # enumerate every prior pipeline
4. READ matching example READMEs
For every example whose directory or filename matches the task domain
(e.g. "ownership", "voting", "insider", "batch", "dealscan"), Read its
README.md (or the top-level file if no README) in full.
5. DECIDE ADOPT / PATCH / GREENFIELD per task
- ADOPT: the example matches exactly — reuse verbatim.
- PATCH: the example is close — reuse with a documented delta.
- GREENFIELD: no example or domain ref applies. Must justify why.
Every task that touches an external skill MUST have an entry in the External Skill Discovery section of PLAN.md (template below). The plan reviewer and the check script (ds-external-skill-discovery.py) both check for this section; a missing or stub section blocks progress to ds-implement.
1. IDENTIFY: list of external skills in play (Step 5b.1)
2. RUN: Glob + Read steps per skill (Step 5b.2)
3. READ: every matched domain reference and example README
4. VERIFY: External Skill Discovery section of PLAN.md documents
ADOPT/PATCH/GREENFIELD decision per relevant task
5. CLAIM: only then proceed to Task Breakdown
If no external skills are in play, note this explicitly in PLAN.md's External Skill Discovery section ("No external skills referenced — greenfield Python analysis only"). Do not skip the section.
sas-etl.md, postgres-vs-sas.md) teach syntax; domain refs teach the recipe; examples/ holds battle-tested implementations. Planning from rule refs alone produces greenfield code that duplicates a tuned pipeline already on disk — days of reinvention avoidable in 5 minutes, which is anti-helpful however diligent it feels.Agents systematically underestimate data size and overlook aggregate-vs-raw trade-offs. Shipping a plan with an ungated pull-raw decision for a ≥50M row or ≥500 MB source means the user discovers the waste at implementation time — after the pull has kicked off, after downstream code assumes raw rows, after hours of rework become days.
Skipping this profile is NOT HELPFUL — your plan's row-count estimate is a guess, and a guess at 150M-row scale costs days of rework when wrong. The 10-minute profile makes the pull-raw vs aggregate-at-source vs server-side-pipeline decision data-driven instead of guessed. </EXTREMELY-IMPORTANT>
Full rule: references/constraints/ds-data-pull-profile.md (loaded above).
Fire this gate when any of the following is true of any data source in SPEC.md or the draft PLAN.md:
Fire liberally. A 3-minute profile on a source that turned out to be 40M rows costs nothing; a missed profile on a source that turned out to be 150M costs days.
If no source triggers, note this in PLAN.md's Data Pull Profile section with one line ("No source exceeded 50M rows or 500 MB thresholds — profiling gate not triggered") and proceed. Do not skip the section header.
For every triggered source, dispatch a profiling subagent. The subagent is read-only (Read, Grep, Glob, Bash for SQL/metadata queries; no Write to pipeline files).
Task(
subagent_type="general-purpose",
description="Profile data pull size vs aggregate trade-off",
prompt="""
Profile the following data source(s) for raw-vs-aggregate ship size trade-offs.
This is a READ-ONLY profiling pass. Do NOT pull the full table.
Sources to profile (from draft PLAN.md):
- <source 1>: <planned WHERE filter>
- <source 2>: <planned WHERE filter>
- ...
For EACH source, perform:
1. COUNT(*) with the planned WHERE filter. No full table pull.
Example: SELECT COUNT(*) FROM risk.voteanalysis_npx v
JOIN risk.vavoteresults r USING (itemonagendaid)
WHERE r.meetingtype IN (...) AND v.meetingdate BETWEEN ...
2. Fetch ~100K-row sample (stratified by year/partition key if possible),
write to scratch/ as parquet with the project's codec (zstd or snappy —
check the existing pipeline for which). Measure bytes-per-row from file
size. Delete the sample after measurement.
3. For EACH candidate aggregation level in the draft PLAN.md, run:
SELECT <agg_keys>, COUNT(*), SUM(<metric>) FROM <source>
WHERE <filter> GROUP BY <agg_keys>
Record the aggregate row count.
4. Information-preservation check: for each aggregation level, list which
columns survive and which are lost. Flag any aggregation that drops
columns needed by downstream tasks (e.g., fundid/permno/wficn for block
classification, ticker for cross-section panels).
5. Compute ratio = raw_rows / aggregate_rows per aggregation level.
6. Write docs/investigations/YYYY-MM-DD_pull_profile.md with:
- Machine-readable decision table (schema below)
- Bytes/row calibration notes (codec, sample size, stratification)
- Per-aggregation information-preservation notes
- Final recommendation per source:
pull-raw / SQL GROUP BY / server-side pipeline (SAS-on-WRDS, BigQuery, etc.) / hybrid
Decision table schema (required):
| Source | Raw rows | Raw MB | Aggregate level | Aggregate rows | Aggregate MB | Ratio | Recommendation |
Ratio rule of thumb:
< 10x -> pull-raw is usually fine
10-100x -> server-side aggregation wins on transfer; prefer SQL GROUP BY
> 100x -> pull-raw is malpractice UNLESS downstream needs raw rows
(information-preservation check must justify it)
Do NOT write to pipeline files. Only docs/investigations/ and scratch/.
Return the path to the investigation file when done.
"""
)
Parallelize across sources. If 3 sources trigger the gate, launch 3 profiling subagents in a single message with run_in_background=true — same pattern as Step 2 parallel profiling.
After the profiling subagent(s) complete, read the investigation file(s) and record the decision in PLAN.md under a ## Data Pull Profile section (template below). The section must include:
pull-raw — information-preservation trumps ratio)See: docs/investigations/YYYY-MM-DD_pull_profile.mdThe check script ds-data-pull-profile.py enforces this section — a missing or stub Data Pull Profile section when triggers fired blocks progress to ds-implement.
1. IDENTIFY: list of triggered sources (Step 5c.1)
2. RUN: read-only profiling subagent per source (Step 5c.2)
3. READ: every investigation file produced
4. VERIFY: PLAN.md ## Data Pull Profile section contains the decision
table with required columns and per-source justification
5. CLAIM: only then proceed to Task Breakdown
fundid/wficn. Recommending a strategy from ratio without checking preserved columns is a shortcut dressed as analysis — profile BOTH and record both in the decision table.Break analysis into ordered tasks:
Write to .planning/PLAN.md:
---
phase: ds-plan
status: completed
implements: [all requirement IDs from SPEC.md]
requires: [.planning/SPEC.md]
provides: [.planning/PLAN.md, .planning/LEARNINGS.md]
affects: [.planning/]
tags: [planning, data-profiling]
---
# Analysis Plan: [Analysis Name]
> **For Claude:** REQUIRED SUB-SKILL: Discover and load ds-implement for output-first verification:
>Read `${CLAUDE_SKILL_DIR}/../../skills/ds-implement/SKILL.md` and follow its instructions.
>
> **Delegation:** Main chat orchestrates, Task agents implement. Discover and load ds-delegate:
>Read `${CLAUDE_SKILL_DIR}/../../skills/ds-delegate/SKILL.md` and follow its instructions.
## Spec Reference
See: .planning/SPEC.md
## Data Profile
### Source 1: [name]
- Location: [path/connection]
- Shape: [rows] x [columns]
- Date range: [start] to [end]
- Key columns: [list]
#### Column Summary
| Column | Type | Non-null | Unique | Notes |
|--------|------|----------|--------|-------|
| col1 | int64 | 100% | 50 | Primary key |
| col2 | object | 95% | 10 | Category |
#### Data Quality Issues
- [ ] Missing: col2 has 5% nulls - [strategy: drop/impute/flag]
- [ ] Duplicates: 100 duplicate rows on [key] - [strategy]
- [ ] Outliers: col3 has values > 1000 - [strategy]
### Source 2: [name]
[Same structure]
## External Skill Discovery
<!-- Required. If no external skills are referenced, state so explicitly. -->
<!-- For each external skill in play, record Glob results, loaded refs, example READMEs read, and ADOPT/PATCH/GREENFIELD decision per task. -->
### Skills in play
- [skill-name] — tasks: [list task IDs]
### Per-skill discovery
#### skills/[skill-name]
- **References globbed:** skills/[skill-name]/references/*.md
- **Domain refs loaded:** [e.g. tfn-ownership.md, sas-etl.md]
- **Examples globbed:** skills/[skill-name]/examples/**
- **READMEs read:** [e.g. examples/voting_ownership_pipeline/README.md]
- **Decisions:**
| Task | Decision | Example Path | Delta (for PATCH) / Justification (for GREENFIELD) |
|------|----------|--------------|-----------------------------------------------------|
| Task 2 | ADOPT | skills/wrds/examples/voting_ownership_pipeline/build_inst_own.sas | — |
| Task 3 | PATCH | skills/wrds/examples/voting_ownership_pipeline/merge_panel.py | New date window 2020Q1-2024Q4; add new classification column |
| Task 4 | GREENFIELD | — | No example covers this specific aggregation; domain ref tfn-ownership.md §4 gives the SAS pattern |
## Data Pull Profile
<!-- Required when any source >= 50M rows, >= 500 MB ship size, or SPEC uses large-source keywords. -->
<!-- If no source triggered, state so explicitly in one line and omit the decision table. -->
<!-- Otherwise include the decision table AND per-source justification. -->
See: docs/investigations/YYYY-MM-DD_pull_profile.md
### Decision Table
| Source | Raw rows | Raw MB | Aggregate level | Aggregate rows | Aggregate MB | Ratio | Recommendation |
|--------|---------:|-------:|-----------------|---------------:|-------------:|------:|----------------|
| source1 | 144M | 720 | (meeting_id, item_id, vote) | 1.62M | 50 | 89x | pull-raw |
| source2 | 245M | 2500 | (permno, rqdate) | 450K | 18 | 540x | server-side pipeline (SAS-on-WRDS) |
### Per-Source Justification
- **source1:** Ratio 89x favors aggregate, BUT aggregate drops `fundid` required by Task 5 block classification. pull-raw despite ratio.
- **source2:** Ratio 540x; aggregate preserves (permno, rqdate) — everything downstream needs. SAS-on-WRDS pipeline ships 18 MB result instead of 2.5 GB raw.
## Task Breakdown — MANDATORY EXECUTABLE TABLE
> **This table is the machine-executable spec.** `ds-implement` reads it directly: it topologically sorts `Deps` (the data-flow DAG — which intermediates a task consumes) into levels, runs each level's tasks output-first (produce the `Outputs`, then run the `Verify` assertion), and gates each task on its `Verify` exit code. **A plan without a complete table is not executable — `ds-plan-executable-guard.py` blocks `PLAN_REVIEWED.md` until every row is filled.** (ds is output-first, not TDD: the `Verify` command is the per-task mechanical gate; `Expected Output` is the human-readable claim that `ds-validate-coverage` reviews per requirement.)
>
> **Every task MUST be one table row** (no prose `### Task N` headers carrying the work). Every column is REQUIRED:
>
> | Column | Rule |
> |--------|------|
> | **Task** | `N. <name>` — N a unique integer, referenced by `Deps`. Add a `[engineer]`/`[analyst]` tag if the role matters (pipeline/ETL vs analysis). |
> | **Deps** | the data-flow DAG: `---` (reads only raw sources) or `after N` / `after N,M` (consumes task N's `Outputs`). Must reference real task numbers; no cycles |
> | **Outputs** | the artifact(s) this task produces (intermediate parquet / result table / figure / model file), repo- or DATA_DIR-relative. Drives the DAG (downstream `Deps` consume these) |
> | **Expected Output** | the verifiable claim that proves completion (`~1.2M rows, 0 nulls in id`; `accuracy ≥ 0.8`; `12 cols incl {a,b,c}`). Specific numbers, not "looks right" |
> | **Verify** | the deterministic command whose exit-0 IS the per-task gate — an assertion of Expected Output (`uv run python -c "import pandas as pd; df=pd.read_parquet('out.parquet'); assert len(df)>1_000_000 and df.id.notna().all()"`). For inherently-visual outputs, assert the mechanical floor (file exists, expected shape) and let `ds-validate`/look-at judge the rest. NEVER empty |
> | **Implements** | SPEC.md `CATEGORY-NN` requirement ID(s). Must trace to a real ID; every v1 requirement appears in ≥1 task's Implements (coverage invariant — ds-plan-reviewer rejects a dropped v1 ID) |
| Task | Deps | Outputs | Expected Output | Verify | Implements |
|------|------|---------|-----------------|--------|------------|
| 1. clean source [engineer] | `---` | `clean_source1.parquet` | ~1.2M rows, 0 nulls in `id`, log of rows dropped | `uv run python -c "import pandas as pd; df=pd.read_parquet('data/clean_source1.parquet'); assert len(df)>1_000_000 and df.id.notna().all()"` | `DATA-01` |
| 2. merge panel [engineer] | `after 1` | `panel.parquet` | 1 row per firm-year, 12 cols incl {gvkey,year,roa} | `uv run python -c "import pandas as pd; df=pd.read_parquet('data/panel.parquet'); assert {'gvkey','year','roa'}<=set(df.columns) and not df.duplicated(['gvkey','year']).any()"` | `DATA-02` |
| 3. regression [analyst] | `after 2` | `results/model.json` | coef on X significant, R² ≥ 0.3 | `uv run python -c "import json; r=json.load(open('results/model.json')); assert r['r2']>=0.3 and r['p_X']<0.05"` | `STAT-01` |
> Coverage invariant holds: every `v1` SPEC requirement ID appears in at least one row's Implements; ds-plan-reviewer rejects a plan that drops one.
## ETL Strategy
<!-- Include when any source > 1M rows or multiple sources require joins -->
### Filter Strategy
| Source | Rows | Filter Location | Filter Columns | Justification |
|--------|------|-----------------|----------------|---------------|
### Parallelism Plan
| Task | Parallelizable? | Dimension | Method |
|------|----------------|-----------|--------|
### Data Flow
[source] → [task] → [intermediate] → [task] → [output]
### ETL Strategy Flowchart (Required in PLAN.md)
Every PLAN.md with data processing MUST include an ASCII flowchart showing data sources, transformations, and outputs with annotations (FILTER/PARALLEL/CACHE):
Example: source.csv ──→ [Task 1: Clean] ──→ clean.parquet ──→ [Task 2: Analyze] ──→ results.csv FILTER: SQL WHERE CACHE: parquet PARALLEL: disabled (rows: 5M → 3M) (rows: 3M) (join key unique)
This flowchart IS the specification. If PLAN.md narrative and flowchart disagree, the flowchart wins.
### Scale-Up Testing Plan
<!-- Include when any task involves batch APIs, irreversible operations, or >500 items through external services -->
| Task | Total Items | Stage 1 (test) | Stage 2 | Stage 3 | Gate Criteria |
|------|-------------|-----------------|---------|---------|---------------|
| Batch extraction | 21,000 | 10 | 100 | 1,000 | ≥95% success, schema valid, non-empty responses |
| DB write | 5M rows | 100 | 1,000 | — | No constraint violations, row counts match |
## Implementation Language
[Python / SAS / R / Mixed]
<!-- If SAS or Mixed, include this section: -->
## SAS Performance Constraints
> **For Claude:** REQUIRED: Load SAS ETL enforcement before writing ANY SAS code:
>Read `${CLAUDE_SKILL_DIR}/../../skills/wrds/references/sas-etl.md` and follow its instructions.
> Validate ALL SAS code against the SAS Code Validation Checklist in the WRDS skill.
### Per-Task SAS Annotations
| Task | Merge Strategy | WHERE Pattern | Parallelism |
|------|---------------|---------------|-------------|
| Task 1 | Hash (lookup < 500K rows) | BETWEEN date literals | SGE array by year |
| Task 2 | Sort-merge (both tables > 50M) | No date filter | Sequential (single output) |
## Reproducibility Requirements
- Random seed: [value if needed]
- Package versions: [key packages]
- Data snapshot: [date/version]
The failure modes and their consequences are stated once, at the point of use — see Profiling Facts, ETL Facts, Step 5b Facts, and Step 5c Facts. If you are about to skip profiling, pull unfiltered tables, submit an untested batch, greenfield past an existing example, or finalize a ≥50M-row pull without a Data Pull Profile — those sections explain why that is counterproductive, and what to do instead.
Complete the plan when:
.planning/SPEC.mddocs/investigations/.planning/PLAN.md.planning/LEARNINGS.mdAfter writing .planning/PLAN.md, create .planning/LEARNINGS.md:
---
phase: ds-implement
status: in_progress
implements: []
requires: [.planning/PLAN.md]
provides: [analysis outputs]
affects: []
deviations: {r1: 0, r2: 0, r3: 0, r4: 0}
tags: [implementation, data-quality]
---
# Analysis Learnings: [Analysis Name]
## Data Quality Pipeline
[To be populated during implementation]
## Key Findings
[To be populated during implementation]
This file is populated by ds-implement as tasks complete. Initializing it here ensures the file exists before implementation begins.
Checkpoint type: human-verify (PLAN.md content is machine-verifiable)
<EXTREMELY-IMPORTANT> Before proceeding to ds-implement, execute this gate:.planning/PLAN.mdRead(".planning/PLAN.md"), uv run python3 ${CLAUDE_SKILL_DIR}/../../references/constraints/ds-external-skill-discovery.py ., and uv run python3 ${CLAUDE_SKILL_DIR}/../../references/constraints/ds-data-pull-profile.py .docs/investigations/YYYY-MM-DD_pull_profile.mdds-external-skill-discovery.py exits 0 (PASS)ds-data-pull-profile.py exits 0 (PASS)Skipping this gate is NOT HELPFUL — an incomplete plan wastes the user's time when implementation hits missing sections. The 30 seconds this gate takes saves hours. </EXTREMELY-IMPORTANT>
After passing the exit gate, dispatch the plan reviewer before proceeding:
Phase 2: ds-plan -> PLAN.md written -> exit gate passed
-> Dispatch ds-plan-reviewer subagent
-> If APPROVED -> proceed to ds-implement
-> If ISSUES_FOUND -> fix PLAN.md -> re-dispatch reviewer (max 5 iterations)
Step 1: Discover and load the plan reviewer skill:
Read ${CLAUDE_SKILL_DIR}/../../skills/ds-plan-reviewer/SKILL.md and follow its instructions.
Step 2: Only after reviewer returns APPROVED, discover and load the next phase:
Read ${CLAUDE_SKILL_DIR}/../../skills/ds-implement/SKILL.md and follow its instructions.
CRITICAL: Do not skip plan review. An unreviewed plan means subagents struggling with incomplete task definitions and missing verification steps.
tools
Use when "query Dewey Data", "deweydata.io", "SafeGraph places/patterns/spend", "Advan foot traffic", "POI / points of interest", "mobility data", "dataplor", "Veraset", "PassBy", "crypto/Bitcoin ATM locations", or any pull from the Dewey Data academic marketplace (UVA/NYU Platform Subscription) via the deweypy/deweydatapy client, DuckDB, or the Dewey MCP server.
development
Use when submitting jobs to UVA HPC (Rivanna/Afton), writing Slurm scripts (sbatch/srun/squeue), converting SGE to Slurm, running compute on any Slurm-managed cluster, or building WRDS data pipelines with polars on HPC. Triggers: 'submit to HPC', 'sbatch', 'squeue', 'slurm job', 'run on Rivanna', 'run on Afton', 'HPC array job', 'convert SGE to Slurm', 'polars on HPC', 'WRDS from HPC'.
testing
Internal skill for literature review and source materialization. Called after brainstorm, before setup. NOT user-facing.
development
This skill should be used when the user asks to "add paper", "paperpile add", "fetch PDF for", "find and add", "search paperpile", "find in paperpile", "paperpile search", "label paper", "trash paper", "download paper", "paperpile index", "edit paper metadata", "update paper title", "fix paper author", "paperpile edit", "find PDF online", "search google for PDF", "resolve PDF", "fetch PDF for citation", "get full-text for DOI", "resolve cite to PDF", or any request to manage their Paperpile library or resolve a citation to a local PDF.