dev-data/SKILL.md
Data engineering and analysis guide for orchestrated sub-agents. Data pipelines, ETL/ELT design, data quality validation, SQL optimization, and analysis patterns. Injected when role=data.
npx skillsauth add lidge-jun/cli-jaw-skills dev-dataInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Production-grade data engineering patterns for building reliable data systems.
Five rules that apply to every data task:
| Principle | What It Means | |-----------|---------------| | Pipeline thinking | Every pipeline is Extract → Transform → Load. Keep each stage as an independent, testable function. | | Schema-first | Define expected columns, types, and constraints BEFORE writing transformation logic. | | Defensive parsing | External data will have nulls, wrong types, extra columns, missing columns, and encoding issues. Assume all of these. | | Idempotent operations | Running the same pipeline twice on the same input must produce the same output. Use upsert patterns, not blind inserts. | | Fail fast, fail loud | Raise errors on invalid data immediately — silent failures produce wrong downstream results that are harder to debug. |
| Format | Best For | Watch Out For | |--------|----------|---------------| | CSV | Simple tabular data, human-readable | Encoding (UTF-8 BOM), delimiter ambiguity, multiline values, inconsistent quoting | | JSON | Nested structures, API responses | Large files (stream, don't load all at once), deeply nested objects, encoding | | Parquet | Large analytical datasets, columnar queries | Requires library support, not human-readable, schema evolution | | Excel | Business user handoffs | Multiple sheets, merged cells, formulas vs. values, date formatting | | Database | Production system access | Connection pooling, query timeouts, use read replicas for analytics |
For large or frequently updated data sources:
updated_at, id) to track the last processed record.loaded_rows should equal source_rows_since_watermark.Before any transformation, validate incoming data:
✅ Check: Expected columns exist
✅ Check: Data types match (string, number, date, boolean)
✅ Check: Required fields are not null
✅ Check: Values are within expected ranges
✅ Check: No unexpected duplicate keys
❌ Fail: If any check fails, write to error log with row details. Don't silently drop.
Rules:
When using dbt for transformations, follow the staging → intermediate → mart layer architecture:
Rules:
schema.yml with tests (not_null, unique, relationships, custom SQL).dbt source freshness to monitor upstream data staleness| Scenario | Pattern | |----------|---------| | Invalid records | Write to dead-letter table/file for manual review. Preserve every record for debugging. | | Source unavailable | Retry with exponential backoff (1s, 2s, 4s). Alert after 3 failures. | | Schema mismatch | Halt pipeline. Log expected vs. actual schema. Don't attempt partial loads. | | Duplicate records | Use upsert (INSERT ON CONFLICT UPDATE) or deduplicate with window functions. |
When pipelines have multiple steps with dependencies:
Run these after every pipeline step, not just at the end:
| Check | What It Validates | Example |
|-------|-------------------|---------|
| Not null | Required fields have values | WHERE order_id IS NULL → 0 rows |
| Unique | No duplicates on key columns | COUNT(*) = COUNT(DISTINCT id) |
| Range | Numeric values within bounds | amount BETWEEN 0 AND 1,000,000 |
| Categorical | Values in allowed set | status IN ('pending', 'active', 'closed') |
| Freshness | Data is recent enough | MAX(updated_at) > NOW() - INTERVAL '24 hours' |
| Row count | No unexpected data loss or explosion | Within ±10% of previous run |
| Referential | Foreign keys point to existing records | customer_id EXISTS IN customers |
Use a layered quality strategy — different tools at different pipeline stages:
| Stage | Tool | Purpose | |-------|------|---------| | Ingest | Great Expectations | Validate raw data against expectations before staging | | Transform | dbt tests | Assert model-level quality (not_null, unique, relationships, custom SQL) | | Production | Soda / Monte Carlo | Real-time monitoring, anomaly detection, SLA enforcement |
Validate data dimensions: completeness, uniqueness, range, format, referential integrity, freshness.
Rule: Run validation on every pipeline step — skipping "because the data looks fine" leads to silent downstream corruption.
For datasets shared between teams, define a contract:
A data contract must include:
Changes to a contracted schema require versioning and consumer notification.
Before any deep analysis, provide:
| Metric | What to Report | |--------|----------------| | Row count | Total records in dataset | | Column inventory | Name, type, null count per column | | Numeric summary | min, max, mean, median, std dev | | Categorical summary | Unique values, top 5 most frequent | | Time range | Earliest and latest timestamp | | Data quality | Null percentage, duplicate percentage |
| Format | When to Use | |--------|-------------| | Markdown tables | Inline reports, ≤50 rows, quick summaries | | JSON | Programmatic consumption, API responses | | CSV export | Handoff to spreadsheet users, large datasets | | HTML + charts | Dashboards, visual reports (Chart.js, Mermaid diagrams) |
When analysis involves statistics:
| Condition | Choose | |-----------|--------| | Real-time insight required (sub-minute latency) | Streaming (Kafka + Flink, Spark Structured Streaming, or Kafka Streams depending on complexity) | | Exactly-once semantics needed | Kafka transactional producers + Flink/Spark | | Latency >1 min acceptable, volume >1TB/day | Distributed batch (Spark, Databricks) | | Latency >1 min acceptable, volume <1TB/day | Single-node batch (SQL, Python, dbt) |
Default to batch. Streaming adds significant complexity in error handling, state management, and debugging. Only use streaming when latency requirements genuinely demand it.
| Latency Requirement | Framework | Complexity | |---------------------|-----------|------------| | Sub-100ms, complex stateful | Apache Flink | High (dedicated cluster) | | Sub-second, existing Spark infra | Spark Structured Streaming | Medium | | Sub-second, Kafka-centric | Kafka Streams (embedded library) | Low-Medium | | Minutes acceptable | Batch with frequent scheduling | Low |
Kafka essentials for data engineers:
See references/streaming.md for Kafka configuration, CDC patterns, and windowing.
| Need | Choose | |------|--------| | SQL analytics, BI dashboards, structured queries | Data warehouse (Snowflake, BigQuery, PostgreSQL) | | ML training, unstructured data, large-scale storage | Data lake (S3/GCS + Parquet or Delta format) | | Both SQL and ML needs | Lakehouse (Delta Lake, Apache Iceberg) | | Real-time key-value lookups, caching | Redis, DynamoDB | | Graph relationships | Neo4j, Neptune |
| Category | Options | |----------|---------| | Orchestration | Airflow, Prefect, Dagster | | Transformation | dbt, Spark, plain SQL | | Streaming | Kafka, Kinesis, Pub/Sub | | Quality | Great Expectations, dbt tests, Soda, custom validators | | Monitoring | Prometheus, Grafana, Datadog, Monte Carlo | | Local analysis | DuckDB (in-process SQL), Polars (fast DataFrame), pandas (exploration/ML) |
| Factor | pandas | Polars | DuckDB |
|--------|--------|--------|--------|
| Best for | <1GB, exploration, ML prep | >1GB, batch ETL, performance | SQL analytics, ad-hoc queries |
| Execution | Single-threaded, eager | Multi-threaded Rust, lazy eval | Vectorized, auto disk spill |
| Speed (groupby/join) | Baseline | 5-10x faster | Matches Polars on SQL-native |
| Memory | Full load into RAM | Streaming, lazy chains | Spill-to-disk for out-of-core |
| API style | DataFrame (imperative) | DataFrame (expression-based) | SQL-first |
| ML interop | Excellent (scikit-learn, etc.) | Good (.to_pandas()) | Good (.fetchdf()) |
| File format | CSV, JSON, Excel | CSV, Parquet, Arrow-native | CSV, Parquet, JSON, S3 direct |
Decision rule:
| Data size / workflow | Recommended tool | |----------------------|------------------| | Small (<100MB), interactive exploration | pandas | | Medium (100MB-10GB), batch transforms | Polars | | SQL-first analytics, any size | DuckDB | | Blended workflow | Polars transforms, DuckDB aggregations (zero-copy via Arrow) |
See references/tools.md for full patterns and code examples.
| Level | Examples | Handling | |-------|----------|---------| | Public | Aggregated metrics, public reports | No restrictions | | Internal | Business KPIs, operational data | Access controls, no external sharing | | Confidential | Customer data, financial records | Encryption at rest, column-level masking | | Restricted | SSN, payment data, health records | Tokenization, row-level security, audit logging |
Before building any pipeline that touches PII:
| Requirement | Engineering Pattern | |-------------|---------------------| | Right to erasure | Soft delete → batch purge → propagate to downstream stores including data lake | | Data minimization | Collect only necessary fields; TTL on non-essential data | | Consent tracking | Consent event store with versioned preferences; consent-aware pipeline branches | | Data portability | Standardized export endpoint (JSON/CSV) per user request |
See references/governance.md for detailed implementation patterns, row-level security, and retention policies.
Data engineering does not exist in isolation. Cross-reference these skills when your pipeline connects to other systems:
| Companion | When to Consult | Key Sections |
|-----------|-----------------|--------------|
| dev-backend | Exposing data via API, response envelope shape, pagination | §5 API Response Contract, §2 Layered Architecture |
| dev-security | PII handling, data classification, access controls, audit logging | §1 Input Validation, §4 Secrets, §8 Pre-Flight |
| dev-testing | Pipeline validation, contract tests for data APIs, CI gates | §2 Backend & API Testing, §3 Contract Testing |
| dev-frontend | Downstream reporting/dashboard consumers, data format expectations | §8 Backend Contract & Security Alignment |
Integration patterns:
dev-backend §5)dev-security guidance before this skill's §7 rulesdevelopment
Goal execution guidelines with PABCD integration, verification tiers, documentation workflow, and AI-driven planning
tools
A CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.
development
Use this skill any time a spreadsheet file is the primary input or output (.xlsx, .xlsm, .csv, .tsv). This includes: creating, reading, editing, analyzing, or formatting spreadsheets; cleaning messy tabular data; converting between formats; and data visualization with charts. Also use for pandas-based data analysis when the deliverable is a spreadsheet. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration.
tools
Use this skill when the user wants to build a financial model, 3-statement model, DCF valuation, cap table, scenario analysis, or financial projections in Excel. Trigger on: 'financial model', '3-statement model', 'DCF', 'cap table', 'pro forma', 'projections', 'sensitivity analysis', 'waterfall', 'debt schedule', 'break-even', 'discounted cash flow', 'capitalization table', 'fundraising model', 'WACC calculation', 'scenario analysis model'. Input is a text prompt with assumptions. Output is a single .xlsx file with formula-driven, interconnected statement sheets.