Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

mohitagw15856/data-pipeline-spec

Name: data-pipeline-spec
Author: mohitagw15856

plugins/pm-data/skills/data-pipeline-spec/SKILL.md

npx skillsauth add mohitagw15856/pm-claude-skills data-pipeline-spec

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Data Pipeline Spec Skill

This skill produces a complete data pipeline specification covering sources, transformations, destinations, scheduling, SLAs, error handling, data quality checks, and monitoring requirements. Output is ready for engineering handoff or architecture review.

Required Inputs

Ask the user for these if not provided:

Pipeline purpose — what business question or workflow does this pipeline serve?
Source systems — where does data come from? (databases, APIs, files, event streams)
Destination — where does data land? (data warehouse, data lake, downstream DB, reporting tool)
Transformation type — ETL (transform before loading) or ELT (load raw, transform in warehouse)?
Frequency / SLA — how often must data be fresh? (real-time / hourly / daily / weekly)
Volume estimate — approximate rows/events per run
Data quality requirements — completeness, deduplication, freshness, schema enforcement
Team or stack — any specific tools in use? (Airflow, dbt, Fivetran, Spark, Kafka, etc.)

Output Structure

Data Pipeline Spec: [Pipeline Name]

Purpose: [One sentence — what decision or workflow does this pipeline enable?] Type: [ETL / ELT / Streaming / Batch] Owner: [Team or individual] Version: [1.0] Date: [Date] Status: [Draft / Under Review / Approved]

1. Overview

[2–3 sentences describing the pipeline end-to-end: what data moves, from where to where, at what cadence, and why.]

Architecture diagram (text):

[Source A] ──┐
[Source B] ──┤──► [Ingestion Layer] ──► [Transform Layer] ──► [Destination] ──► [Consumers]
[Source C] ──┘

2. Sources

| Source | System | Connection type | Data format | Update pattern | Volume | |---|---|---|---|---|---| | [Source 1] | [PostgreSQL / Salesforce / S3 / Kafka] | [JDBC / REST API / SDK / Webhook] | [JSON / CSV / Parquet / CDC] | [Append / Full refresh / Incremental] | [X rows/day] | | [Source 2] | [...] | [...] | [...] | [...] | [...] |

Incremental key (if applicable): [The column used to identify new or changed records — e.g. updated_at, event_id]

Authentication: [API key / OAuth / IAM role / connection string — note where credentials are stored]

3. Ingestion Layer

Tool: [Fivetran / Airbyte / Kafka Connect / custom script / dbt source]

Ingestion method:

[ ] Full extract (full table refresh each run)
[ ] Incremental extract (only new/changed rows since last run)
[ ] CDC (change data capture from database transaction log)
[ ] Event streaming (continuous ingestion from Kafka/Kinesis)

Raw landing zone: [Where raw data lands before transformation — e.g. raw.salesforce_opportunities in Snowflake, S3 bucket s3://data-raw/crm/]

Schema handling: [Strict schema enforcement / Schema evolution allowed / Union schema]

4. Transformation Logic

List each transformation in execution order. For ELT pipelines, this is the dbt model or SQL layer.

| Step | Name | Description | Input | Output | Tool | |---|---|---|---|---|---| | 1 | [Deduplicate events] | [Remove duplicate event rows based on event_id] | raw.events | staging.events_deduped | [dbt / SQL / Spark] | | 2 | [Join user profile] | [Enrich events with user attributes from CRM] | staging.events_deduped, raw.users | staging.events_enriched | [...] | | 3 | [Aggregate to daily] | [Roll up to user×day grain] | staging.events_enriched | mart.user_daily_activity | [...] |

Business logic rules:

[e.g. Revenue is recognised on payment_confirmed_at, not payment_initiated_at]
[e.g. Users in the [email protected] domain are excluded from all metrics]
[e.g. Currency conversion uses the ECB rate from the first business day of each month]

Slowly Changing Dimensions (SCD) — if applicable:

[e.g. users.plan_tier is SCD Type 2 — keep history of plan changes with valid_from / valid_to]

5. Destination

| Destination | System | Schema / Table | Write mode | Consumers | |---|---|---|---|---| | [Primary] | [Snowflake / BigQuery / Redshift / PostgreSQL] | [analytics.mart_user_activity] | [Append / Upsert / Full replace] | [Looker / Metabase / downstream pipeline] | | [Secondary] | [...] | [...] | [...] | [...] |

Partitioning / Clustering: [e.g. Partitioned by event_date, clustered by user_id — reduces query cost for time-range scans]

Retention policy: [e.g. Raw data retained for 90 days; mart tables retained indefinitely]

6. Scheduling & SLAs

| SLA | Target | Breach action | |---|---|---| | Data freshness | [Data must be ≤ X hours old by HH:MM UTC] | [Page on-call / alert Slack channel] | | Pipeline completion | [Must complete within X minutes of trigger] | [Alert and auto-retry] | | Availability | [Pipeline must run successfully X% of days per month] | [Incident review] |

Schedule: [Cron expression and human description — e.g. 0 6 * * * — daily at 06:00 UTC]

Trigger type:

[ ] Time-based (cron)
[ ] Event-based (triggered by upstream pipeline success / file arrival / Kafka lag)
[ ] Manual (ad hoc runs only)

Backfill strategy: [How to reprocess historical data if the pipeline fails or logic changes — e.g. parameterised date range, full drop-and-reload]

7. Data Quality Rules

| Check | Table | Rule | Failure action | |---|---|---|---| | Completeness | staging.events | event_id IS NOT NULL — 100% of rows | Block load / Alert | | Uniqueness | mart.user_daily_activity | (user_id, date) must be unique | Block load | | Freshness | mart.user_daily_activity | max(event_date) >= CURRENT_DATE - 1 | Alert | | Volume | staging.events | Row count within ±20% of 7-day average | Alert | | Referential integrity | staging.events | All user_id values exist in users table | Alert |

DQ tool: [dbt tests / Great Expectations / Monte Carlo / custom SQL assertions]

8. Error Handling & Recovery

Retry policy: [e.g. 3 retries with exponential back-off: 5 min, 20 min, 60 min]

Failure modes and responses:

| Failure | Detection | Response | Owner | |---|---|---|---| | Source unavailable | HTTP 5xx / connection timeout | Retry 3×, then alert and skip run | Data engineering | | Schema change in source | Column missing or type mismatch | Block load, alert schema owner | Data owner + engineering | | DQ check fails | dbt test failure / assertion error | Block load for P1 checks; alert for P2 | Data engineering | | Partial load | Row count < expected threshold | Alert; do not publish to consumers until resolved | Data engineering |

Dead-letter queue: [Where failed records are routed for manual inspection — e.g. raw.dlq_events]

9. Monitoring & Observability

Metrics to track:

Pipeline run duration (p50, p95)
Rows processed per run
DQ check pass rate
Source freshness lag
Error rate per source

Alerting:

[Slack channel: #data-alerts]
[PagerDuty: data-on-call escalation for P1 SLA breaches]
[Dashboard: [link to monitoring dashboard]]

Logging: [What gets logged and where — e.g. Airflow task logs to CloudWatch, structured JSON to data lake]

10. Dependencies & Sequencing

Upstream dependencies: [Which pipelines or data sources must succeed before this pipeline runs?]

Downstream dependents: [Which dashboards, pipelines, or models depend on this pipeline's output?]

[upstream pipeline A] ──► THIS PIPELINE ──► [downstream dashboard B]
                                          └──► [downstream pipeline C]

Coordination mechanism: [Airflow DAG dependency / dbt ref() / event trigger / manual gate]

11. Security & Compliance

PII fields: [List columns containing PII — e.g. email, ip_address, name]
Masking / Pseudonymisation: [e.g. email hashed with SHA-256 before landing in mart layer]
Access control: [Who can query the destination tables? — e.g. Role-based access in Snowflake]
Data residency: [Which regions is data permitted to transit and rest in?]
Audit trail: [Is pipeline execution auditable for compliance purposes? Where are logs retained?]

Quality Checks

[ ] Every source has an incremental key or full-refresh justification
[ ] Business logic rules are documented, not just the SQL
[ ] SLAs are agreed with consumers, not set unilaterally by engineering
[ ] DQ checks cover completeness, uniqueness, freshness, and volume
[ ] Failure modes include a documented recovery owner
[ ] PII fields are identified and a treatment plan is specified

Anti-Patterns

[ ] Do not spec a pipeline without defining SLAs — "as fast as possible" is not an acceptable freshness target
[ ] Do not omit error handling and dead-letter queue strategy — every pipeline must specify what happens to failed records
[ ] Do not design idempotent loads without documenting the deduplication key — assume reruns will happen
[ ] Do not leave data quality rules implicit — schema validation, null checks, and referential integrity must be explicit
[ ] Do not ignore schema evolution — specify how upstream schema changes are detected and handled

Example Trigger Phrases

"Design a data pipeline for our Salesforce to Snowflake sync"
"Write a pipeline spec for ingesting Stripe events into our data warehouse"
"Build an ETL spec for our user activity data"
"Document our dbt pipeline from raw events to the analytics mart"
"Spec out the pipeline that feeds the executive dashboard"

mohitagw15856/data-pipeline-spec

plugins/pm-data/skills/data-pipeline-spec/SKILL.md

Design an ETL/ELT data pipeline specification. Use when asked to design a data pipeline, spec an ETL or ELT process, document a data ingestion workflow, or plan a data integration. Produces a complete pipeline spec with sources, transforms, destinations, SLAs, error handling, and data quality rules.

946 stars

testing

Updated Jun 9, 2026

$ install --global

skillsauth

npx skillsauth add mohitagw15856/pm-claude-skills data-pipeline-spec

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 9, 2026, 7:19 AM168.5s1 file scanned

SKILL.md

name:: data-pipeline-spec
description:: Design an ETL/ELT data pipeline specification. Use when asked to design a data pipeline, spec an ETL or ELT process, document a data ingestion workflow, or plan a data integration. Produces a complete pipeline spec with sources, transforms, destinations, SLAs, error handling, and data quality rules.

Data Pipeline Spec Skill

Required Inputs

Ask the user for these if not provided:

Pipeline purpose — what business question or workflow does this pipeline serve?
Source systems — where does data come from? (databases, APIs, files, event streams)
Destination — where does data land? (data warehouse, data lake, downstream DB, reporting tool)
Transformation type — ETL (transform before loading) or ELT (load raw, transform in warehouse)?
Frequency / SLA — how often must data be fresh? (real-time / hourly / daily / weekly)
Volume estimate — approximate rows/events per run
Data quality requirements — completeness, deduplication, freshness, schema enforcement
Team or stack — any specific tools in use? (Airflow, dbt, Fivetran, Spark, Kafka, etc.)

Output Structure

Data Pipeline Spec: [Pipeline Name]

1. Overview

[2–3 sentences describing the pipeline end-to-end: what data moves, from where to where, at what cadence, and why.]

Architecture diagram (text):

[Source A] ──┐
[Source B] ──┤──► [Ingestion Layer] ──► [Transform Layer] ──► [Destination] ──► [Consumers]
[Source C] ──┘

2. Sources

Incremental key (if applicable): [The column used to identify new or changed records — e.g. updated_at, event_id]

Authentication: [API key / OAuth / IAM role / connection string — note where credentials are stored]

3. Ingestion Layer

Tool: [Fivetran / Airbyte / Kafka Connect / custom script / dbt source]

Ingestion method:

[ ] Full extract (full table refresh each run)
[ ] Incremental extract (only new/changed rows since last run)
[ ] CDC (change data capture from database transaction log)
[ ] Event streaming (continuous ingestion from Kafka/Kinesis)

Raw landing zone: [Where raw data lands before transformation — e.g. raw.salesforce_opportunities in Snowflake, S3 bucket s3://data-raw/crm/]

Schema handling: [Strict schema enforcement / Schema evolution allowed / Union schema]

4. Transformation Logic

List each transformation in execution order. For ELT pipelines, this is the dbt model or SQL layer.

Business logic rules:

[e.g. Revenue is recognised on payment_confirmed_at, not payment_initiated_at]
[e.g. Users in the [email protected] domain are excluded from all metrics]
[e.g. Currency conversion uses the ECB rate from the first business day of each month]

Slowly Changing Dimensions (SCD) — if applicable:

[e.g. users.plan_tier is SCD Type 2 — keep history of plan changes with valid_from / valid_to]

5. Destination

Partitioning / Clustering: [e.g. Partitioned by event_date, clustered by user_id — reduces query cost for time-range scans]

Retention policy: [e.g. Raw data retained for 90 days; mart tables retained indefinitely]

6. Scheduling & SLAs

Schedule: [Cron expression and human description — e.g. 0 6 * * * — daily at 06:00 UTC]

Trigger type:

[ ] Time-based (cron)
[ ] Event-based (triggered by upstream pipeline success / file arrival / Kafka lag)
[ ] Manual (ad hoc runs only)

Backfill strategy: [How to reprocess historical data if the pipeline fails or logic changes — e.g. parameterised date range, full drop-and-reload]

7. Data Quality Rules

DQ tool: [dbt tests / Great Expectations / Monte Carlo / custom SQL assertions]

8. Error Handling & Recovery

Retry policy: [e.g. 3 retries with exponential back-off: 5 min, 20 min, 60 min]

Failure modes and responses:

Dead-letter queue: [Where failed records are routed for manual inspection — e.g. raw.dlq_events]

9. Monitoring & Observability

Metrics to track:

Pipeline run duration (p50, p95)
Rows processed per run
DQ check pass rate
Source freshness lag
Error rate per source

Alerting:

[Slack channel: #data-alerts]
[PagerDuty: data-on-call escalation for P1 SLA breaches]
[Dashboard: [link to monitoring dashboard]]

Logging: [What gets logged and where — e.g. Airflow task logs to CloudWatch, structured JSON to data lake]

10. Dependencies & Sequencing

Upstream dependencies: [Which pipelines or data sources must succeed before this pipeline runs?]

Downstream dependents: [Which dashboards, pipelines, or models depend on this pipeline's output?]

[upstream pipeline A] ──► THIS PIPELINE ──► [downstream dashboard B]
                                          └──► [downstream pipeline C]

Coordination mechanism: [Airflow DAG dependency / dbt ref() / event trigger / manual gate]

11. Security & Compliance

PII fields: [List columns containing PII — e.g. email, ip_address, name]
Masking / Pseudonymisation: [e.g. email hashed with SHA-256 before landing in mart layer]
Access control: [Who can query the destination tables? — e.g. Role-based access in Snowflake]
Data residency: [Which regions is data permitted to transit and rest in?]
Audit trail: [Is pipeline execution auditable for compliance purposes? Where are logs retained?]

Quality Checks

[ ] Every source has an incremental key or full-refresh justification
[ ] Business logic rules are documented, not just the SQL
[ ] SLAs are agreed with consumers, not set unilaterally by engineering
[ ] DQ checks cover completeness, uniqueness, freshness, and volume
[ ] Failure modes include a documented recovery owner
[ ] PII fields are identified and a treatment plan is specified

Anti-Patterns

[ ] Do not spec a pipeline without defining SLAs — "as fast as possible" is not an acceptable freshness target
[ ] Do not omit error handling and dead-letter queue strategy — every pipeline must specify what happens to failed records
[ ] Do not design idempotent loads without documenting the deduplication key — assume reruns will happen
[ ] Do not leave data quality rules implicit — schema validation, null checks, and referential integrity must be explicit
[ ] Do not ignore schema evolution — specify how upstream schema changes are detected and handled

Example Trigger Phrases

"Design a data pipeline for our Salesforce to Snowflake sync"
"Write a pipeline spec for ingesting Stripe events into our data warehouse"
"Build an ETL spec for our user activity data"
"Document our dbt pipeline from raw events to the analytics mart"
"Spec out the pipeline that feeds the executive dashboard"

Related Skills

mohitagw15856/win-loss-analysis

business

VerifiedTrustedCommunity

Analyze why deals are won and lost and turn it into an action plan. Use when asked to run a win/loss analysis, review closed-won and closed-lost deals, understand why the team is losing to a competitor, or summarize sales feedback into patterns. Produces a structured win/loss report with themes, win/loss rates by segment and competitor, representative quotes, and prioritized actions for product, marketing, and sales.

1,117SKILL.mdUpdated Jul 2, 2026

mohitagw15856/win-loss-analysis

mohitagw15856/which-skill

development

VerifiedTrustedCommunity

Route a fuzzy request to the right skill in this library. Use when the user is unsure which skill fits, asks 'which skill should I use for X', describes a task without naming a skill, or when a request could plausibly match several skills. Produces a best-fit recommendation with the inputs to gather, a runner-up with the tie-breaker, and a workflow recipe when the job spans multiple skills.

1,117SKILL.mdUpdated Jul 2, 2026

mohitagw15856/which-skill

mohitagw15856/vuln-triage

testing

VerifiedTrustedCommunity

Triage a vulnerability or scanner finding — assess real severity, exploitability, and how urgently to fix. Use when asked to triage a CVE, prioritize scanner/pentest findings, assess a vuln's risk, or decide what to patch first. Produces a triage verdict: CVSS-informed severity adjusted for your context, exploitability, real risk, a fix/mitigation, and an SLA — so you fix what matters, not just what's red.

1,117SKILL.mdUpdated Jul 2, 2026

mohitagw15856/vuln-triage

mohitagw15856/voice-of-customer-program

development

VerifiedTrustedCommunity

Stand up a Voice of Customer (VoC) program that turns feedback into action. Use when asked to build a VoC program, design a customer feedback loop, consolidate feedback sources, or set up a closed-loop feedback process. Produces a VoC program design — objectives, feedback sources and channels, a taxonomy, collection and analysis cadence, closed-loop routing, ownership, and success metrics.

1,117SKILL.mdUpdated Jul 2, 2026

mohitagw15856/voice-of-customer-program

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/mohitagw15856/pm-claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r pm-claude-skills/plugins/pm-data/skills/data-pipeline-spec ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

mohitagw15856/pm-claude-skills

946 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT