skills/by-role/data-engineer/pipeline-design-doc/SKILL.md
Write a data pipeline design document. Use when the user says "pipeline design doc", "document this pipeline", "pipeline architecture doc", "data flow document", "how does this pipeline work", "design doc for ETL", "pipeline spec", or needs to capture the architecture, data flow, and operational details of a data pipeline - even if they don't explicitly say "design doc".
npx skillsauth add qa-aman/claude-skills pipeline-design-docInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Based on Fundamentals of Data Engineering (Reis & Housley) and Designing Data-Intensive Applications (Kleppmann). A pipeline design doc is the architectural contract for a data flow: what data moves, from where to where, how it transforms, and what guarantees it provides. Kleppmann's standard: document the data system's reliability, scalability, and maintainability properties - not just the happy path.
The test: can a new engineer understand what this pipeline does, debug it when it breaks, and extend it safely - without asking anyone?
Pipeline: [name]
Owner: [team or person]
Purpose: [1-2 sentences on why this pipeline exists and what decision it enables]
Source systems: [list of upstream data sources]
Destination: [where data lands - warehouse, lake, API, etc.]
Schedule: [batch frequency or streaming latency SLA]
Data volume: [approx rows/day or GB/day]
SLA: [when downstream consumers need data available by]
Draw or describe each stage linearly:
[Source] → [Ingest] → [Raw/Landing Zone] → [Transform] → [Serving Layer] → [Consumer]
For each stage, document:
For each source, specify the contract:
Source: [system name]
Schema: [fields and types]
Delivery method: [push/pull, API, file drop, CDC, etc.]
Delivery frequency: [real-time / hourly / daily]
SLA: [when source guarantees data is available]
Schema change policy: [how the source team notifies of changes]
Known quirks: [nulls in required fields, late-arriving data, duplicates, etc.]
For the output, specify the guarantee:
Output table/topic: [name]
Schema: [fields and types]
Freshness SLA: [data is no older than X]
Completeness guarantee: [how to detect missing data]
Deduplication: [how duplicates are handled]
For each non-trivial transformation, explain:
Example:
Transform: revenue_attribution
What: assigns each order to the last marketing touch within 30 days
Why: finance requires last-touch model per FY23 attribution policy
Assumptions: user_id is consistent across sessions (not always true for guests)
Edge cases: guest checkouts with no touch - attributed to "direct"
Failure mode: [source is late / schema change / job fails mid-run]
Impact: [what downstream consumers experience]
Detection: [how we know it's happening - alert name or monitoring link]
Recovery:
1. [first action]
2. [verify: command or query]
3. [reprocessing steps if data needs backfill]
Escalate if: [condition that exceeds this runbook]
1. Happy path only Bad: "The pipeline reads from Postgres and loads to BigQuery daily." Good: Document what happens when Postgres is slow, when a schema column is dropped, when the daily job misses its window.
2. Undocumented assumptions Bad: "Joins on user_id." Good: "Joins on user_id. Assumes user_id is stable post-registration. Guest users (user_id = null) are excluded from this join and handled in the guest_orders pipeline."
3. No data contract Bad: "Takes whatever the source sends." Good: Define the expected schema, the schema change notification process, and how the pipeline handles unexpected fields.
4. No idempotency statement Bad: Doc doesn't say whether reruns are safe. Good: "This pipeline is idempotent. Reruns use MERGE (upsert) on event_id. Safe to rerun for any date range."
development
Plan a webinar end-to-end using April Dunford's Obviously Awesome positioning framework to find the topic angle that makes the webinar obviously valuable to the right audience. Produces topic positioning, abstract, speaker brief, registration page, promotion sequence, day-of run-of-show, and post-webinar follow-up. Use when the user asks to plan a webinar, virtual event, online workshop, "we need a webinar on X", host a webinar, online masterclass, or any live virtual event with promotion and follow-up. Reads ICP, services, and brand voice from knowledge/.
development
Write long-form thought leadership articles, opinion pieces, industry POV essays, and CEO/founder bylines using the Made to Stick SUCCESs framework (Chip and Dan Heath). Use when the user asks for a long-form article, executive byline, opinion piece, industry POV, manifesto, "explain our point of view on X", or wants to publish an authority-building piece (1200-2500 words). Reads brand voice and positioning from knowledge/.
development
Plan a monthly content calendar across channels using the Content Marketing Matrix (Dave Chaffey, Smart Insights) - Entertain/Inspire/Educate/Convince. Every post gets a quadrant label. The monthly calendar must hit 40% Educate, 40% Inspire+Convince, 20% Entertain. Produces a week-by-week posting schedule with topics, formats, channels, and asset links. Use when the user says "content calendar", "social calendar", "plan next month's content", "what should we post", "content plan", "editorial calendar", "schedule posts for the month", or wants a structured posting plan for LinkedIn, Twitter, email, or blog. Reads brand voice, ICP, and past learnings from knowledge/.
development
Write SEO-optimized long-form articles targeting specific keywords using the They Ask You Answer Big 5 framework (Marcus Sheridan). Articles are categorized by Big 5 type (Cost, Problems, Versus, Best/Reviews, How-To) and structured accordingly. The "answer first" rule applies to every article. Use when the user asks for an SEO article, blog post for ranking, "rank for keyword X", organic content, search-optimized post, pillar page, or content for organic traffic. Includes keyword targeting, search intent matching, internal linking suggestions, and meta tags.