Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

kienbui1995/skills/cloud/gcp/dataflow-pipeline

Name: skills/cloud/gcp/dataflow-pipeline
Author: kienbui1995

skills/cloud/gcp/dataflow-pipeline/SKILL.md

npx skillsauth add kienbui1995/magic-powers skills/cloud/gcp/dataflow-pipeline

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Dataflow Pipeline

When to Use

Building ETL pipelines (batch or streaming) on GCP
Choosing between Dataflow and Dataproc for a workload
Designing windowing or late-data handling for streaming
Preparing for GCP Professional Data Engineer exam (highest weight domain)

Core Jobs

1. Dataflow vs Dataproc Decision

| Factor | Choose Dataflow | Choose Dataproc | |--------|----------------|-----------------| | Runtime | Apache Beam pipelines | Spark/Hadoop ecosystem | | Management | Fully managed, serverless | Cluster to manage (or autoscaling) | | Streaming | Native (Pub/Sub → BQ) | Spark Streaming (more complex) | | Existing code | Greenfield | Migrating existing Spark jobs | | Cost model | Per vCPU/memory/hour | Cluster uptime |

2. Pipeline Design (Apache Beam)

Core abstractions:

PCollection — distributed dataset (bounded for batch, unbounded for streaming)
PTransform — operation on a PCollection (Map, Filter, GroupByKey, Combine)
ParDo — element-wise transformation (like map/flatMap)
Pipeline — DAG of PTransforms applied to PCollections

3. Windowing (Streaming)

Fixed windows — equal non-overlapping intervals (e.g., 1-minute buckets)
Sliding windows — overlapping intervals (e.g., 10-min window every 1 min)
Session windows — gap-based; group events within a user session
Global window — default; all elements in one window (use with triggers for streaming)

4. Watermarks and Late Data

Watermark — Dataflow's estimate of how far behind real-time the data is
Late elements arrive after the watermark passes their window
Handle with .withAllowedLateness(Duration.standardMinutes(10))
Late data triggers go to a dead-letter or side output

5. Triggers

Default (event time) — fire when watermark passes window end
AfterProcessingTime — fire after processing-time delay (for low-latency)
AfterCount — fire after N elements accumulated
Composite triggers — combine with .orFinally(), .repeatedly()

6. Templates

Classic Templates — staged as GCS files; no runtime parameters
Flex Templates — packaged as Docker images; support runtime parameters; preferred

Key Concepts

Fusion — Dataflow optimization: merges compatible transforms to reduce shuffles
Worker autoscaling — Dataflow scales workers based on backlog
Shuffle service — offloads GroupByKey shuffle to Dataflow backend (reduces worker cost)
Streaming Engine — offloads windowing/state to backend; reduces memory on workers

Checklist

[ ] Use Flex Templates (not Classic) for new pipelines?
[ ] Late data handled with .withAllowedLateness()?
[ ] Side outputs used for dead-letter / error records?
[ ] GroupByKey minimized (use Combine where possible)?
[ ] Dataflow Shuffle service enabled for batch jobs?
[ ] Streaming Engine enabled for streaming jobs?
[ ] Pipeline tested locally with DirectRunner before deploying?

Output Format

🔴 Critical — unbounded PCollection without windowing in streaming pipeline
🟡 Warning — Classic Template used (prefer Flex), no late data handling
🟢 Suggestion — Shuffle service / Streaming Engine not enabled

Exam Tips

Watermark = when Dataflow thinks all data with that timestamp has arrived
Late data arrives AFTER the watermark → use .withAllowedLateness() to capture it
Dataflow → BigQuery streaming inserts = standard pattern for real-time analytics
Dataproc = use for existing Spark/Hadoop code migration, not greenfield
Flex Templates > Classic Templates for all new pipelines (runtime params, easier updates)
DirectRunner = local testing; DataflowRunner = GCP execution

kienbui1995/skills/cloud/gcp/dataflow-pipeline

skills/cloud/gcp/dataflow-pipeline/SKILL.md

--- name: dataflow-pipeline description: Use when building Apache Beam pipelines on Google Cloud Dataflow — batch ETL, streaming, windowing, triggers, or Dataflow vs Dataproc decisions. Covers GCP-PDE domain: Ingest and process data (~25-30%). --- # Dataflow Pipeline ## When to Use - Building ETL pipelines (batch or streaming) on GCP - Choosing between Dataflow and Dataproc for a workload - Designing windowing or late-data handling for streaming - Preparing for GCP Professional Data Engineer e

development

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add kienbui1995/magic-powers skills/cloud/gcp/dataflow-pipeline

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 12:52 AM49.6s1 file scanned

SKILL.md

name:: dataflow-pipeline
description:: Use when building Apache Beam pipelines on Google Cloud Dataflow — batch ETL, streaming, windowing, triggers, or Dataflow vs Dataproc decisions. Covers GCP-PDE domain: Ingest and process data (~25-30%).

Dataflow Pipeline

When to Use

Building ETL pipelines (batch or streaming) on GCP
Choosing between Dataflow and Dataproc for a workload
Designing windowing or late-data handling for streaming
Preparing for GCP Professional Data Engineer exam (highest weight domain)

Core Jobs

1. Dataflow vs Dataproc Decision

2. Pipeline Design (Apache Beam)

Core abstractions:

PCollection — distributed dataset (bounded for batch, unbounded for streaming)
PTransform — operation on a PCollection (Map, Filter, GroupByKey, Combine)
ParDo — element-wise transformation (like map/flatMap)
Pipeline — DAG of PTransforms applied to PCollections

3. Windowing (Streaming)

Fixed windows — equal non-overlapping intervals (e.g., 1-minute buckets)
Sliding windows — overlapping intervals (e.g., 10-min window every 1 min)
Session windows — gap-based; group events within a user session
Global window — default; all elements in one window (use with triggers for streaming)

4. Watermarks and Late Data

Watermark — Dataflow's estimate of how far behind real-time the data is
Late elements arrive after the watermark passes their window
Handle with .withAllowedLateness(Duration.standardMinutes(10))
Late data triggers go to a dead-letter or side output

5. Triggers

Default (event time) — fire when watermark passes window end
AfterProcessingTime — fire after processing-time delay (for low-latency)
AfterCount — fire after N elements accumulated
Composite triggers — combine with .orFinally(), .repeatedly()

6. Templates

Classic Templates — staged as GCS files; no runtime parameters
Flex Templates — packaged as Docker images; support runtime parameters; preferred

Key Concepts

Fusion — Dataflow optimization: merges compatible transforms to reduce shuffles
Worker autoscaling — Dataflow scales workers based on backlog
Shuffle service — offloads GroupByKey shuffle to Dataflow backend (reduces worker cost)
Streaming Engine — offloads windowing/state to backend; reduces memory on workers

Checklist

[ ] Use Flex Templates (not Classic) for new pipelines?
[ ] Late data handled with .withAllowedLateness()?
[ ] Side outputs used for dead-letter / error records?
[ ] GroupByKey minimized (use Combine where possible)?
[ ] Dataflow Shuffle service enabled for batch jobs?
[ ] Streaming Engine enabled for streaming jobs?
[ ] Pipeline tested locally with DirectRunner before deploying?

Output Format

🔴 Critical — unbounded PCollection without windowing in streaming pipeline
🟡 Warning — Classic Template used (prefer Flex), no late data handling
🟢 Suggestion — Shuffle service / Streaming Engine not enabled

Exam Tips

Watermark = when Dataflow thinks all data with that timestamp has arrived
Late data arrives AFTER the watermark → use .withAllowedLateness() to capture it
Dataflow → BigQuery streaming inserts = standard pattern for real-time analytics
Dataproc = use for existing Spark/Hadoop code migration, not greenfield
Flex Templates > Classic Templates for all new pipelines (runtime params, easier updates)
DirectRunner = local testing; DataflowRunner = GCP execution

Related Skills

kienbui1995/xr-interface-design

content-media

VerifiedTrustedCommunity

Use when designing for XR (AR/VR/MR), choosing interaction modes, or adapting 2D UI patterns for spatial computing

SKILL.mdUpdated Apr 24, 2026

kienbui1995/xr-interface-design

kienbui1995/writing-skills

testing

VerifiedTrustedCommunity

Use when creating new skills, editing existing skills, or verifying skills work before deployment

SKILL.mdUpdated Apr 24, 2026

kienbui1995/writing-skills

kienbui1995/writing-plans

development

VerifiedTrustedCommunity

Use when you have a spec or requirements for a multi-step task, before touching code

SKILL.mdUpdated Apr 24, 2026

kienbui1995/writing-plans

kienbui1995/workflow-templates

development

VerifiedTrustedCommunity

Use when executing a structured workflow — select and run a feature, bugfix, refactor, research, or incident template with correct agent and model assignments per phase.

SKILL.mdUpdated Apr 24, 2026

kienbui1995/workflow-templates

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/kienbui1995/magic-powers.git

# Copy into Claude Code skills folder (global)
cp -r magic-powers/skills/cloud/gcp/dataflow-pipeline ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

kienbui1995/magic-powers

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT