Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

frank-luongt/skills/codex/data-engineering-data-pipeline

Name: skills/codex/data-engineering-data-pipeline
Author: frank-luongt

skills/codex/data-engineering-data-pipeline/SKILL.md

npx skillsauth add frank-luongt/faos-skills-marketplace skills/codex/data-engineering-data-pipeline

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

name: data-engineering-data-pipeline description: You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

Data Pipeline Architecture

You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

Use this skill when

Working on data pipeline architecture tasks or workflows
Needing guidance, best practices, or checklists for data pipeline architecture

Do not use this skill when

The task is unrelated to data pipeline architecture
You need a different domain or tool outside this scope

Requirements

$ARGUMENTS

Core Capabilities

Design ETL/ELT, Lambda, Kappa, and Lakehouse architectures
Implement batch and streaming data ingestion
Build workflow orchestration with Airflow/Prefect
Transform data using dbt and Spark
Manage Delta Lake/Iceberg storage with ACID transactions
Implement data quality frameworks (Great Expectations, dbt tests)
Monitor pipelines with CloudWatch/Prometheus/Grafana
Optimize costs through partitioning, lifecycle policies, and compute optimization

Instructions

1. Architecture Design

Assess: sources, volume, latency requirements, targets
Select pattern: ETL (transform before load), ELT (load then transform), Lambda (batch + speed layers), Kappa (stream-only), Lakehouse (unified)
Design flow: sources → ingestion → processing → storage → serving
Add observability touchpoints

2. Ingestion Implementation

Batch

Incremental loading with watermark columns
Retry logic with exponential backoff
Schema validation and dead letter queue for invalid records
Metadata tracking (_extracted_at, _source)

Streaming

Kafka consumers with exactly-once semantics
Manual offset commits within transactions
Windowing for time-based aggregations
Error handling and replay capability

3. Orchestration

Airflow

Task groups for logical organization
XCom for inter-task communication
SLA monitoring and email alerts
Incremental execution with execution_date
Retry with exponential backoff

Prefect

Task caching for idempotency
Parallel execution with .submit()
Artifacts for visibility
Automatic retries with configurable delays

4. Transformation with dbt

Staging layer: incremental materialization, deduplication, late-arriving data handling
Marts layer: dimensional models, aggregations, business logic
Tests: unique, not_null, relationships, accepted_values, custom data quality tests
Sources: freshness checks, loaded_at_field tracking
Incremental strategy: merge or delete+insert

5. Data Quality Framework

Great Expectations

Table-level: row count, column count
Column-level: uniqueness, nullability, type validation, value sets, ranges
Checkpoints for validation execution
Data docs for documentation
Failure notifications

dbt Tests

Schema tests in YAML
Custom data quality tests with dbt-expectations
Test results tracked in metadata

6. Storage Strategy

Delta Lake

ACID transactions with append/overwrite/merge modes
Upsert with predicate-based matching
Time travel for historical queries
Optimize: compact small files, Z-order clustering
Vacuum to remove old files

Apache Iceberg

Partitioning and sort order optimization
MERGE INTO for upserts
Snapshot isolation and time travel
File compaction with binpack strategy
Snapshot expiration for cleanup

7. Monitoring & Cost Optimization

Monitoring

Track: records processed/failed, data size, execution time, success/failure rates
CloudWatch metrics and custom namespaces
SNS alerts for critical/warning/info events
Data freshness checks
Performance trend analysis

Cost Optimization

Partitioning: date/entity-based, avoid over-partitioning (keep >1GB)
File sizes: 512MB-1GB for Parquet
Lifecycle policies: hot (Standard) → warm (IA) → cold (Glacier)
Compute: spot instances for batch, on-demand for streaming, serverless for adhoc
Query optimization: partition pruning, clustering, predicate pushdown

Example: Minimal Batch Pipeline

# Batch ingestion with validation
from batch_ingestion import BatchDataIngester
from storage.delta_lake_manager import DeltaLakeManager
from data_quality.expectations_suite import DataQualityFramework

ingester = BatchDataIngester(config={})

# Extract with incremental loading
df = ingester.extract_from_database(
    connection_string='postgresql://host:5432/db',
    query='SELECT * FROM orders',
    watermark_column='updated_at',
    last_watermark=last_run_timestamp
)

# Validate
schema = {'required_fields': ['id', 'user_id'], 'dtypes': {'id': 'int64'}}
df = ingester.validate_and_clean(df, schema)

# Data quality checks
dq = DataQualityFramework()
result = dq.validate_dataframe(df, suite_name='orders_suite', data_asset_name='orders')

# Write to Delta Lake
delta_mgr = DeltaLakeManager(storage_path='s3://lake')
delta_mgr.create_or_update_table(
    df=df,
    table_name='orders',
    partition_columns=['order_date'],
    mode='append'
)

# Save failed records
ingester.save_dead_letter_queue('s3://lake/dlq/orders')

Output Deliverables

1. Architecture Documentation

Architecture diagram with data flow
Technology stack with justification
Scalability analysis and growth patterns
Failure modes and recovery strategies

2. Implementation Code

Ingestion: batch/streaming with error handling
Transformation: dbt models (staging → marts) or Spark jobs
Orchestration: Airflow/Prefect DAGs with dependencies
Storage: Delta/Iceberg table management
Data quality: Great Expectations suites and dbt tests

3. Configuration Files

Orchestration: DAG definitions, schedules, retry policies
dbt: models, sources, tests, project config
Infrastructure: Docker Compose, K8s manifests, Terraform
Environment: dev/staging/prod configs

4. Monitoring & Observability

Metrics: execution time, records processed, quality scores
Alerts: failures, performance degradation, data freshness
Dashboards: Grafana/CloudWatch for pipeline health
Logging: structured logs with correlation IDs

5. Operations Guide

Deployment procedures and rollback strategy
Troubleshooting guide for common issues
Scaling guide for increased volume
Cost optimization strategies and savings
Disaster recovery and backup procedures

Success Criteria

Pipeline meets defined SLA (latency, throughput)
Data quality checks pass with >99% success rate
Automatic retry and alerting on failures
Comprehensive monitoring shows health and performance
Documentation enables team maintenance
Cost optimization reduces infrastructure costs by 30-50%
Schema evolution without downtime
End-to-end data lineage tracked

frank-luongt/skills/codex/data-engineering-data-pipeline

skills/codex/data-engineering-data-pipeline/SKILL.md

--- name: data-engineering-data-pipeline description: You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing. --- # Data Pipeline Architecture You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing. ## Use this skill when - Working on data

12 stars

development

Updated Apr 21, 2026

$ install --global

skillsauth

npx skillsauth add frank-luongt/faos-skills-marketplace skills/codex/data-engineering-data-pipeline

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 21, 2026, 6:01 AM53.3s2 files scanned

SKILL.md

name: data-engineering-data-pipeline description: You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

Data Pipeline Architecture

You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

Use this skill when

Working on data pipeline architecture tasks or workflows
Needing guidance, best practices, or checklists for data pipeline architecture

Do not use this skill when

The task is unrelated to data pipeline architecture
You need a different domain or tool outside this scope

Requirements

$ARGUMENTS

Core Capabilities

Design ETL/ELT, Lambda, Kappa, and Lakehouse architectures
Implement batch and streaming data ingestion
Build workflow orchestration with Airflow/Prefect
Transform data using dbt and Spark
Manage Delta Lake/Iceberg storage with ACID transactions
Implement data quality frameworks (Great Expectations, dbt tests)
Monitor pipelines with CloudWatch/Prometheus/Grafana
Optimize costs through partitioning, lifecycle policies, and compute optimization

Instructions

1. Architecture Design

Assess: sources, volume, latency requirements, targets
Select pattern: ETL (transform before load), ELT (load then transform), Lambda (batch + speed layers), Kappa (stream-only), Lakehouse (unified)
Design flow: sources → ingestion → processing → storage → serving
Add observability touchpoints

2. Ingestion Implementation

Batch

Incremental loading with watermark columns
Retry logic with exponential backoff
Schema validation and dead letter queue for invalid records
Metadata tracking (_extracted_at, _source)

Streaming

Kafka consumers with exactly-once semantics
Manual offset commits within transactions
Windowing for time-based aggregations
Error handling and replay capability

3. Orchestration

Airflow

Task groups for logical organization
XCom for inter-task communication
SLA monitoring and email alerts
Incremental execution with execution_date
Retry with exponential backoff

Prefect

Task caching for idempotency
Parallel execution with .submit()
Artifacts for visibility
Automatic retries with configurable delays

4. Transformation with dbt

Staging layer: incremental materialization, deduplication, late-arriving data handling
Marts layer: dimensional models, aggregations, business logic
Tests: unique, not_null, relationships, accepted_values, custom data quality tests
Sources: freshness checks, loaded_at_field tracking
Incremental strategy: merge or delete+insert

5. Data Quality Framework

Great Expectations

Table-level: row count, column count
Column-level: uniqueness, nullability, type validation, value sets, ranges
Checkpoints for validation execution
Data docs for documentation
Failure notifications

dbt Tests

Schema tests in YAML
Custom data quality tests with dbt-expectations
Test results tracked in metadata

6. Storage Strategy

Delta Lake

ACID transactions with append/overwrite/merge modes
Upsert with predicate-based matching
Time travel for historical queries
Optimize: compact small files, Z-order clustering
Vacuum to remove old files

Apache Iceberg

Partitioning and sort order optimization
MERGE INTO for upserts
Snapshot isolation and time travel
File compaction with binpack strategy
Snapshot expiration for cleanup

7. Monitoring & Cost Optimization

Monitoring

Track: records processed/failed, data size, execution time, success/failure rates
CloudWatch metrics and custom namespaces
SNS alerts for critical/warning/info events
Data freshness checks
Performance trend analysis

Cost Optimization

Partitioning: date/entity-based, avoid over-partitioning (keep >1GB)
File sizes: 512MB-1GB for Parquet
Lifecycle policies: hot (Standard) → warm (IA) → cold (Glacier)
Compute: spot instances for batch, on-demand for streaming, serverless for adhoc
Query optimization: partition pruning, clustering, predicate pushdown

Example: Minimal Batch Pipeline

# Batch ingestion with validation
from batch_ingestion import BatchDataIngester
from storage.delta_lake_manager import DeltaLakeManager
from data_quality.expectations_suite import DataQualityFramework

ingester = BatchDataIngester(config={})

# Extract with incremental loading
df = ingester.extract_from_database(
    connection_string='postgresql://host:5432/db',
    query='SELECT * FROM orders',
    watermark_column='updated_at',
    last_watermark=last_run_timestamp
)

# Validate
schema = {'required_fields': ['id', 'user_id'], 'dtypes': {'id': 'int64'}}
df = ingester.validate_and_clean(df, schema)

# Data quality checks
dq = DataQualityFramework()
result = dq.validate_dataframe(df, suite_name='orders_suite', data_asset_name='orders')

# Write to Delta Lake
delta_mgr = DeltaLakeManager(storage_path='s3://lake')
delta_mgr.create_or_update_table(
    df=df,
    table_name='orders',
    partition_columns=['order_date'],
    mode='append'
)

# Save failed records
ingester.save_dead_letter_queue('s3://lake/dlq/orders')

Output Deliverables

1. Architecture Documentation

Architecture diagram with data flow
Technology stack with justification
Scalability analysis and growth patterns
Failure modes and recovery strategies

2. Implementation Code

Ingestion: batch/streaming with error handling
Transformation: dbt models (staging → marts) or Spark jobs
Orchestration: Airflow/Prefect DAGs with dependencies
Storage: Delta/Iceberg table management
Data quality: Great Expectations suites and dbt tests

3. Configuration Files

Orchestration: DAG definitions, schedules, retry policies
dbt: models, sources, tests, project config
Infrastructure: Docker Compose, K8s manifests, Terraform
Environment: dev/staging/prod configs

4. Monitoring & Observability

Metrics: execution time, records processed, quality scores
Alerts: failures, performance degradation, data freshness
Dashboards: Grafana/CloudWatch for pipeline health
Logging: structured logs with correlation IDs

5. Operations Guide

Deployment procedures and rollback strategy
Troubleshooting guide for common issues
Scaling guide for increased volume
Cost optimization strategies and savings
Disaster recovery and backup procedures

Success Criteria

Pipeline meets defined SLA (latency, throughput)
Data quality checks pass with >99% success rate
Automatic retry and alerting on failures
Comprehensive monitoring shows health and performance
Documentation enables team maintenance
Cost optimization reduces infrastructure costs by 30-50%
Schema evolution without downtime
End-to-end data lineage tracked

Related Skills

frank-luongt/skills/codex/grpo-rl-training

development

VerifiedTrustedCommunity

--- name: grpo-rl-training description: GRPO reinforcement learning training with TRL. Use when applying Group Relative Policy Optimization for reasoning and task-specific model training. --- # GRPO/RL Training with TRL Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-r

26SKILL.mdUpdated Jul 9, 2026

frank-luongt/skills/codex/grpo-rl-training

frank-luongt/skills/codex/graphql-architect

tools

VerifiedTrustedCommunity

--- name: graphql-architect description: Master modern GraphQL with federation, performance optimization, --- ## Use this skill when - Working on graphql architect tasks or workflows - Needing guidance, best practices, or checklists for graphql architect ## Do not use this skill when - The task is unrelated to graphql architect - You need a different domain or tool outside this scope ## Instructions - Clarify goals, constraints, and

26SKILL.mdUpdated Jul 9, 2026

frank-luongt/skills/codex/graphql-architect

frank-luongt/skills/codex/grafana-dashboards

development

VerifiedTrustedCommunity

--- name: grafana-dashboards description: Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces. --- # Grafana Dashboards Create and manage production-ready Grafana dashboards for comprehensive system observability. ## Do not use this skill when - The task is unrelated

26SKILL.mdUpdated Jul 9, 2026

frank-luongt/skills/codex/grafana-dashboards

frank-luongt/skills/codex/gptq

development

VerifiedTrustedCommunity

--- name: gptq description: GPTQ post-training quantization for generative models. Use when quantizing large models to 4-bit with calibration-based weight compression. --- # GPTQ (Generative Pre-trained Transformer Quantization) Post-training quantization method that compresses LLMs to 4-bit with minimal accuracy loss using group-wise quantization. ## When to use GPTQ **Use GPTQ when:** - Need to fit large models (70B+) on limited GPU

26SKILL.mdUpdated Jul 9, 2026

frank-luongt/skills/codex/gptq

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/frank-luongt/faos-skills-marketplace.git

# Copy into Claude Code skills folder (global)
cp -r faos-skills-marketplace/skills/codex/data-engineering-data-pipeline ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

frank-luongt/faos-skills-marketplace

12 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT