1kalin/afrexai-data-engineering/SKILL.md
# Data Engineering Command Center Complete methodology for designing, building, operating, and scaling data pipelines and infrastructure. Zero dependencies — pure agent skill. --- ## Phase 1: Data Architecture Assessment Before building anything, understand the landscape. ### Architecture Brief ```yaml project_name: "" business_context: "" data_consumers: - team: "" use_case: "" # analytics | ML | operational | reporting | reverse-ETL latency_requirement: "" # real-time
npx skillsauth add openclaw/skills 1kalin/afrexai-data-engineeringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Complete methodology for designing, building, operating, and scaling data pipelines and infrastructure. Zero dependencies — pure agent skill.
Before building anything, understand the landscape.
project_name: ""
business_context: ""
data_consumers:
- team: ""
use_case: "" # analytics | ML | operational | reporting | reverse-ETL
latency_requirement: "" # real-time (<1s) | near-real-time (<5min) | batch (hourly+)
query_pattern: "" # ad-hoc | scheduled | API | dashboard
current_state:
sources: [] # list every system producing data
storage: [] # where data lives today
pain_points: [] # what's broken, slow, unreliable
data_volume:
current_gb_per_day: 0
growth_rate_percent: 0
retention_months: 0
constraints:
budget_monthly_usd: 0
team_size: 0
skill_level: "" # junior | mid | senior | mixed
compliance: [] # GDPR, HIPAA, SOX, PCI, none
cloud_provider: "" # AWS | GCP | Azure | multi | on-prem
| Signal | Pattern | When to Use | |--------|---------|-------------| | All consumers need data hourly+ | Batch ETL | Reporting, warehousing, most analytics | | Some need <5 min latency | Micro-batch | Dashboard freshness, near-real-time analytics | | Events need <1s processing | Streaming | Fraud detection, real-time pricing, alerts | | Need both batch + streaming | Lambda | When batch accuracy + real-time speed both matter | | Want to simplify Lambda | Kappa | When you can reprocess from stream replay | | Data lake + warehouse combined | Lakehouse | When you need both cheap storage + fast SQL | | Sources change independently | Data Mesh | Large orgs, domain-owned data, >5 teams | | ML is primary consumer | Feature Store | ML-heavy orgs with feature reuse needs |
| Tool | Best For | Avoid When | |------|----------|------------| | Airflow | Complex DAGs, Python-native teams, mature ecosystem | Simple pipelines (<5 tasks) | | Dagster | Software-defined assets, strong typing, dev experience | Legacy team resistant to new paradigms | | Prefect | Dynamic workflows, cloud-native, Python-first | Need on-prem with no cloud dependency | | dbt | SQL transformations, ELT, analytics engineering | Non-SQL transforms, streaming | | Temporal | Long-running workflows, retry-heavy, microservices | Simple ETL, small teams | | Cron + scripts | <3 pipelines, solo engineer, simple schedules | Anything with dependencies or retries |
| Tool | Best For | Avoid When | |------|----------|------------| | Spark | >100GB, complex transforms, ML pipelines | <10GB (overkill), real-time streaming | | DuckDB | Local analytics, <100GB, SQL on files | Distributed processing, production streaming | | Polars | Single-node, Rust-speed, <50GB, DataFrames | Distributed, need Spark ecosystem | | Pandas | <1GB, quick analysis, prototyping | Production pipelines, anything >5GB | | Flink | True streaming, event-time processing | Batch-only, small team (steep learning curve) | | SQL (warehouse) | ELT in Snowflake/BigQuery/Redshift | Complex ML transforms, binary data |
| Tool | Best For | Avoid When | |------|----------|------------| | Snowflake | Analytics, separation of compute/storage, multi-cloud | Tight budget, real-time OLTP | | BigQuery | GCP-native, serverless, large-scale analytics | Multi-cloud, need fine-grained cost control | | Redshift | AWS-native, existing AWS ecosystem | Elastic scaling needs, multi-cloud | | Databricks | ML + analytics unified, Spark-native, lakehouse | Pure SQL analytics, small data | | PostgreSQL | OLTP + light analytics, <500GB, budget-conscious | >1TB analytics, real-time dashboards on large data | | S3/GCS/ADLS | Raw data lake, cheap storage, any format | Direct SQL queries (need compute layer) | | Delta Lake/Iceberg | Table format on data lake, ACID on files | Simple file storage, no lakehouse need |
| Approach | Best For | Key Concept | |----------|----------|-------------| | Kimball (Dimensional) | BI/reporting, star schemas | Facts + Dimensions, business-process-centric | | Inmon (3NF) | Enterprise data warehouse, single source of truth | Normalized, subject-area-centric | | Data Vault 2.0 | Agile warehousing, auditability, multiple sources | Hubs + Links + Satellites, insert-only | | One Big Table (OBT) | Simple analytics, few joins, dashboard performance | Pre-joined, denormalized, fast queries | | Activity Schema | Event analytics, product analytics | Entity + Activity + Feature columns |
fact_table:
name: "fact_[business_process]"
grain: "" # one row = one [what]?
grain_statement: "One row per [transaction/event/snapshot] at [time grain]"
measures:
- name: ""
type: "" # additive | semi-additive | non-additive
aggregation: "" # SUM | AVG | COUNT | MIN | MAX | COUNT DISTINCT
business_definition: ""
degenerate_dimensions: [] # IDs stored in fact (order_number, invoice_id)
foreign_keys: [] # links to dimension tables
dimensions:
- name: "dim_[entity]"
type: "" # Type 1 (overwrite) | Type 2 (history) | Type 3 (previous value)
natural_key: "" # business key from source
surrogate_key: "" # warehouse-generated key
attributes:
- name: ""
source: ""
scd_type: "" # 1 | 2 | 3
hierarchy: [] # e.g., [country, region, city, store]
| Scenario | SCD Type | Implementation | |----------|----------|----------------| | Don't care about history | Type 1 | UPDATE in place | | Need full history | Type 2 | New row + valid_from/valid_to + is_current flag | | Only need previous value | Type 3 | Add previous_[column] | | Track changes with timestamps | Type 4 | Mini-dimension (history table) | | Hybrid: some attrs Type 1, some Type 2 | Type 6 | Combine 1+2+3 in one table |
Default recommendation: Type 2 for anything business-critical (customer status, product price, employee department). Type 1 for everything else.
| Object | Convention | Example |
|--------|-----------|---------|
| Raw/staging tables | raw_[source]_[table] | raw_stripe_payments |
| Staging models | stg_[source]__[entity] | stg_stripe__payments |
| Intermediate models | int_[entity]_[verb] | int_orders_pivoted |
| Mart/fact tables | fct_[business_process] | fct_orders |
| Dimension tables | dim_[entity] | dim_customers |
| Metrics/aggregates | mrt_[domain]_[metric] | mrt_sales_daily |
| Snapshots | snp_[entity]_[grain] | snp_inventory_daily |
| Columns: boolean | is_[state] or has_[thing] | is_active, has_subscription |
| Columns: timestamp | [event]_at | created_at, shipped_at |
| Columns: date | [event]_date | order_date |
| Columns: ID | [entity]_id | customer_id |
| Columns: amount | [thing]_amount | order_amount |
| Columns: count | [thing]_count | line_item_count |
pipeline:
name: ""
owner: ""
schedule: "" # cron expression
sla_minutes: 0 # max acceptable runtime
tier: "" # 1 (critical) | 2 (important) | 3 (nice-to-have)
extract:
source_system: ""
connection: ""
strategy: "" # full | incremental | CDC | log-based
incremental_key: "" # column for incremental (e.g., updated_at)
watermark_storage: "" # where to persist last-extracted position
transform:
engine: "" # SQL | Spark | Python | dbt
stages:
- name: "clean"
operations: [] # dedupe, null handling, type casting, trimming
- name: "conform"
operations: [] # standardize codes, currencies, timezones
- name: "enrich"
operations: [] # lookups, calculations, derived fields
- name: "aggregate"
operations: [] # rollups, pivots, window functions
load:
target_system: ""
strategy: "" # append | upsert | merge | truncate-reload | partition-swap
merge_keys: []
partition_key: ""
clustering_keys: []
quality_gates:
pre_load: [] # checks before writing
post_load: [] # checks after writing
error_handling:
strategy: "" # fail-fast | dead-letter | retry | skip-and-alert
max_retries: 3
retry_delay_seconds: 300
alert_channels: []
Is the source database?
├── Yes → Does it support CDC?
│ ├── Yes → Use CDC (Debezium, AWS DMS, Fivetran)
│ │ Best for: high-volume, low-latency, minimal source impact
│ └── No → Does it have a reliable updated_at column?
│ ├── Yes → Incremental extraction on updated_at
│ │ ⚠️ Won't catch hard deletes — add periodic full reconciliation
│ └── No → Full extraction
│ Only viable for small tables (<1M rows)
├── Is it an API?
│ ├── Supports webhooks? → Event-driven ingestion
│ ├── Has cursor/pagination? → Incremental with cursor bookmark
│ └── No pagination? → Full pull with rate-limit handling
├── Is it files (S3, SFTP, email)?
│ └── Event-triggered (S3 notification, file watcher)
│ Validate: schema, completeness, filename pattern
└── Is it streaming (Kafka, Kinesis, Pub/Sub)?
└── Consumer group with offset management
Key decisions: at-least-once vs exactly-once, consumer lag alerting
| Strategy | When | Trade-off | |----------|------|-----------| | Append | Event/log data, immutable facts | Simple but grows forever — partition + retain | | Upsert/Merge | Dimension updates, SCD Type 1 | Handles updates but slower on large tables | | Truncate-Reload | Small tables (<1M), reference data | Simple but window of missing data | | Partition Swap | Large fact tables, daily loads | Atomic, fast, but needs partition alignment | | Soft Delete | Need audit trail of deletions | Adds complexity to every downstream query |
Every pipeline MUST be re-runnable without side effects:
| Dimension | Definition | Example Check |
|-----------|-----------|---------------|
| Completeness | No missing values where required | NOT NULL on required fields, row count within range |
| Uniqueness | No unexpected duplicates | Primary key uniqueness, natural key uniqueness |
| Validity | Values within expected domain | Enum checks, range checks, regex patterns |
| Accuracy | Data matches real-world truth | Cross-system reconciliation, manual spot checks |
| Freshness | Data arrives on time | MAX(loaded_at) > NOW() - INTERVAL '2 hours' |
| Consistency | Same data agrees across systems | Sum reconciliation between source and target |
-- Completeness: Required fields not null
SELECT COUNT(*) AS null_violations
FROM {table}
WHERE {required_column} IS NULL;
-- Threshold: 0
-- Uniqueness: No duplicate primary keys
SELECT {pk_column}, COUNT(*) AS dupe_count
FROM {table}
GROUP BY {pk_column}
HAVING COUNT(*) > 1;
-- Threshold: 0
-- Freshness: Data arrived within SLA
SELECT CASE
WHEN MAX({timestamp_col}) > CURRENT_TIMESTAMP - INTERVAL '{sla_hours} hours'
THEN 'PASS' ELSE 'FAIL'
END AS freshness_check
FROM {table};
-- Volume: Row count within expected range
SELECT CASE
WHEN COUNT(*) BETWEEN {min_expected} AND {max_expected}
THEN 'PASS' ELSE 'FAIL'
END AS volume_check
FROM {table}
WHERE {partition_col} = '{run_date}';
-- Referential: FK integrity
SELECT COUNT(*) AS orphan_count
FROM {fact_table} f
LEFT JOIN {dim_table} d ON f.{fk} = d.{pk}
WHERE d.{pk} IS NULL;
-- Threshold: 0
-- Distribution: No unexpected skew
SELECT {column}, COUNT(*) AS cnt,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS pct
FROM {table}
GROUP BY {column}
ORDER BY cnt DESC;
-- Alert if any single value > {max_pct}%
-- Cross-system reconciliation
SELECT
(SELECT SUM(amount) FROM source_system.orders WHERE date = '{date}') AS source_total,
(SELECT SUM(amount) FROM warehouse.fct_orders WHERE order_date = '{date}') AS target_total,
ABS(source_total - target_total) AS variance;
-- Threshold: variance < 0.01 * source_total (1%)
contract:
name: ""
version: ""
owner: "" # team responsible for producing this data
consumers: [] # teams consuming this data
sla:
freshness_hours: 0
availability_percent: 99.9
support_hours: "" # business-hours | 24x7
schema:
- column: ""
type: ""
nullable: false
description: ""
business_definition: ""
pii: false
checks:
- type: "" # not_null | unique | range | enum | regex | custom
params: {}
breaking_change_policy: "" # notify-30-days | version-bump | never-break
notification_channel: ""
| Level | Definition | Response | |-------|-----------|----------| | P0 — Critical | Data corruption, wrong numbers in production dashboards, compliance data wrong | Stop pipeline, alert immediately, rollback if possible | | P1 — High | Missing data for key reports, SLA breach, >5% of records affected | Alert team, fix within 4 hours, post-mortem required | | P2 — Medium | Non-critical field quality, <1% records affected, no downstream impact | Fix in next sprint, add monitoring to prevent recurrence | | P3 — Low | Cosmetic issues, edge cases, non-critical data | Backlog, fix when convenient |
| Problem | Fix | Impact | |---------|-----|--------| | Full table scan | Add/use partition pruning | 10-100x faster | | Large joins | Pre-aggregate before joining | 5-50x faster | | SELECT * | Select only needed columns | 2-10x faster (columnar stores) | | Correlated subquery | Rewrite as JOIN or window function | 10-100x faster | | DISTINCT on large result | Fix upstream duplication instead | 2-5x faster | | ORDER BY without LIMIT | Add LIMIT or remove if not needed | Prevents memory spills | | String operations in WHERE | Pre-compute, use lookup table | Enables index usage | | Multiple passes over same data | Combine with CASE WHEN + GROUP BY | 2-5x faster | | NOT IN with NULLs | Use NOT EXISTS or LEFT JOIN IS NULL | Correctness + performance |
| Problem | Solution |
|---------|----------|
| Shuffle-heavy joins | Broadcast small table (broadcast(df)) if <100MB |
| Data skew | Salt the skewed key: add random prefix, join on salted key, aggregate |
| Small files | Coalesce output: .coalesce(target_files) or use adaptive query execution |
| Too many partitions | spark.sql.shuffle.partitions = 2-3x cluster cores |
| OOM errors | Increase spark.executor.memory, reduce partition size, spill to disk |
| Slow writes | Use Parquet with snappy, partition by date, avoid small writes |
| Repeated computation | .cache() or .persist() DataFrames used >1 time |
| Complex transformations | Push down predicates, filter early, select early |
| Data Type | Partition Key | Why | |-----------|--------------|-----| | Transactional/event | Date (daily or monthly) | Most queries filter by time range | | Multi-tenant | Tenant ID + date | Isolate tenant queries, time-range pruning | | Geospatial | Region + date | Regional queries are common | | Log data | Date + hour | High volume needs finer partitions | | Reference/dimension | Don't partition | Too small, full scan is fine |
Rules:
| Level | Examples | Controls | |-------|---------|----------| | Public | Product catalog, published stats | No restrictions | | Internal | Aggregated metrics, non-PII analytics | Auth required, audit logging | | Confidential | Customer PII, financial records, HR data | Encryption, column-level access, masking | | Restricted | SSN, payment cards, health records, passwords | Encryption at rest + transit, tokenization, audit every access, retention limits |
dataset:
name: ""
description: ""
owner_team: ""
steward: "" # person responsible for quality
domain: "" # sales | marketing | finance | product | engineering
tier: "" # gold (trusted) | silver (cleaned) | bronze (raw)
lineage:
sources: [] # upstream datasets/systems
transformations: "" # brief description of key transforms
downstream: [] # who consumes this
refresh:
schedule: ""
sla_hours: 0
last_successful_run: ""
quality:
tests: [] # list of quality checks
last_score: 0 # 0-100
known_issues: []
access:
classification: "" # public | internal | confidential | restricted
pii_columns: []
access_request_process: "" # how to get access
usage:
avg_daily_queries: 0
top_consumers: []
cost_monthly_usd: 0
dashboard:
pipeline_metrics:
- metric: "Pipeline Success Rate"
formula: "successful_runs / total_runs * 100"
target: ">99%"
alert_threshold: "<95%"
- metric: "Average Runtime"
formula: "avg(end_time - start_time) over 7 days"
target: "<SLA"
alert_threshold: ">80% of SLA"
- metric: "Data Freshness"
formula: "NOW() - MAX(loaded_at)"
target: "<SLA hours"
alert_threshold: ">SLA"
- metric: "Data Volume Variance"
formula: "abs(today_rows - avg_7d_rows) / avg_7d_rows * 100"
target: "<20%"
alert_threshold: ">50%"
- metric: "Quality Check Pass Rate"
formula: "passed_checks / total_checks * 100"
target: "100%"
alert_threshold: "<95%"
- metric: "Failed Pipeline Count"
formula: "count where status = failed in last 24h"
target: "0"
alert_threshold: ">0"
- metric: "Backfill Queue"
formula: "count of pending backfill requests"
target: "0"
alert_threshold: ">5"
- metric: "Infrastructure Cost"
formula: "compute + storage + egress"
target: "<budget"
alert_threshold: ">110% budget"
| Severity | Condition | Response Time | Example | |----------|-----------|---------------|---------| | P0 | Revenue/compliance impacting | 15 min | Payment pipeline down, regulatory report delayed | | P1 | Business-critical dashboard stale | 1 hour | Executive dashboard >4h stale | | P2 | Non-critical pipeline failed | 4 hours | Marketing attribution delayed | | P3 | Warning/degradation | Next business day | Pipeline 80% of SLA, minor quality drift |
Every pipeline run MUST log:
{
"pipeline_name": "",
"run_id": "",
"started_at": "",
"completed_at": "",
"status": "success|failed|partial",
"stage": "",
"rows_extracted": 0,
"rows_transformed": 0,
"rows_loaded": 0,
"rows_rejected": 0,
"quality_checks_passed": 0,
"quality_checks_failed": 0,
"duration_seconds": 0,
"error_message": "",
"watermark_before": "",
"watermark_after": ""
}
| Layer | What to Test | How | When | |-------|-------------|-----|------| | Unit | Individual transforms, business logic | pytest with fixtures, dbt unit tests | Every PR | | Integration | Source connectivity, schema compatibility | Test against staging/dev environment | Daily + PR | | Contract | Schema hasn't changed, data types stable | Schema registry, contract tests | Every pipeline run | | Data Quality | Completeness, uniqueness, freshness, validity | Quality framework checks | Every pipeline run | | E2E | Full pipeline produces correct output | Golden dataset comparison | Weekly + release | | Performance | Runtime within SLA, no regression | Benchmark against historical runs | Weekly |
# For every model, define at minimum:
models:
- name: fct_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: order_amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000
- name: order_status
tests:
- accepted_values:
values: ['pending', 'confirmed', 'shipped', 'delivered', 'cancelled']
- name: ordered_at
tests:
- not_null
- dbt_utils.recency:
datepart: day
field: ordered_at
interval: 2
When you need to reprocess historical data:
| Strategy | Savings | Effort | |----------|---------|--------| | Right-size compute (auto-scaling) | 20-40% | Low | | Use spot/preemptible instances for batch | 60-80% | Medium | | Compress data (Parquet + Snappy/Zstd) | 50-80% storage | Low | | Lifecycle policies (hot → warm → cold → archive) | 40-70% storage | Low | | Eliminate unused tables/pipelines | 10-30% | Low | | Optimize query patterns (partition pruning) | 30-60% compute | Medium | | Reserved capacity for steady-state | 30-50% | Medium | | Cache expensive queries | 20-50% compute | Medium |
cost_tracking:
by_pipeline:
- pipeline: ""
compute_monthly_usd: 0
storage_monthly_usd: 0
egress_monthly_usd: 0
total: 0
cost_per_row: 0 # total / rows_processed
business_value: "" # what revenue/decision does this enable?
roi_justified: true # is the cost worth it?
optimization_opportunities:
- description: ""
estimated_savings_usd: 0
effort: "" # low | medium | high
priority: 0 # 1 = do now
Pipeline failed →
1. Check error message in logs
├── Connection timeout → Check source availability, network, credentials
├── Schema mismatch → Source schema changed → update extract + notify
├── Data quality check failed → Investigate source data, check for anomalies
├── Out of memory → Increase resources or optimize query
├── Permission denied → Check IAM roles, token expiry
├── Duplicate key violation → Check idempotency, investigate source dupes
└── Timeout (SLA breach) → Check data volume spike, query plan, cluster health
2. Determine impact
├── What dashboards/reports are affected?
├── What's the data freshness SLA?
└── Who needs to be notified?
3. Fix
├── Transient (network, timeout) → Retry
├── Data issue → Fix source data, re-run with quality gate override if safe
├── Schema change → Update pipeline, backfill if needed
└── Infrastructure → Scale up, file ticket with cloud provider
4. Post-fix
├── Verify data correctness
├── Update runbook with new failure mode
└── Add monitoring/alerting to catch earlier next time
When a source system changes schema:
| Scenario | RPO | RTO | Recovery Steps | |----------|-----|-----|----------------| | Pipeline code lost | 0 (git) | 1h | Redeploy from git, restore orchestrator state | | Warehouse data corrupted | Varies | 4h | Restore from Time Travel/snapshot, re-run affected pipelines | | Source system down | N/A | Wait | Queue extractions, catch up with incremental once restored | | Cloud region outage | 24h | 8h | Failover to DR region if configured, else wait | | Credential compromise | 0 | 2h | Rotate all credentials, audit access logs, re-run affected pipelines |
-- Merge pattern for SCD Type 2
MERGE INTO dim_customer AS target
USING (
SELECT * FROM stg_customers
WHERE updated_at > (SELECT MAX(valid_from) FROM dim_customer)
) AS source
ON target.customer_natural_key = source.customer_id
AND target.is_current = TRUE
-- Update: close old record
WHEN MATCHED AND (
target.customer_name != source.name OR
target.customer_status != source.status
-- list all Type 2 tracked columns
) THEN UPDATE SET
is_current = FALSE,
valid_to = CURRENT_TIMESTAMP
-- Insert: new record (both new customers and changed ones)
WHEN NOT MATCHED THEN INSERT (
customer_natural_key, customer_name, customer_status,
valid_from, valid_to, is_current
) VALUES (
source.customer_id, source.name, source.status,
CURRENT_TIMESTAMP, '9999-12-31', TRUE
);
-- Then insert new versions of changed records
INSERT INTO dim_customer (
customer_natural_key, customer_name, customer_status,
valid_from, valid_to, is_current
)
SELECT customer_id, name, status,
CURRENT_TIMESTAMP, '9999-12-31', TRUE
FROM stg_customers s
WHERE EXISTS (
SELECT 1 FROM dim_customer d
WHERE d.customer_natural_key = s.customer_id
AND d.is_current = FALSE
AND d.valid_to = CURRENT_TIMESTAMP
);
Source DB → Debezium Connector → Kafka Topic →
├── Stream processor (Flink/Spark Streaming) → Target DB
├── S3 sink connector → Data Lake (raw)
└── Elasticsearch sink → Search index
Key decisions:
feature_store:
entity: "customer"
entity_key: "customer_id"
features:
- name: "total_orders_30d"
description: "Total orders in last 30 days"
type: "INT"
source: "fct_orders"
computation: "batch" # batch | streaming | on-demand
freshness: "daily"
ttl_hours: 48
- name: "avg_order_value_90d"
description: "Average order value last 90 days"
type: "FLOAT"
source: "fct_orders"
computation: "batch"
freshness: "daily"
ttl_hours: 48
- name: "last_login_minutes_ago"
description: "Minutes since last login event"
type: "INT"
source: "events_stream"
computation: "streaming"
freshness: "real-time"
ttl_hours: 1
serving:
online: true # low-latency feature serving (Redis/DynamoDB)
offline: true # batch feature retrieval for training
point_in_time_correct: true # prevent feature leakage in ML training
If operating at scale (>5 data teams):
When NOT to use Data Mesh:
| Dimension | Weight | Scoring | |-----------|--------|---------| | Pipeline Reliability | 20 | 0=frequent failures, 10=some failures with manual recovery, 20=99.5%+ success rate with auto-retry | | Data Quality | 20 | 0=no checks, 10=basic null/unique checks, 20=comprehensive quality framework with contracts | | Performance | 15 | 0=regularly breaches SLA, 8=meets SLA, 15=well under SLA with optimization | | Documentation | 10 | 0=none, 5=basic README, 10=full catalog entries with lineage and business definitions | | Monitoring | 15 | 0=no alerts, 8=failure alerts only, 15=proactive monitoring with dashboards and anomaly detection | | Testing | 10 | 0=no tests, 5=basic smoke tests, 10=full test pyramid (unit+integration+contract+E2E) | | Cost Efficiency | 10 | 0=no cost tracking, 5=tracked, 10=optimized with ROI justification per pipeline |
Scoring guide:
TIMESTAMP WITH TIME ZONE, never WITHOUTSay any of these to activate specific workflows:
tools
Use when the user wants to connect to, test, or use the McDonalds service at mcp.mcd.cn, including checking authentication, probing MCP endpoints, listing tools, or calling McDonalds MCP tools through a reusable local CLI.
development
Web scraping platform — Twitter/X data, Vinted marketplace, and general web scraping API
development
SlowMist AI Agent Security Review — comprehensive security framework for skills, repositories, URLs, on-chain addresses, and products (Claude Code version)
data-ai
去除中文文本中的 AI 写作痕迹,使其读起来自然。基于维基百科 AI 写作特征指南,检测 24 种 AI 模式。触发词:humanizer-cn、去除 AI 痕迹、去除 AI 写作痕迹、中文文本人性化。