Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

dtsong/ml-workflow

Name: ml-workflow
Author: dtsong

skills/council/alchemist/ml-workflow/SKILL.md

npx skillsauth add dtsong/my-claude-setup ml-workflow

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

ML Workflow

Purpose

Design end-to-end ML workflows covering experiment tracking, feature engineering and storage, model training pipelines, model serving and deployment, A/B testing for models, and monitoring for data and model drift. Produces a workflow architecture, tool selection rationale, and operational runbook.

Scope Constraints

Reads ML code, configuration files, experiment logs, and infrastructure specs for analysis. Does not train models, execute experiments, or deploy to production.

Inputs

ML problem type (classification, regression, ranking, recommendation, NLP, CV)
Data sources and feature candidates
Model complexity range (linear/tree-based vs deep learning)
Serving requirements (batch predictions, real-time inference, edge deployment)
Team size and ML maturity (first model vs established ML platform)
Infrastructure constraints (cloud provider, GPU availability, budget)

Input Sanitization

No user-provided values are used in commands or file paths. All inputs are treated as read-only analysis targets.

Procedure

Progress Checklist

[ ] Step 1: Define the ML problem
[ ] Step 2: Design feature engineering pipeline
[ ] Step 3: Design experiment tracking
[ ] Step 4: Design training pipeline
[ ] Step 5: Design model serving
[ ] Step 6: Design A/B testing
[ ] Step 7: Design monitoring and drift detection

Step 1: Define the ML Problem Clearly

Before any tooling decisions, formalize:

What is the prediction target? What does "correct" look like?
What is the business metric this model optimizes? (Not just accuracy — revenue, conversion, engagement)
What is the baseline? (Rule-based heuristic, current model, random chance)
What is the minimum viable performance to ship?

Document the problem statement, target variable, evaluation metric, and success threshold.

Step 2: Design the Feature Engineering Pipeline

Map raw data to model-ready features:

Feature identification: Which raw fields become features? What transformations are needed (encoding, scaling, windowing, embedding)?
Temporal features: Aggregations over time windows (last 7 days, last 30 days). Guard against leakage — never use future data to predict the past.
Feature store evaluation: Does this project warrant a feature store (Feast, Tecton, Hopsworks)? Feature stores add value when: features are shared across models, real-time features are needed, or training-serving skew is a risk.
Feature documentation: Each feature should have: name, description, data type, source, transformation logic, and expected distribution.

Step 3: Design Experiment Tracking

Set up reproducible experiment management:

Tool selection: MLflow (open-source, self-hosted), Weights & Biases (managed, rich visualization), Neptune, or ClearML.
What to track: Hyperparameters, metrics (train/val/test), dataset version, code version (git SHA), environment (dependencies), artifacts (model files, plots).
Experiment organization: Project → Experiment group → Individual runs. Name runs meaningfully (not "run_42").
Comparison workflow: How does the team compare runs? Dashboard? Automated reports?

Step 4: Design the Training Pipeline

Build a reproducible, automated training workflow:

Data split strategy: Time-based splits for temporal data, stratified splits for imbalanced classes. Never random-split time-series data.
Training orchestration: Single script, or DAG-based (Airflow, Kubeflow Pipelines, SageMaker Pipelines)?
Hyperparameter tuning: Grid search, random search, Bayesian optimization (Optuna, Ray Tune)?
Validation strategy: Cross-validation, holdout, or time-series walk-forward?
Model registry: Where are trained models stored? How are they versioned? Who approves promotion to production?

Step 5: Design Model Serving

Plan how predictions reach users:

Batch serving: Run predictions on a schedule, store results in a table. Best for recommendations, risk scores, daily reports.
Real-time serving: Model behind an API endpoint. Best for search ranking, fraud detection, dynamic pricing.
Streaming serving: Model embedded in a stream processor. Best for event-driven predictions on Kafka/Kinesis streams.
Edge serving: Model deployed to device/browser. Best for latency-critical or offline-capable applications.

For real-time serving, specify: latency SLA (p50/p99), throughput (requests/second), scaling strategy (auto-scale triggers), and fallback behavior (what happens if the model is unavailable?).

Step 6: Design A/B Testing for Models

Plan controlled rollout of model changes:

Traffic splitting: How is traffic divided between control (current model) and treatment (new model)?
Metric selection: Primary metric (business KPI), guardrail metrics (latency, error rate), and minimum detectable effect.
Duration calculation: How long must the test run to reach statistical significance?
Rollback criteria: What triggers an automatic rollback?

Step 7: Design Monitoring and Drift Detection

Plan ongoing model health monitoring:

Data drift: Monitor input feature distributions for shifts. Tool options: Evidently, WhyLabs, Great Expectations.
Model drift: Monitor prediction distribution and performance metrics over time. Alert when performance degrades below threshold.
Concept drift: Monitor the relationship between features and target. Retrain triggers when the world changes (seasonality, market shifts).
Operational monitoring: Latency, error rates, throughput, GPU utilization for serving infrastructure.

Define retraining policy: scheduled (weekly/monthly), triggered (drift detected), or continuous (online learning).

Compaction resilience: If context was lost during a long session, re-read the Inputs section to reconstruct what system is being analyzed, check the Progress Checklist for completed steps, then resume from the earliest incomplete step.

Handoff

Hand off to pipeline-design if the workflow reveals data ingestion or ETL orchestration needs.
Hand off to operator/deployment-plan if model serving surfaces deployment or infrastructure architecture concerns.

Output Format

# ML Workflow: [Project/Model Name]

## Problem Definition

| Aspect | Detail |
|--------|--------|
| Problem type | ... |
| Target variable | ... |
| Business metric | ... |
| Evaluation metric | ... |
| Baseline performance | ... |
| Success threshold | ... |

## Feature Engineering

| Feature | Source | Transformation | Type | Leakage Risk |
|---------|--------|---------------|------|-------------|
| ...     | ...    | ...           | ...  | Low/Med/High |

**Feature store:** [Yes/No — tool choice and rationale]

## Experiment Tracking

| Aspect | Choice | Rationale |
|--------|--------|-----------|
| Tool | ... | ... |
| What's tracked | ... | ... |
| Organization | ... | ... |

## Training Pipeline

[ASCII diagram showing data → features → train → evaluate → register]


| Stage | Tool/Method | Notes |
|-------|------------|-------|
| Data split | ... | ... |
| Training | ... | ... |
| Tuning | ... | ... |
| Validation | ... | ... |
| Registry | ... | ... |

## Model Serving

| Aspect | Detail |
|--------|--------|
| Serving mode | Batch / Real-time / Streaming / Edge |
| Latency SLA | ... |
| Throughput | ... |
| Scaling | ... |
| Fallback | ... |

## A/B Testing

| Aspect | Detail |
|--------|--------|
| Traffic split | ... |
| Primary metric | ... |
| Guardrail metrics | ... |
| Min duration | ... |
| Rollback criteria | ... |

## Monitoring and Drift

| Monitor | Tool | Threshold | Action |
|---------|------|-----------|--------|
| Data drift | ... | ... | ... |
| Model drift | ... | ... | ... |
| Concept drift | ... | ... | ... |
| Operational | ... | ... | ... |

**Retraining policy:** [Scheduled / Triggered / Continuous — details]

Quality Checks

[ ] Problem definition includes a clear business metric, not just an ML metric
[ ] Feature engineering documents leakage risk for every temporal feature
[ ] Experiment tracking captures enough metadata to reproduce any run
[ ] Training pipeline uses appropriate split strategy (time-based for temporal data)
[ ] Model registry has a clear promotion workflow (dev → staging → production)
[ ] Serving architecture matches latency and throughput requirements
[ ] A/B testing plan includes statistical power calculation and guardrail metrics
[ ] Drift monitoring covers data, model, and concept drift with defined thresholds
[ ] Retraining policy is documented with clear triggers and automation level
[ ] Fallback behavior is defined for model unavailability

Evolution Notes

dtsong/ml-workflow

skills/council/alchemist/ml-workflow/SKILL.md

Use when designing end-to-end ML workflows. Covers experiment tracking, feature engineering and storage, model training pipelines, serving and deployment, A/B testing, and drift monitoring. Do not use for data warehouse schema design (use schema-evaluation) or ETL pipeline architecture (use pipeline-design).

4 stars

testing

Updated Apr 26, 2026

$ install --global

skillsauth

npx skillsauth add dtsong/my-claude-setup ml-workflow

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 1, 2026, 1:04 AM266.0s1 file scanned

SKILL.md

name:: ml-workflow
department:: alchemist
description:: Use when designing end-to-end ML workflows. Covers experiment tracking, feature engineering and storage, model training pipelines, serving and deployment, A/B testing, and drift monitoring. Do not use for data warehouse schema design (use schema-evaluation) or ETL pipeline architecture (use pipeline-design).
version:: 1

ML Workflow

Purpose

Scope Constraints

Reads ML code, configuration files, experiment logs, and infrastructure specs for analysis. Does not train models, execute experiments, or deploy to production.

Inputs

ML problem type (classification, regression, ranking, recommendation, NLP, CV)
Data sources and feature candidates
Model complexity range (linear/tree-based vs deep learning)
Serving requirements (batch predictions, real-time inference, edge deployment)
Team size and ML maturity (first model vs established ML platform)
Infrastructure constraints (cloud provider, GPU availability, budget)

Input Sanitization

No user-provided values are used in commands or file paths. All inputs are treated as read-only analysis targets.

Procedure

Progress Checklist

[ ] Step 1: Define the ML problem
[ ] Step 2: Design feature engineering pipeline
[ ] Step 3: Design experiment tracking
[ ] Step 4: Design training pipeline
[ ] Step 5: Design model serving
[ ] Step 6: Design A/B testing
[ ] Step 7: Design monitoring and drift detection

Step 1: Define the ML Problem Clearly

Before any tooling decisions, formalize:

What is the prediction target? What does "correct" look like?
What is the business metric this model optimizes? (Not just accuracy — revenue, conversion, engagement)
What is the baseline? (Rule-based heuristic, current model, random chance)
What is the minimum viable performance to ship?

Document the problem statement, target variable, evaluation metric, and success threshold.

Step 2: Design the Feature Engineering Pipeline

Map raw data to model-ready features:

Feature identification: Which raw fields become features? What transformations are needed (encoding, scaling, windowing, embedding)?
Temporal features: Aggregations over time windows (last 7 days, last 30 days). Guard against leakage — never use future data to predict the past.
Feature store evaluation: Does this project warrant a feature store (Feast, Tecton, Hopsworks)? Feature stores add value when: features are shared across models, real-time features are needed, or training-serving skew is a risk.
Feature documentation: Each feature should have: name, description, data type, source, transformation logic, and expected distribution.

Step 3: Design Experiment Tracking

Set up reproducible experiment management:

Tool selection: MLflow (open-source, self-hosted), Weights & Biases (managed, rich visualization), Neptune, or ClearML.
What to track: Hyperparameters, metrics (train/val/test), dataset version, code version (git SHA), environment (dependencies), artifacts (model files, plots).
Experiment organization: Project → Experiment group → Individual runs. Name runs meaningfully (not "run_42").
Comparison workflow: How does the team compare runs? Dashboard? Automated reports?

Step 4: Design the Training Pipeline

Build a reproducible, automated training workflow:

Data split strategy: Time-based splits for temporal data, stratified splits for imbalanced classes. Never random-split time-series data.
Training orchestration: Single script, or DAG-based (Airflow, Kubeflow Pipelines, SageMaker Pipelines)?
Hyperparameter tuning: Grid search, random search, Bayesian optimization (Optuna, Ray Tune)?
Validation strategy: Cross-validation, holdout, or time-series walk-forward?
Model registry: Where are trained models stored? How are they versioned? Who approves promotion to production?

Step 5: Design Model Serving

Plan how predictions reach users:

Batch serving: Run predictions on a schedule, store results in a table. Best for recommendations, risk scores, daily reports.
Real-time serving: Model behind an API endpoint. Best for search ranking, fraud detection, dynamic pricing.
Streaming serving: Model embedded in a stream processor. Best for event-driven predictions on Kafka/Kinesis streams.
Edge serving: Model deployed to device/browser. Best for latency-critical or offline-capable applications.

For real-time serving, specify: latency SLA (p50/p99), throughput (requests/second), scaling strategy (auto-scale triggers), and fallback behavior (what happens if the model is unavailable?).

Step 6: Design A/B Testing for Models

Plan controlled rollout of model changes:

Traffic splitting: How is traffic divided between control (current model) and treatment (new model)?
Metric selection: Primary metric (business KPI), guardrail metrics (latency, error rate), and minimum detectable effect.
Duration calculation: How long must the test run to reach statistical significance?
Rollback criteria: What triggers an automatic rollback?

Step 7: Design Monitoring and Drift Detection

Plan ongoing model health monitoring:

Data drift: Monitor input feature distributions for shifts. Tool options: Evidently, WhyLabs, Great Expectations.
Model drift: Monitor prediction distribution and performance metrics over time. Alert when performance degrades below threshold.
Concept drift: Monitor the relationship between features and target. Retrain triggers when the world changes (seasonality, market shifts).
Operational monitoring: Latency, error rates, throughput, GPU utilization for serving infrastructure.

Define retraining policy: scheduled (weekly/monthly), triggered (drift detected), or continuous (online learning).

Compaction resilience: If context was lost during a long session, re-read the Inputs section to reconstruct what system is being analyzed, check the Progress Checklist for completed steps, then resume from the earliest incomplete step.

Handoff

Hand off to pipeline-design if the workflow reveals data ingestion or ETL orchestration needs.
Hand off to operator/deployment-plan if model serving surfaces deployment or infrastructure architecture concerns.

Output Format

# ML Workflow: [Project/Model Name]

## Problem Definition

| Aspect | Detail |
|--------|--------|
| Problem type | ... |
| Target variable | ... |
| Business metric | ... |
| Evaluation metric | ... |
| Baseline performance | ... |
| Success threshold | ... |

## Feature Engineering

| Feature | Source | Transformation | Type | Leakage Risk |
|---------|--------|---------------|------|-------------|
| ...     | ...    | ...           | ...  | Low/Med/High |

**Feature store:** [Yes/No — tool choice and rationale]

## Experiment Tracking

| Aspect | Choice | Rationale |
|--------|--------|-----------|
| Tool | ... | ... |
| What's tracked | ... | ... |
| Organization | ... | ... |

## Training Pipeline

[ASCII diagram showing data → features → train → evaluate → register]


| Stage | Tool/Method | Notes |
|-------|------------|-------|
| Data split | ... | ... |
| Training | ... | ... |
| Tuning | ... | ... |
| Validation | ... | ... |
| Registry | ... | ... |

## Model Serving

| Aspect | Detail |
|--------|--------|
| Serving mode | Batch / Real-time / Streaming / Edge |
| Latency SLA | ... |
| Throughput | ... |
| Scaling | ... |
| Fallback | ... |

## A/B Testing

| Aspect | Detail |
|--------|--------|
| Traffic split | ... |
| Primary metric | ... |
| Guardrail metrics | ... |
| Min duration | ... |
| Rollback criteria | ... |

## Monitoring and Drift

| Monitor | Tool | Threshold | Action |
|---------|------|-----------|--------|
| Data drift | ... | ... | ... |
| Model drift | ... | ... | ... |
| Concept drift | ... | ... | ... |
| Operational | ... | ... | ... |

**Retraining policy:** [Scheduled / Triggered / Continuous — details]

Quality Checks

[ ] Problem definition includes a clear business metric, not just an ML metric
[ ] Feature engineering documents leakage risk for every temporal feature
[ ] Experiment tracking captures enough metadata to reproduce any run
[ ] Training pipeline uses appropriate split strategy (time-based for temporal data)
[ ] Model registry has a clear promotion workflow (dev → staging → production)
[ ] Serving architecture matches latency and throughput requirements
[ ] A/B testing plan includes statistical power calculation and guardrail metrics
[ ] Drift monitoring covers data, model, and concept drift with defined thresholds
[ ] Retraining policy is documented with clear triggers and automation level
[ ] Fallback behavior is defined for model unavailability

Evolution Notes

Related Skills

dtsong/enterprise-search-strategy

development

VerifiedTrustedCommunity

Use when the council needs to surface organizational knowledge buried across multiple internal sources (wikis, design docs, ADRs, past tickets, postmortems, chat archives, code repos). Plans where to look, what to cross-reference, and how to synthesize findings into evidence the council can act on. Do not use for external market research (use competitive-analysis), library evaluation (use library-evaluation), or technology trend assessment (use technology-radar).

5SKILL.mdUpdated Jun 23, 2026

dtsong/enterprise-search-strategy

dtsong/docx-to-pdf

testing

VerifiedTrustedCommunity

Use to convert a Word .docx file to PDF and/or verify its page count. Triggers on: converting docx to pdf, rendering a document, checking how many pages a docx produces, or asserting a page-count constraint (e.g. a resume must stay 2 pages). Wraps LibreOffice headless conversion.

5SKILL.mdUpdated Jun 11, 2026

dtsong/web-security-hardening

development

VerifiedTrustedCommunity

Security audit checklist for web applications. Use when reviewing, auditing, or hardening a web app's security posture. Covers rate limiting, auth headers, IP blocking, CORS, security middleware, input validation, file upload limits, ORM usage, and password hashing. Triggers on requests like "review security", "harden this app", "security audit", "check for vulnerabilities", or when building/reviewing API endpoints.

5SKILL.mdUpdated Apr 28, 2026

dtsong/web-security-hardening

dtsong/prompt-wizard

development

VerifiedTrustedCommunity

Interactive wizard to craft effective prompts using Claude Code best practices

5SKILL.mdUpdated Apr 28, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/dtsong/my-claude-setup.git

# Copy into Claude Code skills folder (global)
cp -r my-claude-setup/skills/council/alchemist/ml-workflow ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

dtsong/my-claude-setup

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT