Using ML Production

Overview

This meta-skill routes you to the right production deployment skill based on your concern. Load this when you need to move ML models to production but aren't sure which specific aspect to address.

Core Principle: Production concerns fall into four categories. Identify the concern first, then route to the appropriate skill. Tools and infrastructure choices are implementation details, not routing criteria.

Capability tiers, not model IDs. Sheets describe model and hardware capabilities (e.g., "frontier reasoning model", "mid-tier serving GPU", "edge accelerator") rather than naming specific SKUs. Vendor model names and exact GPU SKUs change quarterly; capability tiers are stable.

Knowledge-cutoff acknowledgement. Tool inventory is calibrated to 2026-05. Fast-moving areas — LLM serving stacks (vLLM, SGLang, TensorRT-LLM, TGI, Triton), observability platforms (Phoenix, Langfuse, OTel GenAI), quantization toolchains (torch.ao.quantization, AWQ, GPTQ), and MLOps platforms — should be re-checked against vendor docs before architecting new systems. Treat sheet content as a structured starting point, not a substitute for current documentation.

When to Use

Load this skill when:

Deploying ML models (classical or LLM) to production
Optimizing model inference (speed, size, cost)
Setting up MLOps workflows (tracking, automation, CI/CD)
Monitoring or debugging production models, including LLM-specific signals (hallucination rate, tool-call success, prompt-injection attempts)
User mentions: "production", "deploy", "serve model", "MLOps", "monitoring", "optimize inference", "vLLM", "SGLang", "TensorRT-LLM", "Phoenix", "Langfuse"

Don't use for: Training optimization (use training-optimization), model architecture selection (use neural-architectures), PyTorch infrastructure (use pytorch-engineering), prompt/RAG/agent design quality (use llm-specialist).

How to Access Reference Sheets

IMPORTANT: All reference sheets are located in the SAME DIRECTORY as this SKILL.md file.

When this skill is loaded from: skills/using-ml-production/SKILL.md

Reference sheets like quantization-for-inference.md are at: skills/using-ml-production/quantization-for-inference.md

NOT at: skills/quantization-for-inference.md ← WRONG PATH

When you see a link like [quantization-for-inference.md](quantization-for-inference.md), read the file from the same directory as this SKILL.md.

Routing by Concern

Category 1: Model Optimization

Symptoms: "Model too slow", "inference latency high", "model too large", "need to optimize for edge", "reduce model size", "speed up inference", "quantize model", "INT8/INT4", "AWQ", "GPTQ"

When to route here:

Model itself is the bottleneck (not infrastructure)
Need to reduce model size or increase inference speed
Deploying to resource-constrained hardware (edge, mobile)
Cost optimization through model efficiency
Quantization operations (the mechanics of converting weights, calibration, kernel fit)

Routes to:

quantization-for-inference.md - torch.ao.quantization, AWQ, GPTQ, INT8/INT4, post-training quantization, QAT, calibration
model-compression-techniques.md - Pruning, distillation, architecture optimization
hardware-optimization-strategies.md - GPU/CPU/edge tuning, batch sizing, hardware/quant-format fit

Key question to ask: "Is the MODEL the bottleneck, or is it infrastructure/serving?"

LLM quantization split: Choosing which quantization format suits a given LLM task (quality vs throughput tradeoffs, instruction-following degradation) is llm-specialist territory. The operations — wiring up AWQ/GPTQ, running calibration, validating kernel support — live here.

Category 2: Serving Infrastructure

Symptoms: "How to serve model", "need API endpoint", "deploy to production", "containerize model", "scale serving", "load balancing", "traffic management", "vLLM", "SGLang", "TensorRT-LLM", "TGI", "Triton", "continuous batching", "PagedAttention", "speculative decoding", "KV-cache management"

When to route here:

Need to expose model as API or service (classical model or LLM)
Questions about serving patterns (REST, gRPC, batch, streaming)
Deployment strategies (gradual rollout, A/B testing)
Scaling concerns (traffic, replicas, autoscaling)
LLM-serving stacks: vLLM, SGLang, TensorRT-LLM, TGI, Triton — covered in the LLM-serving Part of model-serving-patterns.md alongside continuous batching, PagedAttention, and speculative decoding ops

Routes to:

model-serving-patterns.md - FastAPI, TorchServe (maintenance mode — see routing mistakes), gRPC, ONNX, batching, containerization, plus LLM-serving Part: vLLM / SGLang / TensorRT-LLM / TGI / Triton, continuous batching, PagedAttention, speculative decoding
deployment-strategies.md - A/B testing, canary, shadow mode, rollback procedures
scaling-and-load-balancing.md - Horizontal scaling, autoscaling, load balancing, cost optimization

Key distinction:

Serving patterns = HOW to expose model (API, container, batching, LLM-serving stack)
Deployment strategies = HOW to roll out safely (gradual, testing, rollback)
Scaling = HOW to handle traffic (replicas, autoscaling, balancing)

Category 3: MLOps Tooling

Symptoms: "Track experiments", "version models", "automate deployment", "reproducibility", "CI/CD for ML", "feature store", "model registry", "experiment management", "version prompts", "version RAG eval set"

When to route here:

Need workflow/process improvements
Want to track experiments, version models, version prompts, or version RAG eval sets
Need to automate training-to-deployment pipeline
Team collaboration and reproducibility concerns

Routes to:

experiment-tracking-and-versioning.md - MLflow, Weights & Biases, Comet, Hugging Face Hub, model registries, prompt versioning, RAG/eval-set versioning, lineage
mlops-pipeline-automation.md - CI/CD for ML, feature stores, data validation, automated retraining, orchestration

Key distinction:

Experiment tracking = Research/development phase (track runs, version models, version prompts/eval sets)
Pipeline automation = Production phase (automate workflows, CI/CD)

Multi-concern: Queries like "track experiments AND automate deployment" → route to BOTH skills

Category 4: Observability

Symptoms: "Monitor production", "model degrading", "detect drift", "production debugging", "alert on failures", "model not working in prod", "performance issues in production", "LLM in production", "hallucination rate", "tool-call success", "cost-per-request spiking", "prompt-injection attempts", "drift in RAG retrieval quality"

When to route here:

Model already deployed, need to monitor or debug
Detecting production issues (drift, errors, degradation, accuracy regressions)
Setting up alerts and dashboards
Root cause analysis for production failures
LLM-specific telemetry: trace generations, score outputs, track cost/token spend, capture tool-call traces, surface prompt-injection signals

Routes to:

production-monitoring-and-alerting.md - Metrics, drift detection, dashboards, alerts, SLAs, LLM observability platforms: Phoenix, Langfuse, OpenTelemetry GenAI semantic conventions for trace/eval/cost monitoring
production-debugging-techniques.md - Error analysis, profiling, rollback procedures, post-mortems, tool-call failure forensics

Key distinction:

Monitoring = Proactive (set up metrics, alerts, detect issues early)
Debugging = Reactive (diagnose and fix existing issues)

LLM observability split: The plumbing — Phoenix/Langfuse/OTel-GenAI deployment, retention, dashboards, alerts on cost/latency/error budgets — lives here. The eval methodology feeding those dashboards (which metrics to compute, how to score hallucinations, how to build judges) lives in llm-specialist (llm-evaluation-metrics.md). Wire them together: methodology defines the signal; this pack ships it.

"Performance" ambiguity:

If "performance" = speed/latency → might be Category 1 (optimization) or Category 2 (serving/scaling)
If "performance" = accuracy degradation → Category 4 (observability - drift detection)
If "performance" = LLM output quality → Category 4 (observability — wire Phoenix/Langfuse) and llm-specialist (eval methodology)
Ask clarifying question: "By performance, do you mean inference speed, model accuracy, or LLM output quality?"

Routing Decision Tree

User query → Identify primary concern

Is model THE problem (size/speed)?
  YES → Category 1: Model Optimization
  NO → Continue

Is it about HOW to expose/deploy model (incl. LLM-serving stack)?
  YES → Category 2: Serving Infrastructure
  NO → Continue

Is it about workflow/process/automation?
  YES → Category 3: MLOps Tooling
  NO → Continue

Is it about monitoring/debugging in production (incl. LLM telemetry)?
  YES → Category 4: Observability
  NO → Ask clarifying question

Is the question about LLM generation quality, prompt design, RAG retrieval design, or agent behavior design?
  YES → Hand off to llm-specialist (this pack does ops, not generation quality)

Ambiguous? → Ask ONE question to clarify concern category

Clarification Questions for Ambiguous Queries

Query: "My model is too slow"

Ask: "Is this inference latency (how fast predictions are), or training time?"

Training → Route to training-optimization (wrong pack)
Inference → Follow-up: "Have you profiled to find bottlenecks?"
- Model is bottleneck → Category 1 (optimization)
- Infrastructure/batching issue → Category 2 (serving)
- LLM-specific (KV cache, batching strategy) → Category 2 LLM-serving Part + llm-specialist llm-inference-optimization.md

Query: "I need to deploy my model"

Ask: "What's your deployment target — cloud server, edge device, batch processing, or LLM endpoint?"

Cloud/server → Category 2 (serving-patterns, then maybe deployment-strategies if gradual rollout needed)
Edge/mobile → Category 1 (optimization first for size/speed) + Category 2 (serving)
Batch → Category 2 (serving-patterns - batch processing)
LLM endpoint → Category 2 LLM-serving Part of model-serving-patterns.md + llm-specialist (llm-inference-optimization.md for capability/strategy choice)

Query: "My model isn't performing well in production"

Ask: "By performance, do you mean inference speed, prediction accuracy, or LLM output quality (hallucination, tool-call failure)?"

Speed → Category 1 (optimization) or Category 2 (serving/scaling)
Accuracy → Category 4 (observability — drift detection, monitoring)
LLM output quality → Category 4 (Phoenix/Langfuse plumbing) + llm-specialist (llm-evaluation-metrics.md for methodology, prompt-engineering-patterns.md / rag-architecture-patterns.md for fixes)

Query: "Set up MLOps for my team"

Ask: "What's the current pain point — experiment tracking, automated deployment, prompt/eval-set versioning, or some combination?"

Tracking/versioning (incl. prompt/eval-set) → Category 3 (experiment-tracking-and-versioning)
Automation/CI/CD → Category 3 (mlops-pipeline-automation)
Multiple → Route to multiple skills

Multi-Concern Scenarios

Some queries span multiple categories. Route to ALL relevant skills in logical order:

| Scenario | Route Order | Why | |----------|-------------|-----| | "Optimize and deploy model" | 1. Optimization → 2. Serving | Optimize BEFORE deploying | | "Deploy and monitor model" | 1. Serving → 2. Observability | Deploy BEFORE monitoring | | "Track experiments and automate deployment" | 1. Experiment tracking → 2. Pipeline automation | Track BEFORE automating | | "Quantize model and serve with vLLM" | 1. Quantization (ops) → 2. LLM-serving Part of serving-patterns | Optimize BEFORE serving | | "Deploy with A/B testing and monitor" | 1. Deployment strategies → 2. Monitoring | Deploy strategy BEFORE monitoring | | "Deploy LLM to production" | 1. llm-specialist (model/strategy choice) → 2. ml-production serving + 3. ml-production observability | Generation-quality choice frames the ops | | "RAG production system" | 1. llm-specialist (rag-architecture-patterns.md) → 2. ml-production (vector-store deployment, retrieval-quality observability, latency monitoring) | Architecture decisions frame the ops | | "Agent in production" | 1. llm-specialist (agentic-patterns-and-mcp.md) → 2. ml-production (tool-call observability, error rates, cost monitoring) → 3. axiom-engineering-foundations (sandboxing, system design) | Agent design frames ops + safety |

Principle: Route in execution order (what needs to happen first).

Relationship with Other Packs

With llm-specialist

The two packs split along the ops vs. generation-quality boundary.

ml-production owns (this pack):

General serving stacks: vLLM, SGLang, TensorRT-LLM, TGI, Triton — deployment, batching ops, KV-cache config, scaling
Quantization for inference as ops: torch.ao.quantization, AWQ, GPTQ wiring, calibration, hardware fit
MLOps: CI/CD, registries, feature stores, retraining
Deployment patterns: containers, canary, A/B, rollback
Monitoring & observability platforms: Phoenix, Langfuse, OpenTelemetry GenAI semantic conventions, dashboards, alerts, drift detection

llm-specialist owns (sister pack — 10 reference sheets):

Prompt engineering (prompt-engineering-patterns.md)
Reasoning-model use (reasoning-models.md)
Agentic patterns + MCP (agentic-patterns-and-mcp.md)
RAG architecture (rag-architecture-patterns.md)
Fine-tuning strategy choice (llm-finetuning-strategies.md)
Context engineering & prompt caching (context-engineering-and-prompt-caching.md, context-window-management.md)
Evaluation methodology (llm-evaluation-metrics.md)
Inference-strategy choice (llm-inference-optimization.md)
Safety / alignment (llm-safety-alignment.md)

Concrete bidirectional triggers:

| Query | llm-specialist contributes | ml-production contributes | |-------|---------------------------|---------------------------| | "Deploy LLM to production" | Inference-strategy + capability-tier choice | Serving stack, monitoring, scaling | | "Quantize LLM" | Which format suits the task (quality tradeoffs) | torch.ao.quantization, AWQ/GPTQ ops, hardware fit | | "LLM observability" | Eval methodology to feed into dashboards | Phoenix / Langfuse / OTel-GenAI plumbing | | "RAG production system" | RAG architecture + retrieval design | Vector-store deployment, retrieval-quality observability, latency monitoring | | "Agent in production" | Agentic patterns, MCP, anti-patterns | Tool-call observability, error rates, cost monitoring (also: axiom-engineering-foundations for sandboxing/system design) |

Rule of thumb: If the question is "what should the model/prompt/agent do and how should it be designed?" → llm-specialist. If the question is "how do we run, observe, and operate it at scale?" → ml-production. Most LLM production questions need both.

With training-optimization

Clear boundary:

training-optimization = Training phase (convergence, hyperparameters, training speed)
ml-production = Inference phase (deployment, serving, monitoring)

"Too slow" disambiguation:

Training slow → training-optimization
Inference slow → ml-production

Bidirectional: training-optimization should send users here when they're done training; ml-production should send users back when they need to retrain (drift response, dataset updates).

With pytorch-engineering

pytorch-engineering covers: Foundation (distributed training, profiling, memory management)

ml-production covers: Production-specific (serving APIs, deployment patterns, MLOps, production observability)

When to use both:

"Profile production inference" → pytorch-engineering (profiling techniques) + ml-production (production context, observability wiring)
"Optimize serving performance" → ml-production (serving patterns) + pytorch-engineering (low-level profiling, CUDA/memory)

Bidirectional: pytorch-engineering profiling skills point here for production-context observability; this pack's debugging sheet points back to pytorch-engineering for low-level CUDA/memory work.

With ordis-security-architect

For AI/LLM threat modeling — prompt injection, data exfiltration via tool-calls, supply-chain risk on model artifacts, PII leakage in logs, model-registry access control — route to ordis-security-architect (threat modeling, controls design, security review). This pack handles the operational signals (prompt-injection attempt rate as an observability metric) but the threat model and control design live in security-architect.

Common Routing Mistakes

| Query | Wrong Route | Correct Route | Why | |-------|-------------|---------------|-----| | "Model too slow in production" | Immediately to quantization | Ask: inference or training? Then model vs infrastructure? | Could be serving/batching issue, not model | | "Deploy with Kubernetes" | Defer to Kubernetes docs | Category 2: serving-patterns or deployment-strategies | Kubernetes is tool choice, not routing concern | | "Set up MLOps" | Route to one skill | Ask about specific pain point — could be tracking AND automation AND prompt/eval-set versioning | MLOps spans multiple skills | | "Performance issues" | Assume accuracy | Ask: speed, accuracy, or LLM output quality? | Performance is ambiguous | | "Deploy LLM" | Skip llm-specialist, route only here | Route both: llm-specialist (strategy/capability choice), ml-production (serving stack, observability, scaling) | Generation-quality choices frame the ops | | "Just use TorchServe" | Route uncritically to TorchServe content | Flag: TorchServe is in maintenance mode (PyTorch deprecated active development). Route to LLM-serving Part of model-serving-patterns.md (vLLM/SGLang/TensorRT-LLM/TGI/Triton) for LLMs; for classical models route to serving-patterns but recommend evaluating Triton or framework-native serving | Don't recommend a maintenance-mode tool for new systems | | "We use TorchServe" (existing system) | Skip routing | Route to serving-patterns, note migration path | Existing systems still need ops guidance, but flag the deprecation | | "Hallucinations in production" | Route only to llm-specialist | Route both: ml-production (Phoenix/Langfuse to detect+alert) + llm-specialist (eval methodology + fix via prompt/RAG/fine-tune) | Detection is ops; remediation is generation quality | | "Quantize an LLM" | Route only to quantization-for-inference | Route both: llm-specialist (which format suits the task) + ml-production (AWQ/GPTQ ops, hardware fit) | Format choice and ops are different concerns |

Common Rationalizations (Don't Do These)

| Excuse | Reality | |--------|---------| | "User mentioned Kubernetes, route to deployment" | Tools are implementation details. Route by concern first. | | "Slow = optimization, route to quantization" | Slow could be infrastructure. Clarify model vs serving bottleneck. | | "They said deploy, must be serving-patterns" | Could need serving + deployment-strategies + monitoring. Don't assume single concern. | | "MLOps = experiment tracking" | MLOps spans tracking AND automation AND prompt/eval-set versioning. Ask which pain point. | | "Performance obviously means speed" | Could mean accuracy or LLM output quality. Clarify. | | "They're technical, skip clarification" | Technical users still benefit from clarifying questions. | | "LLM question, route only to llm-specialist" | Most LLM production questions need both packs. Default to dual-route. | | "TorchServe still works, just route there" | TorchServe is in maintenance mode. Flag and offer alternatives. |

Red Flags Checklist

If you catch yourself thinking ANY of these, STOP and clarify:

"I'll guess optimization vs serving" → ASK which is the bottleneck
"Performance probably means speed" → ASK speed, accuracy, or LLM output quality
"Deploy = serving-patterns only" → Consider deployment-strategies and monitoring too
"They mentioned [tool], route based on tool" → Route by CONCERN, not tool
"MLOps = one skill" → Could span experiment tracking, automation, and prompt/eval-set versioning
"Skip question to save time" → Clarifying prevents wrong routing
"LLM in production = llm-specialist alone" → Default to dual-routing

When in doubt: Ask ONE clarifying question. 10 seconds of clarification prevents minutes of wrong-skill loading.

Routing Summary Table

| User Concern | Ask Clarifying | Route To | Also Consider | |--------------|----------------|----------|---------------| | Model slow/large | Inference or training? | Optimization skills | If inference, check serving too | | Deploy classical model | Target (cloud/edge/batch)? | Serving patterns | Deployment strategies for gradual rollout | | Deploy LLM | Capability tier and target? | LLM-serving Part of serving-patterns + llm-specialist | Observability + cost monitoring | | Production monitoring | Proactive or reactive? | Monitoring OR debugging | Both if setting up + fixing issues | | LLM observability | Plumbing or methodology? | Monitoring (plumbing) + llm-specialist (methodology) | Cost + drift alerts | | MLOps setup | Tracking, automation, or prompt/eval versioning? | Experiment tracking AND/OR automation | Often multiple needed | | Performance issues | Speed, accuracy, or LLM quality? | Optimization OR observability OR llm-specialist | Depends on clarification | | Scale serving | Traffic pattern? | Scaling-and-load-balancing | Serving patterns if not set up yet |

Integration Examples

Example 1: Full Production Pipeline

Query: "I trained a model, now I need to put it in production"

Routing:

Ask: "What's your deployment target and are there performance concerns? Is it an LLM or a classical model?"
If "cloud deployment, classical model, fast enough":
- model-serving-patterns.md (expose as API)
- deployment-strategies.md (if gradual rollout needed)
- production-monitoring-and-alerting.md (set up observability)
If "edge device, model too large":
- quantization-for-inference.md (reduce size first)
- model-serving-patterns.md (edge deployment pattern)
- production-monitoring-and-alerting.md (if possible on edge)
If "LLM endpoint":
- llm-specialist llm-inference-optimization.md (capability tier + strategy choice)
- model-serving-patterns.md (LLM-serving Part: vLLM/SGLang/TensorRT-LLM/TGI/Triton)
- production-monitoring-and-alerting.md (Phoenix/Langfuse + cost monitoring)
- llm-specialist llm-evaluation-metrics.md (methodology to feed observability)

Example 2: Optimization Decision

Query: "My inference is slow"

Routing:

Ask: "Have you profiled to find the bottleneck — is it the model, serving infrastructure, or KV-cache/batching for an LLM?"
If "not profiled yet":
- production-debugging-techniques.md (profile first to diagnose)
- Then route based on findings
If "model is bottleneck":
- hardware-optimization-strategies.md (check if hardware tuning helps)
- If not enough → quantization-for-inference.md or model-compression-techniques.md
If "infrastructure/batching is bottleneck":
- model-serving-patterns.md (batching strategies, LLM-serving Part for continuous batching/PagedAttention/speculative decoding)
- scaling-and-load-balancing.md (if traffic-related)

Example 3: MLOps Maturity

Query: "We need better ML workflows"

Routing:

Ask: "What's the current pain point — can't reproduce experiments, manual deployment, prompt/eval-set drift, or some combination?"
If "can't reproduce, need to track experiments":
- experiment-tracking-and-versioning.md
If "manual deployment is slow":
- mlops-pipeline-automation.md
If "prompts and eval sets keep drifting":
- experiment-tracking-and-versioning.md (prompt + eval-set versioning patterns)
- llm-specialist llm-evaluation-metrics.md (what to put in those eval sets)
If multiple:
- experiment-tracking-and-versioning.md (establish tracking first)
- mlops-pipeline-automation.md (then automate workflow)

Example 4: LLM in Production with Quality Issues

Query: "Our LLM keeps hallucinating in production and costs are spiking"

Routing:

production-monitoring-and-alerting.md — wire Phoenix/Langfuse, capture traces, set cost-budget alerts, surface hallucination/eval-score drift
llm-specialist llm-evaluation-metrics.md — define hallucination scoring methodology to feed the observability stack
llm-specialist prompt-engineering-patterns.md and/or rag-architecture-patterns.md — remediate the underlying generation-quality issue
production-debugging-techniques.md — root-cause cost spikes (token bloat? retry storms? bad routing?)

When NOT to Use ml-production Skills

Skip ml-production when:

Still designing/training model → Use neural-architectures, training-optimization
PyTorch infrastructure issues → Use pytorch-engineering
LLM generation quality only (prompts, RAG retrieval design, fine-tuning strategy choice, agent design, eval methodology, safety/alignment) → Use llm-specialist
Classical ML deployment → ml-production still applies; consider gradient boosting / sklearn serving paths inside serving-patterns
AI/LLM threat modeling → Use ordis-security-architect

Red flag: If model isn't trained yet, probably don't need ml-production. Finish training first.

Success Criteria

You've routed correctly when:

Identified concern category (optimization, serving, MLOps, observability)
Asked clarifying question for ambiguous queries
Routed to appropriate skill(s) in logical order
Didn't let tool choices (Kubernetes, TorchServe, vLLM) dictate routing
Recognized multi-concern scenarios and routed to multiple skills
For LLM questions, defaulted to dual-routing across ml-production and llm-specialist unless the question is purely ops or purely generation-quality
Flagged maintenance-mode tools (e.g., TorchServe) instead of recommending them uncritically
Pointed at ordis-security-architect when threat modeling, not just observability, was at issue

ML Production Specialist Skills Catalog

After routing, load the appropriate specialist skill for detailed guidance. This pack contains exactly 10 reference sheets (no new sheets added in this refresh):

quantization-for-inference.md - torch.ao.quantization, AWQ, GPTQ, INT8/INT4 ops, post-training quantization, QAT, calibration, kernel/hardware fit (LLM format choice lives in llm-specialist)
model-compression-techniques.md - Pruning (structured/unstructured), knowledge distillation, architecture optimization, model size reduction
hardware-optimization-strategies.md - GPU/CPU/edge tuning, batch sizing, memory optimization, hardware-specific acceleration (TensorRT, ONNX Runtime), capability-tier hardware selection
model-serving-patterns.md - FastAPI, gRPC, ONNX, batching, containerization (Docker), REST/gRPC APIs; LLM-serving Part: vLLM, SGLang, TensorRT-LLM, TGI, Triton, continuous batching, PagedAttention, speculative decoding, KV-cache management; TorchServe covered as legacy/maintenance-mode
deployment-strategies.md - A/B testing, canary deployment, shadow mode, gradual rollout, rollback procedures, blue-green deployment
scaling-and-load-balancing.md - Horizontal scaling, autoscaling, load balancing, traffic management, cost optimization, replica management
experiment-tracking-and-versioning.md - MLflow, Weights & Biases, Comet, Hugging Face Hub, model registries, prompt versioning, RAG / eval-set versioning, lineage, reproducibility
mlops-pipeline-automation.md - CI/CD for ML, feature stores, data validation, automated retraining, orchestration (Airflow, Kubeflow, Prefect, Dagster)
production-monitoring-and-alerting.md - Metrics tracking, drift detection, dashboards, alerting, SLAs; LLM observability: Phoenix, Langfuse, OpenTelemetry GenAI semantic conventions; tool-call telemetry, prompt-injection signal, RAG-retrieval-quality drift, cost/token monitoring
production-debugging-techniques.md - Error analysis, production profiling, rollback procedures, post-mortems, root cause analysis, tool-call failure forensics

References

Primary router: plugins/yzmir-ai-engineering-expert/skills/using-ai-engineering/SKILL.md
Sister packs (bidirectional):
- yzmir-llm-specialist/using-llm-specialist — generation quality, prompt/RAG/agent/eval/safety design (10 sheets including reasoning-models.md, agentic-patterns-and-mcp.md, context-engineering-and-prompt-caching.md)
- yzmir-training-optimization/using-training-optimization — training-phase convergence and speed
- yzmir-pytorch-engineering/using-pytorch-engineering — distributed training, low-level profiling, CUDA/memory
- ordis-security-architect/using-security-architect — AI/LLM threat modeling, controls, security review

Using ML Production

Overview

This meta-skill routes you to the right production deployment skill based on your concern. Load this when you need to move ML models to production but aren't sure which specific aspect to address.

When to Use

Load this skill when:

Deploying ML models (classical or LLM) to production
Optimizing model inference (speed, size, cost)
Setting up MLOps workflows (tracking, automation, CI/CD)
Monitoring or debugging production models, including LLM-specific signals (hallucination rate, tool-call success, prompt-injection attempts)
User mentions: "production", "deploy", "serve model", "MLOps", "monitoring", "optimize inference", "vLLM", "SGLang", "TensorRT-LLM", "Phoenix", "Langfuse"

How to Access Reference Sheets

IMPORTANT: All reference sheets are located in the SAME DIRECTORY as this SKILL.md file.

When this skill is loaded from: skills/using-ml-production/SKILL.md

Reference sheets like quantization-for-inference.md are at: skills/using-ml-production/quantization-for-inference.md

NOT at: skills/quantization-for-inference.md ← WRONG PATH

When you see a link like [quantization-for-inference.md](quantization-for-inference.md), read the file from the same directory as this SKILL.md.

Routing by Concern

Category 1: Model Optimization

Symptoms: "Model too slow", "inference latency high", "model too large", "need to optimize for edge", "reduce model size", "speed up inference", "quantize model", "INT8/INT4", "AWQ", "GPTQ"

When to route here:

Model itself is the bottleneck (not infrastructure)
Need to reduce model size or increase inference speed
Deploying to resource-constrained hardware (edge, mobile)
Cost optimization through model efficiency
Quantization operations (the mechanics of converting weights, calibration, kernel fit)

Routes to:

quantization-for-inference.md - torch.ao.quantization, AWQ, GPTQ, INT8/INT4, post-training quantization, QAT, calibration
model-compression-techniques.md - Pruning, distillation, architecture optimization
hardware-optimization-strategies.md - GPU/CPU/edge tuning, batch sizing, hardware/quant-format fit

Key question to ask: "Is the MODEL the bottleneck, or is it infrastructure/serving?"

Category 2: Serving Infrastructure

When to route here:

Need to expose model as API or service (classical model or LLM)
Questions about serving patterns (REST, gRPC, batch, streaming)
Deployment strategies (gradual rollout, A/B testing)
Scaling concerns (traffic, replicas, autoscaling)
LLM-serving stacks: vLLM, SGLang, TensorRT-LLM, TGI, Triton — covered in the LLM-serving Part of model-serving-patterns.md alongside continuous batching, PagedAttention, and speculative decoding ops

Routes to:

model-serving-patterns.md - FastAPI, TorchServe (maintenance mode — see routing mistakes), gRPC, ONNX, batching, containerization, plus LLM-serving Part: vLLM / SGLang / TensorRT-LLM / TGI / Triton, continuous batching, PagedAttention, speculative decoding
deployment-strategies.md - A/B testing, canary, shadow mode, rollback procedures
scaling-and-load-balancing.md - Horizontal scaling, autoscaling, load balancing, cost optimization

Key distinction:

Serving patterns = HOW to expose model (API, container, batching, LLM-serving stack)
Deployment strategies = HOW to roll out safely (gradual, testing, rollback)
Scaling = HOW to handle traffic (replicas, autoscaling, balancing)

Category 3: MLOps Tooling

When to route here:

Need workflow/process improvements
Want to track experiments, version models, version prompts, or version RAG eval sets
Need to automate training-to-deployment pipeline
Team collaboration and reproducibility concerns

Routes to:

experiment-tracking-and-versioning.md - MLflow, Weights & Biases, Comet, Hugging Face Hub, model registries, prompt versioning, RAG/eval-set versioning, lineage
mlops-pipeline-automation.md - CI/CD for ML, feature stores, data validation, automated retraining, orchestration

Key distinction:

Experiment tracking = Research/development phase (track runs, version models, version prompts/eval sets)
Pipeline automation = Production phase (automate workflows, CI/CD)

Multi-concern: Queries like "track experiments AND automate deployment" → route to BOTH skills

Category 4: Observability

When to route here:

Model already deployed, need to monitor or debug
Detecting production issues (drift, errors, degradation, accuracy regressions)
Setting up alerts and dashboards
Root cause analysis for production failures
LLM-specific telemetry: trace generations, score outputs, track cost/token spend, capture tool-call traces, surface prompt-injection signals

Routes to:

production-monitoring-and-alerting.md - Metrics, drift detection, dashboards, alerts, SLAs, LLM observability platforms: Phoenix, Langfuse, OpenTelemetry GenAI semantic conventions for trace/eval/cost monitoring
production-debugging-techniques.md - Error analysis, profiling, rollback procedures, post-mortems, tool-call failure forensics

Key distinction:

Monitoring = Proactive (set up metrics, alerts, detect issues early)
Debugging = Reactive (diagnose and fix existing issues)

"Performance" ambiguity:

If "performance" = speed/latency → might be Category 1 (optimization) or Category 2 (serving/scaling)
If "performance" = accuracy degradation → Category 4 (observability - drift detection)
If "performance" = LLM output quality → Category 4 (observability — wire Phoenix/Langfuse) and llm-specialist (eval methodology)
Ask clarifying question: "By performance, do you mean inference speed, model accuracy, or LLM output quality?"

Routing Decision Tree

User query → Identify primary concern

Is model THE problem (size/speed)?
  YES → Category 1: Model Optimization
  NO → Continue

Is it about HOW to expose/deploy model (incl. LLM-serving stack)?
  YES → Category 2: Serving Infrastructure
  NO → Continue

Is it about workflow/process/automation?
  YES → Category 3: MLOps Tooling
  NO → Continue

Is it about monitoring/debugging in production (incl. LLM telemetry)?
  YES → Category 4: Observability
  NO → Ask clarifying question

Is the question about LLM generation quality, prompt design, RAG retrieval design, or agent behavior design?
  YES → Hand off to llm-specialist (this pack does ops, not generation quality)

Ambiguous? → Ask ONE question to clarify concern category

Clarification Questions for Ambiguous Queries

Query: "My model is too slow"

Ask: "Is this inference latency (how fast predictions are), or training time?"

Training → Route to training-optimization (wrong pack)
Inference → Follow-up: "Have you profiled to find bottlenecks?"
- Model is bottleneck → Category 1 (optimization)
- Infrastructure/batching issue → Category 2 (serving)
- LLM-specific (KV cache, batching strategy) → Category 2 LLM-serving Part + llm-specialist llm-inference-optimization.md

Query: "I need to deploy my model"

Ask: "What's your deployment target — cloud server, edge device, batch processing, or LLM endpoint?"

Cloud/server → Category 2 (serving-patterns, then maybe deployment-strategies if gradual rollout needed)
Edge/mobile → Category 1 (optimization first for size/speed) + Category 2 (serving)
Batch → Category 2 (serving-patterns - batch processing)
LLM endpoint → Category 2 LLM-serving Part of model-serving-patterns.md + llm-specialist (llm-inference-optimization.md for capability/strategy choice)

Query: "My model isn't performing well in production"

Ask: "By performance, do you mean inference speed, prediction accuracy, or LLM output quality (hallucination, tool-call failure)?"

Speed → Category 1 (optimization) or Category 2 (serving/scaling)
Accuracy → Category 4 (observability — drift detection, monitoring)
LLM output quality → Category 4 (Phoenix/Langfuse plumbing) + llm-specialist (llm-evaluation-metrics.md for methodology, prompt-engineering-patterns.md / rag-architecture-patterns.md for fixes)

Query: "Set up MLOps for my team"

Ask: "What's the current pain point — experiment tracking, automated deployment, prompt/eval-set versioning, or some combination?"

Tracking/versioning (incl. prompt/eval-set) → Category 3 (experiment-tracking-and-versioning)
Automation/CI/CD → Category 3 (mlops-pipeline-automation)
Multiple → Route to multiple skills

Multi-Concern Scenarios

Some queries span multiple categories. Route to ALL relevant skills in logical order:

Principle: Route in execution order (what needs to happen first).

Relationship with Other Packs

With llm-specialist

The two packs split along the ops vs. generation-quality boundary.

ml-production owns (this pack):

General serving stacks: vLLM, SGLang, TensorRT-LLM, TGI, Triton — deployment, batching ops, KV-cache config, scaling
Quantization for inference as ops: torch.ao.quantization, AWQ, GPTQ wiring, calibration, hardware fit
MLOps: CI/CD, registries, feature stores, retraining
Deployment patterns: containers, canary, A/B, rollback
Monitoring & observability platforms: Phoenix, Langfuse, OpenTelemetry GenAI semantic conventions, dashboards, alerts, drift detection

llm-specialist owns (sister pack — 10 reference sheets):

Prompt engineering (prompt-engineering-patterns.md)
Reasoning-model use (reasoning-models.md)
Agentic patterns + MCP (agentic-patterns-and-mcp.md)
RAG architecture (rag-architecture-patterns.md)
Fine-tuning strategy choice (llm-finetuning-strategies.md)
Context engineering & prompt caching (context-engineering-and-prompt-caching.md, context-window-management.md)
Evaluation methodology (llm-evaluation-metrics.md)
Inference-strategy choice (llm-inference-optimization.md)
Safety / alignment (llm-safety-alignment.md)

Concrete bidirectional triggers:

With training-optimization

Clear boundary:

training-optimization = Training phase (convergence, hyperparameters, training speed)
ml-production = Inference phase (deployment, serving, monitoring)

"Too slow" disambiguation:

Training slow → training-optimization
Inference slow → ml-production

Bidirectional: training-optimization should send users here when they're done training; ml-production should send users back when they need to retrain (drift response, dataset updates).

With pytorch-engineering

pytorch-engineering covers: Foundation (distributed training, profiling, memory management)

ml-production covers: Production-specific (serving APIs, deployment patterns, MLOps, production observability)

When to use both:

"Profile production inference" → pytorch-engineering (profiling techniques) + ml-production (production context, observability wiring)
"Optimize serving performance" → ml-production (serving patterns) + pytorch-engineering (low-level profiling, CUDA/memory)

Bidirectional: pytorch-engineering profiling skills point here for production-context observability; this pack's debugging sheet points back to pytorch-engineering for low-level CUDA/memory work.

With ordis-security-architect

Common Routing Mistakes

Common Rationalizations (Don't Do These)

Red Flags Checklist

If you catch yourself thinking ANY of these, STOP and clarify:

"I'll guess optimization vs serving" → ASK which is the bottleneck
"Performance probably means speed" → ASK speed, accuracy, or LLM output quality
"Deploy = serving-patterns only" → Consider deployment-strategies and monitoring too
"They mentioned [tool], route based on tool" → Route by CONCERN, not tool
"MLOps = one skill" → Could span experiment tracking, automation, and prompt/eval-set versioning
"Skip question to save time" → Clarifying prevents wrong routing
"LLM in production = llm-specialist alone" → Default to dual-routing

When in doubt: Ask ONE clarifying question. 10 seconds of clarification prevents minutes of wrong-skill loading.

Routing Summary Table

Integration Examples

Example 1: Full Production Pipeline

Query: "I trained a model, now I need to put it in production"

Routing:

Ask: "What's your deployment target and are there performance concerns? Is it an LLM or a classical model?"
If "cloud deployment, classical model, fast enough":
- model-serving-patterns.md (expose as API)
- deployment-strategies.md (if gradual rollout needed)
- production-monitoring-and-alerting.md (set up observability)
If "edge device, model too large":
- quantization-for-inference.md (reduce size first)
- model-serving-patterns.md (edge deployment pattern)
- production-monitoring-and-alerting.md (if possible on edge)
If "LLM endpoint":
- llm-specialist llm-inference-optimization.md (capability tier + strategy choice)
- model-serving-patterns.md (LLM-serving Part: vLLM/SGLang/TensorRT-LLM/TGI/Triton)
- production-monitoring-and-alerting.md (Phoenix/Langfuse + cost monitoring)
- llm-specialist llm-evaluation-metrics.md (methodology to feed observability)

Example 2: Optimization Decision

Query: "My inference is slow"

Routing:

Ask: "Have you profiled to find the bottleneck — is it the model, serving infrastructure, or KV-cache/batching for an LLM?"
If "not profiled yet":
- production-debugging-techniques.md (profile first to diagnose)
- Then route based on findings
If "model is bottleneck":
- hardware-optimization-strategies.md (check if hardware tuning helps)
- If not enough → quantization-for-inference.md or model-compression-techniques.md
If "infrastructure/batching is bottleneck":
- model-serving-patterns.md (batching strategies, LLM-serving Part for continuous batching/PagedAttention/speculative decoding)
- scaling-and-load-balancing.md (if traffic-related)

Example 3: MLOps Maturity

Query: "We need better ML workflows"

Routing:

Ask: "What's the current pain point — can't reproduce experiments, manual deployment, prompt/eval-set drift, or some combination?"
If "can't reproduce, need to track experiments":
- experiment-tracking-and-versioning.md
If "manual deployment is slow":
- mlops-pipeline-automation.md
If "prompts and eval sets keep drifting":
- experiment-tracking-and-versioning.md (prompt + eval-set versioning patterns)
- llm-specialist llm-evaluation-metrics.md (what to put in those eval sets)
If multiple:
- experiment-tracking-and-versioning.md (establish tracking first)
- mlops-pipeline-automation.md (then automate workflow)

Example 4: LLM in Production with Quality Issues

Query: "Our LLM keeps hallucinating in production and costs are spiking"

Routing:

production-monitoring-and-alerting.md — wire Phoenix/Langfuse, capture traces, set cost-budget alerts, surface hallucination/eval-score drift
llm-specialist llm-evaluation-metrics.md — define hallucination scoring methodology to feed the observability stack
llm-specialist prompt-engineering-patterns.md and/or rag-architecture-patterns.md — remediate the underlying generation-quality issue
production-debugging-techniques.md — root-cause cost spikes (token bloat? retry storms? bad routing?)

When NOT to Use ml-production Skills

Skip ml-production when:

Still designing/training model → Use neural-architectures, training-optimization
PyTorch infrastructure issues → Use pytorch-engineering
LLM generation quality only (prompts, RAG retrieval design, fine-tuning strategy choice, agent design, eval methodology, safety/alignment) → Use llm-specialist
Classical ML deployment → ml-production still applies; consider gradient boosting / sklearn serving paths inside serving-patterns
AI/LLM threat modeling → Use ordis-security-architect

Red flag: If model isn't trained yet, probably don't need ml-production. Finish training first.

Success Criteria

You've routed correctly when:

Identified concern category (optimization, serving, MLOps, observability)
Asked clarifying question for ambiguous queries
Routed to appropriate skill(s) in logical order
Didn't let tool choices (Kubernetes, TorchServe, vLLM) dictate routing
Recognized multi-concern scenarios and routed to multiple skills
For LLM questions, defaulted to dual-routing across ml-production and llm-specialist unless the question is purely ops or purely generation-quality
Flagged maintenance-mode tools (e.g., TorchServe) instead of recommending them uncritically
Pointed at ordis-security-architect when threat modeling, not just observability, was at issue

ML Production Specialist Skills Catalog

After routing, load the appropriate specialist skill for detailed guidance. This pack contains exactly 10 reference sheets (no new sheets added in this refresh):

quantization-for-inference.md - torch.ao.quantization, AWQ, GPTQ, INT8/INT4 ops, post-training quantization, QAT, calibration, kernel/hardware fit (LLM format choice lives in llm-specialist)
model-compression-techniques.md - Pruning (structured/unstructured), knowledge distillation, architecture optimization, model size reduction
hardware-optimization-strategies.md - GPU/CPU/edge tuning, batch sizing, memory optimization, hardware-specific acceleration (TensorRT, ONNX Runtime), capability-tier hardware selection
model-serving-patterns.md - FastAPI, gRPC, ONNX, batching, containerization (Docker), REST/gRPC APIs; LLM-serving Part: vLLM, SGLang, TensorRT-LLM, TGI, Triton, continuous batching, PagedAttention, speculative decoding, KV-cache management; TorchServe covered as legacy/maintenance-mode
deployment-strategies.md - A/B testing, canary deployment, shadow mode, gradual rollout, rollback procedures, blue-green deployment
scaling-and-load-balancing.md - Horizontal scaling, autoscaling, load balancing, traffic management, cost optimization, replica management
experiment-tracking-and-versioning.md - MLflow, Weights & Biases, Comet, Hugging Face Hub, model registries, prompt versioning, RAG / eval-set versioning, lineage, reproducibility
mlops-pipeline-automation.md - CI/CD for ML, feature stores, data validation, automated retraining, orchestration (Airflow, Kubeflow, Prefect, Dagster)
production-monitoring-and-alerting.md - Metrics tracking, drift detection, dashboards, alerting, SLAs; LLM observability: Phoenix, Langfuse, OpenTelemetry GenAI semantic conventions; tool-call telemetry, prompt-injection signal, RAG-retrieval-quality drift, cost/token monitoring
production-debugging-techniques.md - Error analysis, production profiling, rollback procedures, post-mortems, root cause analysis, tool-call failure forensics

References

Primary router: plugins/yzmir-ai-engineering-expert/skills/using-ai-engineering/SKILL.md
Sister packs (bidirectional):
- yzmir-llm-specialist/using-llm-specialist — generation quality, prompt/RAG/agent/eval/safety design (10 sheets including reasoning-models.md, agentic-patterns-and-mcp.md, context-engineering-and-prompt-caching.md)
- yzmir-training-optimization/using-training-optimization — training-phase convergence and speed
- yzmir-pytorch-engineering/using-pytorch-engineering — distributed training, low-level profiling, CUDA/memory
- ordis-security-architect/using-security-architect — AI/LLM threat modeling, controls, security review

Adoption

tachyon-beep/using-ml-production

$ install --global

Security Scan Results

SKILL.md

Using ML Production

Overview

When to Use

How to Access Reference Sheets

Routing by Concern

Category 1: Model Optimization

Category 2: Serving Infrastructure

Category 3: MLOps Tooling

Category 4: Observability

Routing Decision Tree

Clarification Questions for Ambiguous Queries

Query: "My model is too slow"

Query: "I need to deploy my model"

Query: "My model isn't performing well in production"

Query: "Set up MLOps for my team"

Multi-Concern Scenarios

Relationship with Other Packs

With llm-specialist

With training-optimization

With pytorch-engineering

With ordis-security-architect

Common Routing Mistakes

Common Rationalizations (Don't Do These)

Red Flags Checklist

Routing Summary Table

Integration Examples

Example 1: Full Production Pipeline

Example 2: Optimization Decision

Example 3: MLOps Maturity

Example 4: LLM in Production with Quality Issues

When NOT to Use ml-production Skills

Success Criteria

ML Production Specialist Skills Catalog

References

Related Skills

tachyon-beep/using-mcp-engineering

tachyon-beep/using-embedded-database

tachyon-beep/using-procedural-architecture

tachyon-beep/using-creative-writing

tachyon-beep/using-ml-production

$ install --global

Security Scan Results

SKILL.md

Using ML Production

Overview

When to Use

How to Access Reference Sheets

Routing by Concern

Category 1: Model Optimization

Category 2: Serving Infrastructure

Category 3: MLOps Tooling

Category 4: Observability

Routing Decision Tree

Clarification Questions for Ambiguous Queries

Query: "My model is too slow"

Query: "I need to deploy my model"

Query: "My model isn't performing well in production"

Query: "Set up MLOps for my team"

Multi-Concern Scenarios

Relationship with Other Packs

With llm-specialist

With training-optimization

With pytorch-engineering

With ordis-security-architect

Common Routing Mistakes

Common Rationalizations (Don't Do These)

Red Flags Checklist

Routing Summary Table

Integration Examples

Example 1: Full Production Pipeline

Example 2: Optimization Decision

Example 3: MLOps Maturity

Example 4: LLM in Production with Quality Issues

When NOT to Use ml-production Skills

Success Criteria