plugins/yzmir-ml-production/skills/using-ml-production/SKILL.md
Router skill directing to deployment, optimization, MLOps, and monitoring guides.
npx skillsauth add tachyon-beep/skillpacks using-ml-productionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This meta-skill routes you to the right production deployment skill based on your concern. Load this when you need to move ML models to production but aren't sure which specific aspect to address.
Core Principle: Production concerns fall into four categories. Identify the concern first, then route to the appropriate skill. Tools and infrastructure choices are implementation details, not routing criteria.
Capability tiers, not model IDs. Sheets describe model and hardware capabilities (e.g., "frontier reasoning model", "mid-tier serving GPU", "edge accelerator") rather than naming specific SKUs. Vendor model names and exact GPU SKUs change quarterly; capability tiers are stable.
Knowledge-cutoff acknowledgement. Tool inventory is calibrated to 2026-05. Fast-moving areas — LLM serving stacks (vLLM, SGLang, TensorRT-LLM, TGI, Triton), observability platforms (Phoenix, Langfuse, OTel GenAI), quantization toolchains (torch.ao.quantization, AWQ, GPTQ), and MLOps platforms — should be re-checked against vendor docs before architecting new systems. Treat sheet content as a structured starting point, not a substitute for current documentation.
Load this skill when:
Don't use for: Training optimization (use training-optimization), model architecture selection (use neural-architectures), PyTorch infrastructure (use pytorch-engineering), prompt/RAG/agent design quality (use llm-specialist).
IMPORTANT: All reference sheets are located in the SAME DIRECTORY as this SKILL.md file.
When this skill is loaded from:
skills/using-ml-production/SKILL.md
Reference sheets like quantization-for-inference.md are at:
skills/using-ml-production/quantization-for-inference.md
NOT at:
skills/quantization-for-inference.md ← WRONG PATH
When you see a link like [quantization-for-inference.md](quantization-for-inference.md), read the file from the same directory as this SKILL.md.
Symptoms: "Model too slow", "inference latency high", "model too large", "need to optimize for edge", "reduce model size", "speed up inference", "quantize model", "INT8/INT4", "AWQ", "GPTQ"
When to route here:
Routes to:
torch.ao.quantization, AWQ, GPTQ, INT8/INT4, post-training quantization, QAT, calibrationKey question to ask: "Is the MODEL the bottleneck, or is it infrastructure/serving?"
LLM quantization split: Choosing which quantization format suits a given LLM task (quality vs throughput tradeoffs, instruction-following degradation) is llm-specialist territory. The operations — wiring up AWQ/GPTQ, running calibration, validating kernel support — live here.
Symptoms: "How to serve model", "need API endpoint", "deploy to production", "containerize model", "scale serving", "load balancing", "traffic management", "vLLM", "SGLang", "TensorRT-LLM", "TGI", "Triton", "continuous batching", "PagedAttention", "speculative decoding", "KV-cache management"
When to route here:
Routes to:
Key distinction:
Symptoms: "Track experiments", "version models", "automate deployment", "reproducibility", "CI/CD for ML", "feature store", "model registry", "experiment management", "version prompts", "version RAG eval set"
When to route here:
Routes to:
Key distinction:
Multi-concern: Queries like "track experiments AND automate deployment" → route to BOTH skills
Symptoms: "Monitor production", "model degrading", "detect drift", "production debugging", "alert on failures", "model not working in prod", "performance issues in production", "LLM in production", "hallucination rate", "tool-call success", "cost-per-request spiking", "prompt-injection attempts", "drift in RAG retrieval quality"
When to route here:
Routes to:
Key distinction:
LLM observability split: The plumbing — Phoenix/Langfuse/OTel-GenAI deployment, retention, dashboards, alerts on cost/latency/error budgets — lives here. The eval methodology feeding those dashboards (which metrics to compute, how to score hallucinations, how to build judges) lives in llm-specialist (llm-evaluation-metrics.md). Wire them together: methodology defines the signal; this pack ships it.
"Performance" ambiguity:
User query → Identify primary concern
Is model THE problem (size/speed)?
YES → Category 1: Model Optimization
NO → Continue
Is it about HOW to expose/deploy model (incl. LLM-serving stack)?
YES → Category 2: Serving Infrastructure
NO → Continue
Is it about workflow/process/automation?
YES → Category 3: MLOps Tooling
NO → Continue
Is it about monitoring/debugging in production (incl. LLM telemetry)?
YES → Category 4: Observability
NO → Ask clarifying question
Is the question about LLM generation quality, prompt design, RAG retrieval design, or agent behavior design?
YES → Hand off to llm-specialist (this pack does ops, not generation quality)
Ambiguous? → Ask ONE question to clarify concern category
Ask: "Is this inference latency (how fast predictions are), or training time?"
training-optimization (wrong pack)llm-inference-optimization.mdAsk: "What's your deployment target — cloud server, edge device, batch processing, or LLM endpoint?"
model-serving-patterns.md + llm-specialist (llm-inference-optimization.md for capability/strategy choice)Ask: "By performance, do you mean inference speed, prediction accuracy, or LLM output quality (hallucination, tool-call failure)?"
llm-evaluation-metrics.md for methodology, prompt-engineering-patterns.md / rag-architecture-patterns.md for fixes)Ask: "What's the current pain point — experiment tracking, automated deployment, prompt/eval-set versioning, or some combination?"
Some queries span multiple categories. Route to ALL relevant skills in logical order:
| Scenario | Route Order | Why |
|----------|-------------|-----|
| "Optimize and deploy model" | 1. Optimization → 2. Serving | Optimize BEFORE deploying |
| "Deploy and monitor model" | 1. Serving → 2. Observability | Deploy BEFORE monitoring |
| "Track experiments and automate deployment" | 1. Experiment tracking → 2. Pipeline automation | Track BEFORE automating |
| "Quantize model and serve with vLLM" | 1. Quantization (ops) → 2. LLM-serving Part of serving-patterns | Optimize BEFORE serving |
| "Deploy with A/B testing and monitor" | 1. Deployment strategies → 2. Monitoring | Deploy strategy BEFORE monitoring |
| "Deploy LLM to production" | 1. llm-specialist (model/strategy choice) → 2. ml-production serving + 3. ml-production observability | Generation-quality choice frames the ops |
| "RAG production system" | 1. llm-specialist (rag-architecture-patterns.md) → 2. ml-production (vector-store deployment, retrieval-quality observability, latency monitoring) | Architecture decisions frame the ops |
| "Agent in production" | 1. llm-specialist (agentic-patterns-and-mcp.md) → 2. ml-production (tool-call observability, error rates, cost monitoring) → 3. axiom-engineering-foundations (sandboxing, system design) | Agent design frames ops + safety |
Principle: Route in execution order (what needs to happen first).
The two packs split along the ops vs. generation-quality boundary.
ml-production owns (this pack):
torch.ao.quantization, AWQ, GPTQ wiring, calibration, hardware fitllm-specialist owns (sister pack — 10 reference sheets):
prompt-engineering-patterns.md)reasoning-models.md)agentic-patterns-and-mcp.md)rag-architecture-patterns.md)llm-finetuning-strategies.md)context-engineering-and-prompt-caching.md, context-window-management.md)llm-evaluation-metrics.md)llm-inference-optimization.md)llm-safety-alignment.md)Concrete bidirectional triggers:
| Query | llm-specialist contributes | ml-production contributes |
|-------|---------------------------|---------------------------|
| "Deploy LLM to production" | Inference-strategy + capability-tier choice | Serving stack, monitoring, scaling |
| "Quantize LLM" | Which format suits the task (quality tradeoffs) | torch.ao.quantization, AWQ/GPTQ ops, hardware fit |
| "LLM observability" | Eval methodology to feed into dashboards | Phoenix / Langfuse / OTel-GenAI plumbing |
| "RAG production system" | RAG architecture + retrieval design | Vector-store deployment, retrieval-quality observability, latency monitoring |
| "Agent in production" | Agentic patterns, MCP, anti-patterns | Tool-call observability, error rates, cost monitoring (also: axiom-engineering-foundations for sandboxing/system design) |
Rule of thumb: If the question is "what should the model/prompt/agent do and how should it be designed?" → llm-specialist. If the question is "how do we run, observe, and operate it at scale?" → ml-production. Most LLM production questions need both.
Clear boundary:
"Too slow" disambiguation:
Bidirectional: training-optimization should send users here when they're done training; ml-production should send users back when they need to retrain (drift response, dataset updates).
pytorch-engineering covers: Foundation (distributed training, profiling, memory management)
ml-production covers: Production-specific (serving APIs, deployment patterns, MLOps, production observability)
When to use both:
Bidirectional: pytorch-engineering profiling skills point here for production-context observability; this pack's debugging sheet points back to pytorch-engineering for low-level CUDA/memory work.
For AI/LLM threat modeling — prompt injection, data exfiltration via tool-calls, supply-chain risk on model artifacts, PII leakage in logs, model-registry access control — route to ordis-security-architect (threat modeling, controls design, security review). This pack handles the operational signals (prompt-injection attempt rate as an observability metric) but the threat model and control design live in security-architect.
| Query | Wrong Route | Correct Route | Why |
|-------|-------------|---------------|-----|
| "Model too slow in production" | Immediately to quantization | Ask: inference or training? Then model vs infrastructure? | Could be serving/batching issue, not model |
| "Deploy with Kubernetes" | Defer to Kubernetes docs | Category 2: serving-patterns or deployment-strategies | Kubernetes is tool choice, not routing concern |
| "Set up MLOps" | Route to one skill | Ask about specific pain point — could be tracking AND automation AND prompt/eval-set versioning | MLOps spans multiple skills |
| "Performance issues" | Assume accuracy | Ask: speed, accuracy, or LLM output quality? | Performance is ambiguous |
| "Deploy LLM" | Skip llm-specialist, route only here | Route both: llm-specialist (strategy/capability choice), ml-production (serving stack, observability, scaling) | Generation-quality choices frame the ops |
| "Just use TorchServe" | Route uncritically to TorchServe content | Flag: TorchServe is in maintenance mode (PyTorch deprecated active development). Route to LLM-serving Part of model-serving-patterns.md (vLLM/SGLang/TensorRT-LLM/TGI/Triton) for LLMs; for classical models route to serving-patterns but recommend evaluating Triton or framework-native serving | Don't recommend a maintenance-mode tool for new systems |
| "We use TorchServe" (existing system) | Skip routing | Route to serving-patterns, note migration path | Existing systems still need ops guidance, but flag the deprecation |
| "Hallucinations in production" | Route only to llm-specialist | Route both: ml-production (Phoenix/Langfuse to detect+alert) + llm-specialist (eval methodology + fix via prompt/RAG/fine-tune) | Detection is ops; remediation is generation quality |
| "Quantize an LLM" | Route only to quantization-for-inference | Route both: llm-specialist (which format suits the task) + ml-production (AWQ/GPTQ ops, hardware fit) | Format choice and ops are different concerns |
| Excuse | Reality | |--------|---------| | "User mentioned Kubernetes, route to deployment" | Tools are implementation details. Route by concern first. | | "Slow = optimization, route to quantization" | Slow could be infrastructure. Clarify model vs serving bottleneck. | | "They said deploy, must be serving-patterns" | Could need serving + deployment-strategies + monitoring. Don't assume single concern. | | "MLOps = experiment tracking" | MLOps spans tracking AND automation AND prompt/eval-set versioning. Ask which pain point. | | "Performance obviously means speed" | Could mean accuracy or LLM output quality. Clarify. | | "They're technical, skip clarification" | Technical users still benefit from clarifying questions. | | "LLM question, route only to llm-specialist" | Most LLM production questions need both packs. Default to dual-route. | | "TorchServe still works, just route there" | TorchServe is in maintenance mode. Flag and offer alternatives. |
If you catch yourself thinking ANY of these, STOP and clarify:
When in doubt: Ask ONE clarifying question. 10 seconds of clarification prevents minutes of wrong-skill loading.
| User Concern | Ask Clarifying | Route To | Also Consider | |--------------|----------------|----------|---------------| | Model slow/large | Inference or training? | Optimization skills | If inference, check serving too | | Deploy classical model | Target (cloud/edge/batch)? | Serving patterns | Deployment strategies for gradual rollout | | Deploy LLM | Capability tier and target? | LLM-serving Part of serving-patterns + llm-specialist | Observability + cost monitoring | | Production monitoring | Proactive or reactive? | Monitoring OR debugging | Both if setting up + fixing issues | | LLM observability | Plumbing or methodology? | Monitoring (plumbing) + llm-specialist (methodology) | Cost + drift alerts | | MLOps setup | Tracking, automation, or prompt/eval versioning? | Experiment tracking AND/OR automation | Often multiple needed | | Performance issues | Speed, accuracy, or LLM quality? | Optimization OR observability OR llm-specialist | Depends on clarification | | Scale serving | Traffic pattern? | Scaling-and-load-balancing | Serving patterns if not set up yet |
Query: "I trained a model, now I need to put it in production"
Routing:
llm-inference-optimization.md (capability tier + strategy choice)llm-evaluation-metrics.md (methodology to feed observability)Query: "My inference is slow"
Routing:
Query: "We need better ML workflows"
Routing:
llm-evaluation-metrics.md (what to put in those eval sets)Query: "Our LLM keeps hallucinating in production and costs are spiking"
Routing:
llm-evaluation-metrics.md — define hallucination scoring methodology to feed the observability stackprompt-engineering-patterns.md and/or rag-architecture-patterns.md — remediate the underlying generation-quality issueSkip ml-production when:
Red flag: If model isn't trained yet, probably don't need ml-production. Finish training first.
You've routed correctly when:
After routing, load the appropriate specialist skill for detailed guidance. This pack contains exactly 10 reference sheets (no new sheets added in this refresh):
torch.ao.quantization, AWQ, GPTQ, INT8/INT4 ops, post-training quantization, QAT, calibration, kernel/hardware fit (LLM format choice lives in llm-specialist)plugins/yzmir-ai-engineering-expert/skills/using-ai-engineering/SKILL.mdyzmir-llm-specialist/using-llm-specialist — generation quality, prompt/RAG/agent/eval/safety design (10 sheets including reasoning-models.md, agentic-patterns-and-mcp.md, context-engineering-and-prompt-caching.md)yzmir-training-optimization/using-training-optimization — training-phase convergence and speedyzmir-pytorch-engineering/using-pytorch-engineering — distributed training, low-level profiling, CUDA/memoryordis-security-architect/using-security-architect — AI/LLM threat modeling, controls, security reviewtools
Use when designing, implementing, or auditing an MCP (Model Context Protocol) server — tool API design, idempotency under agent retry, structured error envelopes agents can recover from, schema versioning across model drift, transport reliability (stdio / HTTP), output-shape and pagination discipline, and choosing between tools / resources / prompts / sampling. Also use when an MCP server's tools confuse agents, return unstructured errors, deadlock under concurrent calls, double-execute under retry, or lose state across reconnects. Do not use for general REST/GraphQL API design (use `/web-backend`), for client-side prompt engineering or tool-loop design (use `/llm-specialist`), for general in-process plugin architecture (use `/system-architect`), or for cryptographic-provenance audit trails (use `/audit-pipelines`).
development
Use when running **SQLite or DuckDB inside an application process** as the durable store — not as a development convenience but as the production database. Use when scaling an SQLite layer that worked at low concurrency and is now hitting SQLITE_BUSY, WAL bloat, lock contention, schema-migration ceremony, or correctness gaps under multi-process writers. Use when introducing DuckDB as an OLAP complement to an OLTP SQLite store, or when picking between the two for a new component. Pairs with `/web-backend` (the API surface above the DB) and `/audit-pipelines` (when the DB is also the audit trail). Do not load for server databases (Postgres, MySQL), key-value stores, or ORM choice in isolation.
development
Use when designing or critiquing the structure of a staged procedure — a wizard, configuration flow, troubleshooting tree, training curriculum, multi-stage approval pipeline, decision pipeline, or any decomposition of expert work into composable stages. Use for both producer work (build the decomposition) and critic work (audit a proposed decomposition). Use when reasoning about capacity, bottlenecks, or soundness of a procedural flow. Do not use for implementation-plan critique of code changes (use `/axiom-planning` instead), for execution-time dynamics (use `/simulation-foundations`), or for rendering an already-designed procedure as docs or UI (use `/technical-writer` or `/ux-designer`).
testing
Use when the user wants to draft fiction or creative nonfiction prose, get craft critique on prose they have written, or plan story structure, outline, or premise. Workshop-voiced. Three explicit modes (draft, critique, plan) and the router will refuse to begin work without a declared mode.