plugins/yzmir-pytorch-engineering/skills/using-pytorch-engineering/SKILL.md
Routes to appropriate PyTorch specialist skill based on symptoms and problem type
npx skillsauth add tachyon-beep/skillpacks using-pytorch-engineeringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This meta-skill routes you to the right PyTorch specialist based on symptoms. PyTorch engineering problems fall into distinct categories that require specialized knowledge. Load this skill when you encounter PyTorch-specific issues but aren't sure which specialized skill to use.
Core Principle: Different PyTorch problems require different specialists. Match symptoms to the appropriate specialist skill. Don't guess at solutions—route to the expert.
API surface calibrated to PyTorch 2.9+ as of 2026-05. Deprecated torch.cuda.amp aliases have been migrated to torch.amp; FairScale ZeRO references are replaced with native FSDP1/FSDP2. Modern features (torch.compile, FlexAttention, CUDA Graphs, NVTX/Nsight Systems, DTensor, expandable_segments, channels_last, FP8) are covered as first-class topics.
Reconciliation gate — what you can rely on inside this pack:
torch.cuda.amp.autocast / torch.cuda.amp.GradScaler are deprecated aliases — sheets use torch.amp.autocast(device_type=...) and torch.amp.GradScaler() instead.FullyShardedDataParallel) and FSDP2 (fully_shard, MixedPrecisionPolicy, OffloadPolicy, sharded state dict, init_device_mesh/DTensor). DDP and pipeline parallelism are still covered where appropriate.torch.compile (modes, dynamic=, recompilation triage, graph-break debugging) and FlexAttention / scaled_dot_product_attention are first-class — not afterthoughts.torch.cuda.graph, make_graphed_callables), and expandable_segments:True for fragmentation are the tools of choice.pytorch-code-reviewer, memory-diagnostician) and commands (/yzmir-pytorch-engineering:debug-nan, :debug-oom, :profile) are the canonical entry points and route into the sheets refreshed in this pass.Load this skill when:
Don't use for: Framework-agnostic ML theory, non-PyTorch frameworks, algorithm selection (use yzmir-training-optimization or other packs)
IMPORTANT: All reference sheets are located in the SAME DIRECTORY as this SKILL.md file.
When this skill is loaded from:
skills/using-pytorch-engineering/SKILL.md
Reference sheets like tensor-operations-and-memory.md are at:
skills/using-pytorch-engineering/tensor-operations-and-memory.md
NOT at:
skills/tensor-operations-and-memory.md ← WRONG PATH
When you see a link like [tensor-operations-and-memory.md](tensor-operations-and-memory.md), read the file from the same directory as this SKILL.md.
Symptoms:
expandable_segments / "reserved but unallocated"channels_last memory format questionsRoute to: See tensor-operations-and-memory.md for memory management, expandable_segments:True, channels_last, contiguity, and allocator tuning. For OOM diagnostics paired with profiling traces, also see performance-profiling.md.
Why: Memory management is foundational. Must understand tensor lifecycles, allocator behavior, memory format, and profiling before other optimizations.
Example queries:
Symptoms:
Route to: See module-design-patterns.md for model architecture and nn.Module patterns.
Why: Proper module design prevents bugs and enables features like checkpointing, distributed training, torch.compile, and serialization.
Example queries:
Symptoms:
fully_shard / FullyShardedDataParallelMixedPrecisionPolicy / OffloadPolicy / "sharded state dict"init_device_mesh / "device mesh"Route to: See distributed-training-strategies.md for DDP, FSDP1/FSDP2, DTensor + device mesh, and multi-node setup.
Why: Distributed training has unique setup requirements, synchronization patterns, and pitfalls. Generic advice breaks in distributed settings, and FSDP2 (fully_shard) has materially different ergonomics from FSDP1.
Example queries:
Symptoms:
torch.cuda.graph / make_graphed_callablesRoute to: See performance-profiling.md FIRST for systematic bottleneck identification (PyTorch Profiler, NVTX, Nsight Systems, CUDA Graphs, allocator stats).
Why: MUST profile before optimizing. Many "performance" problems are actually data loading, host-side overhead, or graph-break recompiles — not raw compute. Profile to identify the real bottleneck.
After profiling, may route to:
Example queries:
torch.compile, and OptimizationSymptoms:
torch.amp.autocast / torch.amp.GradScalertorch.cuda.amp (deprecated — migration covered)torch.compile / "graph breaks" / "recompiles" / "dynamic shapes"mode="reduce-overhead" / mode="max-autotune" / fullgraph=Truescaled_dot_product_attention / "SDPA" / "FlashAttention backend"Route to: See mixed-precision-and-optimization.md for the modern torch.amp API, BF16/FP16/FP8 selection, gradient scaling, numerical stability, torch.compile modes/dynamic/recompilation triage, FlexAttention, and scaled_dot_product_attention.
Why: Mixed precision requires careful handling of numerical stability, gradient scaling, and operation compatibility. torch.compile shifts where bugs surface (recompiles, graph breaks, guard failures), and attention now has a first-class fused path via SDPA / FlexAttention.
Example queries:
torch.amp section)Symptoms:
Route to: See debugging-techniques.md for systematic NaN/Inf debugging, anomaly mode, and torch.compile debugging (graph-break and recompile triage, TORCH_LOGS, TORCH_COMPILE_DEBUG).
Why: NaN/Inf issues require systematic debugging—checking gradients layer by layer, identifying numerical instability sources, and targeted fixes. Under torch.compile, you also need to know how to disable compilation, dump graphs, and isolate the failing region.
Example queries:
Symptoms:
Route to: See checkpointing-and-reproducibility.md for complete state management, including sharded/FSDP checkpoints and RNG/determinism.
Why: Proper checkpointing requires saving ALL state (model, optimizer, scheduler, RNG states, AMP scaler). Reproducibility requires deterministic operations and careful seed management. Distributed checkpoints have additional sharding concerns.
Example queries:
Symptoms:
Route to: See custom-autograd-functions.md for custom backward passes.
Why: Custom autograd functions require understanding the autograd engine, proper gradient computation, and numerical stability. Compatibility with torch.compile adds further constraints.
Example queries:
Some scenarios require multiple specialized skills in sequence:
Distributed training with memory constraints:
expandable_segments, activation checkpointing)Performance optimization:
Custom module with proper patterns:
Training instability with mixed precision or compile:
Load in order of execution: Setup before optimization, diagnosis before fixes, structure before customization.
When symptom unclear, ASK ONE clarifying question:
"Fix my PyTorch training" → Ask: "What specific issue? Memory? Speed? Accuracy? NaN? Graph breaks under compile?"
"Optimize my model" → Ask: "Optimize what? Training speed? Memory usage? Inference? Compile?"
"Setup distributed training" → Ask: "Single-node multi-GPU or multi-node? DDP, FSDP1, or FSDP2? What's not working?"
"Model not working" → Ask: "What's broken? Training fails? Wrong outputs? Performance? Recompiles?"
Never guess when ambiguous. Ask once, route accurately.
| Symptom / User ask | Wrong Route | Correct Route | Why |
|--------------------|-------------|---------------|-----|
| "Training slow, optimize my optimizer" | mixed-precision / optimizer tuning alone | performance-profiling.md FIRST | Real bottleneck is often torch.compile graph-breaks, FSDP comm overhead, or data loading — verify before changing the optimizer |
| "OOM in distributed" | tensor-memory only | distributed-training-strategies.md FIRST, then memory | Sharding policy / MixedPrecisionPolicy / OffloadPolicy may be the actual issue |
| "Custom layer slow" | performance-profiling | module-design-patterns.md FIRST | Design might be inefficient before profiling helps |
| "NaN with AMP" | mixed-precision | debugging-techniques.md FIRST | Debug NaN source, then fix AMP / scaler |
| "Save model" | module-design | checkpointing-and-reproducibility.md FIRST | Checkpointing is its own specialty (incl. sharded state dict) |
| "Use FairScale ZeRO" | fairscale-flavored advice | distributed-training-strategies.md (FSDP1/FSDP2) | FairScale is unmaintained; native FSDP is the current path |
| "Use torch.cuda.amp" | echo deprecated API | mixed-precision-and-optimization.md (torch.amp API) | torch.cuda.amp is a deprecated alias for torch.amp |
| "Hand-roll attention" | implement raw QKᵀ/√d softmax | mixed-precision-and-optimization.md (scaled_dot_product_attention / FlexAttention) | Fused SDPA / FlexAttention is faster, more memory-efficient, and numerically saner |
| "torch.compile slower than eager" | tweak optimizer / batch | mixed-precision-and-optimization.md (compile section) + debugging-techniques.md | Almost always graph breaks or recompiles — diagnose first |
| "Use torch.cuda.graph everywhere" | apply blindly | performance-profiling.md | CUDA Graphs help only when host-bound with static shapes; profile first |
Key principle: Diagnosis before solutions, setup before optimization, root cause before fixes — and never echo deprecated APIs.
If you catch yourself about to:
channels_last, expandable_segments)torch.cuda.amp.autocast / GradScaler → Route to mixed-precision-and-optimization.md for the torch.amp APItorch.compile is "always faster" → Route to performance-profiling.md and the compile section of mixed-precisionAll of these mean: You're about to give incomplete or stale advice. Route to the specialist instead.
| Excuse | Reality | What To Do | |--------|---------|------------| | "User is rushed, skip routing" | Routing takes 5 seconds. Wrong fix wastes minutes. | Route anyway - specialists have quick diagnostics | | "They already tried X" | May have done X wrong, misunderstood, or X wasn't applicable. | Route to specialist to verify X was done correctly | | "Authority/senior says Y" | Authority can misdiagnose bottlenecks without profiling. | Profile first, authority second. Respect skills over seniority. | | "User is tired, don't ask" | Exhaustion makes clarity MORE important, not less. | Ask ONE clarifying question - saves time overall | | "User suggested Z" | Z might not be best option for their specific case. | Route to specialist to evaluate if Z is right approach | | "Too complex, can't route" | Complex scenarios need specialists MORE, not less. | Use cross-cutting section - route to multiple skills in sequence | | "User sounds confident" | Confidence about custom autograd often precedes subtle bugs. | Route to specialist for systematic verification | | "Just a quick question" | No such thing - symptoms need diagnosis. | Quick questions deserve correct answers - route properly | | "Simple issue" | Simple symptoms can have complex root causes. | Route based on symptoms, not perceived complexity | | "Direct answer is helpful" | Wrong direct answer wastes time and frustrates user. | Routing to specialist IS the helpful answer |
If you catch yourself thinking ANY of these, STOP and route to the specialist.
Before giving ANY PyTorch advice, ask yourself:
❓ Did I identify the symptom?
❓ Is this symptom in my routing table?
❓ Am I about to give advice directly?
❓ Is this a diagnosis issue or solution issue?
❓ Is query ambiguous?
❓ Am I feeling pressure to skip routing?
If you failed ANY check above, do NOT give direct advice. Route to specialist or ask clarifying question.
Skip PyTorch pack when:
yzmir-training-optimization or algorithm packs)yzmir-neural-architectures)yzmir-training-optimization)yzmir-llm-specialist)yzmir-ml-production)PyTorch pack is for: PyTorch-specific implementation, infrastructure, debugging, and optimization issues — the framework-bound layer underneath those other packs.
Critical: Many PyTorch issues require diagnosis before solutions:
| Issue Type | Diagnosis Skill | Then Solution Skill | |------------|----------------|---------------------| | Performance | performance-profiling | mixed-precision (incl. compile) / distributed | | Memory | tensor-memory (profiling section) + performance-profiling (allocator) | tensor-memory (optimization) | | NaN/Inf | debugging-techniques | mixed-precision / module-design | | Compile regressions | debugging-techniques (graph-break / recompile triage) | mixed-precision (compile section) | | Distributed hangs / OOM | distributed-training-strategies | tensor-memory / performance-profiling | | Training bugs | debugging-techniques | Appropriate fix |
If unclear what's wrong, route to diagnostic skill first.
After routing, load the appropriate specialist skill for detailed guidance. Eight sheets, all in this directory:
channels_last memory format, expandable_segments:True and allocator tuning, gradient checkpointing, fragmentation, OOM mitigation.nn.Module structure, parameter/buffer registration, initialization, composability with checkpointing / FSDP / torch.compile.FullyShardedDataParallel), FSDP2 (fully_shard, MixedPrecisionPolicy, OffloadPolicy), DTensor + init_device_mesh, sharded state dict, NCCL, multi-node launch. FairScale ZeRO is out — replaced by native FSDP.torch.amp.autocast / torch.amp.GradScaler (BF16/FP16/FP8 API and selection), TF32, torch.compile modes / dynamic= / recompilation triage / fullgraph, scaled_dot_product_attention, FlexAttention.nsys workflows, CUDA Graphs (torch.cuda.graph, make_graphed_callables), allocator stats / expandable_segments for fragmentation, host-bound vs compute-bound vs comm-bound triage.torch.compile debugging (TORCH_LOGS, TORCH_COMPILE_DEBUG, graph-break and recompile triage), distributed deadlock isolation.torch.autograd.Function, custom forward/backward, setup_context / save_for_backward, gradcheck, torch.compile interop.PyTorch engineering sits underneath several other Yzmir packs. Hand off explicitly when the question is no longer PyTorch-bound:
yzmir-training-optimization ↔ this pack
yzmir-llm-specialist ↔ this pack
torch.compile for transformer blocks, KV-cache memory tuning.yzmir-ml-production ↔ this pack
torch.compile modes, CUDA Graphs, SDPA backend selection, channels_last, allocator config — route here.When in doubt: PyTorch pack handles "the framework is doing something I need to fix or measure." The other packs handle "what should the framework be doing in the first place."
tools
Use when designing, implementing, or auditing an MCP (Model Context Protocol) server — tool API design, idempotency under agent retry, structured error envelopes agents can recover from, schema versioning across model drift, transport reliability (stdio / HTTP), output-shape and pagination discipline, and choosing between tools / resources / prompts / sampling. Also use when an MCP server's tools confuse agents, return unstructured errors, deadlock under concurrent calls, double-execute under retry, or lose state across reconnects. Do not use for general REST/GraphQL API design (use `/web-backend`), for client-side prompt engineering or tool-loop design (use `/llm-specialist`), for general in-process plugin architecture (use `/system-architect`), or for cryptographic-provenance audit trails (use `/audit-pipelines`).
development
Use when running **SQLite or DuckDB inside an application process** as the durable store — not as a development convenience but as the production database. Use when scaling an SQLite layer that worked at low concurrency and is now hitting SQLITE_BUSY, WAL bloat, lock contention, schema-migration ceremony, or correctness gaps under multi-process writers. Use when introducing DuckDB as an OLAP complement to an OLTP SQLite store, or when picking between the two for a new component. Pairs with `/web-backend` (the API surface above the DB) and `/audit-pipelines` (when the DB is also the audit trail). Do not load for server databases (Postgres, MySQL), key-value stores, or ORM choice in isolation.
development
Use when designing or critiquing the structure of a staged procedure — a wizard, configuration flow, troubleshooting tree, training curriculum, multi-stage approval pipeline, decision pipeline, or any decomposition of expert work into composable stages. Use for both producer work (build the decomposition) and critic work (audit a proposed decomposition). Use when reasoning about capacity, bottlenecks, or soundness of a procedural flow. Do not use for implementation-plan critique of code changes (use `/axiom-planning` instead), for execution-time dynamics (use `/simulation-foundations`), or for rendering an already-designed procedure as docs or UI (use `/technical-writer` or `/ux-designer`).
testing
Use when the user wants to draft fiction or creative nonfiction prose, get craft critique on prose they have written, or plan story structure, outline, or premise. Workshop-voiced. Three explicit modes (draft, critique, plan) and the router will refuse to begin work without a declared mode.