.claude/skills/ml-training-debugger/SKILL.md
# ML Training Debugger **Version**: 1.0.0 **Type**: Agent-based skill with SDK implementation **Domain**: Machine learning training diagnostics ## Description Diagnose machine learning training failures including loss divergence, mode collapse, gradient issues, architecture problems, and optimization failures. This skill spawns a specialist ML debugging agent that systematically analyzes training artifacts to identify root causes and propose evidence-based fixes. Use this skill when encounte
npx skillsauth add DNYoussef/ai-chrome-extension .claude/skills/ml-training-debuggerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Version: 1.0.0 Type: Agent-based skill with SDK implementation Domain: Machine learning training diagnostics
Diagnose machine learning training failures including loss divergence, mode collapse, gradient issues, architecture problems, and optimization failures. This skill spawns a specialist ML debugging agent that systematically analyzes training artifacts to identify root causes and propose evidence-based fixes.
Use this skill when encountering training failures, when loss curves exhibit pathological behavior, when models produce degenerate outputs, when experiencing GPU memory issues, or when hyperparameter tuning produces inconsistent results.
This skill activates when users request:
The skill handles:
The ML debugging agent handles:
{
"task": "Diagnose training failure",
"artifacts": {
"training_logs": "path/to/logs.txt",
"loss_curves": "path/to/losses.csv",
"model_code": ["model.py", "trainer.py"],
"error_messages": ["error1.txt"],
"config": "config.yaml"
},
"symptoms": [
"Loss diverged at epoch 7",
"Mode collapse to single token",
"Gradient norm exploded"
],
"constraints": {
"max_analysis_time": "5 minutes",
"output_format": "structured_diagnosis"
}
}
{
"status": "diagnosis_complete",
"root_causes": [
{
"issue": "Learning rate too high for Muon optimizer",
"severity": "critical",
"evidence": ["grad_norm spike at step 24590", "loss increased 15% in epoch 7"],
"fix": "Reduce muon_lr from 1e-2 to 5e-3",
"confidence": 0.95
}
],
"quick_fixes": ["Reduce LR by 50%", "Enable gradient clipping"],
"analysis_artifacts": {
"gradient_analysis": "path/to/grad_analysis.md",
"loss_visualization": "path/to/loss_plot.png"
}
}
from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions
import asyncio
async def execute_ml_debugger(context: dict):
"""Spawn ML debugging specialist agent."""
# Load specialist agent prompt
with open('agents/ml-debugger-specialist.prompt', 'r') as f:
specialist_prompt = f.read()
# Configure agent
options = ClaudeAgentOptions(
model='claude-sonnet-4-5',
system_prompt=specialist_prompt,
permission_mode='default', # Read-only for safety
allowed_tools=['Read', 'Grep', 'Bash'], # Analysis tools only
setting_sources=['project']
)
client = ClaudeSDKClient(options)
try:
await client.connect()
# Format task for agent
task = f"""Diagnose ML training failure:
Symptoms: {context['symptoms']}
Artifacts available:
- Training logs: {context['artifacts']['training_logs']}
- Loss curves: {context['artifacts']['loss_curves']}
- Model code: {', '.join(context['artifacts']['model_code'])}
Perform systematic analysis and provide structured diagnosis."""
await client.query(task)
# Collect diagnosis
diagnosis = []
async for message in client.receive_messages():
if message.type == 'assistant':
diagnosis.append(message.content)
return parse_diagnosis(diagnosis)
finally:
await client.disconnect()
scripts/analyze_loss_curve.py - Loss curve analysis and visualizationscripts/check_gradients.py - Gradient flow analysisscripts/count_parameters.py - Model parameter counting and distributionscripts/profile_memory.py - GPU memory profilingreferences/common-failure-modes.md - Catalog of ML training failuresreferences/debugging-checklist.md - Systematic debugging workflowreferences/fix-templates.md - Code templates for common fixesextract_training_metrics() - Parse logs for key metricsvisualize_loss_curve() - Generate loss/gradient plotsanalyze_architecture() - Check model architecture balanceUser: "My model was training fine until epoch 7, then loss started increasing. Help debug this."
Skill gathers:
- Training logs from epochs 1-10
- Loss curve data
- trainer.py and model.py
- Hyperparameter config
Agent diagnoses:
- Root cause: Learning rate too high for curriculum transition
- Evidence: Loss increased 15% at epoch 7, gradient norm spiked
- Fix: Reduce learning rate by 50%, add cosine annealing
- Confidence: 95%
User: "Model only outputs colons (::::) regardless of input. What's wrong?"
Skill gathers:
- Model checkpoint
- Inference test logs
- Training loss history
- Model architecture code
Agent diagnoses:
- Root cause: Embedding layer has 79% of params, transformer underparameterized
- Evidence: Training loss decreased but model has no capacity to learn patterns
- Fix: Rebalance architecture (50% embeddings, 50% transformers)
- Confidence: 90%
User: "Getting warning 'var(): degrees of freedom is <= 0' during training"
Skill gathers:
- Full error traceback
- Gradient statistics from logs
- ACT head implementation code
Agent diagnoses:
- Root cause: ACT variance = 0 (all tokens use same halting steps)
- Evidence: Warning appears in ACT loss computation
- Fix: Add diversity regularization to ACT loss
- Confidence: 98%
The skill processes agent diagnosis into user-friendly format:
The ML debugging agent must:
This skill can be used in conjunction with:
If the agent cannot diagnose the issue:
The agent should NEVER:
Test the skill with:
agents/ml-debugger-specialist.promptindex.pyml-training-debugger-process.dottests/README.mdNext Steps:
development
Comprehensive truth scoring, code quality verification, and automatic rollback system with 0.95 accuracy threshold for ensuring high-quality agent outputs and codebase reliability.
development
Comprehensive framework for analyzing, creating, and refining prompts for AI systems using evidence-based techniques
data-ai
Implement adaptive learning with ReasoningBank for pattern recognition, strategy optimization, and continuous improvement
development
Create new Claude Code Skills with proper YAML frontmatter, progressive disclosure structure, and complete directory organization