ML Training Debugger

Version: 1.0.0 Type: Agent-based skill with SDK implementation Domain: Machine learning training diagnostics

Description

Diagnose machine learning training failures including loss divergence, mode collapse, gradient issues, architecture problems, and optimization failures. This skill spawns a specialist ML debugging agent that systematically analyzes training artifacts to identify root causes and propose evidence-based fixes.

Use this skill when encountering training failures, when loss curves exhibit pathological behavior, when models produce degenerate outputs, when experiencing GPU memory issues, or when hyperparameter tuning produces inconsistent results.

Triggers

This skill activates when users request:

"Debug my training run"
"Why is my loss diverging?"
"Model outputs are all the same token"
"Training failed at epoch X"
"Help diagnose mode collapse"
"Why are gradients exploding/vanishing?"
"Model not learning anything"

Skill Architecture

Skill Layer (Lightweight)

The skill handles:

Detection: Identify ML training debugging requests
Context Gathering: Collect training logs, loss curves, model code
Agent Spawning: Invoke ML debugging specialist with context
Result Processing: Format diagnosis and fixes for user

Agent Layer (Specialist)

The ML debugging agent handles:

Systematic Analysis: Apply debugging methodology to artifacts
Root Cause Identification: Diagnose underlying issues
Fix Prioritization: Rank solutions by impact
Evidence-Based Recommendations: Propose fixes with reasoning

Communication Protocol

Skill → Agent Context Package

{
  "task": "Diagnose training failure",
  "artifacts": {
    "training_logs": "path/to/logs.txt",
    "loss_curves": "path/to/losses.csv",
    "model_code": ["model.py", "trainer.py"],
    "error_messages": ["error1.txt"],
    "config": "config.yaml"
  },
  "symptoms": [
    "Loss diverged at epoch 7",
    "Mode collapse to single token",
    "Gradient norm exploded"
  ],
  "constraints": {
    "max_analysis_time": "5 minutes",
    "output_format": "structured_diagnosis"
  }
}

Agent → Skill Results

{
  "status": "diagnosis_complete",
  "root_causes": [
    {
      "issue": "Learning rate too high for Muon optimizer",
      "severity": "critical",
      "evidence": ["grad_norm spike at step 24590", "loss increased 15% in epoch 7"],
      "fix": "Reduce muon_lr from 1e-2 to 5e-3",
      "confidence": 0.95
    }
  ],
  "quick_fixes": ["Reduce LR by 50%", "Enable gradient clipping"],
  "analysis_artifacts": {
    "gradient_analysis": "path/to/grad_analysis.md",
    "loss_visualization": "path/to/loss_plot.png"
  }
}

Agent Spawning Logic

from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions
import asyncio

async def execute_ml_debugger(context: dict):
    """Spawn ML debugging specialist agent."""

    # Load specialist agent prompt
    with open('agents/ml-debugger-specialist.prompt', 'r') as f:
        specialist_prompt = f.read()

    # Configure agent
    options = ClaudeAgentOptions(
        model='claude-sonnet-4-5',
        system_prompt=specialist_prompt,
        permission_mode='default',  # Read-only for safety
        allowed_tools=['Read', 'Grep', 'Bash'],  # Analysis tools only
        setting_sources=['project']
    )

    client = ClaudeSDKClient(options)

    try:
        await client.connect()

        # Format task for agent
        task = f"""Diagnose ML training failure:

Symptoms: {context['symptoms']}

Artifacts available:
- Training logs: {context['artifacts']['training_logs']}
- Loss curves: {context['artifacts']['loss_curves']}
- Model code: {', '.join(context['artifacts']['model_code'])}

Perform systematic analysis and provide structured diagnosis."""

        await client.query(task)

        # Collect diagnosis
        diagnosis = []
        async for message in client.receive_messages():
            if message.type == 'assistant':
                diagnosis.append(message.content)

        return parse_diagnosis(diagnosis)

    finally:
        await client.disconnect()

Resources

Scripts

scripts/analyze_loss_curve.py - Loss curve analysis and visualization
scripts/check_gradients.py - Gradient flow analysis
scripts/count_parameters.py - Model parameter counting and distribution
scripts/profile_memory.py - GPU memory profiling

References

references/common-failure-modes.md - Catalog of ML training failures
references/debugging-checklist.md - Systematic debugging workflow
references/fix-templates.md - Code templates for common fixes

Custom Tools

extract_training_metrics() - Parse logs for key metrics
visualize_loss_curve() - Generate loss/gradient plots
analyze_architecture() - Check model architecture balance

Usage Examples

Example 1: Loss Divergence

User: "My model was training fine until epoch 7, then loss started increasing. Help debug this."

Skill gathers:
- Training logs from epochs 1-10
- Loss curve data
- trainer.py and model.py
- Hyperparameter config

Agent diagnoses:
- Root cause: Learning rate too high for curriculum transition
- Evidence: Loss increased 15% at epoch 7, gradient norm spiked
- Fix: Reduce learning rate by 50%, add cosine annealing
- Confidence: 95%

Example 2: Mode Collapse

User: "Model only outputs colons (::::) regardless of input. What's wrong?"

Skill gathers:
- Model checkpoint
- Inference test logs
- Training loss history
- Model architecture code

Agent diagnoses:
- Root cause: Embedding layer has 79% of params, transformer underparameterized
- Evidence: Training loss decreased but model has no capacity to learn patterns
- Fix: Rebalance architecture (50% embeddings, 50% transformers)
- Confidence: 90%

Example 3: Gradient Issues

User: "Getting warning 'var(): degrees of freedom is <= 0' during training"

Skill gathers:
- Full error traceback
- Gradient statistics from logs
- ACT head implementation code

Agent diagnoses:
- Root cause: ACT variance = 0 (all tokens use same halting steps)
- Evidence: Warning appears in ACT loss computation
- Fix: Add diversity regularization to ACT loss
- Confidence: 98%

Result Processing

The skill processes agent diagnosis into user-friendly format:

Extract Root Causes: Parse structured diagnosis
Prioritize Fixes: Rank by impact and confidence
Format Recommendations: Present as actionable steps
Include Evidence: Show supporting data/logs
Generate Visualizations: Create loss plots, gradient heatmaps

Quality Standards

The ML debugging agent must:

✅ Identify root cause with >80% confidence or request more data
✅ Provide evidence from actual artifacts (not speculation)
✅ Propose fixes with expected impact and reasoning
✅ Complete analysis within 5 minutes for typical cases
✅ Handle missing artifacts gracefully (work with available data)

Integration with Other Skills

This skill can be used in conjunction with:

ml-expertise skill for implementing fixes
code-analyzer skill for architecture review
functionality-audit skill for validating fixes

Failure Modes and Escalation

If the agent cannot diagnose the issue:

Request additional artifacts (specific logs, config files)
Provide partial diagnosis with lower confidence
Suggest alternative debugging approaches
Escalate to user with specific questions

The agent should NEVER:

Guess at root causes without evidence
Propose fixes that could corrupt training state
Modify code directly (read-only mode)

Testing

Test the skill with:

Real Phase 1 training failure (loss divergence at epoch 7)
Synthetic mode collapse scenario
Architecture imbalance case (79% embedding params)
Gradient explosion/vanishing cases
Missing artifacts scenario

Documentation

Agent system prompt: agents/ml-debugger-specialist.prompt
SDK implementation: index.py
Process visualization: ml-training-debugger-process.dot
Testing guide: tests/README.md

Next Steps:

Create agent system prompt with ML debugging expertise
Implement SDK-based agent spawning
Add custom analysis tools
Test on Phase 1 training failures
Iterate based on real debugging sessions

ML Training Debugger

Version: 1.0.0 Type: Agent-based skill with SDK implementation Domain: Machine learning training diagnostics

Description

Triggers

This skill activates when users request:

"Debug my training run"
"Why is my loss diverging?"
"Model outputs are all the same token"
"Training failed at epoch X"
"Help diagnose mode collapse"
"Why are gradients exploding/vanishing?"
"Model not learning anything"

Skill Architecture

Skill Layer (Lightweight)

The skill handles:

Detection: Identify ML training debugging requests
Context Gathering: Collect training logs, loss curves, model code
Agent Spawning: Invoke ML debugging specialist with context
Result Processing: Format diagnosis and fixes for user

Agent Layer (Specialist)

The ML debugging agent handles:

Systematic Analysis: Apply debugging methodology to artifacts
Root Cause Identification: Diagnose underlying issues
Fix Prioritization: Rank solutions by impact
Evidence-Based Recommendations: Propose fixes with reasoning

Communication Protocol

Skill → Agent Context Package

{
  "task": "Diagnose training failure",
  "artifacts": {
    "training_logs": "path/to/logs.txt",
    "loss_curves": "path/to/losses.csv",
    "model_code": ["model.py", "trainer.py"],
    "error_messages": ["error1.txt"],
    "config": "config.yaml"
  },
  "symptoms": [
    "Loss diverged at epoch 7",
    "Mode collapse to single token",
    "Gradient norm exploded"
  ],
  "constraints": {
    "max_analysis_time": "5 minutes",
    "output_format": "structured_diagnosis"
  }
}

Agent → Skill Results

{
  "status": "diagnosis_complete",
  "root_causes": [
    {
      "issue": "Learning rate too high for Muon optimizer",
      "severity": "critical",
      "evidence": ["grad_norm spike at step 24590", "loss increased 15% in epoch 7"],
      "fix": "Reduce muon_lr from 1e-2 to 5e-3",
      "confidence": 0.95
    }
  ],
  "quick_fixes": ["Reduce LR by 50%", "Enable gradient clipping"],
  "analysis_artifacts": {
    "gradient_analysis": "path/to/grad_analysis.md",
    "loss_visualization": "path/to/loss_plot.png"
  }
}

Agent Spawning Logic

from claude_agent_sdk import ClaudeSDKClient, ClaudeAgentOptions
import asyncio

async def execute_ml_debugger(context: dict):
    """Spawn ML debugging specialist agent."""

    # Load specialist agent prompt
    with open('agents/ml-debugger-specialist.prompt', 'r') as f:
        specialist_prompt = f.read()

    # Configure agent
    options = ClaudeAgentOptions(
        model='claude-sonnet-4-5',
        system_prompt=specialist_prompt,
        permission_mode='default',  # Read-only for safety
        allowed_tools=['Read', 'Grep', 'Bash'],  # Analysis tools only
        setting_sources=['project']
    )

    client = ClaudeSDKClient(options)

    try:
        await client.connect()

        # Format task for agent
        task = f"""Diagnose ML training failure:

Symptoms: {context['symptoms']}

Artifacts available:
- Training logs: {context['artifacts']['training_logs']}
- Loss curves: {context['artifacts']['loss_curves']}
- Model code: {', '.join(context['artifacts']['model_code'])}

Perform systematic analysis and provide structured diagnosis."""

        await client.query(task)

        # Collect diagnosis
        diagnosis = []
        async for message in client.receive_messages():
            if message.type == 'assistant':
                diagnosis.append(message.content)

        return parse_diagnosis(diagnosis)

    finally:
        await client.disconnect()

Resources

Scripts

scripts/analyze_loss_curve.py - Loss curve analysis and visualization
scripts/check_gradients.py - Gradient flow analysis
scripts/count_parameters.py - Model parameter counting and distribution
scripts/profile_memory.py - GPU memory profiling

References

references/common-failure-modes.md - Catalog of ML training failures
references/debugging-checklist.md - Systematic debugging workflow
references/fix-templates.md - Code templates for common fixes

Custom Tools

extract_training_metrics() - Parse logs for key metrics
visualize_loss_curve() - Generate loss/gradient plots
analyze_architecture() - Check model architecture balance

Usage Examples

Example 1: Loss Divergence

User: "My model was training fine until epoch 7, then loss started increasing. Help debug this."

Skill gathers:
- Training logs from epochs 1-10
- Loss curve data
- trainer.py and model.py
- Hyperparameter config

Agent diagnoses:
- Root cause: Learning rate too high for curriculum transition
- Evidence: Loss increased 15% at epoch 7, gradient norm spiked
- Fix: Reduce learning rate by 50%, add cosine annealing
- Confidence: 95%

Example 2: Mode Collapse

User: "Model only outputs colons (::::) regardless of input. What's wrong?"

Skill gathers:
- Model checkpoint
- Inference test logs
- Training loss history
- Model architecture code

Agent diagnoses:
- Root cause: Embedding layer has 79% of params, transformer underparameterized
- Evidence: Training loss decreased but model has no capacity to learn patterns
- Fix: Rebalance architecture (50% embeddings, 50% transformers)
- Confidence: 90%

Example 3: Gradient Issues

User: "Getting warning 'var(): degrees of freedom is <= 0' during training"

Skill gathers:
- Full error traceback
- Gradient statistics from logs
- ACT head implementation code

Agent diagnoses:
- Root cause: ACT variance = 0 (all tokens use same halting steps)
- Evidence: Warning appears in ACT loss computation
- Fix: Add diversity regularization to ACT loss
- Confidence: 98%

Result Processing

The skill processes agent diagnosis into user-friendly format:

Extract Root Causes: Parse structured diagnosis
Prioritize Fixes: Rank by impact and confidence
Format Recommendations: Present as actionable steps
Include Evidence: Show supporting data/logs
Generate Visualizations: Create loss plots, gradient heatmaps

Quality Standards

The ML debugging agent must:

✅ Identify root cause with >80% confidence or request more data
✅ Provide evidence from actual artifacts (not speculation)
✅ Propose fixes with expected impact and reasoning
✅ Complete analysis within 5 minutes for typical cases
✅ Handle missing artifacts gracefully (work with available data)

Integration with Other Skills

This skill can be used in conjunction with:

ml-expertise skill for implementing fixes
code-analyzer skill for architecture review
functionality-audit skill for validating fixes

Failure Modes and Escalation

If the agent cannot diagnose the issue:

Request additional artifacts (specific logs, config files)
Provide partial diagnosis with lower confidence
Suggest alternative debugging approaches
Escalate to user with specific questions

The agent should NEVER:

Guess at root causes without evidence
Propose fixes that could corrupt training state
Modify code directly (read-only mode)

Testing

Test the skill with:

Real Phase 1 training failure (loss divergence at epoch 7)
Synthetic mode collapse scenario
Architecture imbalance case (79% embedding params)
Gradient explosion/vanishing cases
Missing artifacts scenario

Documentation

Agent system prompt: agents/ml-debugger-specialist.prompt
SDK implementation: index.py
Process visualization: ml-training-debugger-process.dot
Testing guide: tests/README.md

Next Steps:

Create agent system prompt with ML debugging expertise
Implement SDK-based agent spawning
Add custom analysis tools
Test on Phase 1 training failures
Iterate based on real debugging sessions

Adoption

DNYoussef/.claude/skills/ml-training-debugger

$ install --global

Security Scan Results

SKILL.md

ML Training Debugger

Description

Triggers

Skill Architecture

Skill Layer (Lightweight)

Agent Layer (Specialist)

Communication Protocol

Skill → Agent Context Package

Agent → Skill Results

Agent Spawning Logic

Resources

Scripts

References

Custom Tools

Usage Examples

Example 1: Loss Divergence

Example 2: Mode Collapse

Example 3: Gradient Issues

Result Processing

Quality Standards

Integration with Other Skills

Failure Modes and Escalation

Testing

Documentation

Related Skills

DNYoussef/Verification & Quality Assurance

DNYoussef/when-optimizing-prompts-use-prompt-architect

DNYoussef/when-optimizing-agent-learning-use-reasoningbank-intelligence

DNYoussef/when-creating-skill-template-use-skill-builder

DNYoussef/.claude/skills/ml-training-debugger

$ install --global

Security Scan Results

SKILL.md

ML Training Debugger

Description

Triggers

Skill Architecture

Skill Layer (Lightweight)

Agent Layer (Specialist)

Communication Protocol

Skill → Agent Context Package

Agent → Skill Results

Agent Spawning Logic

Resources

Scripts

References

Custom Tools

Usage Examples

Example 1: Loss Divergence

Example 2: Mode Collapse

Example 3: Gradient Issues

Result Processing

Quality Standards

Integration with Other Skills

Failure Modes and Escalation

Testing

Documentation

Related Skills

DNYoussef/Verification & Quality Assurance

DNYoussef/when-optimizing-prompts-use-prompt-architect

DNYoussef/when-optimizing-agent-learning-use-reasoningbank-intelligence

DNYoussef/when-creating-skill-template-use-skill-builder