Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

madappgang/performance-tracking

Name: performance-tracking
Author: madappgang

plugins/multimodel/skills/performance-tracking/SKILL.md

npx skillsauth add madappgang/magus performance-tracking

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Performance Tracking

Version: 1.0.0 Purpose: Track agent, skill, and model performance metrics for continuous optimization Status: Production Ready

Overview

Performance tracking transforms workflows from "fire and forget" to data-driven optimization systems. By measuring what actually works, you can route tasks more effectively, identify failing patterns early, and reduce costs.

This skill provides battle-tested patterns for:

Agent success tracking (completion rates, confidence scores, task type affinity)
Skill effectiveness (activation counts, success correlation, usage patterns)
Model performance (latency, cost, quality, provider comparison)
Routing optimization (tier distribution, routing accuracy, cost efficiency)
Historical analysis (trend detection, degradation alerts, pattern discovery)

Performance tracking enables continuous improvement by providing the data needed to make informed decisions about agent selection, model choice, and workflow routing.

Why Track Performance

Optimize Routing:

Identify which agents excel at specific task types
Route complex tasks to high-confidence agents
Avoid agents with low success rates for critical work

Identify Failing Agents:

Detect agents with <70% success rate
Alert when agent performance degrades
Replace or retrain underperforming agents

Reduce Costs:

Find cost-effective model alternatives
Identify expensive agents with low success rates
Optimize tier thresholds based on actual performance

Improve Quality:

Track correlation between confidence scores and success
Identify patterns in successful implementations
Learn which models produce best results for task types

What We Track

Agent Metrics:

Total runs, success/failure counts
Average confidence scores
Task type distribution
Last used timestamp
Individual execution history

Skill Metrics:

Activation counts per skill
Last activation timestamps
Success correlation (when skill active, what's success rate?)
Co-activation patterns

Model Metrics:

Total runs, success/failure counts
Average latency (response time)
Total cost (cumulative spend)
Cost per successful task
Last used timestamp

Routing Metrics:

Tier distribution (how often each tier selected)
Routing accuracy (did tier match complexity?)
Cost efficiency (tier1 vs tier4 cost ratio)
Decision history with outcomes

Integration with task-complexity-router

The performance tracker provides critical feedback to the task-complexity-router:

Routing Feedback Loop:

1. Router selects tier based on complexity
   → task-complexity-router analyzes task
   → Routes to tier2 (medium complexity)

2. Agent executes task
   → Records: tier=2, agent=ui-developer, result=success

3. Performance tracker updates metrics
   → tier2 usage +1
   → ui-developer success +1
   → Confidence in tier2 routing increases

4. Future routing decisions informed by history
   → Router sees tier2 has 85% success rate
   → Router sees ui-developer excels at UI tasks
   → Router confidently routes similar tasks to tier2

Metrics Schema

JSON Structure (Version 1.0.0)

Store performance metrics in .claude/agent-performance.json:

{
  "schemaVersion": "1.0.0",
  "lastUpdated": "2026-01-28T15:30:00Z",
  "agents": {
    "ui-developer": {
      "totalRuns": 42,
      "successCount": 38,
      "failureCount": 4,
      "avgConfidence": 0.85,
      "lastUsed": "2026-01-28T15:30:00Z",
      "taskTypes": {
        "implement-component": 15,
        "fix-styling": 12,
        "refactor-ui": 8,
        "review-code": 7
      },
      "history": [
        {
          "timestamp": "2026-01-28T15:30:00Z",
          "taskType": "implement-component",
          "result": "success",
          "confidence": 0.90,
          "duration": 45000,
          "tier": 2,
          "model": "sonnet"
        },
        {
          "timestamp": "2026-01-28T14:20:00Z",
          "taskType": "fix-styling",
          "result": "success",
          "confidence": 0.85,
          "duration": 30000,
          "tier": 1,
          "model": "sonnet"
        }
      ]
    },
    "backend-developer": {
      "totalRuns": 28,
      "successCount": 25,
      "failureCount": 3,
      "avgConfidence": 0.88,
      "lastUsed": "2026-01-28T14:00:00Z",
      "taskTypes": {
        "implement-api": 12,
        "fix-bug": 8,
        "database-migration": 5,
        "write-tests": 3
      },
      "history": []
    }
  },
  "skills": {
    "multi-model-validation": {
      "activations": 15,
      "lastActivated": "2026-01-28T15:00:00Z",
      "successCorrelation": 0.92,
      "coActivations": {
        "quality-gates": 12,
        "error-recovery": 8
      }
    },
    "task-complexity-router": {
      "activations": 68,
      "lastActivated": "2026-01-28T15:30:00Z",
      "successCorrelation": 0.85,
      "coActivations": {
        "multi-agent-coordination": 45,
        "hierarchical-coordinator": 30
      }
    }
  },
  "models": {
    "sonnet": {
      "totalRuns": 120,
      "successCount": 108,
      "failureCount": 12,
      "avgLatency": 2500,
      "totalCost": 0.45,
      "lastUsed": "2026-01-28T15:30:00Z",
      "taskTypePerformance": {
        "code-review": { "success": 25, "failure": 2 },
        "implementation": { "success": 40, "failure": 5 },
        "testing": { "success": 20, "failure": 3 }
      }
    },
    "grok": {
      "totalRuns": 35,
      "successCount": 30,
      "failureCount": 5,
      "avgLatency": 1800,
      "totalCost": 0.08,
      "lastUsed": "2026-01-28T13:00:00Z",
      "taskTypePerformance": {
        "code-review": { "success": 18, "failure": 2 },
        "implementation": { "success": 12, "failure": 3 }
      }
    }
  },
  "routing": {
    "tierDistribution": {
      "tier1": 45,
      "tier2": 30,
      "tier3": 15,
      "tier4": 8
    },
    "decisions": [
      {
        "timestamp": "2026-01-28T15:30:00Z",
        "taskType": "implement-component",
        "complexity": "medium",
        "selectedTier": 2,
        "agent": "ui-developer",
        "result": "success",
        "cost": 0.003
      },
      {
        "timestamp": "2026-01-28T14:20:00Z",
        "taskType": "fix-styling",
        "complexity": "low",
        "selectedTier": 1,
        "agent": "ui-developer",
        "result": "success",
        "cost": 0.001
      }
    ]
  }
}

Schema Field Definitions

Agent Metrics:

totalRuns: Total task executions
successCount: Tasks completed successfully
failureCount: Tasks that failed or required retry
avgConfidence: Rolling average of agent confidence scores (0.0-1.0)
lastUsed: ISO-8601 timestamp of last execution
taskTypes: Distribution of task types (understand agent specialization)
history: Array of recent executions (max 100 entries, FIFO)

Skill Metrics:

activations: Total times skill was triggered
lastActivated: ISO-8601 timestamp
successCorrelation: Success rate when this skill is active (0.0-1.0)
coActivations: Skills frequently activated together (detect patterns)

Model Metrics:

totalRuns: Total executions
successCount/failureCount: Outcome tracking
avgLatency: Average response time in milliseconds
totalCost: Cumulative spend in USD
lastUsed: ISO-8601 timestamp
taskTypePerformance: Success/failure breakdown by task type

Routing Metrics:

tierDistribution: Count of tasks routed to each tier
decisions: Array of routing decisions with outcomes (max 100, FIFO)

Tracking Patterns

Pattern 1: Capturing Agent Performance

After Agent Completes Task:

Execution Flow:

1. Agent executes task
   Task: ui-developer
   Input: "Implement login form component"
   Result: Success
   Confidence: 0.90
   Duration: 45 seconds
   Tier: 2
   Model: sonnet

2. Update agent metrics
   Read: .claude/agent-performance.json
   Update:
     agents["ui-developer"].totalRuns += 1
     agents["ui-developer"].successCount += 1
     agents["ui-developer"].avgConfidence = rolling_avg(0.90)
     agents["ui-developer"].lastUsed = NOW
     agents["ui-developer"].taskTypes["implement-component"] += 1
     agents["ui-developer"].history.push({
       timestamp: NOW,
       taskType: "implement-component",
       result: "success",
       confidence: 0.90,
       duration: 45000,
       tier: 2,
       model: "sonnet"
     })
   Trim history if > 100 entries
   Write: .claude/agent-performance.json

3. Calculate derived metrics
   Success rate: successCount / totalRuns = 38/42 = 90.5%
   Avg duration: sum(history.duration) / history.length
   Task affinity: taskTypes sorted by count

After Agent Fails:

Failure Flow:

1. Agent fails task
   Task: backend-developer
   Input: "Implement complex payment flow"
   Result: Failure (error, timeout, or low quality)
   Confidence: 0.65
   Tier: 3

2. Update failure metrics
   agents["backend-developer"].totalRuns += 1
   agents["backend-developer"].failureCount += 1
   agents["backend-developer"].avgConfidence = rolling_avg(0.65)
   agents["backend-developer"].history.push({
     timestamp: NOW,
     taskType: "implement-api",
     result: "failure",
     confidence: 0.65,
     duration: 120000,
     tier: 3,
     error: "Exceeded max iterations"
   })

3. Check for degradation
   If failureCount / totalRuns > 0.30:
     Alert: "backend-developer has >30% failure rate"
     Recommendation: "Review recent failures, retrain, or replace"

Pattern 2: Tracking Model Performance

After Model Execution:

Execution Flow:

1. Model completes task
   Model: grok
   Task: Code review
   Latency: 1800ms
   Cost: $0.002
   Result: Success

2. Update model metrics
   models["grok"].totalRuns += 1
   models["grok"].successCount += 1
   models["grok"].avgLatency = rolling_avg(1800)
   models["grok"].totalCost += 0.002
   models["grok"].lastUsed = NOW
   models["grok"].taskTypePerformance["code-review"].success += 1

3. Compare model performance
   Claude Sonnet: avgLatency=2500ms, cost=$0.45 (120 runs)
   Grok Fast: avgLatency=1800ms, cost=$0.08 (35 runs)

   Analysis:
   - Grok is 28% faster (1800ms vs 2500ms)
   - Grok is 82% cheaper per run ($0.0023 vs $0.0038)
   - Both have similar success rates (86% vs 90%)

   Recommendation:
   - Use Grok for cost-sensitive tasks
   - Use Claude for critical tasks (higher success rate)

Pattern 3: Recording Skill Activation

After Skill Activation:

Activation Flow:

1. Skill triggers
   Skill: multi-model-validation
   Context: User requested /review with 3 models

2. Update skill metrics
   skills["multi-model-validation"].activations += 1
   skills["multi-model-validation"].lastActivated = NOW

3. Track co-activation
   Active skills: ["multi-model-validation", "quality-gates"]
   skills["multi-model-validation"].coActivations["quality-gates"] += 1

4. Calculate success correlation
   Tasks with this skill active: 15
   Successful tasks: 14
   Success correlation: 14/15 = 93.3%

5. Pattern detection
   Observation: multi-model-validation + quality-gates = 100% success (12/12)
   Recommendation: Always pair these skills for high-quality reviews

Pattern 4: Routing Decision Tracking

After Routing Decision:

Routing Flow:

1. Router selects tier
   Task: "Implement user profile page"
   Analysis: Medium complexity (multiple components, state management)
   Selected tier: 2
   Agent: ui-developer
   Model: sonnet

2. Record routing decision
   routing.tierDistribution["tier2"] += 1
   routing.decisions.push({
     timestamp: NOW,
     taskType: "implement-component",
     complexity: "medium",
     selectedTier: 2,
     agent: "ui-developer",
     result: "pending"
   })

3. After task completes
   Update decision with result:
   routing.decisions[last].result = "success"
   routing.decisions[last].cost = 0.003

4. Trim decision history if > 100 entries

Pattern 5: Session-Level Aggregation

End of Session Summary:

Session Summary Flow:

1. Aggregate session metrics
   Session ID: 2026-01-28-session-15
   Duration: 2 hours
   Tasks executed: 15
   Success rate: 14/15 = 93.3%
   Total cost: $0.045
   Models used: Claude (12), Grok (3)

2. Create session snapshot
   File: ai-docs/performance-history/2026-01-28-session-15.json
   Content:
     {
       "sessionId": "2026-01-28-session-15",
       "startTime": "2026-01-28T13:00:00Z",
       "endTime": "2026-01-28T15:00:00Z",
       "duration": 7200000,
       "tasks": 15,
       "successRate": 0.933,
       "totalCost": 0.045,
       "modelUsage": { "claude": 12, "grok": 3 },
       "topAgents": ["ui-developer", "backend-developer"],
       "activeSkills": ["task-complexity-router", "multi-model-validation"]
     }

3. Update rolling metrics
   .claude/agent-performance.json (persistent)
   ai-docs/performance-history/ (snapshots)

4. Cleanup old snapshots
   Keep last 100 session snapshots
   Delete older entries

File Location and Management

Primary Performance File

Location: .claude/agent-performance.json

Purpose: Persistent, project-level performance tracking

When to Update:

After every agent execution
After every model execution
After every skill activation
After every routing decision

Format: JSON schema version 1.0.0 (see Metrics Schema section)

Rotation: Keep full history, but trim individual history arrays to 100 entries

Session Snapshots

Location: ai-docs/performance-history/

Purpose: Point-in-time session summaries for historical analysis

Naming: {YYYY-MM-DD}-session-{N}.json

Example:

ai-docs/performance-history/
  2026-01-28-session-1.json
  2026-01-28-session-2.json
  2026-01-27-session-1.json
  ...

Retention: Keep last 100 sessions, delete older

Integration with Existing Files

Relationship with ai-docs/llm-performance.json:

Comparison:

llm-performance.json (existing):
  - Model-specific performance
  - Cost tracking per model
  - Response time tracking
  - Used by multi-model-validation

agent-performance.json (new):
  - Agent-level metrics (multi-run aggregation)
  - Skill activation tracking
  - Routing decision history
  - Task type affinity

Integration:
  - agent-performance.json imports model data from llm-performance.json
  - Both files updated in parallel
  - llm-performance.json focuses on single-run details
  - agent-performance.json focuses on aggregate trends

Migration Path:

Step 1: Create .claude/agent-performance.json with schema 1.0.0
Step 2: Import historical data from llm-performance.json
Step 3: Update both files going forward
Step 4: Deprecate llm-performance.json after 6 months (optional)

Data Cleanup and Rotation

Automatic Cleanup:

Cleanup Rules:

1. Agent history arrays
   Max entries: 100
   Strategy: FIFO (oldest removed first)
   Trigger: After every agent execution

2. Routing decision arrays
   Max entries: 100
   Strategy: FIFO
   Trigger: After every routing decision

3. Session snapshots
   Max files: 100
   Strategy: FIFO (delete oldest session files)
   Trigger: After every session ends

4. Skill co-activation maps
   Max entries per skill: 50
   Strategy: Keep top 50 by count
   Trigger: Weekly cleanup

Manual Cleanup:

When to manually reset:

1. After major workflow changes
   - Agent capabilities changed
   - New skills added
   - Routing logic updated
   → Reset metrics to start fresh

2. After agent retraining
   - Agent prompt updated
   - Agent model changed
   → Reset agent-specific metrics

3. After prolonged period (>6 months)
   - Metrics may be outdated
   → Archive old data, start fresh

How to reset:
  Backup: cp .claude/agent-performance.json .claude/agent-performance-backup-{DATE}.json
  Reset: echo '{"schemaVersion":"1.0.0","lastUpdated":"...","agents":{},...}' > .claude/agent-performance.json

Using Metrics for Optimization

Optimization 1: Identify Underperforming Agents

Detection:

Analyze agent success rates:

agents["ui-developer"]:
  successCount: 38
  totalRuns: 42
  success rate: 38/42 = 90.5% ✅ GOOD

agents["test-architect"]:
  successCount: 15
  totalRuns: 25
  success rate: 15/25 = 60% ❌ UNDERPERFORMING

Threshold: <70% success rate = underperforming

Action:

For test-architect (60% success):

1. Analyze failure patterns
   Review history entries where result="failure"
   Common failure reasons:
     - "Tests too brittle" (8 occurrences)
     - "Missing test coverage" (5 occurrences)
     - "Test timeout" (2 occurrences)

2. Identify root cause
   Pattern: test-architect struggles with async/timing tests
   Evidence: All timeout failures involved async code

3. Take action
   Option A: Retrain agent
     - Update prompt with async testing best practices
     - Add examples of proper async test patterns
     - Reset metrics after retraining

   Option B: Route differently
     - Route async test tasks to backend-developer (90% success on async)
     - Keep test-architect for synchronous unit tests

   Option C: Replace agent
     - Create new specialized-async-test-architect
     - Deprecate test-architect for async work

Optimization 2: Find Cost-Effective Model Alternatives

Analysis:

Compare model cost-effectiveness:

Model: sonnet
  Total cost: $0.45
  Total runs: 120
  Success count: 108
  Cost per task: $0.0038
  Cost per success: $0.0042
  Success rate: 90%

Model: grok
  Total cost: $0.08
  Total runs: 35
  Success count: 30
  Cost per task: $0.0023
  Cost per success: $0.0027
  Success rate: 86%

Model: gemini
  Total cost: $0.02
  Total runs: 20
  Success count: 16
  Cost per task: $0.0010
  Cost per success: $0.0013
  Success rate: 80%

Cost Efficiency Ranking:
  1. Gemini Flash: $0.0013 per success (80% success rate)
  2. Grok Fast: $0.0027 per success (86% success rate)
  3. Claude Sonnet: $0.0042 per success (90% success rate)

Quality-Cost Tradeoff:
  - Gemini: 69% cheaper than Claude, but 10% lower success rate
  - Grok: 36% cheaper than Claude, but 4% lower success rate

Action:

Optimization strategy:

Tier 1 (Simple tasks):
  Use: Gemini Flash
  Reason: Lowest cost, acceptable success rate for simple work
  Example: "Fix typo in comment", "Format code"

Tier 2 (Medium tasks):
  Use: Grok Fast
  Reason: Good balance of cost and quality
  Example: "Implement CRUD endpoint", "Add validation"

Tier 3 (Complex tasks):
  Use: Claude Sonnet
  Reason: Highest success rate justifies cost
  Example: "Design architecture", "Complex refactoring"

Tier 4 (Critical tasks):
  Use: Claude Sonnet + Multi-model validation
  Reason: Quality > cost for critical work
  Example: "Security review", "Production bug fix"

Expected savings:
  Current: 90% Claude usage × $0.0042 = $0.00378 avg per task
  Optimized: 20% Claude + 50% Grok + 30% Gemini = $0.00257 avg per task
  Savings: 32% cost reduction with minimal quality impact

Optimization 3: Optimize Routing Tier Thresholds

Analysis:

Review tier distribution:

routing.tierDistribution:
  tier1: 45 tasks (45.9%)
  tier2: 30 tasks (30.6%)
  tier3: 15 tasks (15.3%)
  tier4: 8 tasks (8.2%)

Analyze tier accuracy:

Tier 1 (Simple):
  Tasks: 45
  Success: 42
  Failures: 3
  Success rate: 93.3% ✅
  Verdict: Well-calibrated

Tier 2 (Medium):
  Tasks: 30
  Success: 25
  Failures: 5
  Success rate: 83.3% ⚠️
  Verdict: Slightly low (target 90%)

Tier 3 (Complex):
  Tasks: 15
  Success: 12
  Failures: 3
  Success rate: 80.0% ⚠️
  Verdict: Too low (target 90%)

Tier 4 (Critical):
  Tasks: 8
  Success: 8
  Failures: 0
  Success rate: 100% ✅
  Verdict: Well-calibrated

Action:

Adjust tier thresholds:

Current thresholds (task-complexity-router):
  tier1: complexity score 0-3
  tier2: complexity score 4-6
  tier3: complexity score 7-9
  tier4: complexity score 10+

Problem: tier2 and tier3 have lower success rates
Root cause: Tasks slightly too complex for assigned tier

Optimized thresholds:
  tier1: complexity score 0-2 (narrower range)
  tier2: complexity score 3-5 (shift down)
  tier3: complexity score 6-8 (shift down)
  tier4: complexity score 9+ (broader range)

Rationale:
  - Shift more borderline tasks to higher tiers
  - Accept slightly higher cost for better success rates
  - tier2/tier3 success should improve to 90%+

Expected impact:
  - tier1 usage: 45 → 35 tasks (fewer simple tasks)
  - tier2 usage: 30 → 32 tasks (more medium tasks)
  - tier3 usage: 15 → 18 tasks (more complex tasks)
  - tier4 usage: 8 → 13 tasks (more critical tasks)
  - Overall success rate: 88% → 92%
  - Overall cost: +15% (acceptable tradeoff for quality)

Optimization 4: Detect Model-Task Affinity Patterns

Analysis:

Analyze task type performance by model:

Task type: code-review

Claude Sonnet:
  Success: 25, Failure: 2
  Success rate: 92.6% ✅

Grok Fast:
  Success: 18, Failure: 2
  Success rate: 90.0% ✅

Gemini Flash:
  Success: 10, Failure: 4
  Success rate: 71.4% ⚠️

→ Pattern: Claude and Grok excel at code review, Gemini struggles

Task type: implementation

Claude Sonnet:
  Success: 40, Failure: 5
  Success rate: 88.9% ✅

Grok Fast:
  Success: 12, Failure: 3
  Success rate: 80.0% ⚠️

Gemini Flash:
  Success: 6, Failure: 1
  Success rate: 85.7% ✅

→ Pattern: Claude best for implementation, Grok/Gemini acceptable

Task type: testing

Claude Sonnet:
  Success: 20, Failure: 3
  Success rate: 87.0% ✅

Grok Fast:
  Success: 0, Failure: 0
  Success rate: N/A

Gemini Flash:
  Success: 0, Failure: 0
  Success rate: N/A

→ Pattern: Only Claude has testing data (others not used for this)

Action:

Task-specific model routing:

code-review tasks:
  tier1: Grok Fast (90% success, low cost)
  tier2: Grok Fast (90% success, low cost)
  tier3: Claude Sonnet (93% success, high quality)
  tier4: Multi-model (Claude + Grok consensus)

implementation tasks:
  tier1: Gemini Flash (86% success, lowest cost)
  tier2: Grok Fast (80% success, medium cost)
  tier3: Claude Sonnet (89% success, highest quality)
  tier4: Claude Sonnet (89% success, proven)

testing tasks:
  tier1-4: Claude Sonnet (only model with proven testing capability)

Expected impact:
  - 25% cost savings on code reviews (use Grok instead of Claude)
  - 10% cost savings on implementation (use Gemini for simple)
  - Maintain quality (route by proven success rates)

Optimization 5: Alert on Performance Degradation

Detection:

Monitor for degradation:

Week 1 (baseline):
  ui-developer success rate: 90.5%
  Average task duration: 45s

Week 2:
  ui-developer success rate: 88.2% (↓2.3%)
  Average task duration: 48s (↑3s)

Week 3:
  ui-developer success rate: 85.1% (↓5.4% from baseline)
  Average task duration: 52s (↑7s from baseline)

Week 4:
  ui-developer success rate: 78.3% (↓12.2% from baseline) 🚨
  Average task duration: 58s (↑13s from baseline) 🚨

Threshold exceeded:
  ❌ Success rate dropped >10% (78.3% vs 90.5%)
  ❌ Duration increased >20% (58s vs 45s)

→ ALERT: ui-developer performance degraded significantly

Action:

Degradation response:

1. Investigate root cause
   Review recent history:
     - Task complexity increased? (Check taskTypes distribution)
     - Model changed? (Check model field in history)
     - Failures clustered around specific task type?

   Finding: All recent failures on "complex-state-management" tasks
   Root cause: New task type introduced, agent not trained for it

2. Take corrective action
   Option A: Retrain agent
     - Update prompt with state management patterns
     - Add examples of successful state management
     - Reset metrics after retraining

   Option B: Route differently
     - Route state management tasks to specialized agent
     - Keep ui-developer for simpler UI tasks

   Option C: Escalate to human
     - Alert: "ui-developer performance degraded"
     - Request: "Manual review of recent failures needed"

3. Monitor recovery
   Week 5 (after retraining):
     Success rate: 85.0% (recovering)
   Week 6:
     Success rate: 89.2% (near baseline)
   Week 7:
     Success rate: 91.0% (recovered ✅)

Integration with Orchestration Plugin

Integration 1: multi-model-validation

How multi-model-validation records model performance:

Multi-Model Review Flow:

1. Execute parallel review
   Models: [claude-sonnet, grok-fast, gemini-flash]
   Task: Code review of auth.ts

2. Collect model responses
   Each model returns:
     - Review findings
     - Confidence score
     - Latency
     - Cost

3. Record individual model performance
   For each model:
     models[modelId].totalRuns += 1
     models[modelId].avgLatency = rolling_avg(latency)
     models[modelId].totalCost += cost

4. Determine success/failure
   If review found critical issues → success (doing its job)
   If review crashed/errored → failure

5. Update success counts
   models[modelId].successCount += 1  (or failureCount)
   models[modelId].taskTypePerformance["code-review"].success += 1

6. Consolidate findings
   Generate consensus report
   Track which models agreed (co-occurrence patterns)

7. User feedback (optional)
   User rates review quality: "Helpful" | "Not helpful"
   Update successCorrelation for multi-model-validation skill

Integration 2: task-complexity-router

How task-complexity-router reads performance data:

Routing Decision Flow:

1. Analyze task complexity
   Input: "Implement user authentication with OAuth"
   Analysis: Complex (multiple components, external API, security)
   Base tier: 3

2. Read performance history
   Load: .claude/agent-performance.json
   Check: routing.tierDistribution

3. Adjust tier based on history
   tier3 historical success rate: 80% (below 90% target)
   tier4 historical success rate: 100%

   Decision: Bump to tier4 for higher success probability

4. Select agent based on task type affinity
   Task type: "implement-api"
   Candidates: backend-developer, full-stack-developer

   Check affinity:
     backend-developer.taskTypes["implement-api"]: 12 (high affinity)
     full-stack-developer.taskTypes["implement-api"]: 3 (low affinity)

   Decision: Select backend-developer (proven track record)

5. Select model based on tier + task type
   tier4 + implement-api:
     models[claude].taskTypePerformance["implementation"]: 89% success
     models[grok].taskTypePerformance["implementation"]: 80% success

   Decision: Select Claude (higher success rate for tier4)

6. Record routing decision
   routing.decisions.push({
     timestamp: NOW,
     taskType: "implement-api",
     complexity: "complex",
     selectedTier: 4,
     agent: "backend-developer",
     model: "sonnet",
     result: "pending"
   })

7. After execution, update result
   routing.decisions[last].result = "success"
   routing.decisions[last].cost = 0.005

Integration 3: hierarchical-coordinator

How hierarchical-coordinator tracks phase success:

Phase Execution Tracking:

1. Execute workflow phases
   Phase 1: Planning (architect agent)
   Phase 2: Implementation (developer agent)
   Phase 3: Testing (tester agent)
   Phase 4: Review (reviewer agent)

2. Track phase-level metrics
   Create phase-specific tracking:

   agents["architect"].phasePerformance = {
     "planning": { success: 15, failure: 2 },
     "architecture": { success: 8, failure: 1 }
   }

   agents["developer"].phasePerformance = {
     "implementation": { success: 25, failure: 5 },
     "refactoring": { success: 10, failure: 2 }
   }

3. Detect phase-specific issues
   Analysis: developer has 20% failure rate on implementation phase
   But: developer has 83% success rate overall

   Insight: Failures concentrated in specific phase

4. Optimize phase assignment
   Current: developer handles all implementation
   Optimized: Split by complexity
     - Simple implementation → junior-developer (cheaper)
     - Complex implementation → senior-developer (higher success)

5. Track coordinator effectiveness
   skills["hierarchical-coordinator"].activations += 1
   skills["hierarchical-coordinator"].successCorrelation = 0.92

   Insight: Workflows using coordinator have 92% success (vs 80% without)

Integration 4: quality-gates

How quality-gates uses performance thresholds:

Quality Gate Decision:

1. Agent completes task
   Agent: ui-developer
   Task: "Implement dashboard component"
   Confidence: 0.75

2. Check agent performance history
   Load: agents["ui-developer"]
   Historical avg confidence: 0.85
   Current confidence: 0.75 (below average 🚨)

3. Apply quality gate
   Threshold: If confidence < avg - 0.10, trigger validation

   Decision: 0.75 < 0.75 (borderline)
   Action: Trigger designer validation (extra quality check)

4. Designer validates
   Result: Found 3 minor issues
   Verdict: Quality gate prevented low-quality work from proceeding

5. Update metrics
   Without gate: ui-developer would have 1 more failure
   With gate: Issues caught early, fixed before user sees

   skills["quality-gates"].successCorrelation += 1
   (Success correlation increases when gate prevents failures)

6. Continuous improvement
   Pattern: Low-confidence tasks benefit from extra validation
   Threshold: Automatically adjust based on correlation data
   Future: If confidence < 0.80, always trigger validation

Best Practices

Do

✅ Track all agent executions (success and failure provide learning signal)
✅ Record model latency and cost (optimize for cost-effectiveness)
✅ Maintain execution history (detect patterns and trends)
✅ Set success rate thresholds (<70% = investigate, <50% = replace)
✅ Alert on performance degradation (>10% drop from baseline)
✅ Use task type affinity (route tasks to agents with proven success)
✅ Compare model cost-effectiveness (cost per success, not just cost per task)
✅ Track skill co-activation (identify successful skill combinations)
✅ Rotate history data (keep last 100 entries, prevent unbounded growth)
✅ Create session snapshots (point-in-time analysis)
✅ Integrate with routing (feed performance data back to router)

Don't

❌ Track only successes (failures provide valuable learning signal)
❌ Ignore degradation (small drops compound into big problems)
❌ Use stale data (>6 months old metrics may not reflect current state)
❌ Over-optimize on cost alone (balance cost and quality)
❌ Forget to update metrics (incomplete data leads to poor decisions)
❌ Store unbounded history (trim arrays to prevent file bloat)
❌ Mix session metrics (isolate session data for cleaner analysis)
❌ Ignore task type affinity (agents specialize, use it)
❌ Skip validation after major changes (reset metrics when workflows change)

Privacy Considerations

What to Track:

Aggregate metrics (counts, averages, distributions)
Task types (generic categories like "implement-component")
Success/failure outcomes
Model performance data
Timing and cost data

What NOT to Track:

User-specific data (usernames, emails)
Sensitive code snippets
API keys or credentials
Personal information
Business logic details

Data Retention:

Keep aggregate metrics indefinitely (no PII)
Rotate detailed history after 100 entries
Delete session snapshots after 100 sessions
Archive old metrics before major resets

When to Reset Metrics

Situations Requiring Reset:

Agent capabilities changed
- Prompt updated significantly
- Agent model changed
- Agent skills added/removed → Reset agent-specific metrics
Workflow architecture changed
- New routing logic
- New tier definitions
- New skill combinations → Reset routing and skill metrics
Model pricing changed
- Cost per token updated
- New pricing tier → Reset cost calculations (keep counts)
After prolonged period (>6 months)
- Metrics may be outdated
- Workflow patterns changed → Archive and reset all metrics

How to Reset:

# Backup current metrics
cp .claude/agent-performance.json .claude/agent-performance-backup-$(date +%Y%m%d).json

# Reset to empty state
cat > .claude/agent-performance.json << 'EOF'
{
  "schemaVersion": "1.0.0",
  "lastUpdated": "2026-01-28T16:00:00Z",
  "agents": {},
  "skills": {},
  "models": {},
  "routing": {
    "tierDistribution": {},
    "decisions": []
  }
}
EOF

# Archive old session snapshots
mkdir -p ai-docs/performance-history/archive-$(date +%Y%m%d)
mv ai-docs/performance-history/*.json ai-docs/performance-history/archive-$(date +%Y%m%d)/

Metric Hygiene

Regular Maintenance:

Weekly:
  - Review top agents (ensure success rates >70%)
  - Check model cost trends (identify cost spikes)
  - Trim co-activation maps (keep top 50 per skill)

Monthly:
  - Analyze task type affinity changes
  - Compare model cost-effectiveness
  - Review tier distribution accuracy
  - Archive old session snapshots (keep last 100)

Quarterly:
  - Deep analysis of performance trends
  - Optimize routing thresholds
  - Identify underperforming patterns
  - Consider agent retraining or replacement

Annually:
  - Full metrics review and reset (if needed)
  - Archive all historical data
  - Update baseline success rates
  - Document lessons learned

Examples

Example 1: Tracking a Multi-Model Review Session

Scenario: User requests /review with 3 models (Claude, Grok, Gemini)

Execution:

Step 1: Initialize session
  Session ID: 2026-01-28-session-15
  Start time: 15:00:00Z

Step 2: Execute multi-model review
  Models: [claude-sonnet, grok-fast, gemini-flash]
  Task: Code review of auth/login.ts (450 lines)

Step 3: Track individual model executions

  Model: sonnet
    Start: 15:00:05Z
    End: 15:00:08Z
    Latency: 3000ms
    Cost: $0.003
    Result: Found 5 issues (2 CRITICAL, 3 HIGH)
    Outcome: Success

  Update metrics:
    models["sonnet"].totalRuns = 121
    models["sonnet"].successCount = 109
    models["sonnet"].avgLatency = 2520ms
    models["sonnet"].totalCost = $0.453
    models["sonnet"].taskTypePerformance["code-review"].success = 26

  Model: grok
    Start: 15:00:05Z
    End: 15:00:07Z
    Latency: 2000ms
    Cost: $0.002
    Result: Found 4 issues (2 CRITICAL, 2 HIGH)
    Outcome: Success

  Update metrics:
    models["grok"].totalRuns = 36
    models["grok"].successCount = 31
    models["grok"].avgLatency = 1820ms
    models["grok"].totalCost = $0.082
    models["grok"].taskTypePerformance["code-review"].success = 19

  Model: gemini
    Start: 15:00:05Z
    End: 15:00:06Z
    Latency: 1500ms
    Cost: $0.001
    Result: Found 3 issues (1 CRITICAL, 2 MEDIUM)
    Outcome: Success

  Update metrics:
    models["gemini"].totalRuns = 21
    models["gemini"].successCount = 17
    models["gemini"].avgLatency = 1480ms
    models["gemini"].totalCost = $0.021
    models["gemini"].taskTypePerformance["code-review"].success = 11

Step 4: Track skill activation
  skills["multi-model-validation"].activations = 16
  skills["multi-model-validation"].lastActivated = 15:00:08Z
  skills["multi-model-validation"].coActivations["quality-gates"] = 13

Step 5: Consolidate findings
  Consensus issues (all 3 models agreed):
    - CRITICAL: SQL injection vulnerability (UNANIMOUS)
    - CRITICAL: Missing authentication check (UNANIMOUS)

  Majority issues (2/3 models agreed):
    - HIGH: Insufficient input validation (Claude, Grok)
    - HIGH: Missing error handling (Claude, Grok)

  Divergent issues (1/3 models):
    - MEDIUM: Code duplication (Gemini only)

Step 6: Record session summary
  Session complete:
    Duration: 8 seconds
    Models used: 3
    Total cost: $0.006
    Issues found: 5 (2 unanimous, 2 majority, 1 divergent)
    Result: Success

  Create snapshot:
    ai-docs/performance-history/2026-01-28-session-15.json

Step 7: Update aggregate metrics
  Overall session success rate: 3/3 models successful = 100%
  Cost efficiency: $0.002 per model = good value

Insights from Tracking:

Performance comparison:
  Fastest: Gemini (1500ms) - 50% faster than Claude
  Most thorough: Claude (5 issues) - Found 1 extra issue
  Best value: Gemini ($0.001, 3 issues) - Lowest cost, good coverage

Cost analysis:
  Total: $0.006 for 3-model review
  vs Single Claude: $0.003 (double cost, but 2x validation)
  ROI: Found 2 CRITICAL issues all models agreed on = high confidence

Consensus validation:
  UNANIMOUS issues (100% confidence) → Fix immediately
  MAJORITY issues (67% confidence) → Fix before merge
  DIVERGENT issues (33% confidence) → Low priority (possible false positive)

Recommendation:
  Multi-model validation worth the cost for critical code (auth, payments, security)
  Single-model sufficient for non-critical code (UI components, docs)

Example 2: Identifying Model Performance Differences

Scenario: After 100 tasks, compare model performance for optimization

Execution:

Step 1: Load performance data
  Read: .claude/agent-performance.json

Step 2: Extract model metrics

  Claude Sonnet:
    Total runs: 120
    Success: 108, Failure: 12
    Success rate: 90.0%
    Avg latency: 2500ms
    Total cost: $0.45
    Cost per task: $0.00375
    Cost per success: $0.00417

  Grok Fast:
    Total runs: 35
    Success: 30, Failure: 5
    Success rate: 85.7%
    Avg latency: 1800ms
    Total cost: $0.08
    Cost per task: $0.00229
    Cost per success: $0.00267

  Gemini Flash:
    Total runs: 20
    Success: 16, Failure: 4
    Success rate: 80.0%
    Avg latency: 1500ms
    Total cost: $0.02
    Cost per task: $0.00100
    Cost per success: $0.00125

Step 3: Analyze task type performance

  Code Review:
    Claude: 25 success, 2 failure = 92.6%
    Grok: 18 success, 2 failure = 90.0%
    Gemini: 10 success, 4 failure = 71.4%

    Winner: Claude (highest quality)
    Best value: Grok (90% at lower cost)

  Implementation:
    Claude: 40 success, 5 failure = 88.9%
    Grok: 12 success, 3 failure = 80.0%
    Gemini: 6 success, 1 failure = 85.7%

    Winner: Claude (highest quality)
    Surprising: Gemini performs well here (86% success)

  Testing:
    Claude: 20 success, 3 failure = 87.0%
    Grok: No data
    Gemini: No data

    Winner: Claude (only option)
    Action: Try Grok/Gemini for testing tasks to gather data

Step 4: Calculate cost-effectiveness by task type

  Code Review Cost-Effectiveness:
    Claude: $0.00417 per success, 92.6% quality
    Grok: $0.00267 per success, 90.0% quality (36% cheaper, -2.6% quality)
    Gemini: $0.00125 per success, 71.4% quality (70% cheaper, -21.2% quality)

    Recommendation: Use Grok for cost-effective reviews (minimal quality loss)

  Implementation Cost-Effectiveness:
    Claude: $0.00417 per success, 88.9% quality
    Grok: $0.00267 per success, 80.0% quality (36% cheaper, -8.9% quality)
    Gemini: $0.00125 per success, 85.7% quality (70% cheaper, -3.2% quality)

    Recommendation: Use Gemini for simple implementation (best value)

Step 5: Generate optimization plan

  Current usage (120 total tasks):
    Claude: 100 tasks (83%)
    Grok: 15 tasks (13%)
    Gemini: 5 tasks (4%)

  Optimized usage (maintain quality >85%):
    tier1 (Simple): Gemini (30% of tasks)
    tier2 (Medium): Grok (40% of tasks)
    tier3 (Complex): Claude (25% of tasks)
    tier4 (Critical): Claude + Multi-model (5% of tasks)

  Expected impact:
    Current avg cost: $0.00375 per task
    Optimized avg cost: $0.00240 per task
    Savings: 36% cost reduction

    Current avg success: 88.5%
    Optimized avg success: 86.2% (projected)
    Quality impact: -2.3% (acceptable tradeoff)

Step 6: Implement gradual rollout

  Week 1: Route 20% of tier1 tasks to Gemini
    Monitor: Success rate, cost savings
    Target: >80% success rate

  Week 2: Route 40% of tier2 tasks to Grok
    Monitor: Success rate, cost savings
    Target: >85% success rate

  Week 3: Evaluate results
    If successful: Increase percentages
    If unsuccessful: Rollback and investigate

Step 7: Track optimization results

  After 2 weeks:
    Gemini tier1 success: 82% ✅ (above 80% target)
    Grok tier2 success: 87% ✅ (above 85% target)
    Cost savings: 28% ✅ (approaching 36% target)

  Decision: Continue rollout
  Next: Route 50% tier1 to Gemini, 60% tier2 to Grok

Insights from Analysis:

Key findings:
  1. Grok is best value for code reviews (90% quality at 36% lower cost)
  2. Gemini surprisingly good for implementation (86% vs 89% Claude)
  3. Claude still best for critical work (92% code review success)
  4. Latency varies significantly (Gemini 40% faster than Claude)

Optimization strategy:
  - Use Gemini for simple, latency-sensitive tasks
  - Use Grok for medium-complexity, cost-sensitive tasks
  - Use Claude for critical, quality-sensitive tasks
  - Use multi-model for maximum confidence (despite cost)

Expected ROI:
  - 36% cost reduction (from $0.00375 to $0.00240 per task)
  - 2.3% quality tradeoff (from 88.5% to 86.2% success)
  - Worth it: Save $135 per 100,000 tasks with minimal quality impact

Example 3: Optimizing Routing Based on Accumulated Data

Scenario: After 100 routing decisions, optimize tier thresholds

Execution:

Step 1: Load routing data
  Read: .claude/agent-performance.json
  Focus: routing.tierDistribution, routing.decisions

Step 2: Analyze tier distribution

  Current distribution:
    tier1: 45 tasks (45.9%)
    tier2: 30 tasks (30.6%)
    tier3: 15 tasks (15.3%)
    tier4: 8 tasks (8.2%)

  Skew analysis:
    Heavy on tier1 (46%) - Router prefers simple classification
    Light on tier4 (8%) - Router rarely escalates

Step 3: Calculate tier success rates

  tier1 (Simple tasks):
    Total: 45
    Success: 42, Failure: 3
    Success rate: 93.3% ✅
    Avg cost: $0.001
    Avg duration: 25s

  tier2 (Medium tasks):
    Total: 30
    Success: 25, Failure: 5
    Success rate: 83.3% ⚠️ (target: 90%)
    Avg cost: $0.002
    Avg duration: 45s

  tier3 (Complex tasks):
    Total: 15
    Success: 12, Failure: 3
    Success rate: 80.0% ⚠️ (target: 90%)
    Avg cost: $0.004
    Avg duration: 90s

  tier4 (Critical tasks):
    Total: 8
    Success: 8, Failure: 0
    Success rate: 100% ✅
    Avg cost: $0.008
    Avg duration: 120s

Step 4: Analyze tier2/tier3 failures

  tier2 failures (5 tasks):
    1. "Implement complex state management" (complexity: 6)
       - Should have been tier3 (underestimated)
    2. "Add authentication to API" (complexity: 6)
       - Should have been tier3 (security = critical)
    3. "Refactor component with hooks" (complexity: 5)
       - Should have been tier2 (correctly routed, agent issue)
    4. "Implement drag-and-drop" (complexity: 6)
       - Should have been tier3 (complex interaction)
    5. "Add real-time updates" (complexity: 6)
       - Should have been tier3 (WebSocket complexity)

  Pattern: 4/5 failures were borderline tier2/tier3 (complexity 6)
  Root cause: tier2 upper threshold too high (should be 5, not 6)

  tier3 failures (3 tasks):
    1. "Design microservices architecture" (complexity: 9)
       - Should have been tier4 (architecture = critical)
    2. "Implement payment processing" (complexity: 9)
       - Should have been tier4 (money = critical)
    3. "Refactor authentication system" (complexity: 8)
       - Correctly routed, agent struggled with complexity

  Pattern: 2/3 failures should have been tier4 (complexity 9)
  Root cause: tier3 upper threshold too high (should be 8, not 9)

Step 5: Propose threshold adjustments

  Current thresholds:
    tier1: complexity 0-3
    tier2: complexity 4-6
    tier3: complexity 7-9
    tier4: complexity 10+

  Problem: Borderline tasks (6, 9) cause failures

  Optimized thresholds:
    tier1: complexity 0-2 (narrower, more confident)
    tier2: complexity 3-5 (shift down, avoid borderline 6)
    tier3: complexity 6-8 (shift down, avoid borderline 9)
    tier4: complexity 9+ (broader, include borderline cases)

  Rationale:
    - Move borderline complexity 6 from tier2 → tier3
    - Move borderline complexity 9 from tier3 → tier4
    - Accept 15% higher cost for 10% better success rate

Step 6: Simulate new distribution

  Reclassify historical tasks with new thresholds:

  tier1 (0-2): 35 tasks (35%)
    Success rate: 34/35 = 97.1% ↑ (was 93.3%)

  tier2 (3-5): 32 tasks (32%)
    Success rate: 30/32 = 93.8% ↑ (was 83.3%)

  tier3 (6-8): 18 tasks (18%)
    Success rate: 17/18 = 94.4% ↑ (was 80.0%)

  tier4 (9+): 13 tasks (13%)
    Success rate: 13/13 = 100% ✓ (was 100%)

  Overall success rate: 94/98 = 95.9% ↑ (was 87.8%)

Step 7: Calculate cost impact

  Current avg cost: $0.00240 per task
  Optimized avg cost: $0.00276 per task (+15%)

  Cost breakdown:
    tier1 (35%): $0.001 × 0.35 = $0.00035
    tier2 (32%): $0.002 × 0.32 = $0.00064
    tier3 (18%): $0.004 × 0.18 = $0.00072
    tier4 (13%): $0.008 × 0.13 = $0.00104
    Total: $0.00275 (rounded $0.00276)

  ROI calculation:
    Cost increase: +$0.00036 per task (+15%)
    Success increase: +8.1% (from 87.8% to 95.9%)
    Failure reduction: 12 → 4 failures (67% reduction)

  Value: Preventing 8 failures per 100 tasks worth the 15% cost increase

Step 8: Implement new thresholds

  Update task-complexity-router skill:
    OLD:
      if (complexity <= 3) return "tier1";
      if (complexity <= 6) return "tier2";
      if (complexity <= 9) return "tier3";
      return "tier4";

    NEW:
      if (complexity <= 2) return "tier1";
      if (complexity <= 5) return "tier2";
      if (complexity <= 8) return "tier3";
      return "tier4";

  Document change:
    Reason: Performance data showed borderline tasks caused failures
    Expected: 8% success rate improvement, 15% cost increase
    Monitoring: Track next 100 tasks to validate improvement

Step 9: Monitor post-optimization

  After 50 tasks with new thresholds:
    tier1: 18 tasks, 18 success = 100% ✅
    tier2: 16 tasks, 15 success = 93.8% ✅
    tier3: 10 tasks, 9 success = 90.0% ✅
    tier4: 6 tasks, 6 success = 100% ✅

  Overall: 48/50 = 96.0% success ✅ (matches projection)
  Avg cost: $0.00280 ✅ (matches projection)

  Verdict: Optimization successful, keep new thresholds

Insights from Optimization:

Key findings:
  1. Borderline complexity scores (6, 9) caused most failures
  2. Router was too aggressive in keeping tasks at lower tiers
  3. Small threshold adjustments (6→5, 9→8) had big impact

Optimization results:
  - Success rate: 87.8% → 96.0% (+8.2%)
  - Failure rate: 12.2% → 4.0% (-67%)
  - Cost per task: $0.00240 → $0.00280 (+15%)
  - ROI: Strong (quality improvement worth cost increase)

Lessons learned:
  - Track tier success rates, not just overall success
  - Borderline cases benefit from tier escalation
  - Performance data reveals routing blind spots
  - Continuous monitoring enables iterative improvement

Next steps:
  - Continue monitoring for 100 more tasks
  - Consider dynamic thresholds (adjust based on live data)
  - Explore agent-specific routing (some agents handle complexity better)

Troubleshooting

Problem: Agent performance.json file growing too large

Cause: History arrays not being trimmed

Solution: Implement automatic trimming after each update

function updateAgentMetrics(agentId, execution) {
  const agent = metrics.agents[agentId];

  // Update aggregates
  agent.totalRuns += 1;
  agent.successCount += execution.result === "success" ? 1 : 0;

  // Add to history
  agent.history.push(execution);

  // Trim to max 100 entries (FIFO)
  if (agent.history.length > 100) {
    agent.history = agent.history.slice(-100);
  }
}

Problem: Metrics don't reflect recent changes

Cause: Stale data from old workflows

Solution: Reset metrics after major changes

# Backup current metrics
cp .claude/agent-performance.json .claude/agent-performance-backup-$(date +%Y%m%d).json

# Reset relevant sections (keep models, reset agents)
# Edit .claude/agent-performance.json manually or use script

Problem: Success rate calculations seem wrong

Cause: Inconsistent result values ("success", "SUCCESS", "completed", etc.)

Solution: Normalize result values

function normalizeResult(result) {
  const successValues = ["success", "SUCCESS", "completed", "PASS"];
  const failureValues = ["failure", "FAILURE", "error", "ERROR", "FAIL"];

  if (successValues.includes(result)) return "success";
  if (failureValues.includes(result)) return "failure";
  return "unknown";
}

// Use normalized values in metrics
const normalizedResult = normalizeResult(execution.result);
agent.successCount += normalizedResult === "success" ? 1 : 0;
agent.failureCount += normalizedResult === "failure" ? 1 : 0;

Summary

Performance tracking enables data-driven orchestration optimization through:

Agent success tracking (identify high-performers and underperformers)
Model performance comparison (find cost-effective alternatives)
Skill effectiveness analysis (discover successful patterns)
Routing optimization (adjust tier thresholds based on actual results)
Historical trend detection (alert on degradation, celebrate improvements)

Key metrics to monitor:

Agent success rate (target >70%, alert if <60%)
Model cost-effectiveness (cost per success, not just cost per task)
Routing tier accuracy (target >90% success per tier)
Skill activation correlation (identify high-value skills)

Master performance tracking and your orchestration workflows will continuously improve, delivering better results at lower costs.

Inspired By:

/review command (multi-model performance tracking)
/dev command (agent success rate monitoring)
task-complexity-router skill (routing feedback loops)
Production workflows (cost optimization, quality tracking)

madappgang/performance-tracking

plugins/multimodel/skills/performance-tracking/SKILL.md

Track agent, skill, and model performance metrics for optimization. Use when measuring agent success rates, tracking model latency, analyzing routing effectiveness, or optimizing cost-per-task.

4 stars

data-ai

Updated May 7, 2026

$ install --global

skillsauth

npx skillsauth add madappgang/magus performance-tracking

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 7, 2026, 6:42 AM155.3s1 file scanned

SKILL.md

name:: performance-tracking
description:: Track agent, skill, and model performance metrics for optimization. Use when measuring agent success rates, tracking model latency, analyzing routing effectiveness, or optimizing cost-per-task.
disable-model-invocation:: true
version:: 0.1.0
tags:: [orchestration, performance, metrics, tracking, agent-performance, optimization]
keywords:: [performance, metrics, tracking, agent, skill, success-rate, latency, cost, optimization, history, routing]
plugin:: multimodel
updated:: 2026-01-28
user-invocable:: false

Performance Tracking

Version: 1.0.0 Purpose: Track agent, skill, and model performance metrics for continuous optimization Status: Production Ready

Overview

This skill provides battle-tested patterns for:

Agent success tracking (completion rates, confidence scores, task type affinity)
Skill effectiveness (activation counts, success correlation, usage patterns)
Model performance (latency, cost, quality, provider comparison)
Routing optimization (tier distribution, routing accuracy, cost efficiency)
Historical analysis (trend detection, degradation alerts, pattern discovery)

Performance tracking enables continuous improvement by providing the data needed to make informed decisions about agent selection, model choice, and workflow routing.

Why Track Performance

Optimize Routing:

Identify which agents excel at specific task types
Route complex tasks to high-confidence agents
Avoid agents with low success rates for critical work

Identify Failing Agents:

Detect agents with <70% success rate
Alert when agent performance degrades
Replace or retrain underperforming agents

Reduce Costs:

Find cost-effective model alternatives
Identify expensive agents with low success rates
Optimize tier thresholds based on actual performance

Improve Quality:

Track correlation between confidence scores and success
Identify patterns in successful implementations
Learn which models produce best results for task types

What We Track

Agent Metrics:

Total runs, success/failure counts
Average confidence scores
Task type distribution
Last used timestamp
Individual execution history

Skill Metrics:

Activation counts per skill
Last activation timestamps
Success correlation (when skill active, what's success rate?)
Co-activation patterns

Model Metrics:

Total runs, success/failure counts
Average latency (response time)
Total cost (cumulative spend)
Cost per successful task
Last used timestamp

Routing Metrics:

Tier distribution (how often each tier selected)
Routing accuracy (did tier match complexity?)
Cost efficiency (tier1 vs tier4 cost ratio)
Decision history with outcomes

Integration with task-complexity-router

The performance tracker provides critical feedback to the task-complexity-router:

Routing Feedback Loop:

1. Router selects tier based on complexity
   → task-complexity-router analyzes task
   → Routes to tier2 (medium complexity)

2. Agent executes task
   → Records: tier=2, agent=ui-developer, result=success

3. Performance tracker updates metrics
   → tier2 usage +1
   → ui-developer success +1
   → Confidence in tier2 routing increases

4. Future routing decisions informed by history
   → Router sees tier2 has 85% success rate
   → Router sees ui-developer excels at UI tasks
   → Router confidently routes similar tasks to tier2

Metrics Schema

JSON Structure (Version 1.0.0)

Store performance metrics in .claude/agent-performance.json:

{
  "schemaVersion": "1.0.0",
  "lastUpdated": "2026-01-28T15:30:00Z",
  "agents": {
    "ui-developer": {
      "totalRuns": 42,
      "successCount": 38,
      "failureCount": 4,
      "avgConfidence": 0.85,
      "lastUsed": "2026-01-28T15:30:00Z",
      "taskTypes": {
        "implement-component": 15,
        "fix-styling": 12,
        "refactor-ui": 8,
        "review-code": 7
      },
      "history": [
        {
          "timestamp": "2026-01-28T15:30:00Z",
          "taskType": "implement-component",
          "result": "success",
          "confidence": 0.90,
          "duration": 45000,
          "tier": 2,
          "model": "sonnet"
        },
        {
          "timestamp": "2026-01-28T14:20:00Z",
          "taskType": "fix-styling",
          "result": "success",
          "confidence": 0.85,
          "duration": 30000,
          "tier": 1,
          "model": "sonnet"
        }
      ]
    },
    "backend-developer": {
      "totalRuns": 28,
      "successCount": 25,
      "failureCount": 3,
      "avgConfidence": 0.88,
      "lastUsed": "2026-01-28T14:00:00Z",
      "taskTypes": {
        "implement-api": 12,
        "fix-bug": 8,
        "database-migration": 5,
        "write-tests": 3
      },
      "history": []
    }
  },
  "skills": {
    "multi-model-validation": {
      "activations": 15,
      "lastActivated": "2026-01-28T15:00:00Z",
      "successCorrelation": 0.92,
      "coActivations": {
        "quality-gates": 12,
        "error-recovery": 8
      }
    },
    "task-complexity-router": {
      "activations": 68,
      "lastActivated": "2026-01-28T15:30:00Z",
      "successCorrelation": 0.85,
      "coActivations": {
        "multi-agent-coordination": 45,
        "hierarchical-coordinator": 30
      }
    }
  },
  "models": {
    "sonnet": {
      "totalRuns": 120,
      "successCount": 108,
      "failureCount": 12,
      "avgLatency": 2500,
      "totalCost": 0.45,
      "lastUsed": "2026-01-28T15:30:00Z",
      "taskTypePerformance": {
        "code-review": { "success": 25, "failure": 2 },
        "implementation": { "success": 40, "failure": 5 },
        "testing": { "success": 20, "failure": 3 }
      }
    },
    "grok": {
      "totalRuns": 35,
      "successCount": 30,
      "failureCount": 5,
      "avgLatency": 1800,
      "totalCost": 0.08,
      "lastUsed": "2026-01-28T13:00:00Z",
      "taskTypePerformance": {
        "code-review": { "success": 18, "failure": 2 },
        "implementation": { "success": 12, "failure": 3 }
      }
    }
  },
  "routing": {
    "tierDistribution": {
      "tier1": 45,
      "tier2": 30,
      "tier3": 15,
      "tier4": 8
    },
    "decisions": [
      {
        "timestamp": "2026-01-28T15:30:00Z",
        "taskType": "implement-component",
        "complexity": "medium",
        "selectedTier": 2,
        "agent": "ui-developer",
        "result": "success",
        "cost": 0.003
      },
      {
        "timestamp": "2026-01-28T14:20:00Z",
        "taskType": "fix-styling",
        "complexity": "low",
        "selectedTier": 1,
        "agent": "ui-developer",
        "result": "success",
        "cost": 0.001
      }
    ]
  }
}

Schema Field Definitions

Agent Metrics:

totalRuns: Total task executions
successCount: Tasks completed successfully
failureCount: Tasks that failed or required retry
avgConfidence: Rolling average of agent confidence scores (0.0-1.0)
lastUsed: ISO-8601 timestamp of last execution
taskTypes: Distribution of task types (understand agent specialization)
history: Array of recent executions (max 100 entries, FIFO)

Skill Metrics:

activations: Total times skill was triggered
lastActivated: ISO-8601 timestamp
successCorrelation: Success rate when this skill is active (0.0-1.0)
coActivations: Skills frequently activated together (detect patterns)

Model Metrics:

totalRuns: Total executions
successCount/failureCount: Outcome tracking
avgLatency: Average response time in milliseconds
totalCost: Cumulative spend in USD
lastUsed: ISO-8601 timestamp
taskTypePerformance: Success/failure breakdown by task type

Routing Metrics:

tierDistribution: Count of tasks routed to each tier
decisions: Array of routing decisions with outcomes (max 100, FIFO)

Tracking Patterns

Pattern 1: Capturing Agent Performance

After Agent Completes Task:

Execution Flow:

1. Agent executes task
   Task: ui-developer
   Input: "Implement login form component"
   Result: Success
   Confidence: 0.90
   Duration: 45 seconds
   Tier: 2
   Model: sonnet

2. Update agent metrics
   Read: .claude/agent-performance.json
   Update:
     agents["ui-developer"].totalRuns += 1
     agents["ui-developer"].successCount += 1
     agents["ui-developer"].avgConfidence = rolling_avg(0.90)
     agents["ui-developer"].lastUsed = NOW
     agents["ui-developer"].taskTypes["implement-component"] += 1
     agents["ui-developer"].history.push({
       timestamp: NOW,
       taskType: "implement-component",
       result: "success",
       confidence: 0.90,
       duration: 45000,
       tier: 2,
       model: "sonnet"
     })
   Trim history if > 100 entries
   Write: .claude/agent-performance.json

3. Calculate derived metrics
   Success rate: successCount / totalRuns = 38/42 = 90.5%
   Avg duration: sum(history.duration) / history.length
   Task affinity: taskTypes sorted by count

After Agent Fails:

Failure Flow:

1. Agent fails task
   Task: backend-developer
   Input: "Implement complex payment flow"
   Result: Failure (error, timeout, or low quality)
   Confidence: 0.65
   Tier: 3

2. Update failure metrics
   agents["backend-developer"].totalRuns += 1
   agents["backend-developer"].failureCount += 1
   agents["backend-developer"].avgConfidence = rolling_avg(0.65)
   agents["backend-developer"].history.push({
     timestamp: NOW,
     taskType: "implement-api",
     result: "failure",
     confidence: 0.65,
     duration: 120000,
     tier: 3,
     error: "Exceeded max iterations"
   })

3. Check for degradation
   If failureCount / totalRuns > 0.30:
     Alert: "backend-developer has >30% failure rate"
     Recommendation: "Review recent failures, retrain, or replace"

Pattern 2: Tracking Model Performance

After Model Execution:

Execution Flow:

1. Model completes task
   Model: grok
   Task: Code review
   Latency: 1800ms
   Cost: $0.002
   Result: Success

2. Update model metrics
   models["grok"].totalRuns += 1
   models["grok"].successCount += 1
   models["grok"].avgLatency = rolling_avg(1800)
   models["grok"].totalCost += 0.002
   models["grok"].lastUsed = NOW
   models["grok"].taskTypePerformance["code-review"].success += 1

3. Compare model performance
   Claude Sonnet: avgLatency=2500ms, cost=$0.45 (120 runs)
   Grok Fast: avgLatency=1800ms, cost=$0.08 (35 runs)

   Analysis:
   - Grok is 28% faster (1800ms vs 2500ms)
   - Grok is 82% cheaper per run ($0.0023 vs $0.0038)
   - Both have similar success rates (86% vs 90%)

   Recommendation:
   - Use Grok for cost-sensitive tasks
   - Use Claude for critical tasks (higher success rate)

Pattern 3: Recording Skill Activation

After Skill Activation:

Activation Flow:

1. Skill triggers
   Skill: multi-model-validation
   Context: User requested /review with 3 models

2. Update skill metrics
   skills["multi-model-validation"].activations += 1
   skills["multi-model-validation"].lastActivated = NOW

3. Track co-activation
   Active skills: ["multi-model-validation", "quality-gates"]
   skills["multi-model-validation"].coActivations["quality-gates"] += 1

4. Calculate success correlation
   Tasks with this skill active: 15
   Successful tasks: 14
   Success correlation: 14/15 = 93.3%

5. Pattern detection
   Observation: multi-model-validation + quality-gates = 100% success (12/12)
   Recommendation: Always pair these skills for high-quality reviews

Pattern 4: Routing Decision Tracking

After Routing Decision:

Routing Flow:

1. Router selects tier
   Task: "Implement user profile page"
   Analysis: Medium complexity (multiple components, state management)
   Selected tier: 2
   Agent: ui-developer
   Model: sonnet

2. Record routing decision
   routing.tierDistribution["tier2"] += 1
   routing.decisions.push({
     timestamp: NOW,
     taskType: "implement-component",
     complexity: "medium",
     selectedTier: 2,
     agent: "ui-developer",
     result: "pending"
   })

3. After task completes
   Update decision with result:
   routing.decisions[last].result = "success"
   routing.decisions[last].cost = 0.003

4. Trim decision history if > 100 entries

Pattern 5: Session-Level Aggregation

End of Session Summary:

Session Summary Flow:

1. Aggregate session metrics
   Session ID: 2026-01-28-session-15
   Duration: 2 hours
   Tasks executed: 15
   Success rate: 14/15 = 93.3%
   Total cost: $0.045
   Models used: Claude (12), Grok (3)

2. Create session snapshot
   File: ai-docs/performance-history/2026-01-28-session-15.json
   Content:
     {
       "sessionId": "2026-01-28-session-15",
       "startTime": "2026-01-28T13:00:00Z",
       "endTime": "2026-01-28T15:00:00Z",
       "duration": 7200000,
       "tasks": 15,
       "successRate": 0.933,
       "totalCost": 0.045,
       "modelUsage": { "claude": 12, "grok": 3 },
       "topAgents": ["ui-developer", "backend-developer"],
       "activeSkills": ["task-complexity-router", "multi-model-validation"]
     }

3. Update rolling metrics
   .claude/agent-performance.json (persistent)
   ai-docs/performance-history/ (snapshots)

4. Cleanup old snapshots
   Keep last 100 session snapshots
   Delete older entries

File Location and Management

Primary Performance File

Location: .claude/agent-performance.json

Purpose: Persistent, project-level performance tracking

When to Update:

After every agent execution
After every model execution
After every skill activation
After every routing decision

Format: JSON schema version 1.0.0 (see Metrics Schema section)

Rotation: Keep full history, but trim individual history arrays to 100 entries

Session Snapshots

Location: ai-docs/performance-history/

Purpose: Point-in-time session summaries for historical analysis

Naming: {YYYY-MM-DD}-session-{N}.json

Example:

ai-docs/performance-history/
  2026-01-28-session-1.json
  2026-01-28-session-2.json
  2026-01-27-session-1.json
  ...

Retention: Keep last 100 sessions, delete older

Integration with Existing Files

Relationship with ai-docs/llm-performance.json:

Comparison:

llm-performance.json (existing):
  - Model-specific performance
  - Cost tracking per model
  - Response time tracking
  - Used by multi-model-validation

agent-performance.json (new):
  - Agent-level metrics (multi-run aggregation)
  - Skill activation tracking
  - Routing decision history
  - Task type affinity

Integration:
  - agent-performance.json imports model data from llm-performance.json
  - Both files updated in parallel
  - llm-performance.json focuses on single-run details
  - agent-performance.json focuses on aggregate trends

Migration Path:

Step 1: Create .claude/agent-performance.json with schema 1.0.0
Step 2: Import historical data from llm-performance.json
Step 3: Update both files going forward
Step 4: Deprecate llm-performance.json after 6 months (optional)

Data Cleanup and Rotation

Automatic Cleanup:

Cleanup Rules:

1. Agent history arrays
   Max entries: 100
   Strategy: FIFO (oldest removed first)
   Trigger: After every agent execution

2. Routing decision arrays
   Max entries: 100
   Strategy: FIFO
   Trigger: After every routing decision

3. Session snapshots
   Max files: 100
   Strategy: FIFO (delete oldest session files)
   Trigger: After every session ends

4. Skill co-activation maps
   Max entries per skill: 50
   Strategy: Keep top 50 by count
   Trigger: Weekly cleanup

Manual Cleanup:

When to manually reset:

1. After major workflow changes
   - Agent capabilities changed
   - New skills added
   - Routing logic updated
   → Reset metrics to start fresh

2. After agent retraining
   - Agent prompt updated
   - Agent model changed
   → Reset agent-specific metrics

3. After prolonged period (>6 months)
   - Metrics may be outdated
   → Archive old data, start fresh

How to reset:
  Backup: cp .claude/agent-performance.json .claude/agent-performance-backup-{DATE}.json
  Reset: echo '{"schemaVersion":"1.0.0","lastUpdated":"...","agents":{},...}' > .claude/agent-performance.json

Using Metrics for Optimization

Optimization 1: Identify Underperforming Agents

Detection:

Analyze agent success rates:

agents["ui-developer"]:
  successCount: 38
  totalRuns: 42
  success rate: 38/42 = 90.5% ✅ GOOD

agents["test-architect"]:
  successCount: 15
  totalRuns: 25
  success rate: 15/25 = 60% ❌ UNDERPERFORMING

Threshold: <70% success rate = underperforming

Action:

For test-architect (60% success):

1. Analyze failure patterns
   Review history entries where result="failure"
   Common failure reasons:
     - "Tests too brittle" (8 occurrences)
     - "Missing test coverage" (5 occurrences)
     - "Test timeout" (2 occurrences)

2. Identify root cause
   Pattern: test-architect struggles with async/timing tests
   Evidence: All timeout failures involved async code

3. Take action
   Option A: Retrain agent
     - Update prompt with async testing best practices
     - Add examples of proper async test patterns
     - Reset metrics after retraining

   Option B: Route differently
     - Route async test tasks to backend-developer (90% success on async)
     - Keep test-architect for synchronous unit tests

   Option C: Replace agent
     - Create new specialized-async-test-architect
     - Deprecate test-architect for async work

Optimization 2: Find Cost-Effective Model Alternatives

Analysis:

Compare model cost-effectiveness:

Model: sonnet
  Total cost: $0.45
  Total runs: 120
  Success count: 108
  Cost per task: $0.0038
  Cost per success: $0.0042
  Success rate: 90%

Model: grok
  Total cost: $0.08
  Total runs: 35
  Success count: 30
  Cost per task: $0.0023
  Cost per success: $0.0027
  Success rate: 86%

Model: gemini
  Total cost: $0.02
  Total runs: 20
  Success count: 16
  Cost per task: $0.0010
  Cost per success: $0.0013
  Success rate: 80%

Cost Efficiency Ranking:
  1. Gemini Flash: $0.0013 per success (80% success rate)
  2. Grok Fast: $0.0027 per success (86% success rate)
  3. Claude Sonnet: $0.0042 per success (90% success rate)

Quality-Cost Tradeoff:
  - Gemini: 69% cheaper than Claude, but 10% lower success rate
  - Grok: 36% cheaper than Claude, but 4% lower success rate

Action:

Optimization strategy:

Tier 1 (Simple tasks):
  Use: Gemini Flash
  Reason: Lowest cost, acceptable success rate for simple work
  Example: "Fix typo in comment", "Format code"

Tier 2 (Medium tasks):
  Use: Grok Fast
  Reason: Good balance of cost and quality
  Example: "Implement CRUD endpoint", "Add validation"

Tier 3 (Complex tasks):
  Use: Claude Sonnet
  Reason: Highest success rate justifies cost
  Example: "Design architecture", "Complex refactoring"

Tier 4 (Critical tasks):
  Use: Claude Sonnet + Multi-model validation
  Reason: Quality > cost for critical work
  Example: "Security review", "Production bug fix"

Expected savings:
  Current: 90% Claude usage × $0.0042 = $0.00378 avg per task
  Optimized: 20% Claude + 50% Grok + 30% Gemini = $0.00257 avg per task
  Savings: 32% cost reduction with minimal quality impact

Optimization 3: Optimize Routing Tier Thresholds

Analysis:

Review tier distribution:

routing.tierDistribution:
  tier1: 45 tasks (45.9%)
  tier2: 30 tasks (30.6%)
  tier3: 15 tasks (15.3%)
  tier4: 8 tasks (8.2%)

Analyze tier accuracy:

Tier 1 (Simple):
  Tasks: 45
  Success: 42
  Failures: 3
  Success rate: 93.3% ✅
  Verdict: Well-calibrated

Tier 2 (Medium):
  Tasks: 30
  Success: 25
  Failures: 5
  Success rate: 83.3% ⚠️
  Verdict: Slightly low (target 90%)

Tier 3 (Complex):
  Tasks: 15
  Success: 12
  Failures: 3
  Success rate: 80.0% ⚠️
  Verdict: Too low (target 90%)

Tier 4 (Critical):
  Tasks: 8
  Success: 8
  Failures: 0
  Success rate: 100% ✅
  Verdict: Well-calibrated

Action:

Adjust tier thresholds:

Current thresholds (task-complexity-router):
  tier1: complexity score 0-3
  tier2: complexity score 4-6
  tier3: complexity score 7-9
  tier4: complexity score 10+

Problem: tier2 and tier3 have lower success rates
Root cause: Tasks slightly too complex for assigned tier

Optimized thresholds:
  tier1: complexity score 0-2 (narrower range)
  tier2: complexity score 3-5 (shift down)
  tier3: complexity score 6-8 (shift down)
  tier4: complexity score 9+ (broader range)

Rationale:
  - Shift more borderline tasks to higher tiers
  - Accept slightly higher cost for better success rates
  - tier2/tier3 success should improve to 90%+

Expected impact:
  - tier1 usage: 45 → 35 tasks (fewer simple tasks)
  - tier2 usage: 30 → 32 tasks (more medium tasks)
  - tier3 usage: 15 → 18 tasks (more complex tasks)
  - tier4 usage: 8 → 13 tasks (more critical tasks)
  - Overall success rate: 88% → 92%
  - Overall cost: +15% (acceptable tradeoff for quality)

Optimization 4: Detect Model-Task Affinity Patterns

Analysis:

Analyze task type performance by model:

Task type: code-review

Claude Sonnet:
  Success: 25, Failure: 2
  Success rate: 92.6% ✅

Grok Fast:
  Success: 18, Failure: 2
  Success rate: 90.0% ✅

Gemini Flash:
  Success: 10, Failure: 4
  Success rate: 71.4% ⚠️

→ Pattern: Claude and Grok excel at code review, Gemini struggles

Task type: implementation

Claude Sonnet:
  Success: 40, Failure: 5
  Success rate: 88.9% ✅

Grok Fast:
  Success: 12, Failure: 3
  Success rate: 80.0% ⚠️

Gemini Flash:
  Success: 6, Failure: 1
  Success rate: 85.7% ✅

→ Pattern: Claude best for implementation, Grok/Gemini acceptable

Task type: testing

Claude Sonnet:
  Success: 20, Failure: 3
  Success rate: 87.0% ✅

Grok Fast:
  Success: 0, Failure: 0
  Success rate: N/A

Gemini Flash:
  Success: 0, Failure: 0
  Success rate: N/A

→ Pattern: Only Claude has testing data (others not used for this)

Action:

Task-specific model routing:

code-review tasks:
  tier1: Grok Fast (90% success, low cost)
  tier2: Grok Fast (90% success, low cost)
  tier3: Claude Sonnet (93% success, high quality)
  tier4: Multi-model (Claude + Grok consensus)

implementation tasks:
  tier1: Gemini Flash (86% success, lowest cost)
  tier2: Grok Fast (80% success, medium cost)
  tier3: Claude Sonnet (89% success, highest quality)
  tier4: Claude Sonnet (89% success, proven)

testing tasks:
  tier1-4: Claude Sonnet (only model with proven testing capability)

Expected impact:
  - 25% cost savings on code reviews (use Grok instead of Claude)
  - 10% cost savings on implementation (use Gemini for simple)
  - Maintain quality (route by proven success rates)

Optimization 5: Alert on Performance Degradation

Detection:

Monitor for degradation:

Week 1 (baseline):
  ui-developer success rate: 90.5%
  Average task duration: 45s

Week 2:
  ui-developer success rate: 88.2% (↓2.3%)
  Average task duration: 48s (↑3s)

Week 3:
  ui-developer success rate: 85.1% (↓5.4% from baseline)
  Average task duration: 52s (↑7s from baseline)

Week 4:
  ui-developer success rate: 78.3% (↓12.2% from baseline) 🚨
  Average task duration: 58s (↑13s from baseline) 🚨

Threshold exceeded:
  ❌ Success rate dropped >10% (78.3% vs 90.5%)
  ❌ Duration increased >20% (58s vs 45s)

→ ALERT: ui-developer performance degraded significantly

Action:

Degradation response:

1. Investigate root cause
   Review recent history:
     - Task complexity increased? (Check taskTypes distribution)
     - Model changed? (Check model field in history)
     - Failures clustered around specific task type?

   Finding: All recent failures on "complex-state-management" tasks
   Root cause: New task type introduced, agent not trained for it

2. Take corrective action
   Option A: Retrain agent
     - Update prompt with state management patterns
     - Add examples of successful state management
     - Reset metrics after retraining

   Option B: Route differently
     - Route state management tasks to specialized agent
     - Keep ui-developer for simpler UI tasks

   Option C: Escalate to human
     - Alert: "ui-developer performance degraded"
     - Request: "Manual review of recent failures needed"

3. Monitor recovery
   Week 5 (after retraining):
     Success rate: 85.0% (recovering)
   Week 6:
     Success rate: 89.2% (near baseline)
   Week 7:
     Success rate: 91.0% (recovered ✅)

Integration with Orchestration Plugin

Integration 1: multi-model-validation

How multi-model-validation records model performance:

Multi-Model Review Flow:

1. Execute parallel review
   Models: [claude-sonnet, grok-fast, gemini-flash]
   Task: Code review of auth.ts

2. Collect model responses
   Each model returns:
     - Review findings
     - Confidence score
     - Latency
     - Cost

3. Record individual model performance
   For each model:
     models[modelId].totalRuns += 1
     models[modelId].avgLatency = rolling_avg(latency)
     models[modelId].totalCost += cost

4. Determine success/failure
   If review found critical issues → success (doing its job)
   If review crashed/errored → failure

5. Update success counts
   models[modelId].successCount += 1  (or failureCount)
   models[modelId].taskTypePerformance["code-review"].success += 1

6. Consolidate findings
   Generate consensus report
   Track which models agreed (co-occurrence patterns)

7. User feedback (optional)
   User rates review quality: "Helpful" | "Not helpful"
   Update successCorrelation for multi-model-validation skill

Integration 2: task-complexity-router

How task-complexity-router reads performance data:

Routing Decision Flow:

1. Analyze task complexity
   Input: "Implement user authentication with OAuth"
   Analysis: Complex (multiple components, external API, security)
   Base tier: 3

2. Read performance history
   Load: .claude/agent-performance.json
   Check: routing.tierDistribution

3. Adjust tier based on history
   tier3 historical success rate: 80% (below 90% target)
   tier4 historical success rate: 100%

   Decision: Bump to tier4 for higher success probability

4. Select agent based on task type affinity
   Task type: "implement-api"
   Candidates: backend-developer, full-stack-developer

   Check affinity:
     backend-developer.taskTypes["implement-api"]: 12 (high affinity)
     full-stack-developer.taskTypes["implement-api"]: 3 (low affinity)

   Decision: Select backend-developer (proven track record)

5. Select model based on tier + task type
   tier4 + implement-api:
     models[claude].taskTypePerformance["implementation"]: 89% success
     models[grok].taskTypePerformance["implementation"]: 80% success

   Decision: Select Claude (higher success rate for tier4)

6. Record routing decision
   routing.decisions.push({
     timestamp: NOW,
     taskType: "implement-api",
     complexity: "complex",
     selectedTier: 4,
     agent: "backend-developer",
     model: "sonnet",
     result: "pending"
   })

7. After execution, update result
   routing.decisions[last].result = "success"
   routing.decisions[last].cost = 0.005

Integration 3: hierarchical-coordinator

How hierarchical-coordinator tracks phase success:

Phase Execution Tracking:

1. Execute workflow phases
   Phase 1: Planning (architect agent)
   Phase 2: Implementation (developer agent)
   Phase 3: Testing (tester agent)
   Phase 4: Review (reviewer agent)

2. Track phase-level metrics
   Create phase-specific tracking:

   agents["architect"].phasePerformance = {
     "planning": { success: 15, failure: 2 },
     "architecture": { success: 8, failure: 1 }
   }

   agents["developer"].phasePerformance = {
     "implementation": { success: 25, failure: 5 },
     "refactoring": { success: 10, failure: 2 }
   }

3. Detect phase-specific issues
   Analysis: developer has 20% failure rate on implementation phase
   But: developer has 83% success rate overall

   Insight: Failures concentrated in specific phase

4. Optimize phase assignment
   Current: developer handles all implementation
   Optimized: Split by complexity
     - Simple implementation → junior-developer (cheaper)
     - Complex implementation → senior-developer (higher success)

5. Track coordinator effectiveness
   skills["hierarchical-coordinator"].activations += 1
   skills["hierarchical-coordinator"].successCorrelation = 0.92

   Insight: Workflows using coordinator have 92% success (vs 80% without)

Integration 4: quality-gates

How quality-gates uses performance thresholds:

Quality Gate Decision:

1. Agent completes task
   Agent: ui-developer
   Task: "Implement dashboard component"
   Confidence: 0.75

2. Check agent performance history
   Load: agents["ui-developer"]
   Historical avg confidence: 0.85
   Current confidence: 0.75 (below average 🚨)

3. Apply quality gate
   Threshold: If confidence < avg - 0.10, trigger validation

   Decision: 0.75 < 0.75 (borderline)
   Action: Trigger designer validation (extra quality check)

4. Designer validates
   Result: Found 3 minor issues
   Verdict: Quality gate prevented low-quality work from proceeding

5. Update metrics
   Without gate: ui-developer would have 1 more failure
   With gate: Issues caught early, fixed before user sees

   skills["quality-gates"].successCorrelation += 1
   (Success correlation increases when gate prevents failures)

6. Continuous improvement
   Pattern: Low-confidence tasks benefit from extra validation
   Threshold: Automatically adjust based on correlation data
   Future: If confidence < 0.80, always trigger validation

Best Practices

Do

✅ Track all agent executions (success and failure provide learning signal)
✅ Record model latency and cost (optimize for cost-effectiveness)
✅ Maintain execution history (detect patterns and trends)
✅ Set success rate thresholds (<70% = investigate, <50% = replace)
✅ Alert on performance degradation (>10% drop from baseline)
✅ Use task type affinity (route tasks to agents with proven success)
✅ Compare model cost-effectiveness (cost per success, not just cost per task)
✅ Track skill co-activation (identify successful skill combinations)
✅ Rotate history data (keep last 100 entries, prevent unbounded growth)
✅ Create session snapshots (point-in-time analysis)
✅ Integrate with routing (feed performance data back to router)

Don't

❌ Track only successes (failures provide valuable learning signal)
❌ Ignore degradation (small drops compound into big problems)
❌ Use stale data (>6 months old metrics may not reflect current state)
❌ Over-optimize on cost alone (balance cost and quality)
❌ Forget to update metrics (incomplete data leads to poor decisions)
❌ Store unbounded history (trim arrays to prevent file bloat)
❌ Mix session metrics (isolate session data for cleaner analysis)
❌ Ignore task type affinity (agents specialize, use it)
❌ Skip validation after major changes (reset metrics when workflows change)

Privacy Considerations

What to Track:

Aggregate metrics (counts, averages, distributions)
Task types (generic categories like "implement-component")
Success/failure outcomes
Model performance data
Timing and cost data

What NOT to Track:

User-specific data (usernames, emails)
Sensitive code snippets
API keys or credentials
Personal information
Business logic details

Data Retention:

Keep aggregate metrics indefinitely (no PII)
Rotate detailed history after 100 entries
Delete session snapshots after 100 sessions
Archive old metrics before major resets

When to Reset Metrics

Situations Requiring Reset:

Agent capabilities changed
- Prompt updated significantly
- Agent model changed
- Agent skills added/removed → Reset agent-specific metrics
Workflow architecture changed
- New routing logic
- New tier definitions
- New skill combinations → Reset routing and skill metrics
Model pricing changed
- Cost per token updated
- New pricing tier → Reset cost calculations (keep counts)
After prolonged period (>6 months)
- Metrics may be outdated
- Workflow patterns changed → Archive and reset all metrics

How to Reset:

# Backup current metrics
cp .claude/agent-performance.json .claude/agent-performance-backup-$(date +%Y%m%d).json

# Reset to empty state
cat > .claude/agent-performance.json << 'EOF'
{
  "schemaVersion": "1.0.0",
  "lastUpdated": "2026-01-28T16:00:00Z",
  "agents": {},
  "skills": {},
  "models": {},
  "routing": {
    "tierDistribution": {},
    "decisions": []
  }
}
EOF

# Archive old session snapshots
mkdir -p ai-docs/performance-history/archive-$(date +%Y%m%d)
mv ai-docs/performance-history/*.json ai-docs/performance-history/archive-$(date +%Y%m%d)/

Metric Hygiene

Regular Maintenance:

Weekly:
  - Review top agents (ensure success rates >70%)
  - Check model cost trends (identify cost spikes)
  - Trim co-activation maps (keep top 50 per skill)

Monthly:
  - Analyze task type affinity changes
  - Compare model cost-effectiveness
  - Review tier distribution accuracy
  - Archive old session snapshots (keep last 100)

Quarterly:
  - Deep analysis of performance trends
  - Optimize routing thresholds
  - Identify underperforming patterns
  - Consider agent retraining or replacement

Annually:
  - Full metrics review and reset (if needed)
  - Archive all historical data
  - Update baseline success rates
  - Document lessons learned

Examples

Example 1: Tracking a Multi-Model Review Session

Scenario: User requests /review with 3 models (Claude, Grok, Gemini)

Execution:

Step 1: Initialize session
  Session ID: 2026-01-28-session-15
  Start time: 15:00:00Z

Step 2: Execute multi-model review
  Models: [claude-sonnet, grok-fast, gemini-flash]
  Task: Code review of auth/login.ts (450 lines)

Step 3: Track individual model executions

  Model: sonnet
    Start: 15:00:05Z
    End: 15:00:08Z
    Latency: 3000ms
    Cost: $0.003
    Result: Found 5 issues (2 CRITICAL, 3 HIGH)
    Outcome: Success

  Update metrics:
    models["sonnet"].totalRuns = 121
    models["sonnet"].successCount = 109
    models["sonnet"].avgLatency = 2520ms
    models["sonnet"].totalCost = $0.453
    models["sonnet"].taskTypePerformance["code-review"].success = 26

  Model: grok
    Start: 15:00:05Z
    End: 15:00:07Z
    Latency: 2000ms
    Cost: $0.002
    Result: Found 4 issues (2 CRITICAL, 2 HIGH)
    Outcome: Success

  Update metrics:
    models["grok"].totalRuns = 36
    models["grok"].successCount = 31
    models["grok"].avgLatency = 1820ms
    models["grok"].totalCost = $0.082
    models["grok"].taskTypePerformance["code-review"].success = 19

  Model: gemini
    Start: 15:00:05Z
    End: 15:00:06Z
    Latency: 1500ms
    Cost: $0.001
    Result: Found 3 issues (1 CRITICAL, 2 MEDIUM)
    Outcome: Success

  Update metrics:
    models["gemini"].totalRuns = 21
    models["gemini"].successCount = 17
    models["gemini"].avgLatency = 1480ms
    models["gemini"].totalCost = $0.021
    models["gemini"].taskTypePerformance["code-review"].success = 11

Step 4: Track skill activation
  skills["multi-model-validation"].activations = 16
  skills["multi-model-validation"].lastActivated = 15:00:08Z
  skills["multi-model-validation"].coActivations["quality-gates"] = 13

Step 5: Consolidate findings
  Consensus issues (all 3 models agreed):
    - CRITICAL: SQL injection vulnerability (UNANIMOUS)
    - CRITICAL: Missing authentication check (UNANIMOUS)

  Majority issues (2/3 models agreed):
    - HIGH: Insufficient input validation (Claude, Grok)
    - HIGH: Missing error handling (Claude, Grok)

  Divergent issues (1/3 models):
    - MEDIUM: Code duplication (Gemini only)

Step 6: Record session summary
  Session complete:
    Duration: 8 seconds
    Models used: 3
    Total cost: $0.006
    Issues found: 5 (2 unanimous, 2 majority, 1 divergent)
    Result: Success

  Create snapshot:
    ai-docs/performance-history/2026-01-28-session-15.json

Step 7: Update aggregate metrics
  Overall session success rate: 3/3 models successful = 100%
  Cost efficiency: $0.002 per model = good value

Insights from Tracking:

Performance comparison:
  Fastest: Gemini (1500ms) - 50% faster than Claude
  Most thorough: Claude (5 issues) - Found 1 extra issue
  Best value: Gemini ($0.001, 3 issues) - Lowest cost, good coverage

Cost analysis:
  Total: $0.006 for 3-model review
  vs Single Claude: $0.003 (double cost, but 2x validation)
  ROI: Found 2 CRITICAL issues all models agreed on = high confidence

Consensus validation:
  UNANIMOUS issues (100% confidence) → Fix immediately
  MAJORITY issues (67% confidence) → Fix before merge
  DIVERGENT issues (33% confidence) → Low priority (possible false positive)

Recommendation:
  Multi-model validation worth the cost for critical code (auth, payments, security)
  Single-model sufficient for non-critical code (UI components, docs)

Example 2: Identifying Model Performance Differences

Scenario: After 100 tasks, compare model performance for optimization

Execution:

Step 1: Load performance data
  Read: .claude/agent-performance.json

Step 2: Extract model metrics

  Claude Sonnet:
    Total runs: 120
    Success: 108, Failure: 12
    Success rate: 90.0%
    Avg latency: 2500ms
    Total cost: $0.45
    Cost per task: $0.00375
    Cost per success: $0.00417

  Grok Fast:
    Total runs: 35
    Success: 30, Failure: 5
    Success rate: 85.7%
    Avg latency: 1800ms
    Total cost: $0.08
    Cost per task: $0.00229
    Cost per success: $0.00267

  Gemini Flash:
    Total runs: 20
    Success: 16, Failure: 4
    Success rate: 80.0%
    Avg latency: 1500ms
    Total cost: $0.02
    Cost per task: $0.00100
    Cost per success: $0.00125

Step 3: Analyze task type performance

  Code Review:
    Claude: 25 success, 2 failure = 92.6%
    Grok: 18 success, 2 failure = 90.0%
    Gemini: 10 success, 4 failure = 71.4%

    Winner: Claude (highest quality)
    Best value: Grok (90% at lower cost)

  Implementation:
    Claude: 40 success, 5 failure = 88.9%
    Grok: 12 success, 3 failure = 80.0%
    Gemini: 6 success, 1 failure = 85.7%

    Winner: Claude (highest quality)
    Surprising: Gemini performs well here (86% success)

  Testing:
    Claude: 20 success, 3 failure = 87.0%
    Grok: No data
    Gemini: No data

    Winner: Claude (only option)
    Action: Try Grok/Gemini for testing tasks to gather data

Step 4: Calculate cost-effectiveness by task type

  Code Review Cost-Effectiveness:
    Claude: $0.00417 per success, 92.6% quality
    Grok: $0.00267 per success, 90.0% quality (36% cheaper, -2.6% quality)
    Gemini: $0.00125 per success, 71.4% quality (70% cheaper, -21.2% quality)

    Recommendation: Use Grok for cost-effective reviews (minimal quality loss)

  Implementation Cost-Effectiveness:
    Claude: $0.00417 per success, 88.9% quality
    Grok: $0.00267 per success, 80.0% quality (36% cheaper, -8.9% quality)
    Gemini: $0.00125 per success, 85.7% quality (70% cheaper, -3.2% quality)

    Recommendation: Use Gemini for simple implementation (best value)

Step 5: Generate optimization plan

  Current usage (120 total tasks):
    Claude: 100 tasks (83%)
    Grok: 15 tasks (13%)
    Gemini: 5 tasks (4%)

  Optimized usage (maintain quality >85%):
    tier1 (Simple): Gemini (30% of tasks)
    tier2 (Medium): Grok (40% of tasks)
    tier3 (Complex): Claude (25% of tasks)
    tier4 (Critical): Claude + Multi-model (5% of tasks)

  Expected impact:
    Current avg cost: $0.00375 per task
    Optimized avg cost: $0.00240 per task
    Savings: 36% cost reduction

    Current avg success: 88.5%
    Optimized avg success: 86.2% (projected)
    Quality impact: -2.3% (acceptable tradeoff)

Step 6: Implement gradual rollout

  Week 1: Route 20% of tier1 tasks to Gemini
    Monitor: Success rate, cost savings
    Target: >80% success rate

  Week 2: Route 40% of tier2 tasks to Grok
    Monitor: Success rate, cost savings
    Target: >85% success rate

  Week 3: Evaluate results
    If successful: Increase percentages
    If unsuccessful: Rollback and investigate

Step 7: Track optimization results

  After 2 weeks:
    Gemini tier1 success: 82% ✅ (above 80% target)
    Grok tier2 success: 87% ✅ (above 85% target)
    Cost savings: 28% ✅ (approaching 36% target)

  Decision: Continue rollout
  Next: Route 50% tier1 to Gemini, 60% tier2 to Grok

Insights from Analysis:

Key findings:
  1. Grok is best value for code reviews (90% quality at 36% lower cost)
  2. Gemini surprisingly good for implementation (86% vs 89% Claude)
  3. Claude still best for critical work (92% code review success)
  4. Latency varies significantly (Gemini 40% faster than Claude)

Optimization strategy:
  - Use Gemini for simple, latency-sensitive tasks
  - Use Grok for medium-complexity, cost-sensitive tasks
  - Use Claude for critical, quality-sensitive tasks
  - Use multi-model for maximum confidence (despite cost)

Expected ROI:
  - 36% cost reduction (from $0.00375 to $0.00240 per task)
  - 2.3% quality tradeoff (from 88.5% to 86.2% success)
  - Worth it: Save $135 per 100,000 tasks with minimal quality impact

Example 3: Optimizing Routing Based on Accumulated Data

Scenario: After 100 routing decisions, optimize tier thresholds

Execution:

Step 1: Load routing data
  Read: .claude/agent-performance.json
  Focus: routing.tierDistribution, routing.decisions

Step 2: Analyze tier distribution

  Current distribution:
    tier1: 45 tasks (45.9%)
    tier2: 30 tasks (30.6%)
    tier3: 15 tasks (15.3%)
    tier4: 8 tasks (8.2%)

  Skew analysis:
    Heavy on tier1 (46%) - Router prefers simple classification
    Light on tier4 (8%) - Router rarely escalates

Step 3: Calculate tier success rates

  tier1 (Simple tasks):
    Total: 45
    Success: 42, Failure: 3
    Success rate: 93.3% ✅
    Avg cost: $0.001
    Avg duration: 25s

  tier2 (Medium tasks):
    Total: 30
    Success: 25, Failure: 5
    Success rate: 83.3% ⚠️ (target: 90%)
    Avg cost: $0.002
    Avg duration: 45s

  tier3 (Complex tasks):
    Total: 15
    Success: 12, Failure: 3
    Success rate: 80.0% ⚠️ (target: 90%)
    Avg cost: $0.004
    Avg duration: 90s

  tier4 (Critical tasks):
    Total: 8
    Success: 8, Failure: 0
    Success rate: 100% ✅
    Avg cost: $0.008
    Avg duration: 120s

Step 4: Analyze tier2/tier3 failures

  tier2 failures (5 tasks):
    1. "Implement complex state management" (complexity: 6)
       - Should have been tier3 (underestimated)
    2. "Add authentication to API" (complexity: 6)
       - Should have been tier3 (security = critical)
    3. "Refactor component with hooks" (complexity: 5)
       - Should have been tier2 (correctly routed, agent issue)
    4. "Implement drag-and-drop" (complexity: 6)
       - Should have been tier3 (complex interaction)
    5. "Add real-time updates" (complexity: 6)
       - Should have been tier3 (WebSocket complexity)

  Pattern: 4/5 failures were borderline tier2/tier3 (complexity 6)
  Root cause: tier2 upper threshold too high (should be 5, not 6)

  tier3 failures (3 tasks):
    1. "Design microservices architecture" (complexity: 9)
       - Should have been tier4 (architecture = critical)
    2. "Implement payment processing" (complexity: 9)
       - Should have been tier4 (money = critical)
    3. "Refactor authentication system" (complexity: 8)
       - Correctly routed, agent struggled with complexity

  Pattern: 2/3 failures should have been tier4 (complexity 9)
  Root cause: tier3 upper threshold too high (should be 8, not 9)

Step 5: Propose threshold adjustments

  Current thresholds:
    tier1: complexity 0-3
    tier2: complexity 4-6
    tier3: complexity 7-9
    tier4: complexity 10+

  Problem: Borderline tasks (6, 9) cause failures

  Optimized thresholds:
    tier1: complexity 0-2 (narrower, more confident)
    tier2: complexity 3-5 (shift down, avoid borderline 6)
    tier3: complexity 6-8 (shift down, avoid borderline 9)
    tier4: complexity 9+ (broader, include borderline cases)

  Rationale:
    - Move borderline complexity 6 from tier2 → tier3
    - Move borderline complexity 9 from tier3 → tier4
    - Accept 15% higher cost for 10% better success rate

Step 6: Simulate new distribution

  Reclassify historical tasks with new thresholds:

  tier1 (0-2): 35 tasks (35%)
    Success rate: 34/35 = 97.1% ↑ (was 93.3%)

  tier2 (3-5): 32 tasks (32%)
    Success rate: 30/32 = 93.8% ↑ (was 83.3%)

  tier3 (6-8): 18 tasks (18%)
    Success rate: 17/18 = 94.4% ↑ (was 80.0%)

  tier4 (9+): 13 tasks (13%)
    Success rate: 13/13 = 100% ✓ (was 100%)

  Overall success rate: 94/98 = 95.9% ↑ (was 87.8%)

Step 7: Calculate cost impact

  Current avg cost: $0.00240 per task
  Optimized avg cost: $0.00276 per task (+15%)

  Cost breakdown:
    tier1 (35%): $0.001 × 0.35 = $0.00035
    tier2 (32%): $0.002 × 0.32 = $0.00064
    tier3 (18%): $0.004 × 0.18 = $0.00072
    tier4 (13%): $0.008 × 0.13 = $0.00104
    Total: $0.00275 (rounded $0.00276)

  ROI calculation:
    Cost increase: +$0.00036 per task (+15%)
    Success increase: +8.1% (from 87.8% to 95.9%)
    Failure reduction: 12 → 4 failures (67% reduction)

  Value: Preventing 8 failures per 100 tasks worth the 15% cost increase

Step 8: Implement new thresholds

  Update task-complexity-router skill:
    OLD:
      if (complexity <= 3) return "tier1";
      if (complexity <= 6) return "tier2";
      if (complexity <= 9) return "tier3";
      return "tier4";

    NEW:
      if (complexity <= 2) return "tier1";
      if (complexity <= 5) return "tier2";
      if (complexity <= 8) return "tier3";
      return "tier4";

  Document change:
    Reason: Performance data showed borderline tasks caused failures
    Expected: 8% success rate improvement, 15% cost increase
    Monitoring: Track next 100 tasks to validate improvement

Step 9: Monitor post-optimization

  After 50 tasks with new thresholds:
    tier1: 18 tasks, 18 success = 100% ✅
    tier2: 16 tasks, 15 success = 93.8% ✅
    tier3: 10 tasks, 9 success = 90.0% ✅
    tier4: 6 tasks, 6 success = 100% ✅

  Overall: 48/50 = 96.0% success ✅ (matches projection)
  Avg cost: $0.00280 ✅ (matches projection)

  Verdict: Optimization successful, keep new thresholds

Insights from Optimization:

Key findings:
  1. Borderline complexity scores (6, 9) caused most failures
  2. Router was too aggressive in keeping tasks at lower tiers
  3. Small threshold adjustments (6→5, 9→8) had big impact

Optimization results:
  - Success rate: 87.8% → 96.0% (+8.2%)
  - Failure rate: 12.2% → 4.0% (-67%)
  - Cost per task: $0.00240 → $0.00280 (+15%)
  - ROI: Strong (quality improvement worth cost increase)

Lessons learned:
  - Track tier success rates, not just overall success
  - Borderline cases benefit from tier escalation
  - Performance data reveals routing blind spots
  - Continuous monitoring enables iterative improvement

Next steps:
  - Continue monitoring for 100 more tasks
  - Consider dynamic thresholds (adjust based on live data)
  - Explore agent-specific routing (some agents handle complexity better)

Troubleshooting

Problem: Agent performance.json file growing too large

Cause: History arrays not being trimmed

Solution: Implement automatic trimming after each update

function updateAgentMetrics(agentId, execution) {
  const agent = metrics.agents[agentId];

  // Update aggregates
  agent.totalRuns += 1;
  agent.successCount += execution.result === "success" ? 1 : 0;

  // Add to history
  agent.history.push(execution);

  // Trim to max 100 entries (FIFO)
  if (agent.history.length > 100) {
    agent.history = agent.history.slice(-100);
  }
}

Problem: Metrics don't reflect recent changes

Cause: Stale data from old workflows

Solution: Reset metrics after major changes

# Backup current metrics
cp .claude/agent-performance.json .claude/agent-performance-backup-$(date +%Y%m%d).json

# Reset relevant sections (keep models, reset agents)
# Edit .claude/agent-performance.json manually or use script

Problem: Success rate calculations seem wrong

Cause: Inconsistent result values ("success", "SUCCESS", "completed", etc.)

Solution: Normalize result values

function normalizeResult(result) {
  const successValues = ["success", "SUCCESS", "completed", "PASS"];
  const failureValues = ["failure", "FAILURE", "error", "ERROR", "FAIL"];

  if (successValues.includes(result)) return "success";
  if (failureValues.includes(result)) return "failure";
  return "unknown";
}

// Use normalized values in metrics
const normalizedResult = normalizeResult(execution.result);
agent.successCount += normalizedResult === "success" ? 1 : 0;
agent.failureCount += normalizedResult === "failure" ? 1 : 0;

Summary

Performance tracking enables data-driven orchestration optimization through:

Agent success tracking (identify high-performers and underperformers)
Model performance comparison (find cost-effective alternatives)
Skill effectiveness analysis (discover successful patterns)
Routing optimization (adjust tier thresholds based on actual results)
Historical trend detection (alert on degradation, celebrate improvements)

Key metrics to monitor:

Agent success rate (target >70%, alert if <60%)
Model cost-effectiveness (cost per success, not just cost per task)
Routing tier accuracy (target >90% success per tier)
Skill activation correlation (identify high-value skills)

Master performance tracking and your orchestration workflows will continuously improve, delivering better results at lower costs.

Inspired By:

/review command (multi-model performance tracking)
/dev command (agent success rate monitoring)
task-complexity-router skill (routing feedback loops)
Production workflows (cost optimization, quality tracking)

Related Skills

madappgang/test-skill

testing

VerifiedTrustedCommunity

A test skill for validation testing. Use when testing skill parsing and validation logic.

4SKILL.mdUpdated Apr 6, 2026

madappgang/test-skill

madappgang/tools/claudeup-core/src/tests/fixtures/invalid-plugin/bad-frontmatter/skills/bad-skill

tools

VerifiedTrustedCommunity

--- name: bad-skill description: This skill has invalid YAML in frontmatter allowed-tools: [invalid, array, syntax prerequisites: not-an-array --- # Bad Skill This skill has malformed frontmatter that should fail parsing. The YAML has: - Unclosed array bracket - Wrong type for prerequisites (should be array, not string)

4SKILL.mdUpdated Apr 6, 2026

madappgang/tools/claudeup-core/src/__tests__/fixtures/invalid-plugin/bad-frontmatter/skills/bad-skill

madappgang/update-models

development

VerifiedTrustedCommunity

Sync model aliases from the curated Firebase database. Fetches default model assignments, short aliases, team compositions, and known model metadata from the claudish API. Run this to get fresh model recommendations.

4SKILL.mdUpdated Apr 6, 2026

madappgang/update-models

madappgang/release

tools

VerifiedTrustedCommunity

Release one or more Magus plugins to the distribution repos (magus, magus-alpha, magus-marketing). Handles version inference from git history, marketplace.json updates, tagging, and force-push to lean dist repos. Use whenever the user says "release kanban", "release the dev plugin", "cut a new version of gtd", "bump kanban to 1.7", or hands you a batch like "release kanban and gtd". Also use for multi-plugin releases and for checking what a release would contain before committing.

4SKILL.mdUpdated Apr 6, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/madappgang/magus.git

# Copy into Claude Code skills folder (global)
cp -r magus/plugins/multimodel/skills/performance-tracking ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

madappgang/magus

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT