plugins/multimodel/skills/performance-tracking/SKILL.md
Track agent, skill, and model performance metrics for optimization. Use when measuring agent success rates, tracking model latency, analyzing routing effectiveness, or optimizing cost-per-task.
npx skillsauth add madappgang/magus performance-trackingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Version: 1.0.0 Purpose: Track agent, skill, and model performance metrics for continuous optimization Status: Production Ready
Performance tracking transforms workflows from "fire and forget" to data-driven optimization systems. By measuring what actually works, you can route tasks more effectively, identify failing patterns early, and reduce costs.
This skill provides battle-tested patterns for:
Performance tracking enables continuous improvement by providing the data needed to make informed decisions about agent selection, model choice, and workflow routing.
Optimize Routing:
Identify Failing Agents:
Reduce Costs:
Improve Quality:
Agent Metrics:
Skill Metrics:
Model Metrics:
Routing Metrics:
The performance tracker provides critical feedback to the task-complexity-router:
Routing Feedback Loop:
1. Router selects tier based on complexity
→ task-complexity-router analyzes task
→ Routes to tier2 (medium complexity)
2. Agent executes task
→ Records: tier=2, agent=ui-developer, result=success
3. Performance tracker updates metrics
→ tier2 usage +1
→ ui-developer success +1
→ Confidence in tier2 routing increases
4. Future routing decisions informed by history
→ Router sees tier2 has 85% success rate
→ Router sees ui-developer excels at UI tasks
→ Router confidently routes similar tasks to tier2
Store performance metrics in .claude/agent-performance.json:
{
"schemaVersion": "1.0.0",
"lastUpdated": "2026-01-28T15:30:00Z",
"agents": {
"ui-developer": {
"totalRuns": 42,
"successCount": 38,
"failureCount": 4,
"avgConfidence": 0.85,
"lastUsed": "2026-01-28T15:30:00Z",
"taskTypes": {
"implement-component": 15,
"fix-styling": 12,
"refactor-ui": 8,
"review-code": 7
},
"history": [
{
"timestamp": "2026-01-28T15:30:00Z",
"taskType": "implement-component",
"result": "success",
"confidence": 0.90,
"duration": 45000,
"tier": 2,
"model": "sonnet"
},
{
"timestamp": "2026-01-28T14:20:00Z",
"taskType": "fix-styling",
"result": "success",
"confidence": 0.85,
"duration": 30000,
"tier": 1,
"model": "sonnet"
}
]
},
"backend-developer": {
"totalRuns": 28,
"successCount": 25,
"failureCount": 3,
"avgConfidence": 0.88,
"lastUsed": "2026-01-28T14:00:00Z",
"taskTypes": {
"implement-api": 12,
"fix-bug": 8,
"database-migration": 5,
"write-tests": 3
},
"history": []
}
},
"skills": {
"multi-model-validation": {
"activations": 15,
"lastActivated": "2026-01-28T15:00:00Z",
"successCorrelation": 0.92,
"coActivations": {
"quality-gates": 12,
"error-recovery": 8
}
},
"task-complexity-router": {
"activations": 68,
"lastActivated": "2026-01-28T15:30:00Z",
"successCorrelation": 0.85,
"coActivations": {
"multi-agent-coordination": 45,
"hierarchical-coordinator": 30
}
}
},
"models": {
"sonnet": {
"totalRuns": 120,
"successCount": 108,
"failureCount": 12,
"avgLatency": 2500,
"totalCost": 0.45,
"lastUsed": "2026-01-28T15:30:00Z",
"taskTypePerformance": {
"code-review": { "success": 25, "failure": 2 },
"implementation": { "success": 40, "failure": 5 },
"testing": { "success": 20, "failure": 3 }
}
},
"grok": {
"totalRuns": 35,
"successCount": 30,
"failureCount": 5,
"avgLatency": 1800,
"totalCost": 0.08,
"lastUsed": "2026-01-28T13:00:00Z",
"taskTypePerformance": {
"code-review": { "success": 18, "failure": 2 },
"implementation": { "success": 12, "failure": 3 }
}
}
},
"routing": {
"tierDistribution": {
"tier1": 45,
"tier2": 30,
"tier3": 15,
"tier4": 8
},
"decisions": [
{
"timestamp": "2026-01-28T15:30:00Z",
"taskType": "implement-component",
"complexity": "medium",
"selectedTier": 2,
"agent": "ui-developer",
"result": "success",
"cost": 0.003
},
{
"timestamp": "2026-01-28T14:20:00Z",
"taskType": "fix-styling",
"complexity": "low",
"selectedTier": 1,
"agent": "ui-developer",
"result": "success",
"cost": 0.001
}
]
}
}
Agent Metrics:
totalRuns: Total task executionssuccessCount: Tasks completed successfullyfailureCount: Tasks that failed or required retryavgConfidence: Rolling average of agent confidence scores (0.0-1.0)lastUsed: ISO-8601 timestamp of last executiontaskTypes: Distribution of task types (understand agent specialization)history: Array of recent executions (max 100 entries, FIFO)Skill Metrics:
activations: Total times skill was triggeredlastActivated: ISO-8601 timestampsuccessCorrelation: Success rate when this skill is active (0.0-1.0)coActivations: Skills frequently activated together (detect patterns)Model Metrics:
totalRuns: Total executionssuccessCount/failureCount: Outcome trackingavgLatency: Average response time in millisecondstotalCost: Cumulative spend in USDlastUsed: ISO-8601 timestamptaskTypePerformance: Success/failure breakdown by task typeRouting Metrics:
tierDistribution: Count of tasks routed to each tierdecisions: Array of routing decisions with outcomes (max 100, FIFO)After Agent Completes Task:
Execution Flow:
1. Agent executes task
Task: ui-developer
Input: "Implement login form component"
Result: Success
Confidence: 0.90
Duration: 45 seconds
Tier: 2
Model: sonnet
2. Update agent metrics
Read: .claude/agent-performance.json
Update:
agents["ui-developer"].totalRuns += 1
agents["ui-developer"].successCount += 1
agents["ui-developer"].avgConfidence = rolling_avg(0.90)
agents["ui-developer"].lastUsed = NOW
agents["ui-developer"].taskTypes["implement-component"] += 1
agents["ui-developer"].history.push({
timestamp: NOW,
taskType: "implement-component",
result: "success",
confidence: 0.90,
duration: 45000,
tier: 2,
model: "sonnet"
})
Trim history if > 100 entries
Write: .claude/agent-performance.json
3. Calculate derived metrics
Success rate: successCount / totalRuns = 38/42 = 90.5%
Avg duration: sum(history.duration) / history.length
Task affinity: taskTypes sorted by count
After Agent Fails:
Failure Flow:
1. Agent fails task
Task: backend-developer
Input: "Implement complex payment flow"
Result: Failure (error, timeout, or low quality)
Confidence: 0.65
Tier: 3
2. Update failure metrics
agents["backend-developer"].totalRuns += 1
agents["backend-developer"].failureCount += 1
agents["backend-developer"].avgConfidence = rolling_avg(0.65)
agents["backend-developer"].history.push({
timestamp: NOW,
taskType: "implement-api",
result: "failure",
confidence: 0.65,
duration: 120000,
tier: 3,
error: "Exceeded max iterations"
})
3. Check for degradation
If failureCount / totalRuns > 0.30:
Alert: "backend-developer has >30% failure rate"
Recommendation: "Review recent failures, retrain, or replace"
After Model Execution:
Execution Flow:
1. Model completes task
Model: grok
Task: Code review
Latency: 1800ms
Cost: $0.002
Result: Success
2. Update model metrics
models["grok"].totalRuns += 1
models["grok"].successCount += 1
models["grok"].avgLatency = rolling_avg(1800)
models["grok"].totalCost += 0.002
models["grok"].lastUsed = NOW
models["grok"].taskTypePerformance["code-review"].success += 1
3. Compare model performance
Claude Sonnet: avgLatency=2500ms, cost=$0.45 (120 runs)
Grok Fast: avgLatency=1800ms, cost=$0.08 (35 runs)
Analysis:
- Grok is 28% faster (1800ms vs 2500ms)
- Grok is 82% cheaper per run ($0.0023 vs $0.0038)
- Both have similar success rates (86% vs 90%)
Recommendation:
- Use Grok for cost-sensitive tasks
- Use Claude for critical tasks (higher success rate)
After Skill Activation:
Activation Flow:
1. Skill triggers
Skill: multi-model-validation
Context: User requested /review with 3 models
2. Update skill metrics
skills["multi-model-validation"].activations += 1
skills["multi-model-validation"].lastActivated = NOW
3. Track co-activation
Active skills: ["multi-model-validation", "quality-gates"]
skills["multi-model-validation"].coActivations["quality-gates"] += 1
4. Calculate success correlation
Tasks with this skill active: 15
Successful tasks: 14
Success correlation: 14/15 = 93.3%
5. Pattern detection
Observation: multi-model-validation + quality-gates = 100% success (12/12)
Recommendation: Always pair these skills for high-quality reviews
After Routing Decision:
Routing Flow:
1. Router selects tier
Task: "Implement user profile page"
Analysis: Medium complexity (multiple components, state management)
Selected tier: 2
Agent: ui-developer
Model: sonnet
2. Record routing decision
routing.tierDistribution["tier2"] += 1
routing.decisions.push({
timestamp: NOW,
taskType: "implement-component",
complexity: "medium",
selectedTier: 2,
agent: "ui-developer",
result: "pending"
})
3. After task completes
Update decision with result:
routing.decisions[last].result = "success"
routing.decisions[last].cost = 0.003
4. Trim decision history if > 100 entries
End of Session Summary:
Session Summary Flow:
1. Aggregate session metrics
Session ID: 2026-01-28-session-15
Duration: 2 hours
Tasks executed: 15
Success rate: 14/15 = 93.3%
Total cost: $0.045
Models used: Claude (12), Grok (3)
2. Create session snapshot
File: ai-docs/performance-history/2026-01-28-session-15.json
Content:
{
"sessionId": "2026-01-28-session-15",
"startTime": "2026-01-28T13:00:00Z",
"endTime": "2026-01-28T15:00:00Z",
"duration": 7200000,
"tasks": 15,
"successRate": 0.933,
"totalCost": 0.045,
"modelUsage": { "claude": 12, "grok": 3 },
"topAgents": ["ui-developer", "backend-developer"],
"activeSkills": ["task-complexity-router", "multi-model-validation"]
}
3. Update rolling metrics
.claude/agent-performance.json (persistent)
ai-docs/performance-history/ (snapshots)
4. Cleanup old snapshots
Keep last 100 session snapshots
Delete older entries
Location: .claude/agent-performance.json
Purpose: Persistent, project-level performance tracking
When to Update:
Format: JSON schema version 1.0.0 (see Metrics Schema section)
Rotation: Keep full history, but trim individual history arrays to 100 entries
Location: ai-docs/performance-history/
Purpose: Point-in-time session summaries for historical analysis
Naming: {YYYY-MM-DD}-session-{N}.json
Example:
ai-docs/performance-history/
2026-01-28-session-1.json
2026-01-28-session-2.json
2026-01-27-session-1.json
...
Retention: Keep last 100 sessions, delete older
Relationship with ai-docs/llm-performance.json:
Comparison:
llm-performance.json (existing):
- Model-specific performance
- Cost tracking per model
- Response time tracking
- Used by multi-model-validation
agent-performance.json (new):
- Agent-level metrics (multi-run aggregation)
- Skill activation tracking
- Routing decision history
- Task type affinity
Integration:
- agent-performance.json imports model data from llm-performance.json
- Both files updated in parallel
- llm-performance.json focuses on single-run details
- agent-performance.json focuses on aggregate trends
Migration Path:
Step 1: Create .claude/agent-performance.json with schema 1.0.0
Step 2: Import historical data from llm-performance.json
Step 3: Update both files going forward
Step 4: Deprecate llm-performance.json after 6 months (optional)
Automatic Cleanup:
Cleanup Rules:
1. Agent history arrays
Max entries: 100
Strategy: FIFO (oldest removed first)
Trigger: After every agent execution
2. Routing decision arrays
Max entries: 100
Strategy: FIFO
Trigger: After every routing decision
3. Session snapshots
Max files: 100
Strategy: FIFO (delete oldest session files)
Trigger: After every session ends
4. Skill co-activation maps
Max entries per skill: 50
Strategy: Keep top 50 by count
Trigger: Weekly cleanup
Manual Cleanup:
When to manually reset:
1. After major workflow changes
- Agent capabilities changed
- New skills added
- Routing logic updated
→ Reset metrics to start fresh
2. After agent retraining
- Agent prompt updated
- Agent model changed
→ Reset agent-specific metrics
3. After prolonged period (>6 months)
- Metrics may be outdated
→ Archive old data, start fresh
How to reset:
Backup: cp .claude/agent-performance.json .claude/agent-performance-backup-{DATE}.json
Reset: echo '{"schemaVersion":"1.0.0","lastUpdated":"...","agents":{},...}' > .claude/agent-performance.json
Detection:
Analyze agent success rates:
agents["ui-developer"]:
successCount: 38
totalRuns: 42
success rate: 38/42 = 90.5% ✅ GOOD
agents["test-architect"]:
successCount: 15
totalRuns: 25
success rate: 15/25 = 60% ❌ UNDERPERFORMING
Threshold: <70% success rate = underperforming
Action:
For test-architect (60% success):
1. Analyze failure patterns
Review history entries where result="failure"
Common failure reasons:
- "Tests too brittle" (8 occurrences)
- "Missing test coverage" (5 occurrences)
- "Test timeout" (2 occurrences)
2. Identify root cause
Pattern: test-architect struggles with async/timing tests
Evidence: All timeout failures involved async code
3. Take action
Option A: Retrain agent
- Update prompt with async testing best practices
- Add examples of proper async test patterns
- Reset metrics after retraining
Option B: Route differently
- Route async test tasks to backend-developer (90% success on async)
- Keep test-architect for synchronous unit tests
Option C: Replace agent
- Create new specialized-async-test-architect
- Deprecate test-architect for async work
Analysis:
Compare model cost-effectiveness:
Model: sonnet
Total cost: $0.45
Total runs: 120
Success count: 108
Cost per task: $0.0038
Cost per success: $0.0042
Success rate: 90%
Model: grok
Total cost: $0.08
Total runs: 35
Success count: 30
Cost per task: $0.0023
Cost per success: $0.0027
Success rate: 86%
Model: gemini
Total cost: $0.02
Total runs: 20
Success count: 16
Cost per task: $0.0010
Cost per success: $0.0013
Success rate: 80%
Cost Efficiency Ranking:
1. Gemini Flash: $0.0013 per success (80% success rate)
2. Grok Fast: $0.0027 per success (86% success rate)
3. Claude Sonnet: $0.0042 per success (90% success rate)
Quality-Cost Tradeoff:
- Gemini: 69% cheaper than Claude, but 10% lower success rate
- Grok: 36% cheaper than Claude, but 4% lower success rate
Action:
Optimization strategy:
Tier 1 (Simple tasks):
Use: Gemini Flash
Reason: Lowest cost, acceptable success rate for simple work
Example: "Fix typo in comment", "Format code"
Tier 2 (Medium tasks):
Use: Grok Fast
Reason: Good balance of cost and quality
Example: "Implement CRUD endpoint", "Add validation"
Tier 3 (Complex tasks):
Use: Claude Sonnet
Reason: Highest success rate justifies cost
Example: "Design architecture", "Complex refactoring"
Tier 4 (Critical tasks):
Use: Claude Sonnet + Multi-model validation
Reason: Quality > cost for critical work
Example: "Security review", "Production bug fix"
Expected savings:
Current: 90% Claude usage × $0.0042 = $0.00378 avg per task
Optimized: 20% Claude + 50% Grok + 30% Gemini = $0.00257 avg per task
Savings: 32% cost reduction with minimal quality impact
Analysis:
Review tier distribution:
routing.tierDistribution:
tier1: 45 tasks (45.9%)
tier2: 30 tasks (30.6%)
tier3: 15 tasks (15.3%)
tier4: 8 tasks (8.2%)
Analyze tier accuracy:
Tier 1 (Simple):
Tasks: 45
Success: 42
Failures: 3
Success rate: 93.3% ✅
Verdict: Well-calibrated
Tier 2 (Medium):
Tasks: 30
Success: 25
Failures: 5
Success rate: 83.3% ⚠️
Verdict: Slightly low (target 90%)
Tier 3 (Complex):
Tasks: 15
Success: 12
Failures: 3
Success rate: 80.0% ⚠️
Verdict: Too low (target 90%)
Tier 4 (Critical):
Tasks: 8
Success: 8
Failures: 0
Success rate: 100% ✅
Verdict: Well-calibrated
Action:
Adjust tier thresholds:
Current thresholds (task-complexity-router):
tier1: complexity score 0-3
tier2: complexity score 4-6
tier3: complexity score 7-9
tier4: complexity score 10+
Problem: tier2 and tier3 have lower success rates
Root cause: Tasks slightly too complex for assigned tier
Optimized thresholds:
tier1: complexity score 0-2 (narrower range)
tier2: complexity score 3-5 (shift down)
tier3: complexity score 6-8 (shift down)
tier4: complexity score 9+ (broader range)
Rationale:
- Shift more borderline tasks to higher tiers
- Accept slightly higher cost for better success rates
- tier2/tier3 success should improve to 90%+
Expected impact:
- tier1 usage: 45 → 35 tasks (fewer simple tasks)
- tier2 usage: 30 → 32 tasks (more medium tasks)
- tier3 usage: 15 → 18 tasks (more complex tasks)
- tier4 usage: 8 → 13 tasks (more critical tasks)
- Overall success rate: 88% → 92%
- Overall cost: +15% (acceptable tradeoff for quality)
Analysis:
Analyze task type performance by model:
Task type: code-review
Claude Sonnet:
Success: 25, Failure: 2
Success rate: 92.6% ✅
Grok Fast:
Success: 18, Failure: 2
Success rate: 90.0% ✅
Gemini Flash:
Success: 10, Failure: 4
Success rate: 71.4% ⚠️
→ Pattern: Claude and Grok excel at code review, Gemini struggles
Task type: implementation
Claude Sonnet:
Success: 40, Failure: 5
Success rate: 88.9% ✅
Grok Fast:
Success: 12, Failure: 3
Success rate: 80.0% ⚠️
Gemini Flash:
Success: 6, Failure: 1
Success rate: 85.7% ✅
→ Pattern: Claude best for implementation, Grok/Gemini acceptable
Task type: testing
Claude Sonnet:
Success: 20, Failure: 3
Success rate: 87.0% ✅
Grok Fast:
Success: 0, Failure: 0
Success rate: N/A
Gemini Flash:
Success: 0, Failure: 0
Success rate: N/A
→ Pattern: Only Claude has testing data (others not used for this)
Action:
Task-specific model routing:
code-review tasks:
tier1: Grok Fast (90% success, low cost)
tier2: Grok Fast (90% success, low cost)
tier3: Claude Sonnet (93% success, high quality)
tier4: Multi-model (Claude + Grok consensus)
implementation tasks:
tier1: Gemini Flash (86% success, lowest cost)
tier2: Grok Fast (80% success, medium cost)
tier3: Claude Sonnet (89% success, highest quality)
tier4: Claude Sonnet (89% success, proven)
testing tasks:
tier1-4: Claude Sonnet (only model with proven testing capability)
Expected impact:
- 25% cost savings on code reviews (use Grok instead of Claude)
- 10% cost savings on implementation (use Gemini for simple)
- Maintain quality (route by proven success rates)
Detection:
Monitor for degradation:
Week 1 (baseline):
ui-developer success rate: 90.5%
Average task duration: 45s
Week 2:
ui-developer success rate: 88.2% (↓2.3%)
Average task duration: 48s (↑3s)
Week 3:
ui-developer success rate: 85.1% (↓5.4% from baseline)
Average task duration: 52s (↑7s from baseline)
Week 4:
ui-developer success rate: 78.3% (↓12.2% from baseline) 🚨
Average task duration: 58s (↑13s from baseline) 🚨
Threshold exceeded:
❌ Success rate dropped >10% (78.3% vs 90.5%)
❌ Duration increased >20% (58s vs 45s)
→ ALERT: ui-developer performance degraded significantly
Action:
Degradation response:
1. Investigate root cause
Review recent history:
- Task complexity increased? (Check taskTypes distribution)
- Model changed? (Check model field in history)
- Failures clustered around specific task type?
Finding: All recent failures on "complex-state-management" tasks
Root cause: New task type introduced, agent not trained for it
2. Take corrective action
Option A: Retrain agent
- Update prompt with state management patterns
- Add examples of successful state management
- Reset metrics after retraining
Option B: Route differently
- Route state management tasks to specialized agent
- Keep ui-developer for simpler UI tasks
Option C: Escalate to human
- Alert: "ui-developer performance degraded"
- Request: "Manual review of recent failures needed"
3. Monitor recovery
Week 5 (after retraining):
Success rate: 85.0% (recovering)
Week 6:
Success rate: 89.2% (near baseline)
Week 7:
Success rate: 91.0% (recovered ✅)
How multi-model-validation records model performance:
Multi-Model Review Flow:
1. Execute parallel review
Models: [claude-sonnet, grok-fast, gemini-flash]
Task: Code review of auth.ts
2. Collect model responses
Each model returns:
- Review findings
- Confidence score
- Latency
- Cost
3. Record individual model performance
For each model:
models[modelId].totalRuns += 1
models[modelId].avgLatency = rolling_avg(latency)
models[modelId].totalCost += cost
4. Determine success/failure
If review found critical issues → success (doing its job)
If review crashed/errored → failure
5. Update success counts
models[modelId].successCount += 1 (or failureCount)
models[modelId].taskTypePerformance["code-review"].success += 1
6. Consolidate findings
Generate consensus report
Track which models agreed (co-occurrence patterns)
7. User feedback (optional)
User rates review quality: "Helpful" | "Not helpful"
Update successCorrelation for multi-model-validation skill
How task-complexity-router reads performance data:
Routing Decision Flow:
1. Analyze task complexity
Input: "Implement user authentication with OAuth"
Analysis: Complex (multiple components, external API, security)
Base tier: 3
2. Read performance history
Load: .claude/agent-performance.json
Check: routing.tierDistribution
3. Adjust tier based on history
tier3 historical success rate: 80% (below 90% target)
tier4 historical success rate: 100%
Decision: Bump to tier4 for higher success probability
4. Select agent based on task type affinity
Task type: "implement-api"
Candidates: backend-developer, full-stack-developer
Check affinity:
backend-developer.taskTypes["implement-api"]: 12 (high affinity)
full-stack-developer.taskTypes["implement-api"]: 3 (low affinity)
Decision: Select backend-developer (proven track record)
5. Select model based on tier + task type
tier4 + implement-api:
models[claude].taskTypePerformance["implementation"]: 89% success
models[grok].taskTypePerformance["implementation"]: 80% success
Decision: Select Claude (higher success rate for tier4)
6. Record routing decision
routing.decisions.push({
timestamp: NOW,
taskType: "implement-api",
complexity: "complex",
selectedTier: 4,
agent: "backend-developer",
model: "sonnet",
result: "pending"
})
7. After execution, update result
routing.decisions[last].result = "success"
routing.decisions[last].cost = 0.005
How hierarchical-coordinator tracks phase success:
Phase Execution Tracking:
1. Execute workflow phases
Phase 1: Planning (architect agent)
Phase 2: Implementation (developer agent)
Phase 3: Testing (tester agent)
Phase 4: Review (reviewer agent)
2. Track phase-level metrics
Create phase-specific tracking:
agents["architect"].phasePerformance = {
"planning": { success: 15, failure: 2 },
"architecture": { success: 8, failure: 1 }
}
agents["developer"].phasePerformance = {
"implementation": { success: 25, failure: 5 },
"refactoring": { success: 10, failure: 2 }
}
3. Detect phase-specific issues
Analysis: developer has 20% failure rate on implementation phase
But: developer has 83% success rate overall
Insight: Failures concentrated in specific phase
4. Optimize phase assignment
Current: developer handles all implementation
Optimized: Split by complexity
- Simple implementation → junior-developer (cheaper)
- Complex implementation → senior-developer (higher success)
5. Track coordinator effectiveness
skills["hierarchical-coordinator"].activations += 1
skills["hierarchical-coordinator"].successCorrelation = 0.92
Insight: Workflows using coordinator have 92% success (vs 80% without)
How quality-gates uses performance thresholds:
Quality Gate Decision:
1. Agent completes task
Agent: ui-developer
Task: "Implement dashboard component"
Confidence: 0.75
2. Check agent performance history
Load: agents["ui-developer"]
Historical avg confidence: 0.85
Current confidence: 0.75 (below average 🚨)
3. Apply quality gate
Threshold: If confidence < avg - 0.10, trigger validation
Decision: 0.75 < 0.75 (borderline)
Action: Trigger designer validation (extra quality check)
4. Designer validates
Result: Found 3 minor issues
Verdict: Quality gate prevented low-quality work from proceeding
5. Update metrics
Without gate: ui-developer would have 1 more failure
With gate: Issues caught early, fixed before user sees
skills["quality-gates"].successCorrelation += 1
(Success correlation increases when gate prevents failures)
6. Continuous improvement
Pattern: Low-confidence tasks benefit from extra validation
Threshold: Automatically adjust based on correlation data
Future: If confidence < 0.80, always trigger validation
What to Track:
What NOT to Track:
Data Retention:
Situations Requiring Reset:
Agent capabilities changed
Workflow architecture changed
Model pricing changed
After prolonged period (>6 months)
How to Reset:
# Backup current metrics
cp .claude/agent-performance.json .claude/agent-performance-backup-$(date +%Y%m%d).json
# Reset to empty state
cat > .claude/agent-performance.json << 'EOF'
{
"schemaVersion": "1.0.0",
"lastUpdated": "2026-01-28T16:00:00Z",
"agents": {},
"skills": {},
"models": {},
"routing": {
"tierDistribution": {},
"decisions": []
}
}
EOF
# Archive old session snapshots
mkdir -p ai-docs/performance-history/archive-$(date +%Y%m%d)
mv ai-docs/performance-history/*.json ai-docs/performance-history/archive-$(date +%Y%m%d)/
Regular Maintenance:
Weekly:
- Review top agents (ensure success rates >70%)
- Check model cost trends (identify cost spikes)
- Trim co-activation maps (keep top 50 per skill)
Monthly:
- Analyze task type affinity changes
- Compare model cost-effectiveness
- Review tier distribution accuracy
- Archive old session snapshots (keep last 100)
Quarterly:
- Deep analysis of performance trends
- Optimize routing thresholds
- Identify underperforming patterns
- Consider agent retraining or replacement
Annually:
- Full metrics review and reset (if needed)
- Archive all historical data
- Update baseline success rates
- Document lessons learned
Scenario: User requests /review with 3 models (Claude, Grok, Gemini)
Execution:
Step 1: Initialize session
Session ID: 2026-01-28-session-15
Start time: 15:00:00Z
Step 2: Execute multi-model review
Models: [claude-sonnet, grok-fast, gemini-flash]
Task: Code review of auth/login.ts (450 lines)
Step 3: Track individual model executions
Model: sonnet
Start: 15:00:05Z
End: 15:00:08Z
Latency: 3000ms
Cost: $0.003
Result: Found 5 issues (2 CRITICAL, 3 HIGH)
Outcome: Success
Update metrics:
models["sonnet"].totalRuns = 121
models["sonnet"].successCount = 109
models["sonnet"].avgLatency = 2520ms
models["sonnet"].totalCost = $0.453
models["sonnet"].taskTypePerformance["code-review"].success = 26
Model: grok
Start: 15:00:05Z
End: 15:00:07Z
Latency: 2000ms
Cost: $0.002
Result: Found 4 issues (2 CRITICAL, 2 HIGH)
Outcome: Success
Update metrics:
models["grok"].totalRuns = 36
models["grok"].successCount = 31
models["grok"].avgLatency = 1820ms
models["grok"].totalCost = $0.082
models["grok"].taskTypePerformance["code-review"].success = 19
Model: gemini
Start: 15:00:05Z
End: 15:00:06Z
Latency: 1500ms
Cost: $0.001
Result: Found 3 issues (1 CRITICAL, 2 MEDIUM)
Outcome: Success
Update metrics:
models["gemini"].totalRuns = 21
models["gemini"].successCount = 17
models["gemini"].avgLatency = 1480ms
models["gemini"].totalCost = $0.021
models["gemini"].taskTypePerformance["code-review"].success = 11
Step 4: Track skill activation
skills["multi-model-validation"].activations = 16
skills["multi-model-validation"].lastActivated = 15:00:08Z
skills["multi-model-validation"].coActivations["quality-gates"] = 13
Step 5: Consolidate findings
Consensus issues (all 3 models agreed):
- CRITICAL: SQL injection vulnerability (UNANIMOUS)
- CRITICAL: Missing authentication check (UNANIMOUS)
Majority issues (2/3 models agreed):
- HIGH: Insufficient input validation (Claude, Grok)
- HIGH: Missing error handling (Claude, Grok)
Divergent issues (1/3 models):
- MEDIUM: Code duplication (Gemini only)
Step 6: Record session summary
Session complete:
Duration: 8 seconds
Models used: 3
Total cost: $0.006
Issues found: 5 (2 unanimous, 2 majority, 1 divergent)
Result: Success
Create snapshot:
ai-docs/performance-history/2026-01-28-session-15.json
Step 7: Update aggregate metrics
Overall session success rate: 3/3 models successful = 100%
Cost efficiency: $0.002 per model = good value
Insights from Tracking:
Performance comparison:
Fastest: Gemini (1500ms) - 50% faster than Claude
Most thorough: Claude (5 issues) - Found 1 extra issue
Best value: Gemini ($0.001, 3 issues) - Lowest cost, good coverage
Cost analysis:
Total: $0.006 for 3-model review
vs Single Claude: $0.003 (double cost, but 2x validation)
ROI: Found 2 CRITICAL issues all models agreed on = high confidence
Consensus validation:
UNANIMOUS issues (100% confidence) → Fix immediately
MAJORITY issues (67% confidence) → Fix before merge
DIVERGENT issues (33% confidence) → Low priority (possible false positive)
Recommendation:
Multi-model validation worth the cost for critical code (auth, payments, security)
Single-model sufficient for non-critical code (UI components, docs)
Scenario: After 100 tasks, compare model performance for optimization
Execution:
Step 1: Load performance data
Read: .claude/agent-performance.json
Step 2: Extract model metrics
Claude Sonnet:
Total runs: 120
Success: 108, Failure: 12
Success rate: 90.0%
Avg latency: 2500ms
Total cost: $0.45
Cost per task: $0.00375
Cost per success: $0.00417
Grok Fast:
Total runs: 35
Success: 30, Failure: 5
Success rate: 85.7%
Avg latency: 1800ms
Total cost: $0.08
Cost per task: $0.00229
Cost per success: $0.00267
Gemini Flash:
Total runs: 20
Success: 16, Failure: 4
Success rate: 80.0%
Avg latency: 1500ms
Total cost: $0.02
Cost per task: $0.00100
Cost per success: $0.00125
Step 3: Analyze task type performance
Code Review:
Claude: 25 success, 2 failure = 92.6%
Grok: 18 success, 2 failure = 90.0%
Gemini: 10 success, 4 failure = 71.4%
Winner: Claude (highest quality)
Best value: Grok (90% at lower cost)
Implementation:
Claude: 40 success, 5 failure = 88.9%
Grok: 12 success, 3 failure = 80.0%
Gemini: 6 success, 1 failure = 85.7%
Winner: Claude (highest quality)
Surprising: Gemini performs well here (86% success)
Testing:
Claude: 20 success, 3 failure = 87.0%
Grok: No data
Gemini: No data
Winner: Claude (only option)
Action: Try Grok/Gemini for testing tasks to gather data
Step 4: Calculate cost-effectiveness by task type
Code Review Cost-Effectiveness:
Claude: $0.00417 per success, 92.6% quality
Grok: $0.00267 per success, 90.0% quality (36% cheaper, -2.6% quality)
Gemini: $0.00125 per success, 71.4% quality (70% cheaper, -21.2% quality)
Recommendation: Use Grok for cost-effective reviews (minimal quality loss)
Implementation Cost-Effectiveness:
Claude: $0.00417 per success, 88.9% quality
Grok: $0.00267 per success, 80.0% quality (36% cheaper, -8.9% quality)
Gemini: $0.00125 per success, 85.7% quality (70% cheaper, -3.2% quality)
Recommendation: Use Gemini for simple implementation (best value)
Step 5: Generate optimization plan
Current usage (120 total tasks):
Claude: 100 tasks (83%)
Grok: 15 tasks (13%)
Gemini: 5 tasks (4%)
Optimized usage (maintain quality >85%):
tier1 (Simple): Gemini (30% of tasks)
tier2 (Medium): Grok (40% of tasks)
tier3 (Complex): Claude (25% of tasks)
tier4 (Critical): Claude + Multi-model (5% of tasks)
Expected impact:
Current avg cost: $0.00375 per task
Optimized avg cost: $0.00240 per task
Savings: 36% cost reduction
Current avg success: 88.5%
Optimized avg success: 86.2% (projected)
Quality impact: -2.3% (acceptable tradeoff)
Step 6: Implement gradual rollout
Week 1: Route 20% of tier1 tasks to Gemini
Monitor: Success rate, cost savings
Target: >80% success rate
Week 2: Route 40% of tier2 tasks to Grok
Monitor: Success rate, cost savings
Target: >85% success rate
Week 3: Evaluate results
If successful: Increase percentages
If unsuccessful: Rollback and investigate
Step 7: Track optimization results
After 2 weeks:
Gemini tier1 success: 82% ✅ (above 80% target)
Grok tier2 success: 87% ✅ (above 85% target)
Cost savings: 28% ✅ (approaching 36% target)
Decision: Continue rollout
Next: Route 50% tier1 to Gemini, 60% tier2 to Grok
Insights from Analysis:
Key findings:
1. Grok is best value for code reviews (90% quality at 36% lower cost)
2. Gemini surprisingly good for implementation (86% vs 89% Claude)
3. Claude still best for critical work (92% code review success)
4. Latency varies significantly (Gemini 40% faster than Claude)
Optimization strategy:
- Use Gemini for simple, latency-sensitive tasks
- Use Grok for medium-complexity, cost-sensitive tasks
- Use Claude for critical, quality-sensitive tasks
- Use multi-model for maximum confidence (despite cost)
Expected ROI:
- 36% cost reduction (from $0.00375 to $0.00240 per task)
- 2.3% quality tradeoff (from 88.5% to 86.2% success)
- Worth it: Save $135 per 100,000 tasks with minimal quality impact
Scenario: After 100 routing decisions, optimize tier thresholds
Execution:
Step 1: Load routing data
Read: .claude/agent-performance.json
Focus: routing.tierDistribution, routing.decisions
Step 2: Analyze tier distribution
Current distribution:
tier1: 45 tasks (45.9%)
tier2: 30 tasks (30.6%)
tier3: 15 tasks (15.3%)
tier4: 8 tasks (8.2%)
Skew analysis:
Heavy on tier1 (46%) - Router prefers simple classification
Light on tier4 (8%) - Router rarely escalates
Step 3: Calculate tier success rates
tier1 (Simple tasks):
Total: 45
Success: 42, Failure: 3
Success rate: 93.3% ✅
Avg cost: $0.001
Avg duration: 25s
tier2 (Medium tasks):
Total: 30
Success: 25, Failure: 5
Success rate: 83.3% ⚠️ (target: 90%)
Avg cost: $0.002
Avg duration: 45s
tier3 (Complex tasks):
Total: 15
Success: 12, Failure: 3
Success rate: 80.0% ⚠️ (target: 90%)
Avg cost: $0.004
Avg duration: 90s
tier4 (Critical tasks):
Total: 8
Success: 8, Failure: 0
Success rate: 100% ✅
Avg cost: $0.008
Avg duration: 120s
Step 4: Analyze tier2/tier3 failures
tier2 failures (5 tasks):
1. "Implement complex state management" (complexity: 6)
- Should have been tier3 (underestimated)
2. "Add authentication to API" (complexity: 6)
- Should have been tier3 (security = critical)
3. "Refactor component with hooks" (complexity: 5)
- Should have been tier2 (correctly routed, agent issue)
4. "Implement drag-and-drop" (complexity: 6)
- Should have been tier3 (complex interaction)
5. "Add real-time updates" (complexity: 6)
- Should have been tier3 (WebSocket complexity)
Pattern: 4/5 failures were borderline tier2/tier3 (complexity 6)
Root cause: tier2 upper threshold too high (should be 5, not 6)
tier3 failures (3 tasks):
1. "Design microservices architecture" (complexity: 9)
- Should have been tier4 (architecture = critical)
2. "Implement payment processing" (complexity: 9)
- Should have been tier4 (money = critical)
3. "Refactor authentication system" (complexity: 8)
- Correctly routed, agent struggled with complexity
Pattern: 2/3 failures should have been tier4 (complexity 9)
Root cause: tier3 upper threshold too high (should be 8, not 9)
Step 5: Propose threshold adjustments
Current thresholds:
tier1: complexity 0-3
tier2: complexity 4-6
tier3: complexity 7-9
tier4: complexity 10+
Problem: Borderline tasks (6, 9) cause failures
Optimized thresholds:
tier1: complexity 0-2 (narrower, more confident)
tier2: complexity 3-5 (shift down, avoid borderline 6)
tier3: complexity 6-8 (shift down, avoid borderline 9)
tier4: complexity 9+ (broader, include borderline cases)
Rationale:
- Move borderline complexity 6 from tier2 → tier3
- Move borderline complexity 9 from tier3 → tier4
- Accept 15% higher cost for 10% better success rate
Step 6: Simulate new distribution
Reclassify historical tasks with new thresholds:
tier1 (0-2): 35 tasks (35%)
Success rate: 34/35 = 97.1% ↑ (was 93.3%)
tier2 (3-5): 32 tasks (32%)
Success rate: 30/32 = 93.8% ↑ (was 83.3%)
tier3 (6-8): 18 tasks (18%)
Success rate: 17/18 = 94.4% ↑ (was 80.0%)
tier4 (9+): 13 tasks (13%)
Success rate: 13/13 = 100% ✓ (was 100%)
Overall success rate: 94/98 = 95.9% ↑ (was 87.8%)
Step 7: Calculate cost impact
Current avg cost: $0.00240 per task
Optimized avg cost: $0.00276 per task (+15%)
Cost breakdown:
tier1 (35%): $0.001 × 0.35 = $0.00035
tier2 (32%): $0.002 × 0.32 = $0.00064
tier3 (18%): $0.004 × 0.18 = $0.00072
tier4 (13%): $0.008 × 0.13 = $0.00104
Total: $0.00275 (rounded $0.00276)
ROI calculation:
Cost increase: +$0.00036 per task (+15%)
Success increase: +8.1% (from 87.8% to 95.9%)
Failure reduction: 12 → 4 failures (67% reduction)
Value: Preventing 8 failures per 100 tasks worth the 15% cost increase
Step 8: Implement new thresholds
Update task-complexity-router skill:
OLD:
if (complexity <= 3) return "tier1";
if (complexity <= 6) return "tier2";
if (complexity <= 9) return "tier3";
return "tier4";
NEW:
if (complexity <= 2) return "tier1";
if (complexity <= 5) return "tier2";
if (complexity <= 8) return "tier3";
return "tier4";
Document change:
Reason: Performance data showed borderline tasks caused failures
Expected: 8% success rate improvement, 15% cost increase
Monitoring: Track next 100 tasks to validate improvement
Step 9: Monitor post-optimization
After 50 tasks with new thresholds:
tier1: 18 tasks, 18 success = 100% ✅
tier2: 16 tasks, 15 success = 93.8% ✅
tier3: 10 tasks, 9 success = 90.0% ✅
tier4: 6 tasks, 6 success = 100% ✅
Overall: 48/50 = 96.0% success ✅ (matches projection)
Avg cost: $0.00280 ✅ (matches projection)
Verdict: Optimization successful, keep new thresholds
Insights from Optimization:
Key findings:
1. Borderline complexity scores (6, 9) caused most failures
2. Router was too aggressive in keeping tasks at lower tiers
3. Small threshold adjustments (6→5, 9→8) had big impact
Optimization results:
- Success rate: 87.8% → 96.0% (+8.2%)
- Failure rate: 12.2% → 4.0% (-67%)
- Cost per task: $0.00240 → $0.00280 (+15%)
- ROI: Strong (quality improvement worth cost increase)
Lessons learned:
- Track tier success rates, not just overall success
- Borderline cases benefit from tier escalation
- Performance data reveals routing blind spots
- Continuous monitoring enables iterative improvement
Next steps:
- Continue monitoring for 100 more tasks
- Consider dynamic thresholds (adjust based on live data)
- Explore agent-specific routing (some agents handle complexity better)
Problem: Agent performance.json file growing too large
Cause: History arrays not being trimmed
Solution: Implement automatic trimming after each update
function updateAgentMetrics(agentId, execution) {
const agent = metrics.agents[agentId];
// Update aggregates
agent.totalRuns += 1;
agent.successCount += execution.result === "success" ? 1 : 0;
// Add to history
agent.history.push(execution);
// Trim to max 100 entries (FIFO)
if (agent.history.length > 100) {
agent.history = agent.history.slice(-100);
}
}
Problem: Metrics don't reflect recent changes
Cause: Stale data from old workflows
Solution: Reset metrics after major changes
# Backup current metrics
cp .claude/agent-performance.json .claude/agent-performance-backup-$(date +%Y%m%d).json
# Reset relevant sections (keep models, reset agents)
# Edit .claude/agent-performance.json manually or use script
Problem: Success rate calculations seem wrong
Cause: Inconsistent result values ("success", "SUCCESS", "completed", etc.)
Solution: Normalize result values
function normalizeResult(result) {
const successValues = ["success", "SUCCESS", "completed", "PASS"];
const failureValues = ["failure", "FAILURE", "error", "ERROR", "FAIL"];
if (successValues.includes(result)) return "success";
if (failureValues.includes(result)) return "failure";
return "unknown";
}
// Use normalized values in metrics
const normalizedResult = normalizeResult(execution.result);
agent.successCount += normalizedResult === "success" ? 1 : 0;
agent.failureCount += normalizedResult === "failure" ? 1 : 0;
Performance tracking enables data-driven orchestration optimization through:
Key metrics to monitor:
Master performance tracking and your orchestration workflows will continuously improve, delivering better results at lower costs.
Inspired By:
/review command (multi-model performance tracking)/dev command (agent success rate monitoring)testing
A test skill for validation testing. Use when testing skill parsing and validation logic.
tools
--- name: bad-skill description: This skill has invalid YAML in frontmatter allowed-tools: [invalid, array, syntax prerequisites: not-an-array --- # Bad Skill This skill has malformed frontmatter that should fail parsing. The YAML has: - Unclosed array bracket - Wrong type for prerequisites (should be array, not string)
development
Sync model aliases from the curated Firebase database. Fetches default model assignments, short aliases, team compositions, and known model metadata from the claudish API. Run this to get fresh model recommendations.
tools
Release one or more Magus plugins to the distribution repos (magus, magus-alpha, magus-marketing). Handles version inference from git history, marketplace.json updates, tagging, and force-push to lean dist repos. Use whenever the user says "release kanban", "release the dev plugin", "cut a new version of gtd", "bump kanban to 1.7", or hands you a batch like "release kanban and gtd". Also use for multi-plugin releases and for checking what a release would contain before committing.