Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

madappgang/model-tracking-protocol

Name: model-tracking-protocol
Author: madappgang

plugins/multimodel/skills/model-tracking-protocol/SKILL.md

npx skillsauth add madappgang/magus model-tracking-protocol

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Model Tracking Protocol

Version: 1.0.0 Purpose: MANDATORY tracking protocol for multi-model validation to prevent incomplete reviews Status: Production Ready

Overview

This skill defines the MANDATORY tracking protocol for multi-model validation. It provides templates and procedures that make proper tracking unforgettable.

The Problem This Solves:

Agents often launch multiple external AI models but fail to:

Create structured tracking tables before launch
Collect timing and performance data during execution
Document failures with error messages
Perform consensus analysis comparing model findings
Present results in a structured format

The Solution:

This skill provides MANDATORY checklists, templates, and protocols that ensure complete tracking. Missing ANY of these steps = INCOMPLETE review.

MANDATORY Pre-Launch Checklist
Tracking Table Templates
Per-Model Status Updates
Failure Documentation Protocol
Consensus Analysis Requirements
Results Presentation Template
Common Failures and Prevention
Integration Examples

MANDATORY Pre-Launch Checklist

You MUST complete ALL items before launching ANY external models.

This is NOT optional. If you skip this, your multi-model validation is INCOMPLETE.

Checklist (Copy and Complete)

PRE-LAUNCH VERIFICATION (complete before Task calls):

[ ] 1. SESSION_ID created: ________________________
[ ] 2. SESSION_DIR created: ________________________
[ ] 3. Tracking table written to: $SESSION_DIR/tracking.md
[ ] 4. Start time recorded: SESSION_START=$(date +%s)
[ ] 5. Model list confirmed (comma-separated): ________________________
[ ] 6. Per-model timing arrays initialized
[ ] 7. Code context written to session directory
[ ] 8. Tracking marker created: /tmp/.claude-multi-model-active

If ANY item is unchecked, STOP and complete it before proceeding.

Why Pre-Launch Matters

Without pre-launch setup, you will:

Lose timing data (cannot calculate speed accurately)
Miss failed model details (no structured place to record)
Skip consensus analysis (no model list to compare)
Present incomplete results (no tracking table to populate)

Pre-Launch Script Template

CRITICAL CONSENSUS FIX APPLIED: Use file-based detection instead of environment variables.

#!/bin/bash
# Run this BEFORE launching any Task calls

# 1. Create unique session
SESSION_ID="review-$(date +%Y%m%d-%H%M%S)-$(head -c 4 /dev/urandom | xxd -p)"
SESSION_DIR="/tmp/${SESSION_ID}"
mkdir -p "$SESSION_DIR"

# 2. Record start time
SESSION_START=$(date +%s)

# 3. Create tracking table
cat > "$SESSION_DIR/tracking.md" << EOF
# Multi-Model Tracking

## Session Info
- Session ID: ${SESSION_ID}
- Started: $(date -u +%Y-%m-%dT%H:%M:%SZ)
- Models Requested: [FILL]

## Model Status

| Model | Agent ID | Status | Start | End | Duration | Issues | Quality | Notes |
|-------|----------|--------|-------|-----|----------|--------|---------|-------|
| [MODEL 1] | | pending | | | | | | |
| [MODEL 2] | | pending | | | | | | |
| [MODEL 3] | | pending | | | | | | |

## Failures

| Model | Failure Type | Error Message | Retry? |
|-------|--------------|---------------|--------|

## Consensus

| Issue | Model 1 | Model 2 | Model 3 | Agreement |
|-------|---------|---------|---------|-----------|

EOF

# 4. Initialize timing arrays
declare -A MODEL_START_TIMES
declare -A MODEL_END_TIMES
declare -A MODEL_STATUS

# 5. Create tracking marker file (CRITICAL FIX)
# This allows hooks to detect that tracking is active
echo "$SESSION_DIR" > /tmp/.claude-multi-model-active

echo "Pre-launch setup complete. Session: $SESSION_ID"
echo "Directory: $SESSION_DIR"
echo "Tracking table: $SESSION_DIR/tracking.md"

Strict Mode (Optional)

For stricter enforcement, set:

export CLAUDE_STRICT_TRACKING=true

When enabled, hooks will BLOCK execution if tracking is not set up, rather than just warning.

Tracking Table Templates

Template A: Simple Model Tracking (3-5 models)

| Model | Status | Time | Issues | Quality | Cost |
|-------|--------|------|--------|---------|------|
| claude-embedded | pending | - | - | - | FREE |
| grok | pending | - | - | - | - |
| qwen | pending | - | - | - | FREE |

Update as each completes:

| Model | Status | Time | Issues | Quality | Cost |
|-------|--------|------|--------|---------|------|
| claude-embedded | success | 32s | 8 | 95% | FREE |
| grok | success | 45s | 6 | 87% | $0.002 |
| qwen | timeout | - | - | - | - |

Template B: Detailed Model Tracking (6+ models)

## Model Execution Status

### Summary
- Total Requested: 8
- Completed: 0
- In Progress: 0
- Failed: 0
- Pending: 8

### Detailed Status

| # | Model | Provider | Status | Start | Duration | Issues | Quality | Cost | Error |
|---|-------|----------|--------|-------|----------|--------|---------|------|-------|
| 1 | claude-embedded | Anthropic | pending | - | - | - | - | FREE | - |
| 2 | grok | X-ai | pending | - | - | - | - | - | - |
| 3 | qwen | Qwen | pending | - | - | - | - | FREE | - |
| 4 | gemini | Google | pending | - | - | - | - | - | - |
| 5 | gpt | OpenAI | pending | - | - | - | - | - | - |
| 6 | devstral | Mistral | pending | - | - | - | - | FREE | - |
| 7 | deepseek-r1 | DeepSeek | pending | - | - | - | - | - | - |
| 8 | claude-sonnet | Anthropic | pending | - | - | - | - | - | - |

Template C: Session-Based Tracking File

Create this file at $SESSION_DIR/tracking.md:

# Multi-Model Validation Tracking
Session: ${SESSION_ID}
Started: ${TIMESTAMP}

## Pre-Launch Verification
- [x] Session directory created: ${SESSION_DIR}
- [x] Tracking table initialized
- [x] Start time recorded: ${SESSION_START}
- [x] Model list: ${MODEL_LIST}

## Model Status

| Model | Status | Start | Duration | Issues | Quality |
|-------|--------|-------|----------|--------|---------|
| claude | pending | - | - | - | - |
| grok | pending | - | - | - | - |
| gemini | pending | - | - | - | - |

## Failures
(populated as failures occur)

## Consensus
(populated after all complete)

Update Protocol

As each model completes, IMMEDIATELY update:

Status: pending -> in_progress -> success/failed/timeout
Duration: Calculate from start time
Issues: Number of issues found
Quality: Percentage if calculable
Error: If failed, brief error message

DO NOT wait until all models finish. Update as each completes.

Per-Model Status Update Protocol

IMMEDIATELY After Each Model Completes

Do NOT wait until all models finish. Update tracking AS EACH COMPLETES.

Update Script

# Call this when each model completes
update_model_status() {
  local model="$1"
  local status="$2"
  local issues="${3:-0}"
  local quality="${4:-}"
  local error="${5:-}"

  local end_time=$(date +%s)
  local start_time="${MODEL_START_TIMES[$model]}"
  local duration=$((end_time - start_time))

  # Update arrays
  MODEL_END_TIMES["$model"]=$end_time
  MODEL_STATUS["$model"]="$status"

  # Log update to session tracking file
  echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) - Model: $model, Status: $status, Duration: ${duration}s" >> "$SESSION_DIR/execution.log"

  # Update tracking table (append to tracking.md)
  echo "| $model | $status | ${duration}s | $issues | ${quality:-N/A} | ${error:-} |" >> "$SESSION_DIR/tracking.md"

  # Track performance in global statistics
  if [[ "$status" == "success" ]]; then
    track_model_performance "$model" "success" "$duration" "$issues" "$quality"
  else
    track_model_performance "$model" "$status" "$duration" 0 ""
  fi
}

# Usage examples:
update_model_status "claude-embedded" "success" 8 95
update_model_status "grok" "success" 6 87
update_model_status "some-model" "timeout" 0 "" "Exceeded 120s limit"
update_model_status "other-model" "failed" 0 "" "API 500 error"

Status Values

| Status | Meaning | Action | |--------|---------|--------| | pending | Not started | Wait | | in_progress | Currently executing | Monitor | | success | Completed successfully | Collect results | | failed | Error during execution | Document error | | timeout | Exceeded time limit | Note timeout | | cancelled | User cancelled | Note cancellation |

Real-Time Progress Display

Show user progress as models complete:

Model Status (3/5 complete):
✓ claude-embedded (32s, 8 issues)
✓ grok (45s, 6 issues)
✓ qwen (52s, 5 issues)
⏳ gpt (in progress, 60s elapsed)
⏳ gemini (in progress, 48s elapsed)

Failure Documentation Protocol

EVERY failed model MUST be documented with:

Model name
Failure type (timeout, API error, parse error, etc.)
Error message (exact or summarized)
Whether retry was attempted

Failure Report Template

## Failed Models Report

### Model: grok
- **Failure Type:** API Error
- **Error Message:** "500 Internal Server Error from OpenRouter"
- **Retry Attempted:** Yes, 1 retry, same error
- **Impact:** Review results based on 3/4 models instead of 4
- **Recommendation:** Check OpenRouter status, retry later

### Model: gemini
- **Failure Type:** Timeout
- **Error Message:** "Exceeded 120s limit, response incomplete"
- **Retry Attempted:** No, time constraints
- **Impact:** Lost Gemini perspective, consensus based on remaining models
- **Recommendation:** Extend timeout to 180s for this model

Failure Categorization

| Category | Common Causes | Recovery | |----------|---------------|----------| | Timeout | Model slow, large input, network latency | Retry with extended timeout | | API Error | Provider down, rate limit, auth issue | Wait and retry, check API status | | Parse Error | Malformed response, encoding issue | Retry, simplify prompt | | Auth Error | Invalid API key, expired token | Check credentials | | Context Limit | Input too large for model | Reduce context, split task | | Rate Limit | Too many requests | Wait, implement backoff |

Failure Summary Table

Always include this in final results:

## Execution Summary

| Metric | Value |
|--------|-------|
| Models Requested | 8 |
| Successful | 5 (62.5%) |
| Failed | 3 (37.5%) |

### Failed Models

| Model | Failure | Recoverable? | Action |
|-------|---------|--------------|--------|
| grok | API 500 | Yes - retry later | Check OpenRouter status |
| gemini | Timeout | Yes - extend limit | Use 180s timeout |
| deepseek-r1 | Auth Error | No - check key | Verify API key valid |

Writing Failures to Session Directory

# Document failure immediately when it occurs
document_failure() {
  local model="$1"
  local failure_type="$2"
  local error_msg="$3"
  local retry_attempted="${4:-No}"

  cat >> "$SESSION_DIR/failures.md" << EOF

### Model: $model
- **Failure Type:** $failure_type
- **Error Message:** "$error_msg"
- **Retry Attempted:** $retry_attempted
- **Timestamp:** $(date -u +%Y-%m-%dT%H:%M:%SZ)

EOF

  echo "Failure documented: $model ($failure_type)" >&2
}

# Usage:
document_failure "grok" "API Error" "500 Internal Server Error" "Yes, 1 retry"

Consensus Analysis Requirements

After ALL models complete (or max wait time), you MUST perform consensus analysis.

This is NOT optional. Even with 2 successful models, compare their findings.

Minimum Viable Consensus (2 models)

With only 2 models, consensus is simple:

AGREE: Both found the same issue
DISAGREE: Only one found the issue

| Issue | Model 1 | Model 2 | Consensus |
|-------|---------|---------|-----------|
| SQL injection | Yes | Yes | AGREE |
| Missing validation | Yes | No | Model 1 only |
| Weak hashing | No | Yes | Model 2 only |

Standard Consensus (3-5 models)

| Issue | Claude | Grok | Gemini | Agreement |
|-------|--------|------|--------|-----------|
| SQL injection | Yes | Yes | Yes | UNANIMOUS (3/3) |
| Missing validation | Yes | Yes | No | STRONG (2/3) |
| Rate limiting | Yes | No | No | DIVERGENT (1/3) |

Extended Consensus (6+ models)

For 6+ models, add summary statistics:

## Consensus Summary

- **Unanimous Issues (100%):** 3 issues
- **Strong Consensus (67%+):** 5 issues
- **Majority (50%+):** 2 issues
- **Divergent (<50%):** 4 issues

## Top 5 by Consensus

1. [6/6] SQL injection in search - FIX IMMEDIATELY
2. [6/6] Missing input validation - FIX IMMEDIATELY
3. [5/6] Weak password hashing - RECOMMENDED
4. [4/6] Missing rate limiting - CONSIDER
5. [3/6] Error handling gaps - INVESTIGATE

Consensus Analysis Script

# Perform consensus analysis on all model findings
analyze_consensus() {
  local session_dir="$1"
  local num_models="$2"

  echo "## Consensus Analysis" > "$session_dir/consensus.md"
  echo "" >> "$session_dir/consensus.md"
  echo "Based on $num_models model reviews:" >> "$session_dir/consensus.md"
  echo "" >> "$session_dir/consensus.md"

  # Read all review files and extract issues
  # (simplified - actual implementation would parse review markdown)
  for review in "$session_dir"/*-review.md; do
    echo "Processing: $review"
    # Extract issues, compare, categorize by agreement level
  done

  # Calculate consensus levels
  echo "### Consensus Levels" >> "$session_dir/consensus.md"
  echo "" >> "$session_dir/consensus.md"
  echo "- UNANIMOUS: All $num_models models agree" >> "$session_dir/consensus.md"
  echo "- STRONG: ≥67% of models agree" >> "$session_dir/consensus.md"
  echo "- MAJORITY: ≥50% of models agree" >> "$session_dir/consensus.md"
  echo "- DIVERGENT: <50% of models agree" >> "$session_dir/consensus.md"
}

NO Consensus Analysis = INCOMPLETE Review

If you present results without a consensus comparison, your review is INCOMPLETE.

Minimum Requirements:

✅ Compare findings across ALL successful models
✅ Categorize by agreement level (unanimous, strong, majority, divergent)
✅ Prioritize issues by consensus + severity
✅ Document in $SESSION_DIR/consensus.md

Results Presentation Template

Your final output MUST include ALL of these sections.

Required Output Format

## Multi-Model Review Complete

### Execution Summary

| Metric | Value |
|--------|-------|
| Session ID | review-20251224-143052-a3f2 |
| Session Directory | /tmp/review-20251224-143052-a3f2 |
| Models Requested | 5 |
| Successful | 4 (80%) |
| Failed | 1 (20%) |
| Total Duration | 68s (parallel) |
| Sequential Equivalent | 245s |
| Speedup | 3.6x |

### Model Performance

| Model | Time | Issues | Quality | Status | Cost |
|-------|------|--------|---------|--------|------|
| claude-embedded | 32s | 8 | 95% | Success | FREE |
| grok | 45s | 6 | 87% | Success | $0.002 |
| qwen | 52s | 5 | 82% | Success | FREE |
| gpt | 68s | 7 | 89% | Success | $0.015 |
| devstral | - | - | - | Timeout | - |

### Failed Models

| Model | Failure | Error |
|-------|---------|-------|
| devstral | Timeout | Exceeded 120s limit |

### Top Issues by Consensus

1. **[UNANIMOUS]** SQL injection in search endpoint
   - Flagged by: claude, grok, qwen, gpt-5 (4/4)
   - Severity: CRITICAL
   - Action: FIX IMMEDIATELY

2. **[UNANIMOUS]** Missing input validation
   - Flagged by: claude, grok, qwen, gpt-5 (4/4)
   - Severity: CRITICAL
   - Action: FIX IMMEDIATELY

3. **[STRONG]** Weak password hashing
   - Flagged by: claude, grok, gpt-5 (3/4)
   - Severity: HIGH
   - Action: RECOMMENDED

### Detailed Reports

- Session directory: /tmp/review-20251224-143052-a3f2
- Consolidated review: /tmp/review-20251224-143052-a3f2/consolidated-review.md
- Individual reviews: /tmp/review-20251224-143052-a3f2/{model}-review.md
- Tracking data: /tmp/review-20251224-143052-a3f2/tracking.md
- Consensus analysis: /tmp/review-20251224-143052-a3f2/consensus.md

### Statistics Saved

- Performance data logged to: ai-docs/llm-performance.json

Missing Section Detection

Before presenting, verify ALL sections are present:

verify_output_complete() {
  local output="$1"

  local required=(
    "Execution Summary"
    "Model Performance"
    "Top Issues"
    "Detailed Reports"
    "Statistics"
  )

  local missing=()
  for section in "${required[@]}"; do
    if ! echo "$output" | grep -q "$section"; then
      missing+=("$section")
    fi
  done

  if [ ${#missing[@]} -gt 0 ]; then
    echo "ERROR: Missing required sections: ${missing[*]}" >&2
    return 1
  fi

  return 0
}

Checklist before presenting results:

[ ] Execution Summary (models requested/successful/failed)
[ ] Model Performance table (per-model times and quality)
[ ] Failed Models section (if any failed)
[ ] Top Issues by Consensus (prioritized list)
[ ] Detailed Reports (session directory, file paths)
[ ] Statistics confirmation (llm-performance.json updated)

Common Failures and Prevention

Failure 1: No Tracking Table Created

Symptom: Results presented as prose, not structured data

What went wrong:

"I ran 5 models. 3 succeeded and found various issues."
(No table, no structure)

Prevention:

Always run pre-launch script FIRST
Create $SESSION_DIR/tracking.md before Task calls
Populate table as models complete

Detection: SubagentStop hook warns if no tracking found

Failure 2: Timing Not Recorded

Symptom: "Duration: unknown" or missing speed stats

What went wrong:

# Launched models without recording start time
Task: reviewer1
Task: reviewer2
# No SESSION_START, cannot calculate duration!

Prevention:

# ALWAYS do this first
SESSION_START=$(date +%s)
MODEL_START_TIMES["model1"]=$SESSION_START

Detection: Hook checks for timing data in output

Failure 3: Failed Models Not Documented

Symptom: "2 of 8 succeeded" with no failure details

What went wrong:

"Launched 8 models. 2 succeeded."
(No info on why 6 failed)

Prevention:

# Immediately when model fails
document_failure "model-name" "Timeout" "Exceeded 120s" "No"

Detection: Hook checks for failure section when success < total

Failure 4: No Consensus Analysis

Symptom: Individual model results listed without comparison

What went wrong:

"Model 1 found: A, B, C
 Model 2 found: B, D, E"
(No comparison: which issues do they agree on?)

Prevention:

After all complete, ALWAYS run consolidation
Create consensus table comparing findings
Prioritize by agreement level

Detection: Hook checks for consensus keywords

Failure 5: Statistics Not Saved

Symptom: No record in ai-docs/llm-performance.json

What went wrong:

# Forgot to call tracking functions
# No record of this session

Prevention:

# ALWAYS call these
track_model_performance "model" "status" duration issues quality
record_session_stats total success failed parallel sequential speedup

Detection: Hook checks file modification time

Prevention Checklist

Before presenting results, verify:

[ ] Tracking table exists at $SESSION_DIR/tracking.md
[ ] Tracking table is populated with all model results
[ ] All model times recorded (or "timeout"/"failed" noted)
[ ] All failures documented in $SESSION_DIR/failures.md
[ ] Consensus analysis performed in $SESSION_DIR/consensus.md
[ ] Results match required output format
[ ] Statistics saved to ai-docs/llm-performance.json
[ ] Session directory contains all artifacts

Integration Examples

Example 1: Complete Multi-Model Review Workflow

#!/bin/bash
# Full multi-model review with complete tracking

# ============================================================================
# PHASE 1: PRE-LAUNCH (MANDATORY)
# ============================================================================

# 1. Create unique session
SESSION_ID="review-$(date +%Y%m%d-%H%M%S)-$(head -c 4 /dev/urandom | xxd -p)"
SESSION_DIR="/tmp/${SESSION_ID}"
mkdir -p "$SESSION_DIR"

# 2. Record start time
SESSION_START=$(date +%s)

# 3. Create tracking table
cat > "$SESSION_DIR/tracking.md" << EOF
# Multi-Model Validation Tracking

## Session: $SESSION_ID
Started: $(date -u +%Y-%m-%dT%H:%M:%SZ)

## Model Status
| Model | Status | Duration | Issues | Quality |
|-------|--------|----------|--------|---------|
EOF

# 4. Initialize timing arrays
declare -A MODEL_START_TIMES
declare -A MODEL_END_TIMES

# 5. Create tracking marker
echo "$SESSION_DIR" > /tmp/.claude-multi-model-active

# 6. Write code context
git diff > "$SESSION_DIR/code-context.md"

echo "Pre-launch complete. Session: $SESSION_ID"

# ============================================================================
# PHASE 2: MODEL EXECUTION (Parallel Task calls)
# ============================================================================

# Record start times for each model
MODEL_START_TIMES["claude-embedded"]=$(date +%s)
MODEL_START_TIMES["grok"]=$(date +%s)
MODEL_START_TIMES["qwen"]=$(date +%s)

# Launch all models in single message (parallel execution)
# (These would be actual Task calls in practice)
echo "Launching 3 models in parallel..."

# ============================================================================
# PHASE 3: RESULTS COLLECTION (as each completes)
# ============================================================================

# Update status immediately after each completes
update_model_status() {
  local model="$1" status="$2" issues="${3:-0}" quality="${4:-}"
  local end_time=$(date +%s)
  local duration=$((end_time - MODEL_START_TIMES["$model"]))

  echo "| $model | $status | ${duration}s | $issues | ${quality:-N/A} |" >> "$SESSION_DIR/tracking.md"
  track_model_performance "$model" "$status" "$duration" "$issues" "$quality"
}

# Example completions
update_model_status "claude-embedded" "success" 8 95
update_model_status "grok" "success" 6 87
update_model_status "qwen" "timeout"

# ============================================================================
# PHASE 4: CONSENSUS ANALYSIS (MANDATORY)
# ============================================================================

# Consolidate and compare findings
echo "Performing consensus analysis..."
# (Would launch consolidation agent here)

# ============================================================================
# PHASE 5: STATISTICS & PRESENTATION
# ============================================================================

# Calculate session stats
PARALLEL_TIME=52  # max of all durations
SEQUENTIAL_TIME=129  # sum of all durations
SPEEDUP=2.5

# Record session
record_session_stats 3 2 1 "$PARALLEL_TIME" "$SEQUENTIAL_TIME" "$SPEEDUP"

# Present results
cat << RESULTS
## Multi-Model Review Complete

Session: $SESSION_ID
Directory: $SESSION_DIR

Models: 3 requested, 2 successful, 1 failed

See tracking table: $SESSION_DIR/tracking.md
See consensus: $SESSION_DIR/consensus.md
Statistics saved to: ai-docs/llm-performance.json
RESULTS

# Cleanup marker
rm -f /tmp/.claude-multi-model-active

Example 2: Minimal 2-Model Comparison

# Simplest viable multi-model validation

# Pre-launch
SESSION_ID="review-$(date +%s)"
SESSION_DIR="/tmp/$SESSION_ID"
mkdir -p "$SESSION_DIR"
SESSION_START=$(date +%s)
echo "$SESSION_DIR" > /tmp/.claude-multi-model-active

# Launch
echo "Launching Claude + Grok..."
# Task: claude-embedded
# Task: claudish grok

# Track
track_model_performance "claude" "success" 32 8 95
track_model_performance "grok" "success" 45 6 87

# Consensus
echo "Issues both found: SQL injection, missing validation" > "$SESSION_DIR/consensus.md"

# Stats
record_session_stats 2 2 0 45 77 1.7

# Cleanup
rm -f /tmp/.claude-multi-model-active

Example 3: Handling Failures

# Multi-model with failure handling

# Pre-launch (same as Example 1)
# ... setup code ...

# Launch 4 models
# ... Task calls ...

# Model 1: Success
update_model_status "claude" "success" 32 8 95

# Model 2: Success
update_model_status "grok" "success" 45 6 87

# Model 3: Timeout
update_model_status "gemini" "timeout"
document_failure "gemini" "Timeout" "Exceeded 120s limit" "No"

# Model 4: API Error
update_model_status "gpt5" "failed"
document_failure "gpt5" "API Error" "500 from OpenRouter" "Yes, 1 retry"

# Proceed with 2 successful models
if [ "$SUCCESS_COUNT" -ge 2 ]; then
  echo "Proceeding with $SUCCESS_COUNT successful models"
  # Consensus with partial data
else
  echo "ERROR: Only $SUCCESS_COUNT succeeded, need minimum 2"
fi

Integration with Other Skills

With `multi-model-validation`

The multi-model-validation skill defines the execution patterns (4-Message Pattern, parallel execution, proxy mode). This skill (model-tracking-protocol) defines the tracking infrastructure.

Use together:

skills: multimodel:multi-model-validation, multimodel:model-tracking-protocol

Workflow:

Read multi-model-validation for execution patterns
Read model-tracking-protocol for tracking setup
Pre-launch (tracking protocol)
Execute (validation patterns)
Track (protocol updates)
Present (protocol templates)

With `quality-gates`

Use quality gates to ensure tracking is complete before proceeding:

# After tracking setup, verify completeness
if [ ! -f "$SESSION_DIR/tracking.md" ]; then
  echo "QUALITY GATE FAILED: No tracking table"
  exit 1
fi

# Before presenting results, verify all sections present
verify_output_complete "$OUTPUT" || exit 1

With `task-orchestration`

Track progress through multi-model phases:

Tasks:
1. [~] Pre-launch setup (tracking protocol)
2. [ ] Launch models (validation patterns)
3. [ ] Collect results (tracking updates)
4. [ ] Consensus analysis (protocol requirement)
5. [ ] Present results (protocol template)

Quick Reference

File-Based Tracking Marker (CONSENSUS FIX)

Create marker after pre-launch setup:

echo "$SESSION_DIR" > /tmp/.claude-multi-model-active

Check if tracking active (in hooks):

if [[ -f /tmp/.claude-multi-model-active ]]; then
  SESSION_DIR=$(cat /tmp/.claude-multi-model-active)
  [[ -f "$SESSION_DIR/tracking.md" ]] && echo "Tracking active"
fi

Remove marker when done:

rm -f /tmp/.claude-multi-model-active

Pre-Launch Commands

SESSION_ID="review-$(date +%Y%m%d-%H%M%S)-$(head -c 4 /dev/urandom | xxd -p)"
SESSION_DIR="/tmp/${SESSION_ID}"
mkdir -p "$SESSION_DIR"
SESSION_START=$(date +%s)
echo "$SESSION_DIR" > /tmp/.claude-multi-model-active

Tracking Commands

update_model_status "model" "status" issues quality
document_failure "model" "type" "error" "retry"
track_model_performance "model" "status" duration issues quality
record_session_stats total success failed parallel sequential speedup

Verification Commands

verify_output_complete "$OUTPUT"
[ -f "$SESSION_DIR/tracking.md" ] && echo "Tracking exists"
[ -f ai-docs/llm-performance.json ] && echo "Statistics saved"

Summary

This skill provides MANDATORY tracking infrastructure for multi-model validation:

Pre-Launch Checklist - 8 items to complete before launching models
Tracking Tables - Templates for 3-5 models and 6+ models
Status Updates - Per-model completion tracking
Failure Documentation - Required format for all failures
Consensus Analysis - Comparing findings across models
Results Template - Required output format
Common Failures - Prevention strategies
Integration Examples - Complete workflows

Key Innovation: File-based tracking marker (/tmp/.claude-multi-model-active) allows hooks to detect active tracking without relying on environment variables.

Use this skill when: Running 2+ external AI models in parallel for validation, review, or consensus analysis.

Missing tracking = INCOMPLETE validation.

madappgang/model-tracking-protocol

plugins/multimodel/skills/model-tracking-protocol/SKILL.md

MANDATORY tracking protocol for multi-model validation. Creates structured tracking tables BEFORE launching models, tracks progress during execution, and ensures complete results presentation. Use when running 2+ external AI models in parallel.

4 stars

data-ai

Updated May 7, 2026

$ install --global

skillsauth

npx skillsauth add madappgang/magus model-tracking-protocol

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 7, 2026, 6:40 AM122.7s1 file scanned

SKILL.md

name:: model-tracking-protocol
description:: MANDATORY tracking protocol for multi-model validation. Creates structured tracking tables BEFORE launching models, tracks progress during execution, and ensures complete results presentation. Use when running 2+ external AI models in parallel.
disable-model-invocation:: true
version:: 1.0.0
tags:: [orchestration, tracking, multi-model, statistics, mandatory]
keywords:: [tracking, mandatory, pre-launch, statistics, consensus, results, failures]
plugin:: multimodel
updated:: 2026-01-20
user-invocable:: false

Model Tracking Protocol

Version: 1.0.0 Purpose: MANDATORY tracking protocol for multi-model validation to prevent incomplete reviews Status: Production Ready

Overview

This skill defines the MANDATORY tracking protocol for multi-model validation. It provides templates and procedures that make proper tracking unforgettable.

The Problem This Solves:

Agents often launch multiple external AI models but fail to:

Create structured tracking tables before launch
Collect timing and performance data during execution
Document failures with error messages
Perform consensus analysis comparing model findings
Present results in a structured format

The Solution:

This skill provides MANDATORY checklists, templates, and protocols that ensure complete tracking. Missing ANY of these steps = INCOMPLETE review.

MANDATORY Pre-Launch Checklist
Tracking Table Templates
Per-Model Status Updates
Failure Documentation Protocol
Consensus Analysis Requirements
Results Presentation Template
Common Failures and Prevention
Integration Examples

MANDATORY Pre-Launch Checklist

You MUST complete ALL items before launching ANY external models.

This is NOT optional. If you skip this, your multi-model validation is INCOMPLETE.

Checklist (Copy and Complete)

PRE-LAUNCH VERIFICATION (complete before Task calls):

[ ] 1. SESSION_ID created: ________________________
[ ] 2. SESSION_DIR created: ________________________
[ ] 3. Tracking table written to: $SESSION_DIR/tracking.md
[ ] 4. Start time recorded: SESSION_START=$(date +%s)
[ ] 5. Model list confirmed (comma-separated): ________________________
[ ] 6. Per-model timing arrays initialized
[ ] 7. Code context written to session directory
[ ] 8. Tracking marker created: /tmp/.claude-multi-model-active

If ANY item is unchecked, STOP and complete it before proceeding.

Why Pre-Launch Matters

Without pre-launch setup, you will:

Lose timing data (cannot calculate speed accurately)
Miss failed model details (no structured place to record)
Skip consensus analysis (no model list to compare)
Present incomplete results (no tracking table to populate)

Pre-Launch Script Template

CRITICAL CONSENSUS FIX APPLIED: Use file-based detection instead of environment variables.

#!/bin/bash
# Run this BEFORE launching any Task calls

# 1. Create unique session
SESSION_ID="review-$(date +%Y%m%d-%H%M%S)-$(head -c 4 /dev/urandom | xxd -p)"
SESSION_DIR="/tmp/${SESSION_ID}"
mkdir -p "$SESSION_DIR"

# 2. Record start time
SESSION_START=$(date +%s)

# 3. Create tracking table
cat > "$SESSION_DIR/tracking.md" << EOF
# Multi-Model Tracking

## Session Info
- Session ID: ${SESSION_ID}
- Started: $(date -u +%Y-%m-%dT%H:%M:%SZ)
- Models Requested: [FILL]

## Model Status

| Model | Agent ID | Status | Start | End | Duration | Issues | Quality | Notes |
|-------|----------|--------|-------|-----|----------|--------|---------|-------|
| [MODEL 1] | | pending | | | | | | |
| [MODEL 2] | | pending | | | | | | |
| [MODEL 3] | | pending | | | | | | |

## Failures

| Model | Failure Type | Error Message | Retry? |
|-------|--------------|---------------|--------|

## Consensus

| Issue | Model 1 | Model 2 | Model 3 | Agreement |
|-------|---------|---------|---------|-----------|

EOF

# 4. Initialize timing arrays
declare -A MODEL_START_TIMES
declare -A MODEL_END_TIMES
declare -A MODEL_STATUS

# 5. Create tracking marker file (CRITICAL FIX)
# This allows hooks to detect that tracking is active
echo "$SESSION_DIR" > /tmp/.claude-multi-model-active

echo "Pre-launch setup complete. Session: $SESSION_ID"
echo "Directory: $SESSION_DIR"
echo "Tracking table: $SESSION_DIR/tracking.md"

Strict Mode (Optional)

For stricter enforcement, set:

export CLAUDE_STRICT_TRACKING=true

When enabled, hooks will BLOCK execution if tracking is not set up, rather than just warning.

Tracking Table Templates

Template A: Simple Model Tracking (3-5 models)

| Model | Status | Time | Issues | Quality | Cost |
|-------|--------|------|--------|---------|------|
| claude-embedded | pending | - | - | - | FREE |
| grok | pending | - | - | - | - |
| qwen | pending | - | - | - | FREE |

Update as each completes:

| Model | Status | Time | Issues | Quality | Cost |
|-------|--------|------|--------|---------|------|
| claude-embedded | success | 32s | 8 | 95% | FREE |
| grok | success | 45s | 6 | 87% | $0.002 |
| qwen | timeout | - | - | - | - |

Template B: Detailed Model Tracking (6+ models)

## Model Execution Status

### Summary
- Total Requested: 8
- Completed: 0
- In Progress: 0
- Failed: 0
- Pending: 8

### Detailed Status

| # | Model | Provider | Status | Start | Duration | Issues | Quality | Cost | Error |
|---|-------|----------|--------|-------|----------|--------|---------|------|-------|
| 1 | claude-embedded | Anthropic | pending | - | - | - | - | FREE | - |
| 2 | grok | X-ai | pending | - | - | - | - | - | - |
| 3 | qwen | Qwen | pending | - | - | - | - | FREE | - |
| 4 | gemini | Google | pending | - | - | - | - | - | - |
| 5 | gpt | OpenAI | pending | - | - | - | - | - | - |
| 6 | devstral | Mistral | pending | - | - | - | - | FREE | - |
| 7 | deepseek-r1 | DeepSeek | pending | - | - | - | - | - | - |
| 8 | claude-sonnet | Anthropic | pending | - | - | - | - | - | - |

Template C: Session-Based Tracking File

Create this file at $SESSION_DIR/tracking.md:

# Multi-Model Validation Tracking
Session: ${SESSION_ID}
Started: ${TIMESTAMP}

## Pre-Launch Verification
- [x] Session directory created: ${SESSION_DIR}
- [x] Tracking table initialized
- [x] Start time recorded: ${SESSION_START}
- [x] Model list: ${MODEL_LIST}

## Model Status

| Model | Status | Start | Duration | Issues | Quality |
|-------|--------|-------|----------|--------|---------|
| claude | pending | - | - | - | - |
| grok | pending | - | - | - | - |
| gemini | pending | - | - | - | - |

## Failures
(populated as failures occur)

## Consensus
(populated after all complete)

Update Protocol

As each model completes, IMMEDIATELY update:

Status: pending -> in_progress -> success/failed/timeout
Duration: Calculate from start time
Issues: Number of issues found
Quality: Percentage if calculable
Error: If failed, brief error message

DO NOT wait until all models finish. Update as each completes.

Per-Model Status Update Protocol

IMMEDIATELY After Each Model Completes

Do NOT wait until all models finish. Update tracking AS EACH COMPLETES.

Update Script

# Call this when each model completes
update_model_status() {
  local model="$1"
  local status="$2"
  local issues="${3:-0}"
  local quality="${4:-}"
  local error="${5:-}"

  local end_time=$(date +%s)
  local start_time="${MODEL_START_TIMES[$model]}"
  local duration=$((end_time - start_time))

  # Update arrays
  MODEL_END_TIMES["$model"]=$end_time
  MODEL_STATUS["$model"]="$status"

  # Log update to session tracking file
  echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) - Model: $model, Status: $status, Duration: ${duration}s" >> "$SESSION_DIR/execution.log"

  # Update tracking table (append to tracking.md)
  echo "| $model | $status | ${duration}s | $issues | ${quality:-N/A} | ${error:-} |" >> "$SESSION_DIR/tracking.md"

  # Track performance in global statistics
  if [[ "$status" == "success" ]]; then
    track_model_performance "$model" "success" "$duration" "$issues" "$quality"
  else
    track_model_performance "$model" "$status" "$duration" 0 ""
  fi
}

# Usage examples:
update_model_status "claude-embedded" "success" 8 95
update_model_status "grok" "success" 6 87
update_model_status "some-model" "timeout" 0 "" "Exceeded 120s limit"
update_model_status "other-model" "failed" 0 "" "API 500 error"

Status Values

Real-Time Progress Display

Show user progress as models complete:

Model Status (3/5 complete):
✓ claude-embedded (32s, 8 issues)
✓ grok (45s, 6 issues)
✓ qwen (52s, 5 issues)
⏳ gpt (in progress, 60s elapsed)
⏳ gemini (in progress, 48s elapsed)

Failure Documentation Protocol

EVERY failed model MUST be documented with:

Model name
Failure type (timeout, API error, parse error, etc.)
Error message (exact or summarized)
Whether retry was attempted

Failure Report Template

## Failed Models Report

### Model: grok
- **Failure Type:** API Error
- **Error Message:** "500 Internal Server Error from OpenRouter"
- **Retry Attempted:** Yes, 1 retry, same error
- **Impact:** Review results based on 3/4 models instead of 4
- **Recommendation:** Check OpenRouter status, retry later

### Model: gemini
- **Failure Type:** Timeout
- **Error Message:** "Exceeded 120s limit, response incomplete"
- **Retry Attempted:** No, time constraints
- **Impact:** Lost Gemini perspective, consensus based on remaining models
- **Recommendation:** Extend timeout to 180s for this model

Failure Categorization

Failure Summary Table

Always include this in final results:

## Execution Summary

| Metric | Value |
|--------|-------|
| Models Requested | 8 |
| Successful | 5 (62.5%) |
| Failed | 3 (37.5%) |

### Failed Models

| Model | Failure | Recoverable? | Action |
|-------|---------|--------------|--------|
| grok | API 500 | Yes - retry later | Check OpenRouter status |
| gemini | Timeout | Yes - extend limit | Use 180s timeout |
| deepseek-r1 | Auth Error | No - check key | Verify API key valid |

Writing Failures to Session Directory

# Document failure immediately when it occurs
document_failure() {
  local model="$1"
  local failure_type="$2"
  local error_msg="$3"
  local retry_attempted="${4:-No}"

  cat >> "$SESSION_DIR/failures.md" << EOF

### Model: $model
- **Failure Type:** $failure_type
- **Error Message:** "$error_msg"
- **Retry Attempted:** $retry_attempted
- **Timestamp:** $(date -u +%Y-%m-%dT%H:%M:%SZ)

EOF

  echo "Failure documented: $model ($failure_type)" >&2
}

# Usage:
document_failure "grok" "API Error" "500 Internal Server Error" "Yes, 1 retry"

Consensus Analysis Requirements

After ALL models complete (or max wait time), you MUST perform consensus analysis.

This is NOT optional. Even with 2 successful models, compare their findings.

Minimum Viable Consensus (2 models)

With only 2 models, consensus is simple:

AGREE: Both found the same issue
DISAGREE: Only one found the issue

| Issue | Model 1 | Model 2 | Consensus |
|-------|---------|---------|-----------|
| SQL injection | Yes | Yes | AGREE |
| Missing validation | Yes | No | Model 1 only |
| Weak hashing | No | Yes | Model 2 only |

Standard Consensus (3-5 models)

| Issue | Claude | Grok | Gemini | Agreement |
|-------|--------|------|--------|-----------|
| SQL injection | Yes | Yes | Yes | UNANIMOUS (3/3) |
| Missing validation | Yes | Yes | No | STRONG (2/3) |
| Rate limiting | Yes | No | No | DIVERGENT (1/3) |

Extended Consensus (6+ models)

For 6+ models, add summary statistics:

## Consensus Summary

- **Unanimous Issues (100%):** 3 issues
- **Strong Consensus (67%+):** 5 issues
- **Majority (50%+):** 2 issues
- **Divergent (<50%):** 4 issues

## Top 5 by Consensus

1. [6/6] SQL injection in search - FIX IMMEDIATELY
2. [6/6] Missing input validation - FIX IMMEDIATELY
3. [5/6] Weak password hashing - RECOMMENDED
4. [4/6] Missing rate limiting - CONSIDER
5. [3/6] Error handling gaps - INVESTIGATE

Consensus Analysis Script

# Perform consensus analysis on all model findings
analyze_consensus() {
  local session_dir="$1"
  local num_models="$2"

  echo "## Consensus Analysis" > "$session_dir/consensus.md"
  echo "" >> "$session_dir/consensus.md"
  echo "Based on $num_models model reviews:" >> "$session_dir/consensus.md"
  echo "" >> "$session_dir/consensus.md"

  # Read all review files and extract issues
  # (simplified - actual implementation would parse review markdown)
  for review in "$session_dir"/*-review.md; do
    echo "Processing: $review"
    # Extract issues, compare, categorize by agreement level
  done

  # Calculate consensus levels
  echo "### Consensus Levels" >> "$session_dir/consensus.md"
  echo "" >> "$session_dir/consensus.md"
  echo "- UNANIMOUS: All $num_models models agree" >> "$session_dir/consensus.md"
  echo "- STRONG: ≥67% of models agree" >> "$session_dir/consensus.md"
  echo "- MAJORITY: ≥50% of models agree" >> "$session_dir/consensus.md"
  echo "- DIVERGENT: <50% of models agree" >> "$session_dir/consensus.md"
}

NO Consensus Analysis = INCOMPLETE Review

If you present results without a consensus comparison, your review is INCOMPLETE.

Minimum Requirements:

✅ Compare findings across ALL successful models
✅ Categorize by agreement level (unanimous, strong, majority, divergent)
✅ Prioritize issues by consensus + severity
✅ Document in $SESSION_DIR/consensus.md

Results Presentation Template

Your final output MUST include ALL of these sections.

Required Output Format

## Multi-Model Review Complete

### Execution Summary

| Metric | Value |
|--------|-------|
| Session ID | review-20251224-143052-a3f2 |
| Session Directory | /tmp/review-20251224-143052-a3f2 |
| Models Requested | 5 |
| Successful | 4 (80%) |
| Failed | 1 (20%) |
| Total Duration | 68s (parallel) |
| Sequential Equivalent | 245s |
| Speedup | 3.6x |

### Model Performance

| Model | Time | Issues | Quality | Status | Cost |
|-------|------|--------|---------|--------|------|
| claude-embedded | 32s | 8 | 95% | Success | FREE |
| grok | 45s | 6 | 87% | Success | $0.002 |
| qwen | 52s | 5 | 82% | Success | FREE |
| gpt | 68s | 7 | 89% | Success | $0.015 |
| devstral | - | - | - | Timeout | - |

### Failed Models

| Model | Failure | Error |
|-------|---------|-------|
| devstral | Timeout | Exceeded 120s limit |

### Top Issues by Consensus

1. **[UNANIMOUS]** SQL injection in search endpoint
   - Flagged by: claude, grok, qwen, gpt-5 (4/4)
   - Severity: CRITICAL
   - Action: FIX IMMEDIATELY

2. **[UNANIMOUS]** Missing input validation
   - Flagged by: claude, grok, qwen, gpt-5 (4/4)
   - Severity: CRITICAL
   - Action: FIX IMMEDIATELY

3. **[STRONG]** Weak password hashing
   - Flagged by: claude, grok, gpt-5 (3/4)
   - Severity: HIGH
   - Action: RECOMMENDED

### Detailed Reports

- Session directory: /tmp/review-20251224-143052-a3f2
- Consolidated review: /tmp/review-20251224-143052-a3f2/consolidated-review.md
- Individual reviews: /tmp/review-20251224-143052-a3f2/{model}-review.md
- Tracking data: /tmp/review-20251224-143052-a3f2/tracking.md
- Consensus analysis: /tmp/review-20251224-143052-a3f2/consensus.md

### Statistics Saved

- Performance data logged to: ai-docs/llm-performance.json

Missing Section Detection

Before presenting, verify ALL sections are present:

verify_output_complete() {
  local output="$1"

  local required=(
    "Execution Summary"
    "Model Performance"
    "Top Issues"
    "Detailed Reports"
    "Statistics"
  )

  local missing=()
  for section in "${required[@]}"; do
    if ! echo "$output" | grep -q "$section"; then
      missing+=("$section")
    fi
  done

  if [ ${#missing[@]} -gt 0 ]; then
    echo "ERROR: Missing required sections: ${missing[*]}" >&2
    return 1
  fi

  return 0
}

Checklist before presenting results:

[ ] Execution Summary (models requested/successful/failed)
[ ] Model Performance table (per-model times and quality)
[ ] Failed Models section (if any failed)
[ ] Top Issues by Consensus (prioritized list)
[ ] Detailed Reports (session directory, file paths)
[ ] Statistics confirmation (llm-performance.json updated)

Common Failures and Prevention

Failure 1: No Tracking Table Created

Symptom: Results presented as prose, not structured data

What went wrong:

"I ran 5 models. 3 succeeded and found various issues."
(No table, no structure)

Prevention:

Always run pre-launch script FIRST
Create $SESSION_DIR/tracking.md before Task calls
Populate table as models complete

Detection: SubagentStop hook warns if no tracking found

Failure 2: Timing Not Recorded

Symptom: "Duration: unknown" or missing speed stats

What went wrong:

# Launched models without recording start time
Task: reviewer1
Task: reviewer2
# No SESSION_START, cannot calculate duration!

Prevention:

# ALWAYS do this first
SESSION_START=$(date +%s)
MODEL_START_TIMES["model1"]=$SESSION_START

Detection: Hook checks for timing data in output

Failure 3: Failed Models Not Documented

Symptom: "2 of 8 succeeded" with no failure details

What went wrong:

"Launched 8 models. 2 succeeded."
(No info on why 6 failed)

Prevention:

# Immediately when model fails
document_failure "model-name" "Timeout" "Exceeded 120s" "No"

Detection: Hook checks for failure section when success < total

Failure 4: No Consensus Analysis

Symptom: Individual model results listed without comparison

What went wrong:

"Model 1 found: A, B, C
 Model 2 found: B, D, E"
(No comparison: which issues do they agree on?)

Prevention:

After all complete, ALWAYS run consolidation
Create consensus table comparing findings
Prioritize by agreement level

Detection: Hook checks for consensus keywords

Failure 5: Statistics Not Saved

Symptom: No record in ai-docs/llm-performance.json

What went wrong:

# Forgot to call tracking functions
# No record of this session

Prevention:

# ALWAYS call these
track_model_performance "model" "status" duration issues quality
record_session_stats total success failed parallel sequential speedup

Detection: Hook checks file modification time

Prevention Checklist

Before presenting results, verify:

[ ] Tracking table exists at $SESSION_DIR/tracking.md
[ ] Tracking table is populated with all model results
[ ] All model times recorded (or "timeout"/"failed" noted)
[ ] All failures documented in $SESSION_DIR/failures.md
[ ] Consensus analysis performed in $SESSION_DIR/consensus.md
[ ] Results match required output format
[ ] Statistics saved to ai-docs/llm-performance.json
[ ] Session directory contains all artifacts

Integration Examples

Example 1: Complete Multi-Model Review Workflow

#!/bin/bash
# Full multi-model review with complete tracking

# ============================================================================
# PHASE 1: PRE-LAUNCH (MANDATORY)
# ============================================================================

# 1. Create unique session
SESSION_ID="review-$(date +%Y%m%d-%H%M%S)-$(head -c 4 /dev/urandom | xxd -p)"
SESSION_DIR="/tmp/${SESSION_ID}"
mkdir -p "$SESSION_DIR"

# 2. Record start time
SESSION_START=$(date +%s)

# 3. Create tracking table
cat > "$SESSION_DIR/tracking.md" << EOF
# Multi-Model Validation Tracking

## Session: $SESSION_ID
Started: $(date -u +%Y-%m-%dT%H:%M:%SZ)

## Model Status
| Model | Status | Duration | Issues | Quality |
|-------|--------|----------|--------|---------|
EOF

# 4. Initialize timing arrays
declare -A MODEL_START_TIMES
declare -A MODEL_END_TIMES

# 5. Create tracking marker
echo "$SESSION_DIR" > /tmp/.claude-multi-model-active

# 6. Write code context
git diff > "$SESSION_DIR/code-context.md"

echo "Pre-launch complete. Session: $SESSION_ID"

# ============================================================================
# PHASE 2: MODEL EXECUTION (Parallel Task calls)
# ============================================================================

# Record start times for each model
MODEL_START_TIMES["claude-embedded"]=$(date +%s)
MODEL_START_TIMES["grok"]=$(date +%s)
MODEL_START_TIMES["qwen"]=$(date +%s)

# Launch all models in single message (parallel execution)
# (These would be actual Task calls in practice)
echo "Launching 3 models in parallel..."

# ============================================================================
# PHASE 3: RESULTS COLLECTION (as each completes)
# ============================================================================

# Update status immediately after each completes
update_model_status() {
  local model="$1" status="$2" issues="${3:-0}" quality="${4:-}"
  local end_time=$(date +%s)
  local duration=$((end_time - MODEL_START_TIMES["$model"]))

  echo "| $model | $status | ${duration}s | $issues | ${quality:-N/A} |" >> "$SESSION_DIR/tracking.md"
  track_model_performance "$model" "$status" "$duration" "$issues" "$quality"
}

# Example completions
update_model_status "claude-embedded" "success" 8 95
update_model_status "grok" "success" 6 87
update_model_status "qwen" "timeout"

# ============================================================================
# PHASE 4: CONSENSUS ANALYSIS (MANDATORY)
# ============================================================================

# Consolidate and compare findings
echo "Performing consensus analysis..."
# (Would launch consolidation agent here)

# ============================================================================
# PHASE 5: STATISTICS & PRESENTATION
# ============================================================================

# Calculate session stats
PARALLEL_TIME=52  # max of all durations
SEQUENTIAL_TIME=129  # sum of all durations
SPEEDUP=2.5

# Record session
record_session_stats 3 2 1 "$PARALLEL_TIME" "$SEQUENTIAL_TIME" "$SPEEDUP"

# Present results
cat << RESULTS
## Multi-Model Review Complete

Session: $SESSION_ID
Directory: $SESSION_DIR

Models: 3 requested, 2 successful, 1 failed

See tracking table: $SESSION_DIR/tracking.md
See consensus: $SESSION_DIR/consensus.md
Statistics saved to: ai-docs/llm-performance.json
RESULTS

# Cleanup marker
rm -f /tmp/.claude-multi-model-active

Example 2: Minimal 2-Model Comparison

# Simplest viable multi-model validation

# Pre-launch
SESSION_ID="review-$(date +%s)"
SESSION_DIR="/tmp/$SESSION_ID"
mkdir -p "$SESSION_DIR"
SESSION_START=$(date +%s)
echo "$SESSION_DIR" > /tmp/.claude-multi-model-active

# Launch
echo "Launching Claude + Grok..."
# Task: claude-embedded
# Task: claudish grok

# Track
track_model_performance "claude" "success" 32 8 95
track_model_performance "grok" "success" 45 6 87

# Consensus
echo "Issues both found: SQL injection, missing validation" > "$SESSION_DIR/consensus.md"

# Stats
record_session_stats 2 2 0 45 77 1.7

# Cleanup
rm -f /tmp/.claude-multi-model-active

Example 3: Handling Failures

# Multi-model with failure handling

# Pre-launch (same as Example 1)
# ... setup code ...

# Launch 4 models
# ... Task calls ...

# Model 1: Success
update_model_status "claude" "success" 32 8 95

# Model 2: Success
update_model_status "grok" "success" 45 6 87

# Model 3: Timeout
update_model_status "gemini" "timeout"
document_failure "gemini" "Timeout" "Exceeded 120s limit" "No"

# Model 4: API Error
update_model_status "gpt5" "failed"
document_failure "gpt5" "API Error" "500 from OpenRouter" "Yes, 1 retry"

# Proceed with 2 successful models
if [ "$SUCCESS_COUNT" -ge 2 ]; then
  echo "Proceeding with $SUCCESS_COUNT successful models"
  # Consensus with partial data
else
  echo "ERROR: Only $SUCCESS_COUNT succeeded, need minimum 2"
fi

Integration with Other Skills

With `multi-model-validation`

The multi-model-validation skill defines the execution patterns (4-Message Pattern, parallel execution, proxy mode). This skill (model-tracking-protocol) defines the tracking infrastructure.

Use together:

skills: multimodel:multi-model-validation, multimodel:model-tracking-protocol

Workflow:

Read multi-model-validation for execution patterns
Read model-tracking-protocol for tracking setup
Pre-launch (tracking protocol)
Execute (validation patterns)
Track (protocol updates)
Present (protocol templates)

With `quality-gates`

Use quality gates to ensure tracking is complete before proceeding:

# After tracking setup, verify completeness
if [ ! -f "$SESSION_DIR/tracking.md" ]; then
  echo "QUALITY GATE FAILED: No tracking table"
  exit 1
fi

# Before presenting results, verify all sections present
verify_output_complete "$OUTPUT" || exit 1

With `task-orchestration`

Track progress through multi-model phases:

Tasks:
1. [~] Pre-launch setup (tracking protocol)
2. [ ] Launch models (validation patterns)
3. [ ] Collect results (tracking updates)
4. [ ] Consensus analysis (protocol requirement)
5. [ ] Present results (protocol template)

Quick Reference

File-Based Tracking Marker (CONSENSUS FIX)

Create marker after pre-launch setup:

echo "$SESSION_DIR" > /tmp/.claude-multi-model-active

Check if tracking active (in hooks):

if [[ -f /tmp/.claude-multi-model-active ]]; then
  SESSION_DIR=$(cat /tmp/.claude-multi-model-active)
  [[ -f "$SESSION_DIR/tracking.md" ]] && echo "Tracking active"
fi

Remove marker when done:

rm -f /tmp/.claude-multi-model-active

Pre-Launch Commands

SESSION_ID="review-$(date +%Y%m%d-%H%M%S)-$(head -c 4 /dev/urandom | xxd -p)"
SESSION_DIR="/tmp/${SESSION_ID}"
mkdir -p "$SESSION_DIR"
SESSION_START=$(date +%s)
echo "$SESSION_DIR" > /tmp/.claude-multi-model-active

Tracking Commands

update_model_status "model" "status" issues quality
document_failure "model" "type" "error" "retry"
track_model_performance "model" "status" duration issues quality
record_session_stats total success failed parallel sequential speedup

Verification Commands

verify_output_complete "$OUTPUT"
[ -f "$SESSION_DIR/tracking.md" ] && echo "Tracking exists"
[ -f ai-docs/llm-performance.json ] && echo "Statistics saved"

Summary

This skill provides MANDATORY tracking infrastructure for multi-model validation:

Pre-Launch Checklist - 8 items to complete before launching models
Tracking Tables - Templates for 3-5 models and 6+ models
Status Updates - Per-model completion tracking
Failure Documentation - Required format for all failures
Consensus Analysis - Comparing findings across models
Results Template - Required output format
Common Failures - Prevention strategies
Integration Examples - Complete workflows

Key Innovation: File-based tracking marker (/tmp/.claude-multi-model-active) allows hooks to detect active tracking without relying on environment variables.

Use this skill when: Running 2+ external AI models in parallel for validation, review, or consensus analysis.

Missing tracking = INCOMPLETE validation.

Related Skills

madappgang/test-skill

testing

VerifiedTrustedCommunity

A test skill for validation testing. Use when testing skill parsing and validation logic.

4SKILL.mdUpdated Apr 6, 2026

madappgang/test-skill

madappgang/tools/claudeup-core/src/tests/fixtures/invalid-plugin/bad-frontmatter/skills/bad-skill

tools

VerifiedTrustedCommunity

--- name: bad-skill description: This skill has invalid YAML in frontmatter allowed-tools: [invalid, array, syntax prerequisites: not-an-array --- # Bad Skill This skill has malformed frontmatter that should fail parsing. The YAML has: - Unclosed array bracket - Wrong type for prerequisites (should be array, not string)

4SKILL.mdUpdated Apr 6, 2026

madappgang/tools/claudeup-core/src/__tests__/fixtures/invalid-plugin/bad-frontmatter/skills/bad-skill

madappgang/update-models

development

VerifiedTrustedCommunity

Sync model aliases from the curated Firebase database. Fetches default model assignments, short aliases, team compositions, and known model metadata from the claudish API. Run this to get fresh model recommendations.

4SKILL.mdUpdated Apr 6, 2026

madappgang/update-models

madappgang/release

tools

VerifiedTrustedCommunity

Release one or more Magus plugins to the distribution repos (magus, magus-alpha, magus-marketing). Handles version inference from git history, marketplace.json updates, tagging, and force-push to lean dist repos. Use whenever the user says "release kanban", "release the dev plugin", "cut a new version of gtd", "bump kanban to 1.7", or hands you a batch like "release kanban and gtd". Also use for multi-plugin releases and for checking what a release would contain before committing.

4SKILL.mdUpdated Apr 6, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/madappgang/magus.git

# Copy into Claude Code skills folder (global)
cp -r magus/plugins/multimodel/skills/model-tracking-protocol ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

madappgang/magus

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

madappgang/model-tracking-protocol

$ install --global

Security Scan Results

SKILL.md

Model Tracking Protocol

Overview

Table of Contents

MANDATORY Pre-Launch Checklist

Checklist (Copy and Complete)

Why Pre-Launch Matters

Pre-Launch Script Template

Strict Mode (Optional)

Tracking Table Templates

Template A: Simple Model Tracking (3-5 models)

Template B: Detailed Model Tracking (6+ models)

Template C: Session-Based Tracking File

Update Protocol

Per-Model Status Update Protocol

IMMEDIATELY After Each Model Completes

Update Script

Status Values

Real-Time Progress Display

Failure Documentation Protocol

Failure Report Template

Failure Categorization

Failure Summary Table

Writing Failures to Session Directory

Consensus Analysis Requirements

Minimum Viable Consensus (2 models)

Standard Consensus (3-5 models)

Extended Consensus (6+ models)

Consensus Analysis Script

NO Consensus Analysis = INCOMPLETE Review

Results Presentation Template

Required Output Format

Missing Section Detection

Common Failures and Prevention

Failure 1: No Tracking Table Created

Failure 2: Timing Not Recorded

Failure 3: Failed Models Not Documented

Failure 4: No Consensus Analysis

Failure 5: Statistics Not Saved

Prevention Checklist

Integration Examples

Example 1: Complete Multi-Model Review Workflow

Example 2: Minimal 2-Model Comparison

Example 3: Handling Failures

Integration with Other Skills

With multi-model-validation

With quality-gates

With task-orchestration

Quick Reference

File-Based Tracking Marker (CONSENSUS FIX)

Pre-Launch Commands

Tracking Commands

Verification Commands

Summary

Related Skills

madappgang/test-skill

madappgang/tools/claudeup-core/src/__tests__/fixtures/invalid-plugin/bad-frontmatter/skills/bad-skill

madappgang/update-models

madappgang/release

madappgang/model-tracking-protocol

$ install --global

Security Scan Results

SKILL.md

Model Tracking Protocol

Overview

Table of Contents

MANDATORY Pre-Launch Checklist

Checklist (Copy and Complete)

Why Pre-Launch Matters

Pre-Launch Script Template

Strict Mode (Optional)

Tracking Table Templates

Template A: Simple Model Tracking (3-5 models)

Template B: Detailed Model Tracking (6+ models)

Template C: Session-Based Tracking File

Update Protocol

With `multi-model-validation`

With `quality-gates`

With `task-orchestration`

madappgang/tools/claudeup-core/src/tests/fixtures/invalid-plugin/bad-frontmatter/skills/bad-skill

With `multi-model-validation`

With `quality-gates`

With `task-orchestration`

madappgang/tools/claudeup-core/src/tests/fixtures/invalid-plugin/bad-frontmatter/skills/bad-skill