Troubleshoot Skill

Arguments: $ARGUMENTS

Systematically troubleshoot issues through multi-agent orchestration: discover → investigate → fix → verify → commit.

Configuration

# Source resolve-config: marketplace installs get ${CLAUDE_PLUGIN_ROOT} substituted
# inline before bash runs; ./install.sh users fall back to ~/.claude. If neither
# path resolves, fail loudly rather than letting resolve_artifact be undefined.
if [ -f "${CLAUDE_PLUGIN_ROOT}/shared/resolve-config.sh" ]; then
  source "${CLAUDE_PLUGIN_ROOT}/shared/resolve-config.sh"
elif [ -f "$HOME/.claude/shared/resolve-config.sh" ]; then
  source "$HOME/.claude/shared/resolve-config.sh"
else
  echo "ERROR: resolve-config.sh not found. Install via marketplace or run ./install.sh" >&2
  exit 1
fi
TROUBLESHOOT_EXEC_MODE=$(resolve_exec_mode troubleshoot team)

# Optional --spec PATH: opt-in source for per-AC verification (Phase 6.3).
# Troubleshoot is ticket-agnostic by default; this flag is the ONLY way a spec
# enters the run — no inference from CWD, ticket, or work dir.
# Quote-aware parse: handles `--spec PATH`, `--spec=PATH`, and quoted paths
# with spaces; a bare `--spec` (no value) resolves to empty and is ignored.
SPEC=""
case " $ARGUMENTS " in
  *" --spec "*|*" --spec="*)
    rest=${ARGUMENTS#*--spec}; rest=${rest#[ =]}    # text after the flag, minus one space/=
    case $rest in
      \"*) SPEC=${rest#\"}; SPEC=${SPEC%%\"*} ;;     # double-quoted path (allows spaces)
      \'*) SPEC=${rest#\'}; SPEC=${SPEC%%\'*} ;;     # single-quoted path
      *)   SPEC=${rest%% *} ;;                        # bare token, up to next space
    esac
    ;;
esac
if [ -n "$SPEC" ] && [ ! -f "$SPEC" ]; then
  echo "WARNING: --spec '$SPEC' not found; per-AC verification skipped." >&2
  SPEC=""
fi

Use $TROUBLESHOOT_EXEC_MODE to determine team vs sub-agent behavior in Phase 6 (verify fix).

Write Safety

When running verification agents in parallel (Phase 6), agents MUST NOT write to the same file:

security-auditor: Writes only to its own scoped output (returned via Task result)
quality-guard: Writes only to its own scoped output (returned via Task result)
Source code fixes: Only the lead applies fixes (Phase 5), sequentially, never in parallel

See ~/.claude/shared/write-safety.md for the full conventions.

Usage

/troubleshoot "Endpoint /api/users returns 202 instead of 200"
/troubleshoot "Login fails when password contains special characters"
/troubleshoot "Database query times out on large datasets"
/troubleshoot "Endpoint returns 500 after deploy" --spec .claude/work/PROJ-1-login/spec.md

--spec PATH (optional) opts into per-AC verification: when supplied, Phase 6.3 verifies the fix against the spec's acceptance criteria and appends a per-AC PASS/FAIL section. Omit it for an ordinary ad-hoc run — troubleshoot stays ticket-agnostic and infers no spec on its own.

When to Use This Skill

Endpoint returns wrong status code
Feature not working as expected
Error/exception being thrown
Performance issue
Data inconsistency
Test failing unexpectedly

Workflow Overview

┌──────────────────────────────────────────────────────────────────────┐
│ Phase 1: Parse Issue              → Extract what's wrong             │
│ Phase 2: Discover Code            → Find relevant code               │
│ Phase 3: Investigate              → Root cause analysis              │
│ Phase 4: Determine Fix Strategy   → Code fix, test fix, or clarify   │
│ Phase 5: Apply Fix                → Apply fix (code or tests)        │
│ Phase 6: Verify                   → Run tests, ensure fix works      │
│ Phase 7: Commit                   → Save the fix                     │
└──────────────────────────────────────────────────────────────────────┘

Phase 0: Enter Worktree (Conditional)

Skip if resolve_worktree_enabled returns "false".

Single mode (WORKSPACE_MODE == "single"):

Call EnterWorktree(name: "troubleshoot-{short_slug}") where {short_slug} is derived from the issue description (e.g., troubleshoot-login-500)
CWD moves to worktree; $WORK_DIR still resolves to original workspace root

Multi mode (WORKSPACE_MODE == "multi"):

Create per-service worktrees using each service's current branch:

WT_ROOT=$(resolve_worktree_root)
TROUBLESHOOT_WORKSPACE="${WT_ROOT}/troubleshoot-{short_slug}"
mkdir -p "$TROUBLESHOOT_WORKSPACE"

for svc in $(resolve_services); do
  svc_path=$(resolve_service_path "$svc")
  wt_path="${TROUBLESHOOT_WORKSPACE}/${svc}"
  [[ -d "$wt_path" ]] && continue
  CURRENT_BRANCH=$(git -C "$svc_path" branch --show-current 2>/dev/null || echo "HEAD")
  git -C "$svc_path" worktree add "$wt_path" -b "troubleshoot/{short_slug}" 2>/dev/null \
    || git -C "$svc_path" worktree add "$wt_path" "$CURRENT_BRANCH"
  echo "Created worktree: ${svc}/ → ${wt_path}"
done

All subsequent agent prompts use $TROUBLESHOOT_WORKSPACE/{service}/ paths

After Phase 7 (Commit): Single mode → ExitWorktree(action: "remove"). Multi mode → remove worktrees:

for svc in $(resolve_services); do
  svc_path=$(resolve_service_path "$svc")
  wt_path="${TROUBLESHOOT_WORKSPACE}/${svc}"
  [[ -d "$wt_path" ]] && git -C "$svc_path" worktree remove "$wt_path" --force 2>/dev/null
done
rmdir "$TROUBLESHOOT_WORKSPACE" 2>/dev/null

Phase 1: Parse Issue

Goal: Understand what's wrong and what's expected.

Extract from user description:

What: What component/endpoint/feature is broken?
Expected: What should happen?
Actual: What actually happens?
Context: Error messages, reproduction steps

Example:

Input: "Endpoint /api/users returns 202 instead of 200"

Parsed:
  Component: /api/users endpoint
  Expected: HTTP 200 status
  Actual: HTTP 202 status
  Type: Response status mismatch

Confirm understanding with user if unclear.

Phase 2: Discover Code

Goal: Find the relevant code and understand the flow.

Use Explore agent to locate:

Entry point (controller, route, handler)
Business logic (services, repositories)
Related tests
Configuration files

Agent delegation: Pass purpose, not just a query — state the symptom and that the result feeds root-cause investigation, so the agent scopes its trace accordingly. If it returns no concrete anchors (file:line, symbols), re-dispatch with a refined query (≤3 cycles). See ~/.claude/shared/subagent-context-discipline.md.

Task(Explore, "Troubleshooting: /api/users returns 202 instead of 200. Find the endpoint definition and trace the code flow through controllers and services so we can locate where the status is set. Return file:line anchors.")

Explore agent returns:

File paths and line numbers
Call chain
Dependencies
Related tests

If code is complex or legacy: (same dispatch discipline — carry the symptom and the investigation goal, not just the endpoint name)

Task(archaeologist, "Troubleshooting why /api/users returns 202: deep-dive the endpoint and its call chain to find what sets the status. Return file:line anchors and any historical clues (TODOs, workarounds).")

Output to user:

## Code Discovery

Found endpoint: src/Controller/UserController.php:45
Route: GET /api/users → UserController::index()
Service: UserService::getUsers()
Tests: tests/Feature/UserApiTest.php

Call chain:
  Route → Controller → Service → Repository → Database

Phase 3: Investigate Root Cause

Goal: Understand WHY the issue occurs.

Investigation steps:

3.1 Read the code

Read controller/handler
Read service methods
Check conditional logic (if/else that might trigger different responses)

3.2 Check git history

# When did this start?
git log -p --all -S "202" -- path/to/controller

# Recent changes to this file
git log --oneline -10 -- path/to/controller

3.3 Check existing tests

# What do tests expect?
grep -r "api/users" tests/ -A 5 -B 5

3.4 Perform systematic investigation

Investigate the root cause:

Reproduce - Identify exact conditions
Isolate - When did it last work?
Investigate - Trace through code
Hypothesize - Form theory about cause
Document - Provide root cause analysis

Output to user:

## Root Cause Analysis

Issue: UserController returns HTTP 202 (Accepted) for async processing
Location: UserController.php:45
Introduced: commit abc123f (3 days ago)

Root Cause:
  Code was changed to use async job processing, which returns 202
  to indicate request was accepted but not yet processed.

Decision needed:
  ☐ Keep 202 (correct for async) and update tests
  ☐ Revert to 200 (synchronous processing)

Phase 4: Determine Fix Strategy

Goal: Decide what needs to be fixed.

Three scenarios:

Scenario A: Code is wrong

Bug in implementation
Logic error
Regression from recent change

Action: Fix the code

Scenario B: Test is wrong

Code change was intentional
Test expectations outdated
Requirements changed

Action: Update tests

Scenario C: Unclear

Ambiguous requirements
Missing documentation

Action: Ask user for clarification

Use AskUserQuestion if decision needed:

AskUserQuestion:
  Question: "Should /api/users be synchronous (200) or asynchronous (202)?"
  Options:
    1. Synchronous (200) - Users wait for result
    2. Asynchronous (202) - Background processing

Phase 5: Apply Fix

Goal: Implement the solution.

5.1 Code Fix (Scenario A)

If code needs fixing:

Apply the fix directly using Edit tool
Keep changes minimal and focused
Add comments if logic is complex

Example:

// Before
return new JsonResponse($data, 202); // Async processing

// After
return new JsonResponse($data, 200); // Synchronous response

5.2 Test Fix (Scenario B)

If tests need updating:

Update test expectations
Add new test cases if edge case was missed

Example:

// Before
$response->assertStatus(200);

// After
$response->assertStatus(202); // Updated for async processing

5.3 Write Missing Tests

If tests are missing:

Task(test-writer, "Write test for /api/users endpoint expecting 200 status code for successful response")

Phase 6: Verify Fix

Goal: Ensure the fix works and doesn't break anything.

Execution mode: Determined by $TROUBLESHOOT_EXEC_MODE.

6.1 Run relevant tests

# Run specific test file
./vendor/bin/phpunit tests/Feature/UserApiTest.php

# Or run all tests
./vendor/bin/phpunit

6.2 If tests fail

Delegate to test-fixer:

Task(test-fixer, "Fix failing test after changing /api/users to return 200 instead of 202")

6.3 Verification review

If $TROUBLESHOOT_EXEC_MODE = "subagent":

Run verification agents in parallel:

[PARALLEL EXECUTION - Single message with multiple Task calls]

Task 1: subagent_type: "security-auditor"
Prompt: Quick security audit of {endpoint/component} after {change description}.
Check for: injection risks, auth bypass, data exposure from the fix.

Task 2: subagent_type: "quality-guard"
Prompt: Verify the troubleshoot fix (Level 2 — Implementation Validation).
Fix diff: {git_diff}
Root cause: {root_cause_analysis}
Verify:
1. Does the fix actually address the root cause, or just the symptom?
2. Are there other code paths with the same bug pattern?
3. Do the tests cover the specific condition that triggered the bug?
When `--spec` was supplied (`$SPEC` resolved to a file): also read `$SPEC`, verify the fix against each acceptance criterion, and prefix every gate that maps to an AC with its AC ID(s) — e.g., `GATE 2: AC-3.1 — ...` — citing grader-typed evidence per `~/.claude/shared/eval-concepts.md`.
Produce a Quality Review Gates report.

If skeptic raises BLOCKING gates, address them before committing.

Deadlock protocol: If the skeptic rejects the fix 3 times, STOP iterating. Escalate to the user with: (a) the fix diff, (b) the skeptic's objections across all rounds, (c) your attempts to address them. The user decides: override, provide guidance, or abort.

If $TROUBLESHOOT_EXEC_MODE = "team" (default):

TeamCreate(team_name="troubleshoot-verify")

TaskCreate: "Security audit of fix" (T1)
TaskCreate: "Challenge the fix" (T2) — depends on T1

[PARALLEL]
Task tool: name: "troubleshoot-security", subagent_type: "security-auditor", team_name: "troubleshoot-verify"
Task tool: name: "troubleshoot-skeptic", subagent_type: "quality-guard", team_name: "troubleshoot-verify"

Skeptic waits for security-auditor, then challenges. Agents resolve via SendMessage. Collect results and TeamDelete.

Deadlock protocol: Max 3 rejection cycles. After 3 rejections from the skeptic, stop iterating and escalate to the user with all objections and attempted fixes. The user decides: override, provide guidance, or abort.

Output to user:

## Verification

✓ Tests passing: 15/15
✓ Security audit: {No issues | Issues found}
✓ Skeptic validation: {APPROVED | CONDITIONAL}
✓ Manual verification: Endpoint returns 200

Per-AC Verification (--spec only — one row per AC):
  | AC ID  | Verdict | Grader | Evidence                      |
  |--------|---------|--------|-------------------------------|
  | AC-3.1 | PASS    | code   | UserApiTest::testStatus → 200 |

Fix verified successfully.

Per-AC section (only when --spec PATH was supplied and $SPEC resolved to a file): assemble one row per AC from the quality-guard gate output (AC-tagged) against the spec's AC list — same rules as /implement Phase 4.5; source is the quality-guard output and evidence follows the grader type. Re-verification reliability matters here: a fix that passes once is pass@1, not pass^k — flag a re-verified flaky fix as such (see ~/.claude/shared/eval-concepts.md). When no --spec is supplied (the default ad-hoc run), omit this section entirely — no error, no placeholder.

Phase 7: Commit

Goal: Save the fix with proper documentation.

Run inline — the hook enforces credential scan and branch protection automatically:

git add <files>
git commit -m "$(cat <<'EOF'
[TICKET-123] fix(scope): description
EOF
)"
# If pushing: record security-auditor confirmation first (after a clean scan)
bash "${CLAUDE_PLUGIN_ROOT}/hooks/record-audit.sh"
git push

Commit message format:

[TICKET-123] fix(api): change /api/users to return 200 instead of 202

- Changed UserController to use synchronous processing
- Updated tests to expect 200 status code
- Root cause: Async processing was unintended change in commit abc123f

Output to user:

## Debug Complete ✓

Issue: /api/users returns 202 instead of 200
Root Cause: Unintended async processing change
Fix Applied: Reverted to synchronous response
Tests: All passing
Commit: abc123f
Status: RESOLVED

Error Handling

If code location not found

❌ Could not locate /api/users endpoint

Suggestions:
  • Check if route exists: grep -r "api/users" routes/
  • Check if endpoint was removed
  • Try broader search: grep -r "users" src/Controller/

If root cause unclear after investigation

⚠️ Root cause not definitively identified

Next steps:
  1. Add debug logging around suspected code
  2. Check production logs for error patterns
  3. Reproduce issue locally with debugging enabled
  4. Consider pairing with developer familiar with this code

If fix breaks other tests

⚠️ Fix broke 3 other tests

Rolling back change...
Delegating to test-fixer for comprehensive test fix...

Task(test-fixer, "Fix all failing tests after changing /api/users status code")

Agent Orchestration Summary

| Phase | Agent(s) Used | Purpose | |-------|---------------|---------| | Discovery | Explore, archaeologist | Find and understand code | | Investigation | Direct analysis | Root cause analysis | | Fix | Direct (Edit tool) | Apply code/test changes | | Verification | test-writer, test-fixer | Ensure fix works | | Review | security-auditor, quality-guard | Validate fix quality and security | | Commit | git-operator | Save and document fix |

Tips for Effective Debugging

Provide clear issue descriptions: ✅ "Login endpoint returns 500 when password is empty" ✅ "User creation fails with unique constraint error on email" ✅ "Dashboard loads slowly (>5s) with 1000+ items"

❌ "It's broken" ❌ "Fix the login" ❌ "Make it faster"

Include context when available:

Error messages
Stack traces
Reproduction steps
Expected vs actual behavior
Recent changes

Example:

/troubleshoot "Login endpoint returns 500 when password is empty
Error: Call to a member function hash() on null
Stack trace shows error in AuthService::validatePassword()
Expected: 400 Bad Request with validation error
Actual: 500 Internal Server Error"

Quality Checklist

Before completing troubleshoot session:

[ ] Root cause identified and documented
[ ] Fix applied (code or tests)
[ ] Tests passing
[ ] No regressions (other tests still pass)
[ ] Security check (if response/auth changed)
[ ] Committed with descriptive message
[ ] Issue resolved confirmation

Example Session

User:

/troubleshoot "Endpoint /api/users returns 202 instead of 200"

Skill Workflow:

Parse: Endpoint issue, status code mismatch
Discover:
- Task(Explore) → Found UserController.php:45
- Traced to UserService::getUsers()
Investigate:
- Direct analysis → Root cause: async processing change
- Git history shows commit abc123f changed to async
Decide: User confirms should be synchronous (200)
Fix: Change response status to 200
Verify:
- Run tests → All passing
- Task(security-auditor) → No issues
Commit:
- Task(git-operator) → Committed fix

Result: Issue resolved, tests passing, fix committed.

Troubleshoot Skill

Arguments: $ARGUMENTS

Systematically troubleshoot issues through multi-agent orchestration: discover → investigate → fix → verify → commit.

Configuration

# Source resolve-config: marketplace installs get ${CLAUDE_PLUGIN_ROOT} substituted
# inline before bash runs; ./install.sh users fall back to ~/.claude. If neither
# path resolves, fail loudly rather than letting resolve_artifact be undefined.
if [ -f "${CLAUDE_PLUGIN_ROOT}/shared/resolve-config.sh" ]; then
  source "${CLAUDE_PLUGIN_ROOT}/shared/resolve-config.sh"
elif [ -f "$HOME/.claude/shared/resolve-config.sh" ]; then
  source "$HOME/.claude/shared/resolve-config.sh"
else
  echo "ERROR: resolve-config.sh not found. Install via marketplace or run ./install.sh" >&2
  exit 1
fi
TROUBLESHOOT_EXEC_MODE=$(resolve_exec_mode troubleshoot team)

# Optional --spec PATH: opt-in source for per-AC verification (Phase 6.3).
# Troubleshoot is ticket-agnostic by default; this flag is the ONLY way a spec
# enters the run — no inference from CWD, ticket, or work dir.
# Quote-aware parse: handles `--spec PATH`, `--spec=PATH`, and quoted paths
# with spaces; a bare `--spec` (no value) resolves to empty and is ignored.
SPEC=""
case " $ARGUMENTS " in
  *" --spec "*|*" --spec="*)
    rest=${ARGUMENTS#*--spec}; rest=${rest#[ =]}    # text after the flag, minus one space/=
    case $rest in
      \"*) SPEC=${rest#\"}; SPEC=${SPEC%%\"*} ;;     # double-quoted path (allows spaces)
      \'*) SPEC=${rest#\'}; SPEC=${SPEC%%\'*} ;;     # single-quoted path
      *)   SPEC=${rest%% *} ;;                        # bare token, up to next space
    esac
    ;;
esac
if [ -n "$SPEC" ] && [ ! -f "$SPEC" ]; then
  echo "WARNING: --spec '$SPEC' not found; per-AC verification skipped." >&2
  SPEC=""
fi

Use $TROUBLESHOOT_EXEC_MODE to determine team vs sub-agent behavior in Phase 6 (verify fix).

Write Safety

When running verification agents in parallel (Phase 6), agents MUST NOT write to the same file:

security-auditor: Writes only to its own scoped output (returned via Task result)
quality-guard: Writes only to its own scoped output (returned via Task result)
Source code fixes: Only the lead applies fixes (Phase 5), sequentially, never in parallel

See ~/.claude/shared/write-safety.md for the full conventions.

Usage

/troubleshoot "Endpoint /api/users returns 202 instead of 200"
/troubleshoot "Login fails when password contains special characters"
/troubleshoot "Database query times out on large datasets"
/troubleshoot "Endpoint returns 500 after deploy" --spec .claude/work/PROJ-1-login/spec.md

When to Use This Skill

Endpoint returns wrong status code
Feature not working as expected
Error/exception being thrown
Performance issue
Data inconsistency
Test failing unexpectedly

Workflow Overview

┌──────────────────────────────────────────────────────────────────────┐
│ Phase 1: Parse Issue              → Extract what's wrong             │
│ Phase 2: Discover Code            → Find relevant code               │
│ Phase 3: Investigate              → Root cause analysis              │
│ Phase 4: Determine Fix Strategy   → Code fix, test fix, or clarify   │
│ Phase 5: Apply Fix                → Apply fix (code or tests)        │
│ Phase 6: Verify                   → Run tests, ensure fix works      │
│ Phase 7: Commit                   → Save the fix                     │
└──────────────────────────────────────────────────────────────────────┘

Phase 0: Enter Worktree (Conditional)

Skip if resolve_worktree_enabled returns "false".

Single mode (WORKSPACE_MODE == "single"):

Call EnterWorktree(name: "troubleshoot-{short_slug}") where {short_slug} is derived from the issue description (e.g., troubleshoot-login-500)
CWD moves to worktree; $WORK_DIR still resolves to original workspace root

Multi mode (WORKSPACE_MODE == "multi"):

Create per-service worktrees using each service's current branch:

WT_ROOT=$(resolve_worktree_root)
TROUBLESHOOT_WORKSPACE="${WT_ROOT}/troubleshoot-{short_slug}"
mkdir -p "$TROUBLESHOOT_WORKSPACE"

for svc in $(resolve_services); do
  svc_path=$(resolve_service_path "$svc")
  wt_path="${TROUBLESHOOT_WORKSPACE}/${svc}"
  [[ -d "$wt_path" ]] && continue
  CURRENT_BRANCH=$(git -C "$svc_path" branch --show-current 2>/dev/null || echo "HEAD")
  git -C "$svc_path" worktree add "$wt_path" -b "troubleshoot/{short_slug}" 2>/dev/null \
    || git -C "$svc_path" worktree add "$wt_path" "$CURRENT_BRANCH"
  echo "Created worktree: ${svc}/ → ${wt_path}"
done

All subsequent agent prompts use $TROUBLESHOOT_WORKSPACE/{service}/ paths

After Phase 7 (Commit): Single mode → ExitWorktree(action: "remove"). Multi mode → remove worktrees:

for svc in $(resolve_services); do
  svc_path=$(resolve_service_path "$svc")
  wt_path="${TROUBLESHOOT_WORKSPACE}/${svc}"
  [[ -d "$wt_path" ]] && git -C "$svc_path" worktree remove "$wt_path" --force 2>/dev/null
done
rmdir "$TROUBLESHOOT_WORKSPACE" 2>/dev/null

Phase 1: Parse Issue

Goal: Understand what's wrong and what's expected.

Extract from user description:

What: What component/endpoint/feature is broken?
Expected: What should happen?
Actual: What actually happens?
Context: Error messages, reproduction steps

Example:

Input: "Endpoint /api/users returns 202 instead of 200"

Parsed:
  Component: /api/users endpoint
  Expected: HTTP 200 status
  Actual: HTTP 202 status
  Type: Response status mismatch

Confirm understanding with user if unclear.

Phase 2: Discover Code

Goal: Find the relevant code and understand the flow.

Use Explore agent to locate:

Entry point (controller, route, handler)
Business logic (services, repositories)
Related tests
Configuration files

Task(Explore, "Troubleshooting: /api/users returns 202 instead of 200. Find the endpoint definition and trace the code flow through controllers and services so we can locate where the status is set. Return file:line anchors.")

Explore agent returns:

File paths and line numbers
Call chain
Dependencies
Related tests

If code is complex or legacy: (same dispatch discipline — carry the symptom and the investigation goal, not just the endpoint name)

Task(archaeologist, "Troubleshooting why /api/users returns 202: deep-dive the endpoint and its call chain to find what sets the status. Return file:line anchors and any historical clues (TODOs, workarounds).")

Output to user:

## Code Discovery

Found endpoint: src/Controller/UserController.php:45
Route: GET /api/users → UserController::index()
Service: UserService::getUsers()
Tests: tests/Feature/UserApiTest.php

Call chain:
  Route → Controller → Service → Repository → Database

Phase 3: Investigate Root Cause

Goal: Understand WHY the issue occurs.

Investigation steps:

3.1 Read the code

Read controller/handler
Read service methods
Check conditional logic (if/else that might trigger different responses)

3.2 Check git history

# When did this start?
git log -p --all -S "202" -- path/to/controller

# Recent changes to this file
git log --oneline -10 -- path/to/controller

3.3 Check existing tests

# What do tests expect?
grep -r "api/users" tests/ -A 5 -B 5

3.4 Perform systematic investigation

Investigate the root cause:

Reproduce - Identify exact conditions
Isolate - When did it last work?
Investigate - Trace through code
Hypothesize - Form theory about cause
Document - Provide root cause analysis

Output to user:

## Root Cause Analysis

Issue: UserController returns HTTP 202 (Accepted) for async processing
Location: UserController.php:45
Introduced: commit abc123f (3 days ago)

Root Cause:
  Code was changed to use async job processing, which returns 202
  to indicate request was accepted but not yet processed.

Decision needed:
  ☐ Keep 202 (correct for async) and update tests
  ☐ Revert to 200 (synchronous processing)

Phase 4: Determine Fix Strategy

Goal: Decide what needs to be fixed.

Three scenarios:

Scenario A: Code is wrong

Bug in implementation
Logic error
Regression from recent change

Action: Fix the code

Scenario B: Test is wrong

Code change was intentional
Test expectations outdated
Requirements changed

Action: Update tests

Scenario C: Unclear

Ambiguous requirements
Missing documentation

Action: Ask user for clarification

Use AskUserQuestion if decision needed:

AskUserQuestion:
  Question: "Should /api/users be synchronous (200) or asynchronous (202)?"
  Options:
    1. Synchronous (200) - Users wait for result
    2. Asynchronous (202) - Background processing

Phase 5: Apply Fix

Goal: Implement the solution.

5.1 Code Fix (Scenario A)

If code needs fixing:

Apply the fix directly using Edit tool
Keep changes minimal and focused
Add comments if logic is complex

Example:

// Before
return new JsonResponse($data, 202); // Async processing

// After
return new JsonResponse($data, 200); // Synchronous response

5.2 Test Fix (Scenario B)

If tests need updating:

Update test expectations
Add new test cases if edge case was missed

Example:

// Before
$response->assertStatus(200);

// After
$response->assertStatus(202); // Updated for async processing

5.3 Write Missing Tests

If tests are missing:

Task(test-writer, "Write test for /api/users endpoint expecting 200 status code for successful response")

Phase 6: Verify Fix

Goal: Ensure the fix works and doesn't break anything.

Execution mode: Determined by $TROUBLESHOOT_EXEC_MODE.

6.1 Run relevant tests

# Run specific test file
./vendor/bin/phpunit tests/Feature/UserApiTest.php

# Or run all tests
./vendor/bin/phpunit

6.2 If tests fail

Delegate to test-fixer:

Task(test-fixer, "Fix failing test after changing /api/users to return 200 instead of 202")

6.3 Verification review

If $TROUBLESHOOT_EXEC_MODE = "subagent":

Run verification agents in parallel:

[PARALLEL EXECUTION - Single message with multiple Task calls]

Task 1: subagent_type: "security-auditor"
Prompt: Quick security audit of {endpoint/component} after {change description}.
Check for: injection risks, auth bypass, data exposure from the fix.

Task 2: subagent_type: "quality-guard"
Prompt: Verify the troubleshoot fix (Level 2 — Implementation Validation).
Fix diff: {git_diff}
Root cause: {root_cause_analysis}
Verify:
1. Does the fix actually address the root cause, or just the symptom?
2. Are there other code paths with the same bug pattern?
3. Do the tests cover the specific condition that triggered the bug?
When `--spec` was supplied (`$SPEC` resolved to a file): also read `$SPEC`, verify the fix against each acceptance criterion, and prefix every gate that maps to an AC with its AC ID(s) — e.g., `GATE 2: AC-3.1 — ...` — citing grader-typed evidence per `~/.claude/shared/eval-concepts.md`.
Produce a Quality Review Gates report.

If skeptic raises BLOCKING gates, address them before committing.

If $TROUBLESHOOT_EXEC_MODE = "team" (default):

TeamCreate(team_name="troubleshoot-verify")

TaskCreate: "Security audit of fix" (T1)
TaskCreate: "Challenge the fix" (T2) — depends on T1

[PARALLEL]
Task tool: name: "troubleshoot-security", subagent_type: "security-auditor", team_name: "troubleshoot-verify"
Task tool: name: "troubleshoot-skeptic", subagent_type: "quality-guard", team_name: "troubleshoot-verify"

Skeptic waits for security-auditor, then challenges. Agents resolve via SendMessage. Collect results and TeamDelete.

Output to user:

## Verification

✓ Tests passing: 15/15
✓ Security audit: {No issues | Issues found}
✓ Skeptic validation: {APPROVED | CONDITIONAL}
✓ Manual verification: Endpoint returns 200

Per-AC Verification (--spec only — one row per AC):
  | AC ID  | Verdict | Grader | Evidence                      |
  |--------|---------|--------|-------------------------------|
  | AC-3.1 | PASS    | code   | UserApiTest::testStatus → 200 |

Fix verified successfully.

Phase 7: Commit

Goal: Save the fix with proper documentation.

Run inline — the hook enforces credential scan and branch protection automatically:

git add <files>
git commit -m "$(cat <<'EOF'
[TICKET-123] fix(scope): description
EOF
)"
# If pushing: record security-auditor confirmation first (after a clean scan)
bash "${CLAUDE_PLUGIN_ROOT}/hooks/record-audit.sh"
git push

Commit message format:

[TICKET-123] fix(api): change /api/users to return 200 instead of 202

- Changed UserController to use synchronous processing
- Updated tests to expect 200 status code
- Root cause: Async processing was unintended change in commit abc123f

Output to user:

## Debug Complete ✓

Issue: /api/users returns 202 instead of 200
Root Cause: Unintended async processing change
Fix Applied: Reverted to synchronous response
Tests: All passing
Commit: abc123f
Status: RESOLVED

Error Handling

If code location not found

❌ Could not locate /api/users endpoint

Suggestions:
  • Check if route exists: grep -r "api/users" routes/
  • Check if endpoint was removed
  • Try broader search: grep -r "users" src/Controller/

If root cause unclear after investigation

⚠️ Root cause not definitively identified

Next steps:
  1. Add debug logging around suspected code
  2. Check production logs for error patterns
  3. Reproduce issue locally with debugging enabled
  4. Consider pairing with developer familiar with this code

If fix breaks other tests

⚠️ Fix broke 3 other tests

Rolling back change...
Delegating to test-fixer for comprehensive test fix...

Task(test-fixer, "Fix all failing tests after changing /api/users status code")

Agent Orchestration Summary

Tips for Effective Debugging

❌ "It's broken" ❌ "Fix the login" ❌ "Make it faster"

Include context when available:

Error messages
Stack traces
Reproduction steps
Expected vs actual behavior
Recent changes

Example:

/troubleshoot "Login endpoint returns 500 when password is empty
Error: Call to a member function hash() on null
Stack trace shows error in AuthService::validatePassword()
Expected: 400 Bad Request with validation error
Actual: 500 Internal Server Error"

Quality Checklist

Before completing troubleshoot session:

[ ] Root cause identified and documented
[ ] Fix applied (code or tests)
[ ] Tests passing
[ ] No regressions (other tests still pass)
[ ] Security check (if response/auth changed)
[ ] Committed with descriptive message
[ ] Issue resolved confirmation

Example Session

User:

/troubleshoot "Endpoint /api/users returns 202 instead of 200"

Skill Workflow:

Parse: Endpoint issue, status code mismatch
Discover:
- Task(Explore) → Found UserController.php:45
- Traced to UserService::getUsers()
Investigate:
- Direct analysis → Root cause: async processing change
- Git history shows commit abc123f changed to async
Decide: User confirms should be synchronous (200)
Fix: Change response status to 200
Verify:
- Run tests → All passing
- Task(security-auditor) → No issues
Commit:
- Task(git-operator) → Committed fix

Result: Issue resolved, tests passing, fix committed.

Adoption

nexus-a1/skills/troubleshoot

$ install --global

Security Scan Results

SKILL.md

Troubleshoot Skill

Configuration

Write Safety

Usage

When to Use This Skill

Workflow Overview

Phase 0: Enter Worktree (Conditional)

Phase 1: Parse Issue

Phase 2: Discover Code

Phase 3: Investigate Root Cause

3.1 Read the code

3.2 Check git history

3.3 Check existing tests

3.4 Perform systematic investigation

Phase 4: Determine Fix Strategy

Scenario A: Code is wrong

Scenario B: Test is wrong

Scenario C: Unclear

Phase 5: Apply Fix

5.1 Code Fix (Scenario A)

5.2 Test Fix (Scenario B)

5.3 Write Missing Tests

Phase 6: Verify Fix

6.1 Run relevant tests

6.2 If tests fail

6.3 Verification review

Phase 7: Commit

Error Handling

If code location not found

If root cause unclear after investigation

If fix breaks other tests

Agent Orchestration Summary

Tips for Effective Debugging

Quality Checklist

Example Session

Related Skills

nexus-a1/add-product-knowledge

nexus-a1/work-status

nexus-a1/update-documentation

nexus-a1/update-context

nexus-a1/skills/troubleshoot

$ install --global

Security Scan Results

SKILL.md

Troubleshoot Skill

Configuration

Write Safety

Usage

When to Use This Skill

Workflow Overview

Phase 0: Enter Worktree (Conditional)

Phase 1: Parse Issue

Phase 2: Discover Code

Phase 3: Investigate Root Cause

3.1 Read the code

3.2 Check git history

3.3 Check existing tests

3.4 Perform systematic investigation

Phase 4: Determine Fix Strategy

Scenario A: Code is wrong

Scenario B: Test is wrong

Scenario C: Unclear

Phase 5: Apply Fix

5.1 Code Fix (Scenario A)

5.2 Test Fix (Scenario B)

5.3 Write Missing Tests

Phase 6: Verify Fix

6.1 Run relevant tests

6.2 If tests fail

6.3 Verification review

Phase 7: Commit

Error Handling

If code location not found

If root cause unclear after investigation

If fix breaks other tests