Kepner-Tregoe Problem Analysis

Overview

Kepner-Tregoe (KT) is a structured root-cause method. This skill focuses on Problem Analysis (PA) — the IS/IS-NOT boundary contrast — which is the high-value KT process for debugging. When a defect is selective (some cases affected, others not), the boundary between IS and IS-NOT reveals the distinction that points at the root cause.

Decision Analysis (DA) and Potential Problem Analysis (PPA) are de-emphasized here. For pure decision-making among alternatives, use thinking-opportunity-cost. For risk anticipation before a change, use thinking-pre-mortem. Those skills are purpose-built for those tasks; KT's DA/PPA add overhead without unique mechanism.

Situation Analysis (SA) is retained as a lightweight triage step when facing multiple concerns, but it is not a required preamble — jump directly to PA when the problem is already clear.

Core Principle: The boundary between what IS affected and what IS NOT affected encodes the root cause. Find the distinction, find the cause.

When to Use

A defect is selective: affects some endpoints/regions/users/times but NOT others — there is an IS-vs-IS-NOT boundary to contrast
The cause is unclear and not obvious from a stack trace, error message, or a single recent change
Multiple possible causes exist and you need a systematic way to narrow them
A complex situation has multiple concerns that need triage before diving in

Decision flow:

Defect is selective (not 100%)? → No → IS/IS-NOT has no signal; use direct debugging or thinking-systems
                                → Yes → Cause obvious from stack trace/recent change? → Yes → Just fix it
                                                                                      → No → APPLY KT PROBLEM ANALYSIS

When NOT to Use

The failure is uniform (affects 100% of requests/everything) — there is no IS-vs-IS-NOT boundary to contrast; PA gives no signal. Use thinking-systems or direct debugging.
The cause is already obvious from a stack trace, error message, or a single recent change — just fix it; IS/IS-NOT is overhead here.
A quick hypothesis is cheaply testable — test it (thinking-occams-razor) before building a full specification matrix.
Pure decision-making with no deviation to diagnose — use thinking-opportunity-cost, not KT's Decision Analysis.
Risk assessment for a planned change — use thinking-pre-mortem, not KT's Potential Problem Analysis.

Trigger Card

When a defect is selective (some cases affected, others not) and the cause is unclear:

State the problem precisely — what is the deviation? In what object? Where/when does it occur?
Map IS vs IS-NOT — what IS affected vs what IS NOT, side by side. The boundary is the signal.
Find the distinction — what is different about the IS cases vs the IS-NOT cases? That distinction IS the cause.

Skip if the failure is uniform (100%) — there's no boundary to contrast; use direct debugging. If the cause is obvious from a stack trace or recent change, just fix it. For a single cheaply-testable hypothesis, test it first.

Procedure

Step 1 (optional): Situation Analysis — Triage Multiple Concerns

Only when facing several problems at once. List all concerns, separate them if compound, and prioritize by Timing/Impact/Trend:

| Concern | Timing | Impact | Trend | Priority | |---------|--------|--------|-------|----------| | API latency spike | Urgent | High | Worsening | P0 | | Checkout errors | Soon | High | Stable | P1 |

For each concern, decide: Problem Analysis (PA), or delegate to another skill.

Step 2: State the Problem Precisely

Describe the deviation from expected behavior with specificity:

"API response time increased from 200ms to 800ms for /checkout endpoint,
US-East only, starting Monday 9 AM, affecting ~30% of requests."

Step 3: Build the IS/IS-NOT Matrix

Specify the problem across four dimensions. The power is in the distinction column — what's unique about the IS side?

| Dimension | IS (affected) | IS NOT (not affected) | Distinction | |-----------|---------------|----------------------|-------------| | WHAT — object | /checkout endpoint | /cart, /product, /user | Payment processing | | WHAT — defect | 4x latency increase | Errors, timeouts, data corruption | Performance only | | WHERE — location | Production US-East | EU, US-West, staging | Single region | | WHERE — on object | Database query phase | Auth, validation, serialization | DB layer | | WHEN — first seen | Monday 9:00 AM | Before Monday, after 6 PM | Business hours | | WHEN — pattern | During checkout submit | During browsing, cart add | Write operations | | EXTENT — how many | ~30% of requests | 100% of requests | Intermittent | | EXTENT — trend | Stable since Tuesday | Getting worse | Plateaued |

Step 4: Extract Distinctions

For each row, ask: "What's unique or distinctive about the IS side compared to the IS-NOT side?"

Distinctions:
- Only /checkout (payment processing) — not other endpoints
- Only US-East (specific DB replica) — not other regions
- Only during business hours (load-related?) — not off-peak
- Only ~30% of requests (specific query pattern?) — not all
- Started Monday 9 AM — what changed?

Step 5: Identify Changes

What changed in, on, around, or about the distinctions near the first observation time?

Changes near Monday 9 AM:
- Payment provider SDK updated (Sunday night deploy)
- Database index rebuild scheduled (Sunday maintenance)
- New fraud detection rules enabled (Monday 8:45 AM)

Step 6: Generate and Test Possible Causes

Each candidate cause must explain BOTH the IS and the IS-NOT:

| Possible Cause | Explains IS? | Explains IS-NOT? | Verdict | |----------------|-------------|------------------|---------| | Fraud rules adding DB queries | ✓ Only checkout, only write ops | ✓ Not other endpoints | Pursue | | Payment SDK change | ✓ Only checkout | ✗ Would affect all regions | Ruled out | | Index rebuild | ✓ DB layer | ✗ Would affect all queries | Ruled out |

Step 7: Verify the True Cause

Design a test to confirm or rule out the leading candidate:

Verification for "Fraud detection rules":
1. Check: Rules enabled 8:45 AM (matches timeline)
2. Check: Rules only on checkout (matches scope)
3. Test: Disable rules in canary, measure latency
4. Examine: Query logs for fraud check queries

Output Contract

A completed KT Problem Analysis produces:

Problem Statement — specific, measurable deviation
IS/IS-NOT Matrix — all four dimensions with distinctions extracted
Changes List — what changed near the distinctions around the first observation
Cause Test Table — each candidate tested against IS and IS-NOT
Confirmed Root Cause — with verification evidence
If used, SA Triage — prioritized concern list with assigned processes

Anti-Patterns

| Anti-Pattern | Symptom | Correction | |---|---|---| | KT on uniform failure | Running PA when 100% of requests fail | No boundary to contrast; use direct debugging or thinking-systems | | Over-specifying the matrix | Filling every IS/IS-NOT cell for a simple bug | Stop when the distinction is clear; don't ritualize | | DA/PPA sprawl | Running full Decision Analysis or Potential Problem Analysis for routine tasks | Redirect to thinking-opportunity-cost (decisions) or thinking-pre-mortem (risks) | | Skipping cause testing | Pursuing the first plausible cause without testing against IS-NOT | Every cause must explain BOTH IS and IS-NOT | | SA as mandatory preamble | Running full Situation Analysis before every PA | Jump directly to PA when the problem is already clear | | Ignoring the distinction | Building the matrix but not extracting what's unique about IS | The distinction IS the signal; without it, the matrix is just a table |

Kepner-Tregoe Problem Analysis

Overview

Situation Analysis (SA) is retained as a lightweight triage step when facing multiple concerns, but it is not a required preamble — jump directly to PA when the problem is already clear.

Core Principle: The boundary between what IS affected and what IS NOT affected encodes the root cause. Find the distinction, find the cause.

When to Use

A defect is selective: affects some endpoints/regions/users/times but NOT others — there is an IS-vs-IS-NOT boundary to contrast
The cause is unclear and not obvious from a stack trace, error message, or a single recent change
Multiple possible causes exist and you need a systematic way to narrow them
A complex situation has multiple concerns that need triage before diving in

Decision flow:

Defect is selective (not 100%)? → No → IS/IS-NOT has no signal; use direct debugging or thinking-systems
                                → Yes → Cause obvious from stack trace/recent change? → Yes → Just fix it
                                                                                      → No → APPLY KT PROBLEM ANALYSIS

When NOT to Use

The failure is uniform (affects 100% of requests/everything) — there is no IS-vs-IS-NOT boundary to contrast; PA gives no signal. Use thinking-systems or direct debugging.
The cause is already obvious from a stack trace, error message, or a single recent change — just fix it; IS/IS-NOT is overhead here.
A quick hypothesis is cheaply testable — test it (thinking-occams-razor) before building a full specification matrix.
Pure decision-making with no deviation to diagnose — use thinking-opportunity-cost, not KT's Decision Analysis.
Risk assessment for a planned change — use thinking-pre-mortem, not KT's Potential Problem Analysis.

Trigger Card

When a defect is selective (some cases affected, others not) and the cause is unclear:

State the problem precisely — what is the deviation? In what object? Where/when does it occur?
Map IS vs IS-NOT — what IS affected vs what IS NOT, side by side. The boundary is the signal.
Find the distinction — what is different about the IS cases vs the IS-NOT cases? That distinction IS the cause.

Procedure

Step 1 (optional): Situation Analysis — Triage Multiple Concerns

Only when facing several problems at once. List all concerns, separate them if compound, and prioritize by Timing/Impact/Trend:

For each concern, decide: Problem Analysis (PA), or delegate to another skill.

Step 2: State the Problem Precisely

Describe the deviation from expected behavior with specificity:

"API response time increased from 200ms to 800ms for /checkout endpoint,
US-East only, starting Monday 9 AM, affecting ~30% of requests."

Step 3: Build the IS/IS-NOT Matrix

Specify the problem across four dimensions. The power is in the distinction column — what's unique about the IS side?

Step 4: Extract Distinctions

For each row, ask: "What's unique or distinctive about the IS side compared to the IS-NOT side?"

Distinctions:
- Only /checkout (payment processing) — not other endpoints
- Only US-East (specific DB replica) — not other regions
- Only during business hours (load-related?) — not off-peak
- Only ~30% of requests (specific query pattern?) — not all
- Started Monday 9 AM — what changed?

Step 5: Identify Changes

What changed in, on, around, or about the distinctions near the first observation time?

Changes near Monday 9 AM:
- Payment provider SDK updated (Sunday night deploy)
- Database index rebuild scheduled (Sunday maintenance)
- New fraud detection rules enabled (Monday 8:45 AM)

Step 6: Generate and Test Possible Causes

Each candidate cause must explain BOTH the IS and the IS-NOT:

Step 7: Verify the True Cause

Design a test to confirm or rule out the leading candidate:

Verification for "Fraud detection rules":
1. Check: Rules enabled 8:45 AM (matches timeline)
2. Check: Rules only on checkout (matches scope)
3. Test: Disable rules in canary, measure latency
4. Examine: Query logs for fraud check queries

Output Contract

A completed KT Problem Analysis produces:

Problem Statement — specific, measurable deviation
IS/IS-NOT Matrix — all four dimensions with distinctions extracted
Changes List — what changed near the distinctions around the first observation
Cause Test Table — each candidate tested against IS and IS-NOT
Confirmed Root Cause — with verification evidence
If used, SA Triage — prioritized concern list with assigned processes

Adoption

tjboudreaux/thinking-kepner-tregoe

$ install --global

Security Scan Results

SKILL.md

Kepner-Tregoe Problem Analysis

Overview

When to Use

When NOT to Use

Trigger Card

Procedure

Step 1 (optional): Situation Analysis — Triage Multiple Concerns

Step 2: State the Problem Precisely

Step 3: Build the IS/IS-NOT Matrix

Step 4: Extract Distinctions

Step 5: Identify Changes

Step 6: Generate and Test Possible Causes

Step 7: Verify the True Cause

Output Contract

Anti-Patterns

Related Skills

tjboudreaux/thinking-via-negativa

tjboudreaux/thinking-triz

tjboudreaux/thinking-thought-experiment

tjboudreaux/thinking-theory-of-constraints

tjboudreaux/thinking-kepner-tregoe

$ install --global

Security Scan Results

SKILL.md

Kepner-Tregoe Problem Analysis

Overview

When to Use

When NOT to Use

Trigger Card

Procedure

Step 1 (optional): Situation Analysis — Triage Multiple Concerns

Step 2: State the Problem Precisely

Step 3: Build the IS/IS-NOT Matrix

Step 4: Extract Distinctions

Step 5: Identify Changes

Step 6: Generate and Test Possible Causes

Step 7: Verify the True Cause

Output Contract

Anti-Patterns

Related Skills

tjboudreaux/thinking-via-negativa

tjboudreaux/thinking-triz

tjboudreaux/thinking-thought-experiment

tjboudreaux/thinking-theory-of-constraints