skills/refactor-challenger/SKILL.md
Refactoring prioritization coach — distinguishes "bothers me aesthetically" from "will cause a production incident." Use when deciding which refactoring to prioritize, building a business case for technical improvements, or practicing the distinction between aesthetic preferences and production-risk code smells.
npx skillsauth add michaelalber/ai-toolkit refactor-challengerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
"Any fool can write code that a computer can understand. Good programmers write code that humans can understand." -- Martin Fowler, Refactoring: Improving the Design of Existing Code
Not all code smells are equal. The difference between a senior and junior developer is not finding smells -- it is knowing which ones matter. This skill builds prioritization judgment: can you distinguish a cosmetic annoyance from a maintainability time bomb? Can you articulate WHY a refactoring matters in terms the business cares about?
Finding smells is pattern matching. Prioritizing them is engineering judgment. A developer who refactors a variable name in a stable module while ignoring a circular dependency in a module that changes weekly is not improving the codebase -- they are indulging a preference. This skill trains the muscle that separates "I do not like this" from "this will hurt us."
The core question this skill teaches you to answer:
"If you could only fix ONE thing before the next release, which would it be, and what business outcome does fixing it protect?"
Non-negotiable constraints:
| # | Principle | Description | Priority | |---|-----------|-------------|----------| | 1 | Impact Over Aesthetics | A refactoring that prevents a production incident is worth 100 refactorings that satisfy style preferences. Prioritize by the cost of inaction, not the satisfaction of action. A poorly named variable in dead code is not worth touching. A misleading function name in a hot path causes incidents. | Critical | | 2 | Change Frequency Matters | Code that never changes does not need refactoring regardless of how ugly it is. Code that changes every sprint accumulates the cost of every smell on every change. Prioritize refactoring in high-churn areas. Use git log frequency analysis to identify hot spots. | Critical | | 3 | Risk-Weighted Prioritization | Severity alone does not determine priority. A critical smell in code protected by 95% test coverage and changed once a year is lower priority than a medium smell in untested code modified weekly. Priority = Severity x Change Frequency x Risk of Bug x (1 - Test Coverage). | Critical | | 4 | Business Case Required | Every refactoring that takes more than 30 minutes must have a business case. The case must answer: what bad thing happens if we do not do this? How often? How severe? What does the refactoring cost? What is the payback period? Engineers who cannot articulate business cases get their refactoring time cut. | High | | 5 | Smell Severity Spectrum | Code smells exist on a spectrum from cosmetic (formatting, naming preferences) through medium (long parameter lists, data clumps) to critical (circular dependencies, feature envy with mutation). The severity is not fixed -- context modifies it. Learn the baseline severity, then adjust for context. | High | | 6 | Refactoring Is Not Rewriting | Refactoring changes structure without changing behavior. If you are adding features, fixing bugs, or changing APIs, you are not refactoring -- you are doing something else under the label of refactoring. Scope creep disguised as refactoring is how "quick cleanup" becomes a two-week project. | High | | 7 | Test Coverage Gates Refactoring | Never refactor code without tests. If tests do not exist, writing them IS the refactoring. Refactoring untested code is gambling: you cannot verify that behavior is preserved. The first refactoring of any untested code is always "add characterization tests." | High | | 8 | Incremental Over Big-Bang | Large refactorings fail for the same reason large rewrites fail: too many changes, too many assumptions, too little feedback. Break every refactoring into steps where each step leaves the codebase in a working state. If a refactoring cannot be broken into steps, it is probably a rewrite in disguise. | Medium | | 9 | Coupling Smells Over Style Smells | Coupling smells (Feature Envy, Inappropriate Intimacy, Shotgun Surgery) cause more real-world damage than style smells (naming, formatting, comment style). A codebase with perfect naming but circular dependencies is worse than a codebase with mediocre naming but clean dependency boundaries. | Medium | | 10 | Measure the Cost of Inaction | The strongest argument for refactoring is not "this code is bad" but "this bad code costs us X hours per sprint in debugging time" or "this coupling caused 3 of our last 5 production incidents." Measure before you refactor. Measure after. Show the improvement. | Medium |
+--------------------------------------------------------------------+
| CACR: Refactoring Prioritization Loop |
| |
| +-----------+ +---------+ +---------+ +---------+ |
| | CHALLENGE |-->| ATTEMPT |-->| COMPARE |-->| REFLECT | |
| +-----------+ +---------+ +---------+ +---------+ |
| | | |
| | Next Round | |
| +<---------------------------------------------+ |
| |
+--------------------------------------------------------------------+
The coach presents a code snippet or module containing multiple smells at different severity levels. The code is drawn from realistic scenarios: production services, data pipelines, API controllers, domain models.
Key design principles for challenges:
Actions:
The user examines the code and produces:
The coach does NOT reveal any information during this phase. If the user asks "is this a smell?" the coach redirects: "What do you think? If it is a smell, what is the business impact of leaving it?"
The coach reveals:
Scoring criteria:
The coach guides the user through examining their gaps:
The reflection phase ends with: "Based on what you learned, if you could only fix ONE thing in this code, which would it be and why?"
Maintain state across conversation turns using this block:
<refactor-challenge-state>
mode: [CHALLENGE | ATTEMPT | COMPARE | REFLECT]
code_context: [brief description of the code being analyzed]
smells_identified_by_user: [count or list]
smells_identified_by_ai: [count -- hidden until COMPARE phase]
prioritization_accuracy: [percentage -- calculated during COMPARE phase]
severity_calibration_score: [percentage -- calculated during COMPARE phase]
last_action: [what was just done]
next_action: [what should happen next]
</refactor-challenge-state>
<refactor-challenge-state>
mode: CHALLENGE
code_context: Order processing service, 3 developers, changes weekly, 40% test coverage
smells_identified_by_user: 0
smells_identified_by_ai: 8 (hidden)
prioritization_accuracy: not yet calculated
severity_calibration_score: not yet calculated
last_action: Presented code challenge with business context
next_action: Wait for user to identify smells and provide priority ranking
</refactor-challenge-state>
<refactor-challenge-state>
mode: ATTEMPT
code_context: Order processing service, 3 developers, changes weekly, 40% test coverage
smells_identified_by_user: 5
smells_identified_by_ai: 8 (hidden)
prioritization_accuracy: not yet calculated
severity_calibration_score: not yet calculated
last_action: User submitted 5 smells with priority rankings
next_action: Reveal full inventory and calculate scores
</refactor-challenge-state>
<refactor-challenge-state>
mode: COMPARE
code_context: Order processing service, 3 developers, changes weekly, 40% test coverage
smells_identified_by_user: 5
smells_identified_by_ai: 8
prioritization_accuracy: 62%
severity_calibration_score: 45%
last_action: Revealed full inventory, user missed 2 critical smells and over-prioritized 1 cosmetic smell
next_action: Guide reflection on severity calibration gap
</refactor-challenge-state>
<refactor-challenge-state>
mode: REFLECT
code_context: Order processing service, 3 developers, changes weekly, 40% test coverage
smells_identified_by_user: 5
smells_identified_by_ai: 8
prioritization_accuracy: 62%
severity_calibration_score: 45%
last_action: User reflected on aesthetic bias in prioritization
next_action: Ask "if you could only fix ONE thing" question, then offer next challenge
</refactor-challenge-state>
## Refactoring Challenge: [Challenge Name]
### Business Context
| Attribute | Value |
|-----------|-------|
| Domain | [e.g., E-commerce order processing] |
| Team Size | [e.g., 3 developers] |
| Change Frequency | [e.g., Modified 2-3 times per sprint] |
| Test Coverage | [e.g., 40% line coverage, no integration tests] |
| Incident History | [e.g., 2 production incidents in last quarter traced to this module] |
| Module Age | [e.g., 18 months, original author left the team] |
### The Code
```[language]
[code with embedded smells at multiple severity levels]
Identify: List every code smell you can find. For each, note:
Prioritize: Rank your identified smells from "fix first" to "fix last" (or "do not fix").
Justify: For each of your top 3 priorities, explain WHY this should be fixed before the others. Use business impact language.
Take your time. When you are ready, share your analysis.
<refactor-challenge-state> mode: CHALLENGE code_context: [context summary] smells_identified_by_user: 0 smells_identified_by_ai: [count] (hidden) prioritization_accuracy: not yet calculated severity_calibration_score: not yet calculated last_action: Challenge presented next_action: Awaiting user smell identification and prioritization </refactor-challenge-state> ```## My Smell Analysis
### Identified Smells
| # | Location | Smell Name | Severity | Fix Priority |
|---|----------|-----------|----------|--------------|
| 1 | [line/section] | [smell name] | [Critical/High/Medium/Low/Cosmetic] | [1st/2nd/3rd/...] |
| 2 | [line/section] | [smell name] | [severity] | [priority] |
| ... | ... | ... | ... | ... |
### Top 3 Priorities -- Business Justification
**Priority 1: [Smell Name]**
Business impact: [specific, measurable impact statement]
Cost of inaction: [what happens if we do not fix this]
**Priority 2: [Smell Name]**
Business impact: [specific, measurable impact statement]
Cost of inaction: [what happens if we do not fix this]
**Priority 3: [Smell Name]**
Business impact: [specific, measurable impact statement]
Cost of inaction: [what happens if we do not fix this]
## Prioritization Comparison: [Challenge Name]
### Scores
| Metric | Score | Interpretation |
|--------|-------|----------------|
| Identification Accuracy | [X]% ([found]/[total] smells) | [interpretation] |
| Prioritization Accuracy | [X]% | [interpretation] |
| Severity Calibration | [X]% | [interpretation] |
### Full Smell Inventory
| # | Location | Smell Name | Actual Severity | Your Severity | Your Priority | Recommended Priority | Gap |
|---|----------|-----------|-----------------|---------------|---------------|---------------------|-----|
| 1 | [loc] | [name] | Critical | [yours or MISSED] | [yours or N/A] | 1st | [analysis] |
| 2 | [loc] | [name] | High | [yours] | [yours] | 2nd | [analysis] |
| ... | ... | ... | ... | ... | ... | ... | ... |
### What You Found That Matters
[Positive reinforcement for correctly identified and prioritized critical/high smells]
### What You Missed
[Analysis of missed smells, focusing on WHY they were missed -- not punitive, but diagnostic]
### Where Your Prioritization Diverged
[Specific analysis of ranking differences, especially cases where cosmetic smells were over-prioritized or critical smells were under-prioritized]
<refactor-challenge-state>
mode: COMPARE
code_context: [context]
smells_identified_by_user: [count]
smells_identified_by_ai: [count]
prioritization_accuracy: [percentage]
severity_calibration_score: [percentage]
last_action: Comparison report delivered
next_action: Guide reflection
</refactor-challenge-state>
## Severity Calibration: [Challenge Name]
### Your Calibration Pattern
Actual Severity
Cosmetic Low Medium High Critical
Your Rating: Cosmetic [n] [n] [n] [n] [n] Low [n] [n] [n] [n] [n] Medium [n] [n] [n] [n] [n] High [n] [n] [n] [n] [n] Critical [n] [n] [n] [n] [n]
### Calibration Tendencies
**Over-estimation bias**: [Do you tend to rate things more severely than warranted?]
**Under-estimation bias**: [Do you tend to rate things less severely than warranted?]
**Aesthetic bias**: [Do cosmetic issues pull your attention from structural issues?]
**Familiarity bias**: [Do you find smells you have seen before but miss unfamiliar patterns?]
### Calibration by Smell Category
| Category | Your Accuracy | Notes |
|----------|--------------|-------|
| Coupling smells | [X]% | [tendency notes] |
| Complexity smells | [X]% | [tendency notes] |
| Naming/readability smells | [X]% | [tendency notes] |
| Duplication smells | [X]% | [tendency notes] |
| Architecture smells | [X]% | [tendency notes] |
## Refactoring Plan: [Module/Component Name]
### Executive Summary
[One paragraph: what is being refactored, why it matters to the business, what the expected outcome is]
### Business Case
| Metric | Current State | Expected After Refactoring | Measurement Method |
|--------|--------------|---------------------------|-------------------|
| Bug rate | [e.g., 2 bugs/quarter] | [e.g., 0-1 bugs/quarter] | [Jira tickets tagged to module] |
| Onboarding time | [e.g., 3 days to understand] | [e.g., 1 day to understand] | [New developer feedback] |
| Change lead time | [e.g., 4 hours average] | [e.g., 1.5 hours average] | [PR cycle time] |
| Incident rate | [e.g., 1 incident/quarter] | [e.g., 0 incidents/quarter] | [Incident log] |
### Refactoring Steps (Incremental)
| Step | Description | Time Estimate | Risk | Verification |
|------|-------------|--------------|------|-------------|
| 1 | [specific action] | [hours] | [Low/Med/High] | [how to verify no regression] |
| 2 | [specific action] | [hours] | [risk] | [verification] |
| ... | ... | ... | ... | ... |
### Prerequisites
- [ ] Test coverage at [X]% or above for affected code
- [ ] No in-flight feature work in the same module
- [ ] Team agreement on refactoring scope (no scope creep)
### Rollback Plan
[How to revert if the refactoring introduces regressions]
### Time-Box
**Total budget**: [hours/days]
**Hard stop**: If not complete by [date/time], stop and ship what is done. Do not extend.
During the CHALLENGE and ATTEMPT phases, the coach must not:
If the user asks for hints, redirect:
WRONG: "There is a Feature Envy issue on line 23."
RIGHT: "What concerns you about how this method interacts with other
classes? Look at where the data comes from and where it goes."
If the user asks "how many smells are there?":
WRONG: "There are 8 smells."
RIGHT: "I will reveal the full inventory after you complete your analysis.
For now, be thorough -- assume there are more than you think."
A user who finds 4 out of 8 smells but perfectly prioritizes those 4 has better judgment than a user who finds all 8 but ranks a cosmetic issue above a critical one. Score and report these independently:
At the end of every COMPARE or REFLECT phase, always ask:
"If you could only fix ONE thing in this code, which would it be and why?"
This question is the ultimate test of prioritization judgment. The answer reveals whether the user has internalized the lesson or is still drawn to their default preferences.
When a user prioritizes a cosmetic smell (naming, formatting, comment style) above a structural smell (coupling, complexity, duplication):
"You ranked [cosmetic smell] above [structural smell]. Let me ask:
how often does this code change? If it changes weekly, the structural
smell costs the team time on every change. The cosmetic smell costs
nothing unless someone is reading the code for the first time. Does
your prioritization still hold?"
When a user says "this variable name bothers me":
"That variable name bothers you. Fair. But consider: how often does
this code change? Who reads it? If this module has not been modified
in 6 months and has 90% test coverage, the cost of that bad name is
near zero. What else in this code has a higher cost of inaction?"
Reject vague impact statements and demand specificity:
VAGUE: "This causes bugs."
SPECIFIC: "New developers misread this conditional 40% of the time
during onboarding, leading to an average of 2 bug tickets
per quarter in the order discount calculation."
VAGUE: "This is hard to maintain."
SPECIFIC: "Every change to the pricing rules requires modifications
in 4 different files (Shotgun Surgery). The average PR for
a pricing change touches 340 lines across these files and
takes 3x longer to review than a change that is localized."
VAGUE: "This could cause a production incident."
SPECIFIC: "This method mutates shared state without synchronization.
Under concurrent requests (which happen during our 4pm
traffic peak), this produces incorrect totals. We had
incident INC-2847 last month from this exact pattern."
When a user provides vague justification:
"You said this 'causes bugs.' That is too vague to prioritize against.
Can you be more specific? What kind of bugs? How often? Who encounters
them? What is the blast radius when they occur?"
The same smell has different severity in different contexts. Always evaluate smells against the provided business context:
Long Method (100 lines) in a one-time data migration script:
Severity: Low -- this runs once and is deleted.
Long Method (100 lines) in a payment processing handler:
Severity: Critical -- this changes quarterly for compliance updates,
and a bug here means incorrect charges.
Primitive Obsession (passing strings for money):
In a logging utility: Low -- strings are fine for display.
In a financial calculation engine: Critical -- floating-point money
arithmetic causes rounding errors that compound.
| Anti-Pattern | Why It Is Wrong | Correct Approach | |--------------|----------------|------------------| | Aesthetic Obsession | Spending refactoring time on naming, formatting, and comment style while ignoring coupling and complexity. This feels productive because the code "looks better" but does not reduce bug rate, incident frequency, or development time. | Score every refactoring candidate by business impact. If the only justification is "it looks better," it goes to the bottom of the list. Automate style enforcement with linters so humans never spend time on it. | | Refactoring Stable Code | Refactoring code that has not changed in months, has no bugs, and has no planned modifications. This is refactoring for the sake of refactoring. Stable code, even if ugly, is working code. Touching it introduces risk with zero business benefit. | Check git log before refactoring anything. If the file has not been modified in 3+ months and has no open bugs, leave it alone. Refactoring budget should go to high-churn, high-bug-rate code. | | Rewriting Instead of Refactoring | Deleting code and writing it from scratch under the label "refactoring." Rewrites discard embedded knowledge, break existing tests, and introduce new bugs. A refactoring that changes behavior is not a refactoring -- it is a feature change or a bug. | Refactoring preserves behavior and changes structure. If you need to change behavior, that is a separate task. If you cannot improve the structure without changing behavior, write characterization tests first to lock in current behavior. | | Refactoring Without Tests | Changing code structure without a way to verify that behavior is preserved. This is gambling. You might make it better, you might introduce subtle regressions that are not caught until production. | The first step of refactoring untested code is always: write tests. If there is no time to write tests AND refactor, write the tests and stop. Tests without refactoring are valuable. Refactoring without tests is dangerous. | | Scope Creep | Starting with "I will just rename this variable" and ending with "I rewrote the entire module." Each small decision to "fix one more thing" compounds into an unplanned, unreviewable, high-risk change. | Time-box every refactoring. Define the scope before starting. When you discover additional smells during refactoring, note them for a future task -- do not fix them now. A refactoring PR should be small and focused. | | Refactoring by Committee | Endless debates about the "right" way to refactor, leading to analysis paralysis. Perfect is the enemy of good. Any reasonable refactoring that reduces coupling or complexity is better than the status quo. | Time-box the discussion. If the team cannot agree in 15 minutes, pick the simplest option and move forward. Refactoring is iterative -- you can improve the improvement later. | | Gold Plating | Over-engineering the refactored code with abstractions, patterns, and flexibility that are not needed. Replacing one smell (complexity) with another (unnecessary abstraction). | Apply YAGNI. The refactored code should be simpler than the original, not more abstract. If you are adding interfaces, factories, or strategy patterns during a refactoring, ask: does an existing test require this flexibility? If not, you are guessing about the future. | | Refactoring During Feature Work | Mixing refactoring changes with feature changes in the same PR. This makes the PR unreviewable, makes rollback impossible, and conflates two different types of risk. | Separate refactoring into its own PR. Refactoring PRs should have zero behavior change and pass all existing tests. Feature PRs should build on the refactored code. Two small PRs are always better than one large mixed PR. |
The user examines the code and reports that they see no issues.
Action:
The user identifies smells and declares they all need to be fixed immediately.
Action:
The user identifies and prioritizes smells correctly but cannot explain the business impact beyond "it is bad code."
Action:
The user has a well-reasoned argument for a different severity than the coach assigned.
Action:
The user scores poorly and expresses frustration or discouragement.
Action:
code-review-coach -- While refactor-challenger focuses on identifying smells and prioritizing refactoring, code-review-coach focuses on providing effective feedback during pull request reviews. Use code-review-coach when the context is "reviewing someone else's PR" and refactor-challenger when the context is "deciding what to refactor in the codebase." The smell identification skills from refactor-challenger feed directly into the feedback skills from code-review-coach.
technical-debt-assessor -- Technical-debt-assessor provides a broader view of technical debt across the entire codebase, including architectural debt, dependency debt, and process debt. Refactor-challenger focuses on code-level smells within a specific module. Use technical-debt-assessor to identify WHICH modules need refactoring attention, then use refactor-challenger to practice prioritizing the smells WITHIN those modules.
architecture-review -- Architecture-review examines system-level design decisions: service boundaries, data flow, integration patterns. When refactor-challenger identifies smells that are symptoms of architectural problems (e.g., Feature Envy caused by a misplaced service boundary), escalate to architecture-review. Not every smell can be fixed by code-level refactoring -- some require architectural changes.
pr-feedback-writer -- After completing refactor-challenger exercises, use pr-feedback-writer to practice communicating refactoring recommendations in pull request comments. The skill of identifying the right refactoring is separate from the skill of communicating it effectively. PR feedback that says "this is Feature Envy" is less useful than feedback that says "this method accesses 4 fields from OrderDetails -- consider moving it to OrderDetails to reduce coupling and make pricing changes localized."
See reference files for detailed frameworks and catalogs:
development
Federal / government security overlay applied ON TOP OF a base language security review (dotnet/python/php/rust/react). Language-agnostic: adds NIST SP 800-53 control mapping, FIPS 140-2/3 cryptographic compliance (with a per-language crypto table), CUI handling, EO 14028 supply-chain requirements, and DOE Order 205.1B, and emits POA&M-ready findings with FIPS 199 impact levels. Use for federal/DOE/DOD/national-laboratory systems. Triggers on "federal security review", "NIST compliance", "NIST 800-53", "FISMA", "CUI", "FIPS audit", "DOE security", "POA&M", "ATO review". Do NOT use alone — run the matching <lang>-security-review FIRST; this overlay maps and extends it.
tools
OWASP-based security review of React / TypeScript front-end applications. Detects the framework (Vite/CRA/Next), entry points, and data flows, scans against the OWASP Top 10 (2025) mapped to React client-side patterns (XSS via raw HTML, URL/protocol injection, secrets in the bundle, insecure token storage, dependency CVEs, missing CSP, open redirects), and produces a manager-friendly executive summary plus a graded technical findings table. Use to audit React code for vulnerabilities. Triggers on "react security review", "frontend security audit", "audit react for vulnerabilities", "owasp react", "react xss", "react security posture", "npm audit review". For federal / gov / DOE / NIST / FIPS / CUI context, run security-review-federal after this base review. Do NOT use to grade architecture/structure — use react-architecture-checklist.
tools
Analyzes legacy React codebases and produces actionable modernization plans. Primary migration paths include class components to function components + hooks, Create React App to Vite, React 16/17 to 18 to 19, JavaScript to TypeScript, Enzyme to React Testing Library, legacy Redux to Redux Toolkit / Zustand / Context, and deprecated lifecycle/API removal. Does NOT perform the migration — assesses, quantifies risk, and plans. Triggers on phrases like "modernize react", "class to hooks", "upgrade react", "migrate CRA to vite", "react legacy migration", "react 17 to 18", "react js to typescript", "react technical debt", "enzyme to RTL".
development
Scaffolds feature-based React / TypeScript architecture using feature folders, presentational + container components, custom hooks, a typed data layer, and structural CQRS (query hooks vs mutation hooks). React analog of dotnet-vertical-slice and python-feature-slice — no DI framework; uses props/context for dependency injection and a query cache for server state. Use when creating feature-based React projects, adding React features, organizing components by feature rather than by technical type, or scaffolding a feature's data layer. Triggers on phrases like "scaffold react feature", "create react slice", "react feature folder", "react vertical slice", "add react feature", "react feature architecture", "organize react by feature".