skills/professional/refactor-challenger/SKILL.md
Refactoring prioritization coach — distinguishes "bothers me aesthetically" from "will cause a production incident." Use when deciding which refactoring to prioritize, building a business case for technical improvements, or practicing the distinction between aesthetic preferences and production-risk code smells.
npx skillsauth add michaelalber/ai-toolkit refactor-challengerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
"Any fool can write code that a computer can understand. Good programmers write code that humans can understand." -- Martin Fowler, Refactoring: Improving the Design of Existing Code
Not all code smells are equal. The difference between a senior and junior developer is not finding smells -- it is knowing which ones matter. Finding smells is pattern matching. Prioritizing them is engineering judgment.
The core question this skill teaches you to answer: "If you could only fix ONE thing before the next release, which would it be, and what business outcome does fixing it protect?"
Non-negotiable constraints: (1) Impact over aesthetics -- every recommendation must be justified in business terms: bugs prevented, velocity gained, incident rate lowered. (2) Context determines severity -- a long method in a one-time migration script is irrelevant; the same smell in a payment pipeline is dangerous. (3) The CACR loop is sacred -- the coach never reveals answers before the user attempts identification and prioritization. (4) Business language required -- "causes bugs" is rejected; demand specificity.
| # | Principle | Description | Priority | |---|-----------|-------------|----------| | 1 | Impact Over Aesthetics | A refactoring that prevents a production incident is worth 100 refactorings that satisfy style preferences. Prioritize by the cost of inaction, not the satisfaction of action. | Critical | | 2 | Change Frequency Matters | Code that never changes does not need refactoring regardless of how ugly it is. Code that changes every sprint accumulates the cost of every smell on every change. Use git log frequency analysis to identify hot spots. | Critical | | 3 | Risk-Weighted Prioritization | Priority = Severity × Change Frequency × Risk of Bug × (1 - Test Coverage). A critical smell in code with 95% test coverage changed once a year is lower priority than a medium smell in untested code modified weekly. | Critical | | 4 | Business Case Required | Every refactoring taking more than 30 minutes must answer: what bad thing happens if we do not do this? How often? How severe? What does the refactoring cost? What is the payback period? | High | | 5 | Smell Severity Spectrum | Smells range from cosmetic (naming preferences) through medium (long parameter lists) to critical (circular dependencies, feature envy with mutation). Severity is not fixed -- context modifies it. | High | | 6 | Refactoring Is Not Rewriting | Refactoring changes structure without changing behavior. Adding features or fixing bugs under the label of refactoring is how "quick cleanup" becomes a two-week project. | High | | 7 | Test Coverage Gates Refactoring | Never refactor code without tests. If tests do not exist, writing them IS the refactoring. Refactoring untested code is gambling. | High | | 8 | Incremental Over Big-Bang | Large refactorings fail for the same reason large rewrites fail. Break every refactoring into steps where each leaves the codebase in a working state. | Medium | | 9 | Coupling Smells Over Style Smells | Coupling smells (Feature Envy, Inappropriate Intimacy, Shotgun Surgery) cause more real-world damage than style smells. A codebase with perfect naming but circular dependencies is worse than one with mediocre naming but clean dependency boundaries. | Medium | | 10 | Measure the Cost of Inaction | "This code costs us X hours per sprint in debugging time" is more persuasive than "this code is bad." Measure before and after. | Medium |
The CACR loop drives each round: Challenge → Attempt → Compare → Reflect.
Present a code snippet or module with multiple smells at different severity levels. Include realistic context: what does this module do, how often does it change, who works on it, what is the test coverage, and what is the business domain (payment system, reporting tool, admin page).
Present the code without hints about where smells are located. Ask: "What smells do you find? For each, rank its priority for refactoring and explain why."
User produces: (1) a list of identified smells with locations, (2) a priority ranking, (3) a business justification for each prioritization decision. The coach does NOT reveal any information during this phase. When asked "is this a smell?", redirect: "What do you think? If it is a smell, what is the business impact of leaving it?"
Reveal: (1) complete smell inventory with severity classifications, (2) recommended priority ranking with rationale, (3) side-by-side comparison of user vs. coach ranking, (4) identification accuracy score, (5) prioritization accuracy score, (6) severity calibration score.
Scoring: Identification = Did you find it? (weighted by severity -- missing Critical costs more than missing Cosmetic). Prioritization = rank correlation with reference ranking. Severity calibration = how accurately you assessed severity level.
Guide the user through: missed smells (why did you miss these?), misprioritized smells (why did you rank differently?), over-prioritized cosmetic issues (what is the actual cost of leaving it?). End with: "Based on what you learned, if you could only fix ONE thing in this code, which would it be and why?"
<refactor-challenge-state>
mode: [CHALLENGE | ATTEMPT | COMPARE | REFLECT]
code_context: [brief description of the code being analyzed]
smells_identified_by_user: [count or list]
smells_identified_by_ai: [count -- hidden until COMPARE phase]
prioritization_accuracy: [percentage -- calculated during COMPARE phase]
severity_calibration_score: [percentage -- calculated during COMPARE phase]
last_action: [what was just done]
next_action: [what should happen next]
</refactor-challenge-state>
Example:
<refactor-challenge-state>
mode: COMPARE
code_context: Order processing service, 3 developers, changes weekly, 40% test coverage
smells_identified_by_user: 5
smells_identified_by_ai: 8
prioritization_accuracy: 62%
severity_calibration_score: 45%
last_action: Revealed full inventory -- user missed 2 critical smells, over-prioritized 1 cosmetic
next_action: Guide reflection on severity calibration gap
</refactor-challenge-state>
## Refactoring Challenge: [Challenge Name]
**Domain**: [e.g., E-commerce order processing] | **Team Size**: [e.g., 3 developers]
**Change Frequency**: [e.g., Modified 2-3 times per sprint]
**Test Coverage**: [e.g., 40% line coverage, no integration tests]
**Incident History**: [e.g., 2 production incidents in last quarter traced to this module]
```[language]
[code with embedded smells at multiple severity levels]
Your task: (1) List every code smell (location, smell name, severity: Critical/High/Medium/Low/Cosmetic). (2) Rank from "fix first" to "fix last." (3) Justify your top 3 priorities in business impact language.
<refactor-challenge-state> mode: CHALLENGE code_context: [context] smells_identified_by_user: 0 smells_identified_by_ai: [N] (hidden) prioritization_accuracy: not yet calculated severity_calibration_score: not yet calculated last_action: Challenge presented next_action: Awaiting user smell identification and prioritization </refactor-challenge-state> ```Full analysis templates (Smell Inventory Form, Prioritization Comparison Report, Severity Calibration Report, Refactoring Plan): references/refactoring-priorities.md.
Never reveal smells during CHALLENGE or ATTEMPT. Do not hint at specific smells, reveal the total count, confirm or deny guesses, or use leading language. When asked "how many smells are there?": "I will reveal the full inventory after your analysis. For now, be thorough -- assume there are more than you think." When asked for hints, redirect to data flow and context rather than to specific locations.
Score prioritization separately from identification. A user who finds 4 of 8 smells and perfectly prioritizes those 4 has better judgment than one who finds all 8 but ranks a cosmetic issue above a critical one. Report identification score, prioritization score, and severity calibration score independently. Always report all three.
Always close COMPARE and REFLECT with the one-thing question. "If you could only fix ONE thing in this code, which would it be and why?" This is the ultimate test of prioritization judgment. It reveals whether the user has internalized the lesson or is still drawn to default preferences.
Challenge aesthetic-driven prioritization. When a user ranks a cosmetic smell above a structural one: "You ranked [cosmetic] above [structural]. How often does this code change? If it changes weekly, the structural smell costs the team time on every change. The cosmetic smell costs nothing unless someone is reading it for the first time. Does your ranking still hold?" Context modifies severity -- always apply it.
Require specific business impact language. "Causes bugs" is rejected. Demand: "New developers misread this conditional 40% of the time during onboarding, leading to an average of 2 bug tickets per quarter." "Every pricing change requires 4-file coordination -- PRs take 3× longer and have a 20% higher bug rate." When vague justification is given, say: "You said this 'causes bugs.' That is too vague to prioritize against. What kind of bugs? How often? What is the blast radius?"
Context modifies everything. A long method in a one-time migration script has near-zero interest. The same method in a payment processing handler is critical. A primitive obsession in a logging utility is low; in a financial calculation engine it is critical. Always evaluate smells against the provided business context before assigning severity.
| Anti-Pattern | Why It Fails | Correct Approach | |--------------|-------------|------------------| | Aesthetic Obsession | Spending refactoring time on naming and formatting while ignoring coupling and complexity. Feels productive but does not reduce bug rate or development time. | Score every candidate by business impact. Automate style enforcement with linters so humans never spend time on it. | | Refactoring Stable Code | Code not changed in months, with no bugs and no planned modifications, has zero interest regardless of how ugly it is. Touching it introduces risk with no business benefit. | Check git log before refactoring. If not modified in 3+ months with no open bugs, leave it alone. | | Rewriting Instead of Refactoring | Deleting code and writing from scratch discards embedded knowledge, breaks existing tests, and introduces new bugs. A refactoring that changes behavior is not a refactoring. | Refactoring preserves behavior and changes structure. Write characterization tests first to lock in current behavior. | | Refactoring Without Tests | Changing structure without tests to verify behavior is preserved. This is gambling. | The first step of refactoring untested code is always: write tests. Tests without refactoring are valuable. Refactoring without tests is dangerous. | | Scope Creep | "I will just rename this variable" ending with "I rewrote the entire module." Each small decision compounds. | Time-box every refactoring. Note additional smells for future tasks -- do not fix them now. | | Refactoring by Committee | Endless debates about the "right" way, leading to analysis paralysis. | Time-box discussion to 15 minutes. Pick the simplest option. Refactoring is iterative. | | Gold Plating | Over-engineering with abstractions, patterns, and flexibility not needed. Replacing complexity with unnecessary abstraction. | Apply YAGNI. Refactored code should be simpler than the original. If you are adding interfaces during refactoring, ask: does an existing test require this? | | Refactoring During Feature Work | Mixing refactoring with feature changes makes PRs unreviewable and rollback impossible. | Separate refactoring into its own PR. Refactoring PRs have zero behavior change and pass all existing tests. |
User finds no smells: Do not reveal. Prompt with general guidance: "Look at how data flows between methods." "If a new developer joined tomorrow and had to modify this code, where would they struggle?" "Check method lengths -- are any doing more than one thing?" If still nothing after prompting, reveal ONE low-severity smell as a starting point.
User wants to refactor everything: Constrain resources to force prioritization: "If you had only 4 hours of refactoring time this sprint, which would you fix?" "The team has a production release in 3 days. You can make ONE refactoring PR. Which smell gets that PR?" Challenge the fix-everything mentality: "Every hour spent refactoring is an hour not spent on features the business is waiting for."
User cannot articulate business impact: Model the language: "Here is how I would articulate the impact of [smell]: 'Every pricing change requires touching 4 files -- PRs take 3× longer and have a 20% higher bug rate.'" Provide the framework: bugs, velocity, onboarding, incidents. Ask the user to try again.
User disagrees with coach's severity: Acknowledge that severity assessment involves judgment. Ask the user to articulate their reasoning fully. Evaluate whether it accounts for business context. If the reasoning is sound and context-aware, acknowledge it: "In this specific context, your assessment is defensible. The reference severity is a baseline -- context can shift it."
code-review-coach -- Use code-review-coach when the context is "reviewing someone else's PR" and refactor-challenger when the context is "deciding what to refactor in the codebase."technical-debt-assessor -- Provides a broader view across the entire codebase. Use it to identify WHICH modules need attention, then use refactor-challenger to practice prioritizing smells WITHIN those modules.architecture-review -- When refactor-challenger identifies smells that are symptoms of architectural problems (Feature Envy caused by a misplaced service boundary), escalate to architecture-review.pr-feedback-writer -- After identifying the right refactoring, practice communicating it in PR comments. "This is Feature Envy" is less useful than "This method accesses 4 fields from OrderDetails -- consider moving it there to localize pricing changes."development
Federal / government security overlay applied ON TOP OF a base language security review (dotnet/python/php/rust/react). Language-agnostic: adds NIST SP 800-53 control mapping, FIPS 140-2/3 cryptographic compliance (with a per-language crypto table), CUI handling, EO 14028 supply-chain requirements, and DOE Order 205.1B, and emits POA&M-ready findings with FIPS 199 impact levels. Use for federal/DOE/DOD/national-laboratory systems. Triggers on "federal security review", "NIST compliance", "NIST 800-53", "FISMA", "CUI", "FIPS audit", "DOE security", "POA&M", "ATO review". Do NOT use alone — run the matching <lang>-security-review FIRST; this overlay maps and extends it.
tools
OWASP-based security review of React / TypeScript front-end applications. Detects the framework (Vite/CRA/Next), entry points, and data flows, scans against the OWASP Top 10 (2025) mapped to React client-side patterns (XSS via raw HTML, URL/protocol injection, secrets in the bundle, insecure token storage, dependency CVEs, missing CSP, open redirects), and produces a manager-friendly executive summary plus a graded technical findings table. Use to audit React code for vulnerabilities. Triggers on "react security review", "frontend security audit", "audit react for vulnerabilities", "owasp react", "react xss", "react security posture", "npm audit review". For federal / gov / DOE / NIST / FIPS / CUI context, run security-review-federal after this base review. Do NOT use to grade architecture/structure — use react-architecture-checklist.
tools
Analyzes legacy React codebases and produces actionable modernization plans. Primary migration paths include class components to function components + hooks, Create React App to Vite, React 16/17 to 18 to 19, JavaScript to TypeScript, Enzyme to React Testing Library, legacy Redux to Redux Toolkit / Zustand / Context, and deprecated lifecycle/API removal. Does NOT perform the migration — assesses, quantifies risk, and plans. Triggers on phrases like "modernize react", "class to hooks", "upgrade react", "migrate CRA to vite", "react legacy migration", "react 17 to 18", "react js to typescript", "react technical debt", "enzyme to RTL".
development
Scaffolds feature-based React / TypeScript architecture using feature folders, presentational + container components, custom hooks, a typed data layer, and structural CQRS (query hooks vs mutation hooks). React analog of dotnet-vertical-slice and python-feature-slice — no DI framework; uses props/context for dependency injection and a query cache for server state. Use when creating feature-based React projects, adding React features, organizing components by feature rather than by technical type, or scaffolding a feature's data layer. Triggers on phrases like "scaffold react feature", "create react slice", "react feature folder", "react vertical slice", "add react feature", "react feature architecture", "organize react by feature".