skills/failure-aware-enhancements-code-generation/SKILL.md
Diagnose why generated code fails and apply the right fix strategy (self-critique, RAG, multi-model, or progressive prompting) based on a data-driven decision framework from empirical research on 25 GitHub projects. Trigger phrases: "my generated code doesn't work", "fix this code generation failure", "why does this code keep failing", "help me debug LLM-generated code", "improve code generation quality", "the AI-generated code is wrong"
npx skillsauth add ndpvt-web/arxiv-claude-skills failure-aware-enhancements-code-generationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill equips Claude to systematically diagnose why generated code fails and select the most effective repair strategy based on failure type rather than trial-and-error. Drawing from an empirical study of 25 GitHub projects (Shen, Peng & Owen, 2026), the approach classifies code generation failures into distinct categories -- logic errors, missing edge cases, integration issues, and specification gaps -- then maps each category to the enhancement method with the highest empirical success rate: self-critique for logic errors, RAG for implementation pattern gaps, multi-model reasoning for low-confidence outputs, and progressive prompting for unclear specifications.
The core insight: not all code generation failures respond to the same fix. The study found that progressive prompting raises average task completion from 80.5% to 96.9% (Cohen's d=1.63, p<0.001), but the remaining failures require targeted interventions. Self-critique works well for code-reviewable logic errors but achieves 0% improvement on external service integration failures. RAG achieves the highest completion rate across all failure types with superior efficiency. The wrong enhancement wastes tokens and time.
The decision framework maps failure patterns to methods:
Why this matters: developers typically default to "just retry with a better prompt," which the study shows is suboptimal. Matching the enhancement to the failure type reduces wasted iterations and produces higher-quality code on fewer attempts.
Generate initial code using progressive prompting: decompose the requirement into ordered sub-tasks (data model → core logic → edge cases → integration → output formatting) and generate code for each sequentially, feeding prior outputs as context.
Run validation against available tests, type checks, or manual inspection. Collect all errors, warnings, and unmet requirements into a failure list.
Classify each failure into one of four categories:
Apply self-critique for logic errors: Re-read the generated code line by line against the requirement. State explicitly what the code does vs. what it should do. Generate a minimal, targeted diff that fixes only the identified discrepancy. Do not rewrite surrounding code.
Apply progressive prompting for specification gaps: Identify the missing requirement. Formulate a focused follow-up prompt that asks specifically about the unaddressed aspect. Generate the additional code and integrate it into the existing solution.
Apply RAG for pattern gaps: Search documentation, codebases, or known examples for the correct usage pattern. Use the retrieved context to rewrite only the misused API calls or library interactions. Verify against official docs if accessible.
Apply multi-model reasoning for compound failures: Generate 2-3 alternative implementations of the failing section using different approaches. Compare outputs to identify which aspects each gets right. Synthesize the strongest elements into a single solution.
Re-validate after each fix: Run the same validation from step 2. If new failures appear, classify and address them. Track which enhancement method resolved which failure to build a feedback loop.
Document the failure-fix mapping: For each resolved failure, note the category and the method that worked. This becomes a project-specific decision guide for future iterations.
Finalize with integration testing: Once all individual failures are resolved, run end-to-end validation to catch interaction effects between the fixes.
Example 1: Logic error in sorting algorithm
User: "Write a function that returns the top K most frequent elements from an array."
Initial generation:
def top_k_frequent(nums, k):
freq = {}
for n in nums:
freq[n] = freq.get(n, 0) + 1
sorted_freq = sorted(freq.items(), key=lambda x: x[1])
return [x[0] for x in sorted_freq[:k]]
Failure: Returns LEAST frequent elements (sorted ascending, takes first k).
Classification: Logic error -- code runs but produces wrong results.
Enhancement: Self-critique.
Reasoning: The sort order is ascending but we need descending.
The slice takes [:k] which grabs the lowest values.
Fix: Either reverse the sort or take [-k:].
Fixed code:
def top_k_frequent(nums, k):
freq = {}
for n in nums:
freq[n] = freq.get(n, 0) + 1
sorted_freq = sorted(freq.items(), key=lambda x: x[1], reverse=True)
return [x[0] for x in sorted_freq[:k]]
Example 2: External API integration failure
User: "Add Stripe payment processing to my checkout endpoint."
Initial generation uses stripe.Charge.create() -- deprecated since 2022.
Failure: stripe.error.InvalidRequestError -- Charges API no longer
recommended, PaymentIntents required.
Classification: Pattern gap -- incorrect API usage pattern.
Enhancement: RAG (self-critique would fail here; the study shows
0% improvement from self-critique on external service integration).
Action:
1. Retrieve current Stripe docs for PaymentIntents API
2. Identify correct method: stripe.PaymentIntent.create()
3. Note required parameters: amount, currency, payment_method, confirm
4. Rewrite only the payment processing section:
Fixed code:
intent = stripe.PaymentIntent.create(
amount=amount_cents,
currency="usd",
payment_method=payment_method_id,
confirm=True,
automatic_payment_methods={"enabled": True, "allow_redirects": "never"},
)
Example 3: Underspecified requirements with compound failures
User: "Build a caching layer for my database queries."
Initial generation: Simple dict-based cache with no expiration.
Failures (multiple):
- F1: No TTL / expiration (specification gap)
- F2: No thread safety (specification gap)
- F3: Unbounded memory growth (logic error)
- F4: Cache key doesn't account for query parameters (logic error)
Approach -- address each by category:
Step 1 (Progressive prompting for F1, F2):
"The cache needs TTL-based expiration. What should the default TTL be?"
"This runs in a multi-threaded web server. Add thread-safe access."
Step 2 (Self-critique for F3, F4):
F3: "The cache grows without bound. Add an LRU eviction policy
with a configurable max size."
F4: "Cache key is just the query string. It must include
parameterized values: key = hash(query + str(params))."
Step 3: Validate the combined solution handles all four failures.
Output: Thread-safe LRU cache with TTL expiration and
parameter-aware cache keys.
Do:
Avoid:
| Situation | Response | |-----------|----------| | Self-critique identifies no discrepancy but tests still fail | Reclassify as pattern gap or compound failure; switch to RAG or multi-model | | RAG returns no relevant documentation | Fall back to multi-model reasoning; generate alternatives and test empirically | | Progressive prompting produces contradictory sub-task outputs | Consolidate requirements into a single coherent specification before regenerating | | Fix introduces new failures | Classify the new failures independently; do not assume they share the original category | | All enhancement methods fail | The requirement may exceed single-generation capability; recommend decomposing into separate modules with clear interfaces |
Shen, J., Peng, Z., & Owen, L. (2026). Failure-Aware Enhancements for Large Language Model (LLM) Code Generation: An Empirical Study on Decision Framework. SANER 2026. arXiv:2602.02896 -- Read for: the failure taxonomy (Section 3), decision framework mapping failures to enhancement methods (Section 4), and empirical results showing self-critique's 0% success rate on integration failures vs. RAG's cross-category effectiveness (Section 5).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".