skills/error-taxonomy-guided-prompt-optimization/SKILL.md
Optimize LLM prompts by systematically collecting errors, building a taxonomy of failure modes, and augmenting prompts with targeted guidance for the most frequent error categories. Based on ETGPO (Singh et al., 2026). Trigger phrases: "optimize this prompt", "my prompt keeps failing", "improve prompt accuracy", "debug prompt errors", "fix LLM failures", "prompt isn't working well"
npx skillsauth add ndpvt-web/arxiv-claude-skills error-taxonomy-guided-prompt-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to optimize LLM prompts using a top-down, error-taxonomy-driven approach. Instead of trial-and-error prompt tweaking, ETGPO collects failures from running a prompt against test cases, categorizes those failures into a structured taxonomy ranked by frequency, and generates targeted guidance text that addresses the most prevalent error patterns. The result is a prompt that systematically closes the most impactful failure modes in a single pass, using roughly one third the token budget of iterative methods.
Top-down vs. bottom-up. Most prompt optimization methods work bottom-up: they look at individual failures, tweak the prompt, re-evaluate, and repeat. This risks overfitting to specific examples and losing sight of the global error landscape. ETGPO inverts this. It first collects all failures across multiple stochastic runs, then builds a complete error taxonomy before making any prompt changes. This single-pass, top-down strategy produces prompts that generalize better while consuming far fewer tokens.
The error taxonomy. The taxonomy is a structured categorization of every failure the model makes on the validation set. Each category has: a name and summary, a detailed description of what goes wrong, self-contained examples of the error, an explanation of why it leads to wrong answers, and prevalence statistics (failure count and unique problem count). Categories with fewer than 2 unique problems are discarded to avoid overfitting. The remaining categories are ranked by failure count, and the top G (typically 10) are selected for guidance generation.
Guidance augmentation. For each selected error category, the optimizer LLM generates actionable guidance text containing: a description of the error pattern, concrete advice for avoiding it, a WRONG example demonstrating the error, and a CORRECT example showing proper reasoning. This guidance is appended to the original prompt with a preamble, producing the final optimized prompt in one shot.
Collect the validation set. Gather 50-200 input-output pairs where the ground truth is known. These are the cases you will use to diagnose failures. Separate a held-out test set for final evaluation.
Run K stochastic passes. Execute the current prompt against the validation set K times (K=5 is a good default). Use temperature > 0 so different runs can surface different failure modes. Record every failed trace: the input, the expected output, the model's reasoning chain, and the model's incorrect answer.
Batch failures for taxonomy creation. Split the collected failed traces into batches of B (B=6 works well). For the first batch, ask the LLM to identify where reasoning first goes wrong in each trace, determine the nature of the error, and group traces into named categories with descriptions and examples. Output as structured JSON.
Incrementally build the taxonomy. For each subsequent batch, provide the current taxonomy (category names, descriptions, and prevalence counts) and instruct the LLM to assign new traces to existing categories when possible, creating new categories only when a failure is fundamentally different from all existing ones.
Filter and rank categories. Discard any category backed by fewer than 2 unique problems (these are likely problem-specific, not systematic). Rank the remaining categories by total failure count in descending order.
Select top G categories. Pick the top G error categories (G=10 is a reasonable default; use G=5 for smaller or simpler domains). These represent the highest-impact failure modes.
Generate targeted guidance. For each selected category, prompt the LLM to produce guidance text structured as: (a) description of the error pattern, (b) actionable advice to avoid it, (c) a WRONG example showing the error in action, (d) a CORRECT example showing proper handling. Also generate a preamble that introduces the guidance block.
Assemble the optimized prompt. Append the preamble and all guidance items to the original prompt. The guidance section should come after the task instructions but before any few-shot examples.
Evaluate on the held-out test set. Run the optimized prompt against the test set to measure improvement. Compare against the baseline prompt accuracy.
Iterate only if needed. If a specific error category still dominates failures on the test set, generate additional guidance for that category. ETGPO is designed to work in a single pass, so further iteration should be rare.
Example 1: Optimizing a math reasoning prompt
User: "My prompt for solving competition math problems gets 60% on my test set. Help me improve it."
Approach:
Output (guidance appended to prompt):
## Common Pitfalls to Avoid
The following guidance addresses systematic errors observed in mathematical reasoning.
### 1. Algebraic Calculation Errors
When expanding, factoring, or simplifying expressions, verify each step explicitly.
Do not skip intermediate steps in polynomial expansion or fraction simplification.
WRONG: "Expanding (x+2)^3 = x^3 + 6x + 8" (missing terms)
CORRECT: "Expanding (x+2)^3 = x^3 + 3(x^2)(2) + 3(x)(4) + 8 = x^3 + 6x^2 + 12x + 8"
### 2. Incorrect Generalization from Patterns
Do not assume a pattern observed in small cases holds generally. Always prove or
verify the pattern for the general case before applying it.
WRONG: "f(1)=2, f(2)=4, f(3)=8, so f(n)=2^n" (without proving the recurrence)
CORRECT: "f(1)=2, f(2)=4, f(3)=8. Hypothesis: f(n)=2^n. Verify: the recurrence
f(n)=2*f(n-1) with f(1)=2 gives f(n)=2^n by induction."
### 3. Ignoring Domain Constraints
Always re-read the problem constraints before finalizing an answer. Check that your
solution satisfies all stated bounds, integrality requirements, and domain restrictions.
WRONG: Solving for x and returning x=-3 when the problem states x > 0
CORRECT: Solving for x, finding x=-3 and x=5, then selecting x=5 since x > 0
Example 2: Improving a multi-hop QA prompt
User: "My RAG pipeline's answer-generation prompt mishandles questions requiring reasoning over multiple documents. Can you optimize it?"
Approach:
Output (excerpt):
## Multi-Document Reasoning Guidelines
### Bridge Entity Resolution
When answering requires information from multiple sources, explicitly identify the
connecting entity between documents. State which entity links the sources before
reasoning about the answer.
WRONG: "Document 1 says the CEO founded the company in 2010. The answer is 2010."
(ignores Document 2 which clarifies the CEO role changed in 2015)
CORRECT: "Document 1 mentions John Smith as CEO. Document 2 clarifies Smith became
CEO in 2015, replacing Jane Doe who founded the company in 2010. The founder is Jane Doe."
Example 3: Fixing a logical reasoning prompt
User: "My prompt for evaluating logical arguments gets confused by negations and quantifiers."
Approach:
Paper: Singh, Yadav, and Blanco. "Error Taxonomy-Guided Prompt Optimization." arXiv:2602.00997v1, February 2026. Key takeaway: Section 3.2 (taxonomy creation via batched LLM analysis) and Section 3.4 (guidance generation with WRONG/CORRECT examples) contain the core algorithmic details. Table 2 shows ETGPO uses ~1/3 the tokens of competing methods at equal or better accuracy.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".