skills/evocodebench-human-performance-benchmark-self-evol/SKILL.md
Self-evolving code generation with iterative reflection and revision. Applies a feedback-driven loop where code is submitted, judged, analyzed for failures, and rewritten up to 3 times — tracking correctness, runtime, memory, and algorithmic improvement at each iteration. Use when: 'solve this coding problem and optimize it', 'iteratively improve this solution', 'refine my code until it passes all tests', 'benchmark my solution against human performance', 'reduce the time complexity of this code', 'fix and re-attempt this failing solution'.
npx skillsauth add ndpvt-web/arxiv-claude-skills evocodebench-human-performance-benchmark-self-evolInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill implements the EvoCodeBench self-evolution methodology: a structured loop where Claude generates a solution, receives execution feedback (verdict, runtime, memory, error details), reflects on failures or inefficiencies, and produces an improved revision — repeating up to 3 rounds. Unlike one-shot code generation, this approach tracks correctness and efficiency dynamics across iterations, targeting not just passing tests but achieving competitive runtime and memory usage calibrated against human programmer distributions.
The EvoCodeBench methodology separates code evolution into two distinct reflection modes. Bugfix reflection activates when a submission fails (Wrong Answer, Runtime Error, Compile Error, Time/Memory Limit Exceeded). The agent analyzes the specific verdict and diagnostic output, identifies root causes — missed corner cases, off-by-one errors, incorrect data structures, language-specific syntax issues — and produces a targeted fix. Optimization reflection activates when a solution is accepted but its runtime or memory percentile is below a competitive threshold. Here the agent proposes algorithmic or implementation-level improvements: replacing O(n^2) scans with hash maps, switching from recursion to iteration to cut stack overhead, or using language-specific optimizations like __builtin_popcount in C++ vs manual bit counting.
The critical insight is that self-evolution must be bounded and structured. The loop runs at most 3 reflection-revision rounds and terminates early on acceptance with competitive performance. Each iteration preserves the full trajectory: the original code, the verdict, the reflection analysis, and the revised code. This trajectory record prevents regression — the agent can see what it already tried and avoid repeating failed strategies. Research shows this approach yields 10-27% relative pass-rate improvements and 7-46% runtime reductions depending on language, with the largest gains in compiled languages where compilation errors are systematically eliminated across iterations.
A key finding is that performance degrades from high-resource languages (Python, C++, Java) to long-tail languages (Kotlin, Go) due to training data imbalance. The self-evolution loop is especially valuable for these languages because it catches compilation errors and API misuse that one-shot generation misses entirely.
Parse the problem specification. Extract constraints (input size bounds, time/memory limits), input/output format, edge cases from examples, and required algorithmic concepts. Identify the target language.
Generate an initial solution with explicit reasoning. Before writing code, state the chosen algorithm and its time/space complexity. Select data structures deliberately. For compiled languages (C++, Java, Go, Kotlin), pay extra attention to type declarations, imports, and language-specific API usage.
Execute and collect feedback. Run the solution against test cases. Record the verdict category: Accepted (AC), Wrong Answer (WA), Time Limit Exceeded (TLE), Memory Limit Exceeded (MLE), Compile Error (CE), or Runtime Error (RE). For AC, record runtime and memory. For failures, capture error messages, failing test case input/output if available, and the number of test cases passed before failure.
Classify the failure mode and enter the appropriate reflection. If CE: fix syntax, imports, or type issues. If RE: check null/nil access, array bounds, stack overflow, integer overflow. If WA: trace logic against the failing case, check edge cases (empty input, single element, maximum values). If TLE: analyze complexity and identify the bottleneck loop or recursion. If MLE: reduce auxiliary data structure size or switch to in-place algorithms.
Perform structured reflection. Write an explicit analysis with three fields: (a) Root cause: what specifically went wrong and why, (b) Improved reasoning: the corrected or optimized approach with complexity analysis, (c) Improved code: the full revised solution. Never patch blindly — always explain the fix rationale.
Resubmit the revised solution. Execute against the same test suite. Compare the new verdict and metrics against the previous iteration. If the verdict improved (e.g., WA to AC, or TLE to AC) but performance is suboptimal, continue to optimization reflection.
Track the iteration trajectory. Maintain a record of each attempt: [iteration, verdict, tests_passed, runtime_ms, memory_mb, change_summary]. Use this to prevent regression — if attempt 2 passes fewer tests than attempt 1, revert to attempt 1 and try a different strategy.
Apply optimization reflection if accepted but slow. Compare runtime against known human baselines or problem constraints. If runtime is above the 50th percentile of typical submissions, analyze: can the algorithm be improved (e.g., sorting + two-pointer instead of brute force)? Can constant factors be reduced (e.g., StringBuilder instead of string concatenation in Java)?
Terminate after at most 3 revision rounds or on competitive acceptance. Report the final solution alongside the full trajectory showing how correctness and efficiency evolved. If the problem remains unsolved after 3 rounds, report the best attempt and the unresolved failure mode.
Provide cross-language guidance if applicable. When the user needs the solution in multiple languages, translate the accepted solution while watching for language-specific pitfalls: integer overflow in Java (use long), Kotlin's nullable types, Go's lack of generics in older versions, C++ iterator invalidation.
Example 1: Wrong Answer resolved through reflection
User: "Solve two-sum in Python and make sure it handles all edge cases."
Iteration 1:
Reasoning: Use a hash map for O(n) lookup.
Code:
def twoSum(nums, target):
seen = {}
for i, n in enumerate(nums):
if target - n in seen:
return [seen[target - n], i]
seen[n] = i
Verdict: AC | Runtime: 48ms | Memory: 17.2MB
Runtime Beats: 72% of human submissions
Decision: Accepted with competitive performance. No further iteration needed.
Trajectory: [(1, AC, all_passed, 48ms, 17.2MB, "initial hash map approach")]
Example 2: TLE resolved by algorithmic improvement
User: "Find the longest increasing subsequence length. My O(n^2) DP solution times out on large inputs."
Iteration 1 (user's code):
Algorithm: O(n^2) DP with nested loops
Verdict: TLE after 45/55 test cases
Reflection:
Root cause: O(n^2) is too slow for n=2500. Need O(n log n).
Improved reasoning: Use patience sorting — maintain a list of smallest
tail elements. For each number, binary search for its insertion point.
This gives O(n log n) time, O(n) space.
Iteration 2:
Code:
import bisect
def lengthOfLIS(nums):
tails = []
for x in nums:
pos = bisect.bisect_left(tails, x)
if pos == len(tails):
tails.append(x)
else:
tails[pos] = x
return len(tails)
Verdict: AC | Runtime: 62ms | Memory: 16.8MB
Runtime Beats: 85% of human submissions
Trajectory:
[(1, TLE, 45/55, -, -, "O(n^2) DP"),
(2, AC, 55/55, 62ms, 16.8MB, "O(n log n) patience sort")]
Example 3: Cross-language compilation error fix (Kotlin)
User: "Port this Java solution to Kotlin. It uses PriorityQueue for Dijkstra's algorithm."
Iteration 1:
Direct translation from Java to Kotlin.
Verdict: CE — "Type mismatch: inferred type is Comparator<IntArray> but
Comparator<in IntArray!>! was expected"
Reflection:
Root cause: Kotlin's type system treats Java generics differently.
PriorityQueue comparator needs explicit typing.
Improved reasoning: Use compareBy lambda instead of Comparator.comparing.
Iteration 2:
Code: val pq = PriorityQueue<IntArray>(compareBy { it[1] })
Verdict: AC | Runtime: 340ms | Memory: 52MB
Runtime Beats: 61%
Reflection (optimization):
Root cause: IntArray boxing overhead in PriorityQueue.
Improved reasoning: Use a custom Pair with primitive-friendly access.
Iteration 3:
Switched to indexed array with manual heap. Reduced boxing.
Verdict: AC | Runtime: 215ms | Memory: 45MB
Runtime Beats: 84%
Trajectory:
[(1, CE, 0/0, -, -, "Java-to-Kotlin type mismatch"),
(2, AC, all, 340ms, 52MB, "fixed comparator syntax"),
(3, AC, all, 215ms, 45MB, "eliminated boxing overhead")]
| Failure Mode | Diagnostic Signal | Recovery Strategy | |---|---|---| | Wrong Answer | Number of tests passed before failure | Trace logic on the smallest failing case. Check edge cases: empty input, single element, duplicates, negative numbers, maximum constraint values. | | Time Limit Exceeded | Tests passed before timeout | Analyze the dominant loop's complexity. Replace O(n^2) with O(n log n) or O(n) approaches. Consider: sorting, binary search, hash maps, monotonic stacks, segment trees. | | Memory Limit Exceeded | Memory spike pattern | Switch from explicit storage to streaming/in-place computation. Replace 2D DP with rolling array. Use bitsets instead of boolean arrays. | | Compile Error | Compiler error message | Fix the exact line cited. Common across languages: missing imports, type mismatches, syntax differences. In Kotlin/Go, check API compatibility. | | Runtime Error | Stack trace or signal | Check: null/nil dereference, array index out of bounds, stack overflow from deep recursion (convert to iteration), integer division by zero. | | Regression across iterations | Trajectory comparison | Revert to the best prior attempt. Analyze why the new change broke passing tests. Try an orthogonal fix strategy. |
Paper: EvoCodeBench: A Human-Performance Benchmark for Self-Evolving LLM-Driven Coding Systems — Zhang et al., 2026. Focus on Section 3 (benchmark design and the reflection-revision protocol), Section 4 (self-evolving agent construction), and Section 5 (results showing iteration-over-iteration gains and human-relative percentile analysis).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".