skills/egss-entropy-guided-stepwise-scaling/SKILL.md
Apply Entropy-Guided Stepwise Scaling (EGSS) to complex software engineering tasks like bug fixing, code generation, and refactoring. Uses entropy-based uncertainty detection to selectively branch exploration at high-uncertainty decision points, then consolidates test suites across trajectories and uses multi-model voting to select the best patch. Trigger phrases: "use EGSS to fix this bug", "entropy-guided scaling", "stepwise scaling for this task", "try multiple approaches with EGSS", "scale test-time compute for this fix", "use adaptive branching to solve this".
npx skillsauth add ndpvt-web/arxiv-claude-skills egss-entropy-guided-stepwise-scalingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
EGSS enables Claude to tackle complex software engineering tasks — bug fixing, feature implementation, large refactors — by dynamically allocating computational effort where it matters most. Instead of uniformly generating many candidate solutions (expensive) or relying on a single attempt (unreliable), EGSS measures the entropy of tool/action choices at each step to identify high-uncertainty decision points, branches exploration selectively at those points, consolidates debugging signals into a robust test suite across trajectories, and uses structured multi-criteria voting to select the best candidate patch. This yields 5-10% higher resolution rates while using 28%+ fewer tokens than naive ensemble approaches.
Entropy as a branching signal. At each step in a coding task, the agent chooses among actions: edit a file, run a test, search for a symbol, read documentation, etc. The distribution over these choices has an entropy: H(a_t | s_t) = -sum P(a | s_t) log P(a | s_t). Most steps are near-deterministic routine operations (low entropy) — reading the next relevant file, running the obvious test command. But at critical junctures — choosing which file to edit, deciding on a fix strategy, selecting between two plausible root causes — entropy spikes. EGSS exploits the empirical finding that the entropy distribution is right-skewed: the vast majority of steps are low-entropy, and only a sparse tail of high-entropy steps represents semantically consequential branching points. By branching only at these high-entropy moments (resampling 4 candidate continuations and scoring them with a judge), EGSS concentrates compute on the decisions that actually matter.
Test Consolidation Augmentation. Single-trajectory self-verification is unreliable — studies show ~36% of trajectories that exhibit explicit self-verification still produce incorrect patches ("self-deceptive debugging"). EGSS addresses this by extracting debugging actions and test intents from multiple trajectories, then synthesizing a consolidated test suite that covers functional completeness, boundary robustness, and behavioral consistency. Candidate patches are evaluated against this augmented suite, and only those exceeding a pass-rate threshold survive to the selection phase.
Multi-Criteria Preference Selection. Rather than picking the patch that passes the most tests, EGSS employs multiple independent Preference Selectors that evaluate candidates across six dimensions: requirement relevance, code accuracy, change precision, dependency awareness, code quality, and functionality validation. Majority voting across selectors produces the final consensus patch.
Analyze the task and identify uncertainty. Read the issue description, relevant source files, and existing tests. Form 2-3 hypotheses about the root cause or implementation strategy. If only one plausible path exists, proceed directly without branching.
Generate an initial trajectory. Begin working on the most likely hypothesis. At each decision step, assess your confidence: Are you choosing between meaningfully different actions (e.g., editing file A vs. file B), or is the next step obvious?
Detect high-entropy branching points. When you encounter a step where multiple actions seem roughly equally valid — different fix strategies, different files to modify, different test approaches — this is a high-entropy moment. Flag it explicitly.
Branch at high-entropy points. At each flagged branching point, generate 2-4 distinct continuation strategies. For each, write a brief rationale (1-2 sentences) explaining the approach. Do NOT branch at routine steps like "read the file I already know I need."
Score and prune branches. For each candidate continuation, evaluate: Does it address the stated requirements? Is the code change minimal and precise? Does it respect existing dependencies? Prune branches that score poorly, keeping 1-2 strongest candidates to continue exploring.
Develop surviving candidates into complete patches. For each surviving branch, complete the implementation through to a testable state. Produce a concrete diff for each candidate.
Consolidate test coverage across candidates. Examine the debugging and testing actions from all branches. Synthesize a combined test suite that covers: (a) the core functional requirement, (b) edge cases that different branches revealed, (c) regression tests for behavior that should not change. Write these tests explicitly.
Evaluate all candidate patches against the consolidated test suite. Run each patch against the full test suite. Record pass rates. Discard any candidate that fails below the pass-rate threshold (aim for >90% of consolidated tests passing).
Apply multi-criteria selection to surviving candidates. For each surviving patch, evaluate across six dimensions: requirement relevance (does it address the actual issue?), code accuracy (is the logic correct?), change precision (are changes minimal?), dependency awareness (does it break imports/interfaces?), code quality (is it readable and maintainable?), functionality validation (does it actually work end-to-end?). Select the candidate with the strongest aggregate profile.
Present the selected patch with rationale. Show the user the chosen solution, explain why it was selected over alternatives, and report which tests pass. If multiple candidates are close in quality, present the top 2 with tradeoffs.
Example 1: Ambiguous bug fix with multiple plausible root causes
User: "The /api/users endpoint returns 500 when the email field contains unicode characters. Fix it."
Approach:
/api/users and trace the email field through validation, database insertion, and response serialization.\p{L} instead of [a-zA-Z]).
Branch B: Alter the database column collation to utf8mb4_unicode_ci.
Branch C: Add explicit UTF-8 encoding in the serialization layer.[email protected] with accented name), emoji in local part, CJK characters, mixed ASCII/unicode, and existing ASCII-only emails (regression).Output:
Fix: Updated email validator in src/validators/user.ts to use Unicode-aware
regex pattern \p{L} instead of [a-zA-Z], and added \p{Emoji} support.
Tests added (all passing):
- unicode_email_accented_chars ........... PASS
- unicode_email_emoji_local_part ......... PASS
- unicode_email_cjk_characters ........... PASS
- unicode_email_mixed_ascii_unicode ...... PASS
- regression_ascii_only_email ............ PASS
- regression_existing_user_lookup ........ PASS
Alternative considered: Database collation change (Branch B) — rejected because
the failure occurs at the validation layer before the DB is reached.
Example 2: Feature implementation with unclear design direction
User: "Add rate limiting to our Express API. We're getting hammered by bots."
Approach:
Output:
Two viable rate-limiting implementations ready:
Option A (Recommended for multi-instance deployments):
- Redis-backed sliding window in src/middleware/rateLimit.ts
- Requires: redis package, REDIS_URL env variable
- Survives process restarts, works across load balancer
Option B (Recommended for single-instance or quick deploy):
- In-memory token bucket in src/middleware/rateLimit.ts
- Zero new dependencies
- Rate limits reset on process restart
Both pass all 5 rate-limiting tests. Choose based on your infrastructure.
Example 3: Refactoring with cascading uncertainty
User: "Refactor the OrderProcessor class — it's 800 lines and does too much."
Approach:
OrderProcessor and map its responsibilities: validation, pricing, inventory, payment, notification.processOrder() end-to-end flow.Paper: EGSS: Entropy-guided Stepwise Scaling for Reliable Software Engineering — Mao et al., 2026. Look for: Algorithm 1 (Test Consolidation Augmentation), the tool entropy formula (Equation 1), the trajectory scoring function (Equation 2), and the empirical analysis of entropy distribution showing right-skewed concentration in the low-entropy regime.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".