skills/aero-autonomous-evolutionary-reasoning/SKILL.md
Apply the AERO dual-loop self-evolution framework to iteratively improve reasoning on complex tasks. Uses entropy-based difficulty calibration, counterfactual verification, and staggered role refinement to solve hard problems without external oracles. Triggers: 'reason through this step by step with self-correction', 'solve this hard problem autonomously', 'verify your reasoning with counterfactuals', 'self-critique and improve your answer', 'use dual-loop reasoning on this', 'iteratively refine your solution'
npx skillsauth add ndpvt-web/arxiv-claude-skills aero-autonomous-evolutionary-reasoningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to apply the AERO dual-loop reasoning framework from Gao et al. (2026) to complex coding, math, and logic tasks. Rather than producing a single-shot answer, Claude decomposes its reasoning into three functional roles — Questioner (problem decomposition), Solver (multi-path solution generation), and Critic (counterfactual verification) — cycling through an inner synthesis loop and an outer refinement loop. The key insight is entropy-based difficulty calibration: focus effort on sub-problems at the boundary of your capability (the "Zone of Proximal Development"), skip what's trivially solved, and flag what's genuinely intractable.
Dual-Loop Self-Evolution. AERO structures reasoning as two nested loops. The inner loop is an experience synthesis cycle: (1) decompose the problem into sub-tasks, (2) generate multiple independent solution trajectories for each sub-task, (3) measure response entropy across trajectories to calibrate difficulty, and (4) apply counterfactual correction to verify correctness. The outer loop aggregates verified insights to refine the overall approach, feeding improved understanding back into the next inner-loop iteration.
Entropy-Based ZPD Positioning. For each sub-problem, AERO generates N independent solution attempts and clusters the answers. Normalized Shannon entropy across clusters tells you where the problem sits: low entropy (all attempts agree) means the problem is mastered — move on. High entropy (answers are scattered) means the problem is too hard or ill-defined — flag it. Medium entropy (some agreement, some divergence) identifies the Zone of Proximal Development — this is where focused reasoning effort pays off. The formula is H̄ = -1/log₂(n) * Σ P(cⱼ)log₂P(cⱼ), normalized to [0,1].
Independent Counterfactual Correction (ICC). Instead of simply checking whether an answer "looks right," ICC forces the Critic role to re-solve the problem under the counterfactual assumption that the proposed answer is wrong. If this independent re-derivation converges to the same answer, confidence is high. If it diverges, the original answer likely contains an error. This breaks confirmation bias — the Critic can't just rubber-stamp the Solver's work because it's required to reason from an adversarial starting point.
Decompose the problem into sub-tasks. Parse the user's request into discrete reasoning units. For a coding problem, this means: understanding the spec, designing the algorithm, handling edge cases, implementing, and verifying. Label each sub-task explicitly.
Generate multiple solution trajectories per sub-task. For each non-trivial sub-task, produce 2-4 independent reasoning paths. Vary your approach: try brute force vs. optimized, recursive vs. iterative, different data structures. Keep each trajectory self-contained.
Cluster answers and compute entropy. Group the trajectories by their conclusions. If all paths agree (low entropy, H̄ < 0.2), accept the consensus and move on. If paths mostly agree with one outlier (medium entropy, 0.2 ≤ H̄ ≤ 0.7), this sub-task is in the ZPD — focus effort here. If paths wildly disagree (high entropy, H̄ > 0.7), the sub-task may need reframing or the user's input may be ambiguous.
Apply counterfactual correction to ZPD sub-tasks. For each sub-task in the ZPD zone, take the top candidate answer and assume it is wrong. Re-derive the answer from scratch under this adversarial assumption. If the re-derivation converges to the same result, mark it as verified. If it diverges, investigate the discrepancy — this is where bugs hide.
Construct a verified solution from confirmed sub-task outputs. Assemble the full solution using only verified sub-task results. For unverified sub-tasks, explicitly note uncertainty and present the competing alternatives to the user.
Run the outer refinement loop. Review the assembled solution holistically. Check for cross-sub-task inconsistencies (e.g., a variable renamed in one part but not another, an edge case handled in the algorithm but not the implementation). Feed any issues back into a second inner loop pass.
Apply staggered role emphasis. On the first pass, prioritize problem decomposition quality (Questioner role). On subsequent passes, shift emphasis to solution correctness (Solver) and verification rigor (Critic). This prevents the common failure mode where you keep refining answers to questions that were poorly framed to begin with.
Present the final answer with a confidence map. For each sub-task, indicate whether it was trivially solved (mastered zone), verified through ICC (ZPD zone), or remains uncertain (chaos zone). Give the user actionable information about where the solution is robust and where it needs human review.
Example 1: Debugging a Concurrency Bug
User: "My Go worker pool deadlocks intermittently. Here's the code. Find and fix the bug."
Approach:
context.WithCancel for clean shutdown, present both changes with confidence mapOutput:
// VERIFIED (ICC-confirmed): Buffer result channel to task count, not pool size
results := make(chan Result, len(tasks)) // was: make(chan Result)
// VERIFIED (consensus): Add context-based shutdown
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
Example 2: Optimizing a SQL Query
User: "This query takes 12 seconds on 2M rows. Make it fast."
Approach:
Output:
-- VERIFIED: Materialize the user-date aggregation
WITH user_latest AS MATERIALIZED (
SELECT user_id, MAX(created_at) AS latest
FROM events
WHERE status = 'active'
GROUP BY user_id
)
SELECT u.*, e.*
FROM users u
JOIN user_latest ul ON u.id = ul.user_id
JOIN events e ON e.user_id = ul.user_id AND e.created_at = ul.latest;
-- RECOMMENDED (additive): Composite index
CREATE INDEX CONCURRENTLY idx_events_user_created
ON events (user_id, created_at) WHERE status = 'active';
Example 3: Implementing a Complex Algorithm
User: "Implement an interval scheduling algorithm that handles weighted jobs with deadlines and dependencies."
Approach:
Output:
Confidence Map:
[MASTERED] Topological sort (Kahn's) — all trajectories agreed
[VERIFIED] DP formulation (layered with dependency constraints) — ICC confirmed
[MASTERED] Deadline filtering (pre-DP feasibility check) — all trajectories agreed
[VERIFIED] Binary search for compatible intervals — ICC confirmed
Algorithm: O(n log n) time, O(n) space after topological preprocessing
Gao, Z., Ma, J., Li, X., Li, P., & Qu, N. (2026). AERO: Autonomous Evolutionary Reasoning Optimization via Endogenous Dual-Loop Feedback. arXiv:2602.03084v2. https://arxiv.org/abs/2602.03084v2
Key sections to read: Section 3.2 (entropy-based ZPD positioning and the normalized entropy formula), Section 3.3 (Independent Counterfactual Correction mechanism), and Section 3.4 (Staggered Training Strategy for role synchronization). Code: https://github.com/mira-ai-lab/AERO
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".