skills/evolve-skill-from-traces/SKILL.md
Evolve SKILL.md files from agent execution traces using a three-stage pipeline: trajectory collection from observed runs, parallel multi-agent patch proposal for error and success analysis, and conflict-free consolidation of overlapping edits via prevalence-weighting. Based on the Trace2Skill methodology.
npx skillsauth add pjt222/agent-almanac evolve-skill-from-tracesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Transform raw agent execution traces into a validated SKILL.md through a three-stage pipeline: trajectory collection, parallel multi-agent patch proposal, and conflict-free consolidation. This skill bridges the gap between observed agent behavior and documented procedures, turning successful runs into reproducible skills.
traces -- set of agent execution logs or session transcripts (minimum 10 successful runs recommended)target_skill -- path to an existing SKILL.md to evolve, or "new" for skill extraction from scratchanalyst_count -- number of parallel analyst agents to spawn (default: 4)held_out_ratio -- fraction of traces reserved for validation, not used in drafting (default: 0.2)Gather agent session logs, tool-call sequences, or conversation transcripts that demonstrate the target behavior. Filter for runs tagged as successful. Normalize into a standard trace format: a sequence of (state, action, outcome) triples with timestamps.
trace_entry:
state: <context before the action>
action: <tool call, command, or decision made>
outcome: <result, output, or state change>
timestamp: <ISO 8601>
held_out_ratio (default 20%) for validation in Step 7, use the remainder for Steps 2-6# Example: count available traces and compute partition
total_traces=$(ls traces/*.json | wc -l)
held_out=$(echo "$total_traces * 0.2 / 1" | bc)
drafting=$((total_traces - held_out))
echo "Drafting: $drafting traces, Held-out: $held_out traces"
Expected: A normalized trace set partitioned into drafting (80%) and held-out (20%) subsets. Each trace entry contains state, action, outcome, and timestamp fields.
On failure: If fewer than 10 successful traces are available, collect more before proceeding. Small trace sets produce overfitted skills that fail on novel inputs. If traces lack timestamps, assign ordinal sequence numbers instead.
Group normalized traces by outcome pattern. Identify the invariant core (steps present in all successful trajectories) versus variant branches (steps that differ across runs). The invariant core becomes the skeleton for the skill procedure.
invariant_core:
- action: "read_input_file"
frequency: 100%
- action: "validate_schema"
frequency: 100%
- action: "transform_data"
frequency: 100%
variant_branches:
- action: "retry_on_timeout"
frequency: 35%
condition: "network latency > 2s"
- action: "fallback_to_cache"
frequency: 15%
condition: "API returns 503"
Expected: A clear separation between invariant core actions (present in all successful traces) and variant branches (conditional, present in a subset). Each variant branch has a frequency count and triggering condition.
On failure: If no invariant core emerges (traces are too heterogeneous), the target behavior may actually be multiple distinct skills. Split traces into coherent subgroups by outcome type and process each group separately.
From the invariant core, generate an initial SKILL.md with frontmatter, When to Use (derived from entry conditions across traces), Inputs (parameters that varied across runs), and a Procedure section with one step per invariant action.
# Scaffold the skeleton if creating a new skill
mkdir -p skills/<skill-name>/
# Skeleton structure
## When to Use
- <derived from common entry conditions>
## Inputs
- **Required**: <parameters present in all traces>
- **Optional**: <parameters present in some traces>
## Procedure
### Step N: <invariant action label>
<most common implementation from traces>
**Expected:** <most common success outcome>
**On failure:** <placeholder -- refined in Steps 4-6>
Expected: A syntactically valid SKILL.md skeleton with frontmatter, When to Use, Inputs, and a Procedure section containing one step per invariant core action. Expected blocks reflect observed outcomes; On failure blocks are placeholders.
On failure: If the skeleton exceeds 500 lines before adding variant branches, the invariant core is too granular. Merge adjacent actions that always occur together into single steps. Target 5-10 procedure steps.
Spawn N analyst agents (recommend 4-6), each reviewing the full trace set against the draft skeleton from a different analytical lens. Each agent produces a structured patch: section, old text, new text, rationale.
Assign one lens per analyst:
| Analyst | Lens | Focus | |---------|------|-------| | 1 | Correctness | Does the skeleton capture all success paths? Are any invariant steps missing? | | 2 | Efficiency | Are there redundant steps? Can any steps be merged or parallelized? | | 3 | Robustness | Which failure modes are unhandled? What should On failure blocks contain? | | 4 | Edge Cases | Which variant branches should become conditional steps or pitfalls? | | 5 (optional) | Clarity | Is each step unambiguous? Can an agent follow it mechanically? | | 6 (optional) | Generalizability | Are there trace-specific artifacts that should be abstracted? |
Each analyst agent receives:
Each analyst returns a list of structured patches:
patch:
analyst: "robustness"
section: "Procedure > Step 3"
old_text: "**On failure:** <placeholder>"
new_text: "**On failure:** If the API returns 503, wait 5 seconds and retry up to 3 times. If retries are exhausted, fall back to the cached response from the previous successful run."
rationale: "Traces #4, #7, #12 show 503 errors resolved by retry. Trace #15 shows cache fallback when retries fail."
supporting_traces: [4, 7, 12, 15]
Expected: Each analyst returns 3-10 structured patches with section references, old/new text, rationale, and supporting trace IDs. All patches are collected into a single patch set.
On failure: If an analyst returns no patches, their lens may not apply to this skill. This is acceptable -- not every lens surfaces issues. If an analyst returns vague patches without trace references, reject and re-prompt with the requirement for concrete supporting_traces.
Compare all patches from Step 4 for overlapping edits. Classify each pair of overlapping patches into one of three categories.
| Conflict Type | Definition | Resolution | |---------------|-----------|------------| | Compatible | Different sections, no overlap | Merge directly | | Complementary | Same section, additive (both add content, no contradiction) | Combine text | | Contradictory | Same section, mutually exclusive (one adds X, other removes X or adds Y instead) | Needs resolution in Step 6 |
conflict_report:
total_patches: 24
compatible: 18
complementary: 4
contradictory: 2
contradictions:
- section: "Procedure > Step 5"
patch_a: {analyst: "efficiency", action: "remove step"}
patch_b: {analyst: "robustness", action: "add retry logic"}
supporting_traces_a: [2, 8, 11]
supporting_traces_b: [4, 7, 12, 15]
Expected: A conflict report listing all patch pairs, their classification, and for contradictions, the supporting trace counts for each side.
On failure: If the classification is ambiguous (a patch both adds and modifies text in the same section), split it into two patches: one additive, one modifying. Re-classify the smaller patches.
Merge all patches into a single consolidated SKILL.md using a three-tier resolution strategy.
argumentation skill to evaluate which patch better serves the skill's stated purposeconsolidation_log:
applied_directly: 18
combined: 4
resolved_by_prevalence: 1
resolved_by_argumentation: 1
rejected_alternatives_documented: 2
After consolidation, verify the resulting SKILL.md:
Expected: A single consolidated SKILL.md incorporating patches from all analysts. Contradictions are resolved with documented rationale. The rejected alternative for each contradiction appears as a pitfall or note.
On failure: If consolidation produces an internally inconsistent document (e.g., Step 3 assumes a file exists but Step 2 was removed by an efficiency patch), revert the conflicting edit and keep the original skeleton text for that section. Flag the inconsistency for manual review.
Run the consolidated skill mentally against held-out traces (the 20% reserved in Step 1). Verify that Expected/On failure blocks match observed outcomes in traces the skill has never seen.
validation_results:
held_out_traces: 5
full_match: 4
partial_match: 1
no_match: 0
mismatches:
- trace_id: 23
step: 4
expected: "API returns 200"
actual: "API returns 429 (rate limited)"
action: "Add rate-limit handling to On failure block"
create-skill for directory creation, registry entry, and symlink setupevolve-skill for version bumping and translation sync# Final validation: line count
lines=$(wc -l < skills/<skill-name>/SKILL.md)
[ "$lines" -le 500 ] && echo "OK ($lines lines)" || echo "FAIL: $lines lines > 500"
Expected: At least 80% of held-out traces match the skill procedure end-to-end. The skill is registered in skills/_registry.yml with correct metadata.
On failure: If validation fails (>20% mismatch), the skill has overfit to the drafting traces. Add the mismatched traces to the drafting set and re-run from Step 2. If validation continues to fail after two iterations, the behavior may be too variable for a single skill -- consider splitting into multiple skills by outcome type.
evolve-skill -- simpler human-directed evolution (complementary: use when traces are unavailable)create-skill -- for newly extracted skills that do not exist yet; used in Step 7 for registrationreview-skill-format -- validation after consolidation to ensure agentskills.io complianceargumentation -- used in Step 6 for resolving contradictory patches when prevalence is tiedverify-agent-output -- evidence trails for patch proposals; validates analyst outputs in Step 4testing
Launch all available agents in parallel waves for open-ended hypothesis generation on problems where the correct domain is unknown. Use when facing a cross-domain problem with no clear starting point, when single-agent approaches have stalled, or when diverse perspectives are more valuable than deep expertise. Produces a ranked hypothesis set with convergence analysis and adversarial refinement.
tools
Write integration tests for a Node.js CLI application using the built-in node:test module. Covers the exec helper pattern, output assertions, filesystem state verification, cleanup hooks, JSON output parsing, error case testing, and state restoration after destructive tests. Use when adding tests to an existing CLI, testing a new command, verifying adapter behavior across frameworks, or setting up CI for a CLI tool.
development
Screen a proposed trademark for conflicts and distinctiveness before filing. Covers trademark database searches (TMview, WIPO Global Brand Database, USPTO TESS), distinctiveness analysis using the Abercrombie spectrum, likelihood of confusion assessment using DuPont factors and EUIPO relative grounds, common law rights evaluation, and goods/services overlap analysis. Produces a conflict report with a risk matrix. Use before adopting a new brand name, logo, or slogan — distinct from patent prior art search, which uses different databases, legal frameworks, and analysis methods.
tools
Scaffold a new CLI command using Commander.js with options, action handler, three output modes (human-readable, quiet, JSON), and optional ceremony variant. Covers command naming, option design, shared context patterns, error handling, and integration testing. Use when adding a command to an existing Commander.js CLI, designing a new CLI tool from scratch, or standardizing command structure across a multi-command CLI.