ace/cli/skills/kayba-pipeline/stage-5-action-plan/SKILL.md
Triage each insight into discard/code-fix/prompt-fix and produce a prioritized action plan with specific recommendations. Trigger when the user says "run stage 5", "make action plan", "triage skills", or when invoked by the kayba-pipeline orchestrator. Requires eval outputs from stages 1-4.
npx skillsauth add kayba-ai/agentic-context-engine kayba-stage-5-action-planInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Triage each insight and produce a concrete, prioritized action plan.
eval/stage1_insights_summary.md — insights from Kaybaeval/stage2_domain_context.md — domain contexteval/baseline_metrics.md — the evaluation rubriceval/baseline_metrics.json — baseline valueseval/compute_baselines.py — measurement codeRead all files before starting.
For each insight/skill, answer three questions in order: Is it valid? Is it already handled? Is it a code fix or prompt fix?
Do not rely on memory or assumption. Run these checks and cite what you find:
cancel, eligibility, criteria.AGENT_INSTRUCTION in the agent file and the domain policy file. Quote any existing language that addresses this behavior.Walk through this tree for every non-discarded insight:
Q1: Can the agent fix this by following different instructions?
(Does it have the right tools, correct data in tool responses,
and sufficient context to behave correctly?)
│
├─ YES → PROMPT FIX
│ The agent has everything it needs but acts wrong.
│ A system prompt addition would fix it.
│
└─ NO → Q2: What is the agent missing?
│
├─ Tool doesn't exist, schema is wrong, API returns
│ incomplete data, infrastructure drops information,
│ timeout/error not surfaced to agent
│ → CODE FIX
│ Name the file, function, and specific change.
│
└─ The agent has partial information but the prompt
can't fully compensate (e.g., needs a new tool
but a heuristic prompt workaround exists)
→ PROMPT FIX (primary) + CODE FIX (optional)
Note both. Mark the code fix as "optional" with
a one-sentence justification for why it's lower priority.
Ambiguity default: When genuinely uncertain, default to prompt fix and add a note: "Classification uncertain — defaulting to prompt fix. Revisit if prompt change doesn't move metrics." This is safer because prompt fixes are cheaper to test and revert, and Stage 7 handles prompt fixes and code fixes through different paths.
Use the reflector's reasoning from Stage 1 insights — it often explicitly identifies root causes that clarify the code-vs-prompt distinction.
Before writing recommendations, merge insights that are redundant. Two insights should merge when ALL three conditions hold:
When NOT to merge — two insights about the same tool or domain area but different failure modes should remain separate. Example: "agent doesn't check cancellation eligibility" and "agent doesn't execute cancellation after user confirms" both involve cancel_reservation but are completely different behavioral failures with different prompt fixes. Keep them separate.
For each merge, document:
For each insight (after merging):
AGENT_INSTRUCTION, added to domain policy, or as a standalone skill block), and why this wording over alternatives.For each non-discarded fix, assess whether the change could break currently-working behaviors:
| Risk | Definition | Example | |------|-----------|---------| | None | Change is additive; no existing behavior could be affected | Adding a new metric to compute_baselines.py | | Low | Change targets a behavior that is currently failing; working cases are unrelated | Adding a cancellation checklist when current cancellation compliance is 0% | | Medium | Change modifies a behavior where some cases already work correctly | Strengthening confirmation protocol when 28.6% already succeed — could the new wording break the working 28.6%? | | High | Change rewrites or constrains a behavior that mostly works | Restricting tool-call patterns when 41.4% already comply — overly rigid wording could cause the agent to under-call tools |
For Medium and High risk fixes, add a one-sentence mitigation: what to watch for, or how to word the prompt to preserve working cases.
Some insights from Stage 3 may be flagged as "unmeasurable." These still get fixes. An insight that the agent fabricates data or violates policy is a real problem whether or not we can measure it programmatically. Treat them the same as any other insight:
Only relegate an insight to a non-actionable "Monitor Items" section if the triage concludes it should be discarded (not valid or not actionable). Being unmeasurable is NOT a reason to skip fixing it.
For each non-discarded fix, identify which metric(s) from the rubric would move if this fix is implemented. Use the metric IDs from eval/baseline_metrics.md (e.g., M1, M2).
Rank non-discarded fixes using this formula:
Priority Score = Impact × Confidence × Tier Bonus ÷ Risk Factor
Where:
baseline_metrics.json:
You do not need to compute exact scores to three decimal places. The formula is a tiebreaker and sanity check. The point is:
After scoring, apply one manual adjustment pass: if a fix is a prerequisite for another fix (e.g., "confirmation protocol" must exist before "post-confirmation execution" can be measured), promote the prerequisite even if its standalone score is lower.
Write to eval/action_plan.md:
# Action Plan
## Summary
- Total insights: N
- Discarded: X (with reasons)
- Code fixes: Y
- Prompt fixes: Z
- Fixes without programmatic metric (verify manually): Q
## Implementation Priority
| Rank | Fix | Type | Metrics | Risk | Score rationale |
|------|-----|------|---------|------|-----------------|
| 1 | [name] | prompt | M1, M2 | Low | [one-line: why this ranks here] |
| 2 | ... | ... | ... | ... | ... |
---
## Skill: [insight ID(s)] — [title]
**Summary:** [one-line description of what the skill addresses]
**Verdict:** `prompt fix` | `code fix` | `discard`
**Classification path:** [which branch of the decision tree — e.g., "Agent has tools and data but acts wrong → prompt fix"]
**Rationale:** [why this verdict — reference specific trace evidence from insights]
**Risk:** None | Low | Medium | High — [one-sentence justification]
**Risk mitigation:** [for Medium/High only — what to watch for or how to preserve working cases]
**Recommendation:** [specific change to make]
**Files to modify:** [list of files, for code fixes]
**Metric link:** [which metrics would move, with baseline values]
**Already-handled check:** [what you grepped, what existing prompt text you found, verdict]
---
[repeat for each insight]
## Consolidated Prompt Skills
[After all per-insight entries, list the final merged prompt skill texts in priority order, ready for Stage 7 to implement]
## Monitor Items (Non-Actionable Only)
[Only insights that were triaged as genuinely non-actionable — e.g., the agent cannot change this behavior, or the insight is noise. Unmeasurable insights that are still real problems should appear in the priority list above, NOT here.]
Group related insights under cluster headings when they address the same underlying behavior. For merged insights, list all constituent insight IDs in the heading.
eval/action_plan.mddevelopment
# ACE — Learn from Traces This skill ships `learn_from_traces.py`, a script that reads OpenClaw session transcripts, feeds them through the ACE learning pipeline, and writes an updated skillbook to disk. ## Usage ```bash python learn_from_traces.py [OPTIONS] [FILES...] ``` The script auto-discovers new sessions from `~/.openclaw/agents/<agent>/sessions/` and only processes files that haven't been processed before. Processed filenames are tracked in `ace_processed.txt`. ## Options | Flag |
devops
Implement the approved fixes from the action plan and log all changes. Trigger when the user says "run stage 7", "implement fixes", "apply action plan", or when invoked by the kayba-pipeline orchestrator. Requires eval/action_plan.md to exist.
testing
Human-In-The-Loop gate that presents the action plan with full context, collects an informed approval/modification/rejection decision, and records the outcome. Trigger when the user says "run stage 6", "HITL review", "approve action plan", or when invoked by the kayba-pipeline orchestrator. Requires eval/action_plan.md and eval/baseline_metrics.md to exist.
development
Organize computed metrics into a tiered evaluation rubric with leading, lagging, and quality indicators. Trigger when the user says "run stage 4", "build rubric", "tier metrics", or when invoked by the kayba-pipeline orchestrator. Requires eval/baseline_metrics.json and eval/compute_baselines.py to exist.