ace/cli/skills/kayba-pipeline/stage-3-metrics/SKILL.md
Define metrics from Kayba insights, implement them as Python measurement code, run against traces, and iterate until the metrics are clean and meaningful. Trigger when the user says "run stage 3", "define metrics", "build metrics", "compute baselines", or when invoked by the kayba-pipeline orchestrator. Requires eval/stage1_insights_summary.md and eval/stage2_domain_context.md to exist.
npx skillsauth add kayba-ai/agentic-context-engine kayba-stage-3-metricsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Define metrics from insights, implement as code, run, review, iterate.
TRACES_FOLDER — path to directory containing trace JSON fileseval/stage1_insights_summary.md — output from Stage 1eval/stage2_domain_context.md — output from Stage 2Read both input files before starting.
This stage is iterative. You cycle through define → implement → run → review, with a hard cap of 3 iterations. A metric set is "clean" when ALL of the following hold:
"confidence": "directional-only" and excluded from priority sorting."extreme_justification" field.|events_A ∩ events_B| / min(|events_A|, |events_B|). If > 0.70, merge or drop one.If after 3 iterations the set is not fully clean, ship what you have and log remaining issues in eval/baseline_metrics.json under a top-level "warnings" key.
For each insight from the Kayba analysis, use the evidence fields to identify observable signals in the traces:
Recovery detectors — consecutive calls to the same function where first has error, next succeeds
def has_recovery(calls, function_name):
for i in range(len(calls) - 1):
if calls[i]['name'] == function_name and is_error(calls[i]['output']):
if calls[i+1]['name'] == function_name and is_success(calls[i+1]['output']):
return True
return False
Loop detectors — N+ consecutive calls to the same function (stuck agent)
Give-up detectors — regex match agent output for abandonment phrases ("I'm unable to", "cannot complete", "beyond my capabilities")
Error classifiers — match function outputs against domain-specific error patterns. Build a pattern table:
ERROR_PATTERNS = {
'pattern_name': r'regex matching the error',
# one entry per distinct error type
}
Over-exploration detectors — ratio of explore vs action calls. Use the tool categories from Stage 2. If explore ratio exceeds threshold AND task didn't complete → analysis paralysis
Ground-truth comparison detectors — agent claims a value (dollar amount, flight number, policy rule) in natural language, and the preceding tool response contains the actual value. Extract candidate values from agent text via regex, then compare against structured fields in the tool response JSON. Examples:
# Extract dollar amounts from agent text
DOLLAR_PATTERN = r'\$\s?([\d,]+(?:\.\d{2})?)'
# Extract flight numbers (3 letters + 3 digits)
FLIGHT_PATTERN = r'\b([A-Z]{2,3}\d{3,4})\b'
def check_agent_claims_against_tool(agent_text, preceding_tool_response):
"""Compare values the agent states against the tool response ground truth."""
claimed_amounts = re.findall(DOLLAR_PATTERN, agent_text)
actual_amounts = extract_amounts_from_json(preceding_tool_response)
# A claim is fabricated if it doesn't match any actual value
fabricated = [c for c in claimed_amounts if not any(matches(c, a) for a in actual_amounts)]
return len(fabricated) == 0, fabricated
This pattern covers data accuracy (fabricated prices/flights), post-action verification (quoted vs actual cost), and policy accuracy (claimed restrictions vs policy text). These are NOT qualitative-only — regex + JSON comparison is noisy but produces a real signal. Build the detector even if it's imperfect; a noisy metric that produces a fix is better than a clean classification that produces nothing.
Ordering/sequencing detectors — agent performs actions in the wrong order (e.g., searches for flights before checking if the reservation is even modifiable). Check whether tool call A appears before tool call B when B should come first.
Clean success — threads where all tasks completed with no errors and no other tags
Write eval/compute_baselines.py with:
--traces-dir (required), --output (default: eval/baseline_metrics.json)load_traces(traces_dir) — loads all JSON trace filestag_thread(thread) — combines all detectors, returns list of tagsnumerator / denominatorcompute_all_baselines(traces_dir) — runs all metrics, returns dictRun it:
python eval/compute_baselines.py --traces-dir {TRACES_FOLDER} --output eval/baseline_metrics.json
Run these checks in order after every run. Each check either passes or produces a concrete fix action.
Check A — Script health. Did the script error or produce null values? → fix and re-run. This is iteration 0-cost; don't count it toward the 3-iteration cap.
Check B — Small-sample guard. For each metric, examine the denominator:
"confidence": "directional-only" in the output JSON. The metric stays in the report but is excluded from priority sorting in Stage 4. Do NOT drop it — small-sample metrics can still inform qualitative analysis."confidence": "not-observed" and move on).Check C — Extreme-value triage. For any metric at exactly 0% or 100%:
"extreme_justification" in the output. Example: M5=0% is correct because both cancellations in the dataset were on ineligible reservations."at_ceiling": true (or "at_floor": true) to its entry in the output JSON. This signals to Stage 4 (direction setting) and Stage 5 (action planning) that the metric is already optimal and should NOT be listed as needing improvement. Stage 4 must set its direction to "↑ maintain" or "— already optimal", never bare "↑".Check D — Correlation / overlap audit. For every pair of metrics, compute event overlap: |denom_A ∩ denom_B| / min(|denom_A|, |denom_B|). If > 0.70:
Check E — Coverage (strict). For EVERY Stage 1 insight, verify it has a corresponding metric. If an insight has no metric:
After checks, if any produced a fix action: apply fixes and re-run (counts as one iteration). If all checks pass → the metric set is clean. Stop iterating.
Target one metric per insight. Every insight should have a metric unless it is genuinely unmeasurable (see above). If you end up with fewer metrics than insights, you are being too conservative. Directional-only metrics (denominator < 5) still count — they produce fixes in Stage 5. Only apply the redundancy check (Check D) to merge metrics that truly overlap; do not use the metric count as a reason to skip building detectors.
Express every metric as a ratio or percentage. Absolute counts aren't comparable across trace sets.
Prefer per-event denominators over per-thread. "% of EditScript calls with errors" is sharper than "% of threads with any EditScript error." Per-thread denominators compress information — a thread with 10 violations and a thread with 1 both count the same.
One metric per behavioral change. If two would always move together, keep only the sharper one. Use Check D (overlap audit) to enforce this mechanically, not just by intuition.
Build a metric for EVERY insight. "Unmeasurable" is a last resort, not a default. Before classifying an insight as unmeasurable, you MUST attempt to build a programmatic detector. The bar for "unmeasurable" is: you tried a concrete approach, it fundamentally cannot work (not just "it's noisy"), and you can explain why in one sentence.
Specifically:
"confidence": "directional-only". Do NOT skip building the metric. A directional-only metric still produces a fix in Stage 5.If after genuine effort an insight truly cannot be measured programmatically, classify it as:
"qualitative-only" — requires semantic understanding that regex/JSON comparison cannot approximate. Must explain what specific semantic judgment is needed and why pattern matching fails."insufficient-data" — detector exists but denominator is 0 (not just small — literally zero applicable events). Note what scenarios would need to appear in traces."needs-ground-truth" — requires task-specific expected outcomes that aren't in the trace format.Record any remaining unmeasurable insights in the output JSON under a "unmeasurable" key. The goal is for this list to be as short as possible — ideally empty.
eval/compute_baselines.py — runnable script with --traces-dir and --output CLI argseval/baseline_metrics.json — computed baseline values, structured as:
{
"M1": {
"name": "single_tool_call_compliance",
"value": 0.414,
"numerator": 12,
"denominator": 29,
"confidence": "full"
},
"M5": {
"name": "cancellation_policy_compliance",
"value": 0.0,
"numerator": 0,
"denominator": 2,
"confidence": "directional-only",
"extreme_justification": "0% correct: both cancellations in dataset were on ineligible reservations"
},
"warnings": ["M5 and M6 have denominator < 5; excluded from priority ranking"],
"unmeasurable": [
{
"insight_id": "d7494740",
"name": "Cabin Change Constraints",
"classification": "insufficient-data",
"reason": "Only 1 update_reservation_flights call in dataset"
}
]
}
development
# ACE — Learn from Traces This skill ships `learn_from_traces.py`, a script that reads OpenClaw session transcripts, feeds them through the ACE learning pipeline, and writes an updated skillbook to disk. ## Usage ```bash python learn_from_traces.py [OPTIONS] [FILES...] ``` The script auto-discovers new sessions from `~/.openclaw/agents/<agent>/sessions/` and only processes files that haven't been processed before. Processed filenames are tracked in `ace_processed.txt`. ## Options | Flag |
devops
Implement the approved fixes from the action plan and log all changes. Trigger when the user says "run stage 7", "implement fixes", "apply action plan", or when invoked by the kayba-pipeline orchestrator. Requires eval/action_plan.md to exist.
testing
Human-In-The-Loop gate that presents the action plan with full context, collects an informed approval/modification/rejection decision, and records the outcome. Trigger when the user says "run stage 6", "HITL review", "approve action plan", or when invoked by the kayba-pipeline orchestrator. Requires eval/action_plan.md and eval/baseline_metrics.md to exist.
development
Triage each insight into discard/code-fix/prompt-fix and produce a prioritized action plan with specific recommendations. Trigger when the user says "run stage 5", "make action plan", "triage skills", or when invoked by the kayba-pipeline orchestrator. Requires eval outputs from stages 1-4.