.claude/skills/outcome-reflection/SKILL.md
Feed actual task results back into agent memory for calibration. Compares predicted vs actual outcomes, records accuracy scores, and tracks estimation quality, prediction quality, and decision quality over time to improve future agent performance.
npx skillsauth add oimiragieo/agent-studio outcome-reflectionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Closes the feedback loop between prediction and reality in agent task execution. After a task completes, this skill compares the predicted outcome (recorded at planning time) against the actual outcome (observed at completion), scores the accuracy on three dimensions, and persists the calibration record to memory for future use.
Over time, accumulated calibration data reveals systematic biases (e.g., consistent underestimation of implementation tasks) that planners can use to improve future predictions.
Invoke immediately after any task that had a predicted outcome recorded:
Do not invoke for:
TaskUpdate(completed)NO CALIBRATION WITHOUT A PRIOR PREDICTION
If no prediction was recorded at planning time, outcome-reflection cannot score accuracy. The fix is to ensure planners record predictions. See plan-generator for prediction metadata format.
Measures how accurately the agent estimated measurable quantities:
Score: 0.0–1.0 (1.0 = exact, 0.0 = off by >10x)
score = max(0, 1 - abs(predicted - actual) / max(predicted, actual))
Measures how accurately the agent predicted qualitative outcomes:
Score: 0.0–1.0 (1.0 = predicted outcome exactly matched, 0.0 = outcome unrecognized)
Scored by reading task completion metadata and comparing against task creation metadata.
Measures whether the decisions made during the task were appropriate in retrospect:
Score: 0.0–1.0 (1.0 = no rework, smooth execution; 0.0 = multiple pivots or failure)
Each outcome-reflection run produces one calibration record:
{
"taskId": "task-N",
"taskType": "implementation|planning|estimation|architecture|security",
"completedAt": "ISO-8601",
"agentType": "developer|planner|architect|...",
"predictions": {
"estimatedTokens": 5000,
"estimatedFiles": 3,
"estimatedSteps": 5,
"predictedOutcome": "Add JWT auth with refresh tokens",
"predictedBlockers": ["Redis not available"],
"confidence": "Medium"
},
"actuals": {
"actualTokens": 7200,
"actualFiles": 5,
"actualSteps": 8,
"actualOutcome": "Added JWT auth with refresh tokens; added Redis fallback",
"actualBlockers": ["Redis not available", "JWT library version mismatch"],
"reworkLoops": 1
},
"scores": {
"estimationAccuracy": 0.72,
"predictionQuality": 0.85,
"decisionQuality": 0.8,
"overall": 0.79
},
"flags": [],
"notes": "Token estimate was 44% low. Consider 1.5x buffer for JWT auth tasks."
}
# Check calibration history for this agent type
grep -r "outcome-reflection" C:/dev/projects/agent-studio/.claude/context/memory/learnings.md 2>/dev/null | tail -10
Read the completed task's metadata using TaskGet:
const taskData = TaskGet({ taskId: 'task-N' });
// Access: taskData.metadata.predictions, taskData.metadata.actuals
node .claude/skills/outcome-reflection/hooks/pre-execute.cjs \
'{"taskId":"task-N","predictions":{},"actuals":{}}'
Expected output: { "valid": true } or error listing missing fields.
For each measurable quantity where a prediction exists:
estimationScore = max(0, 1 - abs(predicted - actual) / max(predicted, actual))
Command to compute:
node .claude/skills/outcome-reflection/scripts/main.cjs \
--taskId task-N \
--predicted '{"tokens":5000,"files":3,"steps":5}' \
--actual '{"tokens":7200,"files":5,"steps":8}'
Expected output: JSON with per-dimension scores and overall score.
Read task creation metadata (what was predicted qualitatively) and task completion metadata (what actually happened). Score on 0.0–1.0:
| Outcome Match | Score | | ----------------------- | ------- | | Exact match | 1.0 | | Minor deviations | 0.8–0.9 | | Moderate deviations | 0.6–0.7 | | Significant differences | 0.3–0.5 | | Completely wrong | 0.0–0.2 |
Check completion metadata for rework signals:
reworkLoops: 0 → Decision quality: 1.0reworkLoops: 1 → Decision quality: 0.75reworkLoops: 2 → Decision quality: 0.5reworkLoops: 3+ → Decision quality: 0.25status: "failed" → Decision quality: 0.0Append to .claude/context/memory/learnings.md:
## [DATE] Calibration: task-N (AGENT_TYPE)
- Estimation: SCORE (note what was under/over-estimated)
- Prediction: SCORE (note what was missed)
- Decision: SCORE (note rework loops)
- Overall: SCORE
- Action: [flag for reflection | no action needed]
Also emit structured record via MemoryRecord for semantic search:
MemoryRecord({
type: 'pattern',
text: `Calibration for ${taskType} tasks: estimation=${score}, prediction=${score}, decision=${score}`,
area: 'calibration',
});
If overall < 0.6, append a reflection request:
echo '{"id":"'$(node -e "console.log(require('crypto').randomUUID())")'","trigger":"calibration-miss","priority":"low","context":"Task task-N had calibration score SCORE. Identify root cause of estimation/prediction miss."}' >> .claude/context/runtime/reflection-spawn-request.json
const { sendEvent } = require('.claude/tools/observability/send-event.cjs');
sendEvent({
tool_name: 'outcome-reflection',
agent_id: process.env.AGENT_ID || 'unknown',
session_id: process.env.SESSION_ID || 'unknown',
outcome: 'success',
metadata: { taskId, overallScore, flagged: overallScore < 0.6 },
});
After 5+ calibrations for the same task type and agent, the planner should query the trend:
node .claude/skills/outcome-reflection/scripts/main.cjs \
--analyze --agentType developer --taskType implementation --last 10
Expected output: Mean, median, trend direction (improving/degrading/stable), and top 3 miss patterns.
Use trend data to adjust future estimates:
Input validated against schemas/input.schema.json before execution.
Output contract defined in schemas/output.schema.json.
Pre-execution hook: hooks/pre-execute.cjs — validates taskId, predictions object, actuals object.
Post-execution hook: hooks/post-execute.cjs — emits observability event via send-event.cjs.
plan-generator — records predictions at planning time (prerequisite for calibration)instinct-learning — records atomic learned patterns (complementary to calibration)reflection-agent — investigates high-miss tasks flagged by outcome-reflectionverification-before-completion — gate that runs before completion (runs before this skill)Before starting: Read .claude/context/memory/learnings.md to find prior calibrations for the same agent type and task type. This provides baseline context.
After completing: Append calibration record to .claude/context/memory/learnings.md. Use MemoryRecord tool for structured pattern recording. Do NOT write directly to patterns.json.
Assume interruption: If calibration context is lost, re-read the completed task's metadata from TaskGet — all predictions and actuals should be in task metadata.
tools
Comprehensive biosignal processing toolkit for analyzing physiological data including ECG, EEG, EDA, RSP, PPG, EMG, and EOG signals. Use this skill when processing cardiovascular signals, brain activity, electrodermal responses, respiratory patterns, muscle activity, or eye movements. Applicable for heart rate variability analysis, event-related potentials, complexity measures, autonomic nervous system assessment, psychophysiology research, and multi-modal physiological signal integration.
tools
Comprehensive toolkit for creating, analyzing, and visualizing complex networks and graphs in Python. Use when working with network/graph data structures, analyzing relationships between entities, computing graph algorithms (shortest paths, centrality, clustering), detecting communities, generating synthetic networks, or visualizing network topologies. Applicable to social networks, biological networks, transportation systems, citation networks, and any domain involving pairwise relationships.
data-ai
Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML.
development
Run Python code in the cloud with serverless containers, GPUs, and autoscaling. Use when deploying ML models, running batch processing jobs, scheduling compute-intensive tasks, or serving APIs that require GPU acceleration or dynamic scaling.