cursor/.cursor/skills/agent-observability/SKILL.md
Track agent decisions, metrics, token costs, and generate reports. Use when logging task execution, tracking agent performance, or generating observability reports.
npx skillsauth add akshay-na/dotfiles agent-observabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Instruments agent operations with per-task metrics, decision audit trails, token cost attribution, and reporting.
| Operation | Description |
|-----------|-------------|
| log_metric | Record a task metric entry |
| log_decision | Record a decision audit entry |
| generate_report | Generate session, weekly, or agent performance report |
| query_metrics | Query metrics with filters |
Store task execution metrics using this schema:
---
entity_name: session.current.metric.{task_id}
namespace: session.current
category: metric
status: accepted
tags: [observability, task-tracking, {task_type}]
created_at: ISO-8601
updated_at: ISO-8601
task_id: string
task_type: string
pipeline: string
target_repo: string
stages_executed:
- stage_id: string
agent: string
started_at: ISO-8601
completed_at: ISO-8601
duration_ms: number
outcome: success | failed | skipped
retry_count: number
token_estimate:
input: number
output: number
total_duration_ms: number
total_retries: number
final_outcome: success | failed | dead_letter | cancelled
decision_trail: [] # References to decision entries
---
Summary of task execution. Key decisions and outcomes.
Field descriptions:
| Field | Type | Description |
|-------|------|-------------|
| task_id | string | Unique identifier for the task |
| task_type | string | Classification (feature, bug-fix, refactor, etc.) |
| pipeline | string | Pipeline used (full-feature, bug-fix, etc.) |
| target_repo | string | Repository being modified |
| stages_executed | array | Per-stage execution records |
| total_duration_ms | number | Total task duration in milliseconds |
| total_retries | number | Sum of all retries across stages |
| final_outcome | enum | Final task status |
| decision_trail | array | References to decision entries for this task |
Record decisions for traceability:
---
entity_name: session.current.decision.{task_id}.{decision_id}
namespace: session.current
category: decision
status: accepted
tags: [observability, decision-audit, {agent}]
created_at: ISO-8601
task_id: string
decision_id: string
agent: string
decision_type: routing | strategy | escalation | override
decision: string
rationale: string
alternatives_considered: string[]
context_summary: string
---
Full decision description and reasoning.
Decision types:
| Type | When Used |
|------|-----------|
| routing | Task classification and pipeline selection |
| strategy | Retry or recovery strategy selection |
| escalation | Escalation to human or senior agent |
| override | Deviation from routing table recommendation |
Agents must write metrics at these events:
| Event | What to Capture | When | |-------|-----------------|------| | Task classified | task_type, signals matched, confidence | After orchestrator classifies | | Agent selected | agent name, reason (routing table vs override) | After routing decision | | Stage started | stage_id, agent, timestamp | Before stage execution | | Stage completed | duration, outcome, retry_count, tokens | After stage finishes | | Retry triggered | failure pattern, strategy selected, retry number | When retry loop fires | | Dead letter | full error context, strategies attempted | When retries exhausted | | Pipeline complete | total duration, total retries, final outcome | After all stages done | | Override made | what was overridden, why, alternatives | When agent overrides routing |
Task classified:
decision_type: routing
decision: "Classified as 'feature'"
rationale: "Signals matched: implement, create, new functionality"
alternatives_considered: ["bug-fix", "refactor"]
Stage completed:
stages_executed:
- stage_id: implementation
agent: senior-dev
started_at: "2024-01-15T10:30:00Z"
completed_at: "2024-01-15T10:45:00Z"
duration_ms: 900000
outcome: success
retry_count: 1
token_estimate:
input: 2500
output: 1800
Override made:
decision_type: override
decision: "Used vp-architecture instead of senior-dev"
rationale: "Complex distributed system design requires architecture expertise"
alternatives_considered: ["senior-dev", "staff-engineer"]
context_summary: "Task involves new microservice with cross-service data flow"
Since exact token counts aren't available to agents, use this estimation:
Input tokens: count_words(input_text) × 1.3
Output tokens: count_words(output_text) × 1.3
Guidelines:
Example calculation:
Input: 500 words prompt + 2000 words context = 2500 words
Estimated input tokens: 2500 × 1.3 = 3250
Output: 800 words response + 1500 words code = 2300 words
Estimated output tokens: 2300 × 1.3 = 2990
Session metrics live in session.current/ during execution. Promote to project-scoped storage for persistence.
| Trigger | Action | |---------|--------| | Pipeline completes (any outcome) | Promote task metrics | | User requests report | Promote and aggregate | | Session ends | Flush all pending metrics |
~/.cursor/memory/projects/{project_name}/metrics/
File naming: metric-{task_id}-{date}.md
Example: metric-task-001-2024-01-15.md
Maintain an index for efficient queries:
~/.cursor/memory/projects/{project_name}/metrics/_index.md
Index format:
---
entity_name: projects.{name}.metrics._index
namespace: projects.{name}.metrics
category: index
updated_at: ISO-8601
metrics:
- task_id: task-001
date: "2024-01-15"
outcome: success
pipeline: full-feature
- task_id: task-002
date: "2024-01-15"
outcome: failed
pipeline: bug-fix
---
Generate session reports using this template:
## Session Report: {date}
### Summary
- Tasks completed: X
- Success rate: Y%
- Total duration: Zmin
- Total token estimate: Nk
### Tasks
| Task ID | Type | Pipeline | Duration | Retries | Outcome |
|---------|------|----------|----------|---------|---------|
| task-001 | feature | full-feature | 45min | 1 | success |
### Agent Usage
| Agent | Invocations | Avg Duration | Token Est. | Failure Rate |
|-------|-------------|--------------|------------|--------------|
| tech-lead | 4 | 15min | ~15k | 25% |
| cto | 2 | 10min | ~8k | 0% |
### Failure Analysis
| Pattern | Count | Auto-resolved | Escalated |
|---------|-------|---------------|-----------|
| lint-error | 3 | 3 | 0 |
| test-failure | 1 | 0 | 1 |
### Decision Audit
- task-001: Classified as 'feature' (signals: implement, create).
Routing table selected full-feature pipeline.
Architecture review skipped (complexity: medium).
Aggregate across sessions for trend analysis:
## Weekly Report: {week_start} to {week_end}
### Summary
- Total tasks: X
- Success rate: Y%
- Total estimated tokens: Nk
- Avg task duration: Zmin
### Trends
- Most common task type: {type} ({count})
- Most used pipeline: {pipeline} ({count})
- Most active agent: {agent} ({invocations})
### Failure Patterns
| Pattern | This Week | Last Week | Trend |
|---------|-----------|-----------|-------|
| lint-error | 5 | 3 | ↑ |
| test-failure | 2 | 4 | ↓ |
### Recommendations
- Consider improving {area} based on failure patterns
- Agent {name} has high retry rate — review prompts
Query metrics using these operations:
| Operation | Parameters | Returns |
|-----------|------------|---------|
| query_by_date_range | start, end | metrics[] |
| query_by_task_type | type | metrics[] |
| query_by_agent | agent | metrics[] |
| query_by_outcome | outcome | metrics[] |
| aggregate_by_agent | - | { agent: stats } |
| aggregate_by_pattern | - | { pattern: counts } |
Query by date range:
query_filter:
operation: query_by_date_range
start: "2024-01-01"
end: "2024-01-15"
Aggregate by agent:
query_filter:
operation: aggregate_by_agent
Returns:
metrics:
tech-lead:
invocations: 12
total_duration_ms: 3600000
avg_duration_ms: 300000
token_estimate: 45000
failure_rate: 0.08
senior-dev:
invocations: 25
total_duration_ms: 7200000
avg_duration_ms: 288000
token_estimate: 95000
failure_rate: 0.04
The task-orchestration skill should log:
Integration point:
# After task classification
log_decision:
task_id: {task_id}
decision_type: routing
decision: "Selected {pipeline} pipeline"
rationale: "Task classified as {task_type} with signals: {signals}"
The pipeline-executor skill should log:
Integration point:
# After each stage
log_metric:
task_id: {task_id}
stage:
stage_id: {stage}
agent: {agent}
duration_ms: {duration}
outcome: {outcome}
token_estimate: {estimate}
The closed-loop-execution skill should log:
Integration point:
# On retry
log_decision:
task_id: {task_id}
decision_type: strategy
decision: "Retry with {strategy}"
rationale: "Pattern matched: {pattern}"
The context-memory skill provides the storage backend:
write for creating metric/decision entriesread for queryinglist for aggregationsession.current/ during execution, projects/{name}/metrics/ for persistence1. Validate metric_data against schema
2. Generate entity_name: session.current.metric.{task_id}
3. Write to context-memory with category: metric
4. Return { status: success, entry_path: ... }
1. Validate decision_data against schema
2. Generate decision_id if not provided
3. Generate entity_name: session.current.decision.{task_id}.{decision_id}
4. Write to context-memory with category: decision
5. Append decision_id to task metric's decision_trail
6. Return { status: success, entry_path: ... }
1. Determine report_type (session, weekly, agent_performance)
2. Query metrics from session.current/ or projects/{name}/metrics/
3. Query decisions for decision audit section
4. Aggregate statistics
5. Format using appropriate template
6. Return { status: success, report: { ... } }
1. Parse query_filter
2. Execute appropriate query operation
3. For aggregations, compute statistics
4. Return { status: success, metrics: [...] }
Track the effectiveness of feedback loops introduced by the feedback loop improvements.
feedback_loop_metrics:
# Pre-execution validation metrics
pre_validation_runs: number
pre_validation_pass_rate: number # % of validations that pass
pre_validation_catches: string[] # Types of errors caught pre-write
# Self-critique metrics
self_critique_runs: number
self_critique_catch_rate: number # % of self-reviews that find issues
self_critique_categories: string[] # Categories of issues found
# Cross-stage feedback metrics
cross_stage_feedback_iterations: number
cross_stage_feedback_resolutions: number # Fixed without user intervention
cross_stage_escalations: number
# Regression detection metrics
regression_checks_run: number
regression_checks_skipped: number # Skipped due to low complexity
regressions_detected: number
regressions_auto_fixed: number
# Strategy effectiveness (from failure-patterns.yml tracking)
strategy_effectiveness:
per_strategy:
- strategy: string
uses: number
success_rate: number
avg_attempts: number
| Event | What to Capture | When | |-------|-----------------|------| | Pre-validation complete | gates_run, passed, catches | After VALIDATE phase in closed-loop | | Self-critique finding | category, severity, file | During pre-execution-validation self_review | | Cross-stage feedback | source_stage, target_stage, items, action | During pipeline feedback loop | | Regression check | complexity, skipped, detected, resolved | During regression check in closed-loop | | Strategy execution | pattern_id, strategy, succeeded, attempts | After each strategy in closed-loop |
Include in session and weekly reports:
### Strategy Effectiveness
| Strategy | Uses | Success Rate | Avg Attempts | Trend |
|----------|------|-------------|-------------|-------|
| auto_fix | {n} | {rate}% | {avg} | {trend} |
| context_expand | {n} | {rate}% | {avg} | {trend} |
| analyze_then_fix | {n} | {rate}% | {avg} | {trend} |
| dependency_check | {n} | {rate}% | {avg} | {trend} |
| retry_with_backoff | {n} | {rate}% | {avg} | {trend} |
#### Strategy Insights
- Most effective: {strategy} ({rate}% success)
- Needs review: {strategy} (below 30% success after 10+ uses)
- Common failure chains: {strategy_a} → {strategy_b}
Include in session reports:
### Feedback Loop Summary
#### Pre-Execution Validation
- Runs: {n}
- Pass rate: {rate}%
- Top catches: {categories}
- Estimated cycles saved: {n} (errors caught before write)
#### Cross-Stage Feedback
- Feedback loops: {n}
- Resolution rate: {rate}% (without escalation)
- Avg iterations: {avg}
- Most active source: {stage}
#### Regression Detection
- Checks run: {n}
- Skipped (low complexity): {n}
- Regressions detected: {n}
- Auto-resolved: {n}
Update existing capture points to include feedback loop data:
# After each stage completes
log_metric:
task_id: {task_id}
stage:
stage_id: {stage}
agent: {agent}
duration_ms: {duration}
outcome: {outcome}
token_estimate: {estimate}
# NEW: Feedback loop data
validation_passed: boolean | null
validation_catches: string[] | null
feedback_items_produced: number | null
feedback_loop_triggered: boolean | null
# On pipeline completion
log_metric:
task_id: {task_id}
final_outcome: {outcome}
# NEW: Feedback loop aggregates
feedback_loops:
pre_validation_runs: number
pre_validation_catches: number
cross_stage_iterations: number
regression_checks: number
regressions_found: number
development
Discovery + naming convention reference for typed dev/SME/QA/devops team members in any workspace folder. Primary consumer: `tech-lead` (org-tier).
devops
Automated task classification, agent selection, and state tracking. Use when routing tasks to agents, selecting pipelines, or managing task state.
testing
Use when designing scalable systems, evaluating consistency models, planning state management, making architectural decisions, or when trade-offs around coupling, failure isolation, and reversibility need explicit reasoning before implementation.
tools
CTO/tech-lead helper — split work into disjoint shard briefs with caps (instance_cap, partition_basis, determinism keys).