packages/skills/skills/ct-grade/SKILL.md
CLEO session grading and A/B behavioral analysis with token tracking. Evaluates agent session quality via a 5-dimension rubric (S1 session discipline, S2 discovery efficiency, S3 task hygiene, S4 error protocol, S5 progressive disclosure). Supports three modes: (1) scenario — run playbook scenarios S1-S5 via CLI; (2) ab — blind A/B comparison of different CLI configurations for same domain operations with token cost measurement; (3) blind — spawn two agents with different configurations, blind-comparator picks winner, analyzer produces recommendation. Use when grading agent sessions, running grade playbook scenarios, comparing behavioral differences, measuring token usage across configurations, or performing multi-run blind A/B evaluation with statistical analysis and comparative report. Triggers on: grade session, evaluate agent behavior, A/B test CLEO configurations, run grade scenario, token usage analysis, behavioral rubric, protocol compliance scoring.
npx skillsauth add kryptobaseddev/cleo ct-gradeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Session grading evaluates agent behavioral patterns against the CLEO protocol. It reads the audit log for a completed session and applies a 5-dimension rubric to produce a score (0-100), letter grade (A-F), and diagnostic flags.
Use grading when you need to:
Grading requires audit data. Sessions must be started with the --grade flag to enable audit log capture.
# Start a session with grading enabled
ct session start --scope epic:T001 --name "Feature work" --grade
# The --grade flag enables detailed audit logging
# All CLI operations are recorded for later analysis
The grading rubric evaluates 5 behavioral scenarios that map to protocol compliance:
Tests whether the agent checks existing sessions and tasks before starting work. Evaluates session.list and tasks.find calls at session start.
Tests whether task creation follows protocol: descriptions provided, parent existence verified before subtask creation, no duplicate tasks.
Tests whether the agent handles errors correctly: follows up E_NOT_FOUND with recovery lookups (tasks.find), avoids duplicate creates after failures.
Tests session discipline end-to-end: session listed before task ops, session properly ended, CLI usage patterns.
Tests progressive disclosure: use of admin.help or skill lookups, use of progressive disclosure for programmatic access.
# Grade a specific session
ct grade <sessionId>
# List all past grade results
ct grade --list
Each dimension scores 0-20 points, totaling 0-100.
| Points | Criteria |
|--------|----------|
| 10 | session.list called before first task operation |
| 10 | session.end called when work is complete |
What it measures: Does the agent check existing sessions before starting, and properly close sessions when done?
| Points | Criteria |
|--------|----------|
| 0-15 | find:list ratio >= 80% earns full 15; scales linearly below |
| 5 | tasks.show used for detail retrieval |
What it measures: Does the agent prefer tasks.find (low context cost) over tasks.list (high context cost) for discovery?
Starts at 20 and deducts for violations:
| Deduction | Violation |
|-----------|-----------|
| -5 each | tasks.add without a description |
| -3 | Subtasks created without tasks.find {exact:true} parent check |
What it measures: Does the agent create well-formed tasks with descriptions and verify parents before creating subtasks?
Starts at 20 and deducts for violations:
| Deduction | Violation |
|-----------|-----------|
| -5 each | E_NOT_FOUND error not followed by recovery lookup within 5 ops |
| -5 | Duplicate task creates detected (same title in session) |
What it measures: Does the agent recover gracefully from errors and avoid creating duplicate tasks?
| Points | Criteria |
|--------|----------|
| 10 | admin.help or skill lookup calls made |
| 10 | Progressive disclosure used for programmatic access |
What it measures: Does the agent use progressive disclosure (help/skills) for efficient protocol access?
| Grade | Score Range | Meaning | |-------|-----------|---------| | A | 90-100 | Excellent protocol adherence. Agent follows all best practices. | | B | 75-89 | Good. Minor gaps in one or two dimensions. | | C | 60-74 | Acceptable. Several protocol violations need attention. | | D | 45-59 | Below expectations. Significant anti-patterns present. | | F | 0-44 | Failing. Major protocol violations across multiple dimensions. |
The grade result includes:
85/100)Flags are actionable diagnostic messages. Each flag identifies a specific behavioral issue:
session.list never called -- Check existing sessions before starting new onessession.end never called -- Always end sessions when donetasks.list used Nx -- Prefer tasks.find for discoverytasks.add without description -- Always provide task descriptionsSubtasks created without parent existence check -- Verify parent exists firstE_NOT_FOUND not followed by recovery lookup -- Follow errors with tasks.findNo admin.help or skill lookup calls -- Load ct-cleo for protocol guidanceNo progressive disclosure calls -- Use admin.help or skill lookups| Anti-pattern | Impact | Fix |
|-------------|--------|-----|
| Skipping session.list at start | -10 S1 | Always check existing sessions first |
| Forgetting session.end | -10 S1 | End sessions when work is complete |
| Using tasks.list instead of tasks.find | -up to 15 S2 | Use find for discovery, list only for known parent children |
| Creating tasks without descriptions | -5 each S3 | Always provide a description with tasks.add |
| Ignoring E_NOT_FOUND errors | -5 each S4 | Follow up with tasks.find or tasks.exists |
| Creating duplicate tasks | -5 S4 | Check for existing tasks before creating new ones |
| Never using admin.help | -10 S5 | Use progressive disclosure for protocol guidance |
| No progressive disclosure calls | -10 S5 | Use admin.help or skill lookups for protocol guidance |
Grade results are stored in .cleo/metrics/GRADES.jsonl as append-only JSONL. Each entry conforms to schemas/grade.schema.json with these fields:
sessionId (string, required) -- Session that was gradedtaskId (string, optional) -- Associated task IDtotalScore (number, 0-100) -- Aggregate scoremaxScore (number, default 100) -- Maximum possible scoredimensions (object) -- Per-dimension { score, max, evidence[] }flags (string[]) -- Specific violations or suggestionstimestamp (ISO 8601) -- When the grade was computedentryCount (number) -- Audit entries analyzedevaluator (auto | manual) -- How the grade was computed| Command | Description |
|---------|-------------|
| ct grade <sessionId> | Grade a specific session |
| ct grade --list | List past grade results |
tools
Connect any AI agent to SignalDock for agent-to-agent messaging. Use when an agent needs to: (1) register on api.signaldock.io, (2) install the signaldock runtime CLI, (3) send/receive messages to other agents, (4) set up SSE real-time streaming, (5) poll for messages, (6) check inbox, or (7) connect to the SignalDock platform. Triggers on: "connect to signaldock", "register agent", "send message to agent", "agent messaging", "signaldock setup", "install signaldock", "agent-to-agent".
development
Compliance validation for verifying systems, documents, or code against requirements, schemas, or standards. Performs schema validation, code compliance checks, document validation, and protocol compliance verification with detailed pass/fail reporting. Use when validating compliance, checking schemas, verifying code standards, or auditing protocol implementations. Triggers on validation tasks, compliance checks, or quality verification needs.
testing
General implementation task execution for completing assigned CLEO tasks by following instructions and producing concrete deliverables. Handles coding, configuration, documentation work with quality verification against acceptance criteria and progress reporting. Use when executing implementation tasks, completing assigned work, or producing task deliverables. Triggers on implementation tasks, general execution needs, or task completion work.
tools
Quick ephemeral sticky notes for project-wide capture before formal classification