Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

kryptobaseddev/ct-grade

Name: ct-grade
Author: kryptobaseddev

packages/skills/skills/ct-grade/SKILL.md

npx skillsauth add kryptobaseddev/cleo ct-grade

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Session Grading Guide

Session grading evaluates agent behavioral patterns against the CLEO protocol. It reads the audit log for a completed session and applies a 5-dimension rubric to produce a score (0-100), letter grade (A-F), and diagnostic flags.

When to Use Grade Mode

Use grading when you need to:

Evaluate how well an agent followed CLEO protocol during a session
Identify behavioral anti-patterns (skipped discovery, missing session.end, etc.)
Track improvement over time across multiple sessions
Validate that orchestrated subagents followed protocol

Grading requires audit data. Sessions must be started with the --grade flag to enable audit log capture.

Starting a Grade Session

CLI

# Start a session with grading enabled
ct session start --scope epic:T001 --name "Feature work" --grade

# The --grade flag enables detailed audit logging
# All CLI operations are recorded for later analysis

Running Scenarios

The grading rubric evaluates 5 behavioral scenarios that map to protocol compliance:

1. Fresh Discovery

Tests whether the agent checks existing sessions and tasks before starting work. Evaluates session.list and tasks.find calls at session start.

2. Task Hygiene

Tests whether task creation follows protocol: descriptions provided, parent existence verified before subtask creation, no duplicate tasks.

3. Error Recovery

Tests whether the agent handles errors correctly: follows up E_NOT_FOUND with recovery lookups (tasks.find), avoids duplicate creates after failures.

4. Full Lifecycle

Tests session discipline end-to-end: session listed before task ops, session properly ended, CLI usage patterns.

5. Multi-Domain Analysis

Tests progressive disclosure: use of admin.help or skill lookups, use of progressive disclosure for programmatic access.

Evaluating Results

CLI

# Grade a specific session
ct grade <sessionId>

# List all past grade results
ct grade --list

Understanding the 5 Dimensions

Each dimension scores 0-20 points, totaling 0-100.

S1: Session Discipline (20 pts)

| Points | Criteria | |--------|----------| | 10 | session.list called before first task operation | | 10 | session.end called when work is complete |

What it measures: Does the agent check existing sessions before starting, and properly close sessions when done?

S2: Discovery Efficiency (20 pts)

| Points | Criteria | |--------|----------| | 0-15 | find:list ratio >= 80% earns full 15; scales linearly below | | 5 | tasks.show used for detail retrieval |

What it measures: Does the agent prefer tasks.find (low context cost) over tasks.list (high context cost) for discovery?

S3: Task Hygiene (20 pts)

Starts at 20 and deducts for violations:

| Deduction | Violation | |-----------|-----------| | -5 each | tasks.add without a description | | -3 | Subtasks created without tasks.find {exact:true} parent check |

What it measures: Does the agent create well-formed tasks with descriptions and verify parents before creating subtasks?

S4: Error Protocol (20 pts)

Starts at 20 and deducts for violations:

| Deduction | Violation | |-----------|-----------| | -5 each | E_NOT_FOUND error not followed by recovery lookup within 5 ops | | -5 | Duplicate task creates detected (same title in session) |

What it measures: Does the agent recover gracefully from errors and avoid creating duplicate tasks?

S5: Progressive Disclosure Use (20 pts)

| Points | Criteria | |--------|----------| | 10 | admin.help or skill lookup calls made | | 10 | Progressive disclosure used for programmatic access |

What it measures: Does the agent use progressive disclosure (help/skills) for efficient protocol access?

Interpreting Scores

Letter Grades

| Grade | Score Range | Meaning | |-------|-----------|---------| | A | 90-100 | Excellent protocol adherence. Agent follows all best practices. | | B | 75-89 | Good. Minor gaps in one or two dimensions. | | C | 60-74 | Acceptable. Several protocol violations need attention. | | D | 45-59 | Below expectations. Significant anti-patterns present. | | F | 0-44 | Failing. Major protocol violations across multiple dimensions. |

Reading the Output

The grade result includes:

score/maxScore: Raw numeric score (e.g., 85/100)
percent: Percentage score
grade: Letter grade (A-F)
dimensions: Per-dimension breakdown with score, max, and evidence
flags: Specific violations or improvement suggestions
entryCount: Number of audit entries analyzed

Flags

Flags are actionable diagnostic messages. Each flag identifies a specific behavioral issue:

session.list never called -- Check existing sessions before starting new ones
session.end never called -- Always end sessions when done
tasks.list used Nx -- Prefer tasks.find for discovery
tasks.add without description -- Always provide task descriptions
Subtasks created without parent existence check -- Verify parent exists first
E_NOT_FOUND not followed by recovery lookup -- Follow errors with tasks.find
No admin.help or skill lookup calls -- Load ct-cleo for protocol guidance
No progressive disclosure calls -- Use admin.help or skill lookups

Common Anti-patterns

| Anti-pattern | Impact | Fix | |-------------|--------|-----| | Skipping session.list at start | -10 S1 | Always check existing sessions first | | Forgetting session.end | -10 S1 | End sessions when work is complete | | Using tasks.list instead of tasks.find | -up to 15 S2 | Use find for discovery, list only for known parent children | | Creating tasks without descriptions | -5 each S3 | Always provide a description with tasks.add | | Ignoring E_NOT_FOUND errors | -5 each S4 | Follow up with tasks.find or tasks.exists | | Creating duplicate tasks | -5 S4 | Check for existing tasks before creating new ones | | Never using admin.help | -10 S5 | Use progressive disclosure for protocol guidance | | No progressive disclosure calls | -10 S5 | Use admin.help or skill lookups for protocol guidance |

Grade Result Schema

Grade results are stored in .cleo/metrics/GRADES.jsonl as append-only JSONL. Each entry conforms to schemas/grade.schema.json with these fields:

sessionId (string, required) -- Session that was graded
taskId (string, optional) -- Associated task ID
totalScore (number, 0-100) -- Aggregate score
maxScore (number, default 100) -- Maximum possible score
dimensions (object) -- Per-dimension { score, max, evidence[] }
flags (string[]) -- Specific violations or suggestions
timestamp (ISO 8601) -- When the grade was computed
entryCount (number) -- Audit entries analyzed
evaluator (auto | manual) -- How the grade was computed

CLI Grade Operations

| Command | Description | |---------|-------------| | ct grade <sessionId> | Grade a specific session | | ct grade --list | List past grade results |

kryptobaseddev/ct-grade

packages/skills/skills/ct-grade/SKILL.md

CLEO session grading and A/B behavioral analysis with token tracking. Evaluates agent session quality via a 5-dimension rubric (S1 session discipline, S2 discovery efficiency, S3 task hygiene, S4 error protocol, S5 progressive disclosure). Supports three modes: (1) scenario — run playbook scenarios S1-S5 via CLI; (2) ab — blind A/B comparison of different CLI configurations for same domain operations with token cost measurement; (3) blind — spawn two agents with different configurations, blind-comparator picks winner, analyzer produces recommendation. Use when grading agent sessions, running grade playbook scenarios, comparing behavioral differences, measuring token usage across configurations, or performing multi-run blind A/B evaluation with statistical analysis and comparative report. Triggers on: grade session, evaluate agent behavior, A/B test CLEO configurations, run grade scenario, token usage analysis, behavioral rubric, protocol compliance scoring.

149 stars

tools

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add kryptobaseddev/cleo ct-grade

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 8:47 PM1.8s1 file scanned

SKILL.md

name:: ct-grade
description:: >-
analysis and comparative report. Triggers on:: grade session, evaluate agent behavior,
version:: 2.1.0
argument-hint:: [mode=scenario|ab|blind] [scenario=s1-s5|all] [runs=N] [session-id=<id>]
allowed-tools:: ["Bash(python *)", "Bash(cleo-dev *)", "Bash(cleo *)", "Bash(kill *)", "Bash(lsof *)", "Agent", "Read", "Write", "Glob"]
tier:: 2
core:: false
category:: quality
protocol:: null
dependencies:: []
sharedResources:: []
license:: MIT

Session Grading Guide

When to Use Grade Mode

Use grading when you need to:

Evaluate how well an agent followed CLEO protocol during a session
Identify behavioral anti-patterns (skipped discovery, missing session.end, etc.)
Track improvement over time across multiple sessions
Validate that orchestrated subagents followed protocol

Grading requires audit data. Sessions must be started with the --grade flag to enable audit log capture.

Starting a Grade Session

CLI

# Start a session with grading enabled
ct session start --scope epic:T001 --name "Feature work" --grade

# The --grade flag enables detailed audit logging
# All CLI operations are recorded for later analysis

Running Scenarios

The grading rubric evaluates 5 behavioral scenarios that map to protocol compliance:

1. Fresh Discovery

Tests whether the agent checks existing sessions and tasks before starting work. Evaluates session.list and tasks.find calls at session start.

2. Task Hygiene

Tests whether task creation follows protocol: descriptions provided, parent existence verified before subtask creation, no duplicate tasks.

3. Error Recovery

Tests whether the agent handles errors correctly: follows up E_NOT_FOUND with recovery lookups (tasks.find), avoids duplicate creates after failures.

4. Full Lifecycle

Tests session discipline end-to-end: session listed before task ops, session properly ended, CLI usage patterns.

5. Multi-Domain Analysis

Tests progressive disclosure: use of admin.help or skill lookups, use of progressive disclosure for programmatic access.

Evaluating Results

CLI

# Grade a specific session
ct grade <sessionId>

# List all past grade results
ct grade --list

Understanding the 5 Dimensions

Each dimension scores 0-20 points, totaling 0-100.

S1: Session Discipline (20 pts)

| Points | Criteria | |--------|----------| | 10 | session.list called before first task operation | | 10 | session.end called when work is complete |

What it measures: Does the agent check existing sessions before starting, and properly close sessions when done?

S2: Discovery Efficiency (20 pts)

| Points | Criteria | |--------|----------| | 0-15 | find:list ratio >= 80% earns full 15; scales linearly below | | 5 | tasks.show used for detail retrieval |

What it measures: Does the agent prefer tasks.find (low context cost) over tasks.list (high context cost) for discovery?

S3: Task Hygiene (20 pts)

Starts at 20 and deducts for violations:

| Deduction | Violation | |-----------|-----------| | -5 each | tasks.add without a description | | -3 | Subtasks created without tasks.find {exact:true} parent check |

What it measures: Does the agent create well-formed tasks with descriptions and verify parents before creating subtasks?

S4: Error Protocol (20 pts)

Starts at 20 and deducts for violations:

| Deduction | Violation | |-----------|-----------| | -5 each | E_NOT_FOUND error not followed by recovery lookup within 5 ops | | -5 | Duplicate task creates detected (same title in session) |

What it measures: Does the agent recover gracefully from errors and avoid creating duplicate tasks?

S5: Progressive Disclosure Use (20 pts)

| Points | Criteria | |--------|----------| | 10 | admin.help or skill lookup calls made | | 10 | Progressive disclosure used for programmatic access |

What it measures: Does the agent use progressive disclosure (help/skills) for efficient protocol access?

Interpreting Scores

Letter Grades

Reading the Output

The grade result includes:

score/maxScore: Raw numeric score (e.g., 85/100)
percent: Percentage score
grade: Letter grade (A-F)
dimensions: Per-dimension breakdown with score, max, and evidence
flags: Specific violations or improvement suggestions
entryCount: Number of audit entries analyzed

Flags

Flags are actionable diagnostic messages. Each flag identifies a specific behavioral issue:

session.list never called -- Check existing sessions before starting new ones
session.end never called -- Always end sessions when done
tasks.list used Nx -- Prefer tasks.find for discovery
tasks.add without description -- Always provide task descriptions
Subtasks created without parent existence check -- Verify parent exists first
E_NOT_FOUND not followed by recovery lookup -- Follow errors with tasks.find
No admin.help or skill lookup calls -- Load ct-cleo for protocol guidance
No progressive disclosure calls -- Use admin.help or skill lookups

Common Anti-patterns

Grade Result Schema

Grade results are stored in .cleo/metrics/GRADES.jsonl as append-only JSONL. Each entry conforms to schemas/grade.schema.json with these fields:

sessionId (string, required) -- Session that was graded
taskId (string, optional) -- Associated task ID
totalScore (number, 0-100) -- Aggregate score
maxScore (number, default 100) -- Maximum possible score
dimensions (object) -- Per-dimension { score, max, evidence[] }
flags (string[]) -- Specific violations or suggestions
timestamp (ISO 8601) -- When the grade was computed
entryCount (number) -- Audit entries analyzed
evaluator (auto | manual) -- How the grade was computed

CLI Grade Operations

| Command | Description | |---------|-------------| | ct grade <sessionId> | Grade a specific session | | ct grade --list | List past grade results |

Related Skills

kryptobaseddev/signaldock-connect

tools

VerifiedTrustedCommunity

Connect any AI agent to SignalDock for agent-to-agent messaging. Use when an agent needs to: (1) register on api.signaldock.io, (2) install the signaldock runtime CLI, (3) send/receive messages to other agents, (4) set up SSE real-time streaming, (5) poll for messages, (6) check inbox, or (7) connect to the SignalDock platform. Triggers on: "connect to signaldock", "register agent", "send message to agent", "agent messaging", "signaldock setup", "install signaldock", "agent-to-agent".

149SKILL.mdUpdated Apr 15, 2026

kryptobaseddev/signaldock-connect

kryptobaseddev/ct-validator

development

VerifiedTrustedCommunity

Compliance validation for verifying systems, documents, or code against requirements, schemas, or standards. Performs schema validation, code compliance checks, document validation, and protocol compliance verification with detailed pass/fail reporting. Use when validating compliance, checking schemas, verifying code standards, or auditing protocol implementations. Triggers on validation tasks, compliance checks, or quality verification needs.

149SKILL.mdUpdated Apr 15, 2026

kryptobaseddev/ct-validator

kryptobaseddev/ct-task-executor

testing

VerifiedTrustedCommunity

General implementation task execution for completing assigned CLEO tasks by following instructions and producing concrete deliverables. Handles coding, configuration, documentation work with quality verification against acceptance criteria and progress reporting. Use when executing implementation tasks, completing assigned work, or producing task deliverables. Triggers on implementation tasks, general execution needs, or task completion work.

149SKILL.mdUpdated Apr 15, 2026

kryptobaseddev/ct-task-executor

kryptobaseddev/ct-stickynote

tools

VerifiedTrustedCommunity

Quick ephemeral sticky notes for project-wide capture before formal classification

149SKILL.mdUpdated Apr 15, 2026

kryptobaseddev/ct-stickynote

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/kryptobaseddev/cleo.git

# Copy into Claude Code skills folder (global)
cp -r cleo/packages/skills/skills/ct-grade ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

kryptobaseddev/cleo

149 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT