Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

letta-ai/analyzing-eval-errors

Name: analyzing-eval-errors
Author: letta-ai

.skills/analyzing-eval-errors/SKILL.md

npx skillsauth add letta-ai/letta-evals analyzing-eval-errors

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Analyzing Eval Errors

Use this skill when:

An eval run has errored samples that need investigation
CLI crashes need diagnosis (exit code 1, empty stderr)
Scores seem wrong and you suspect false failures
You need to determine if agents actually completed on the server despite being recorded as errors

Quick Start: Run the Analysis Script

For an initial overview, run scripts/analyze_errors.py:

# Classify errors from JSONL (no API calls)
python <skill-dir>/scripts/analyze_errors.py --results-dir path/to/results

# Full analysis with Letta server cross-reference
python <skill-dir>/scripts/analyze_errors.py --results-dir path/to/results --check-server

This produces error_analysis.json with classified errors and server state. Read the output to understand the error landscape before diving deeper.

Investigation Workflow

Step 1: Parse and Classify

Read results.jsonl and summary.json. See references/results-schema.md for the data format.

Classify errors into buckets:

timeout — "timed out" in error message. Usually expected. Skip unless investigating slow models.
cli_crash — "return code" in error message. The letta CLI subprocess crashed. Most common bug category.
extraction — ExtractionError. Agent ran but produced no extractable submission.
grading — Grading failed after extraction succeeded.
other — Anything else.

Step 2: Cross-Reference with Server

For non-timeout errors, check what actually happened on the server. See references/letta-sdk-inspection.md for API details.

For each errored agent:

Check agent state: client.agents.retrieve(agent_id) → is last_stop_reason "end_turn" (normal) or "error"?
Check messages: client.agents.messages.list(agent_id, limit=200, order="asc") → did the agent produce a final assistant_message?
Compare: If JSONL says error but server shows assistant_message at end → false failure.

Step 3: Investigate Discrepancies

For false failures (agent completed on server but recorded as error):

Find ghost runs: Compare client.runs.list(agent_id) against run_ids from messages. Runs with zero messages are ghost runs.
Inspect ghost runs: client.runs.retrieve(run_id) → check metadata.error for the actual error detail.
Check timing: Compare ghost run created_at vs last message date. Ghost runs typically appear 0.5-2s after the agent's final message.

For extraction errors (agent never responded):

Check run steps: client.runs.steps.list(run_id) → check completion_tokens. Zero tokens with status="success" means the provider returned an empty response.
Check provider: step.provider_name identifies which LLM provider is responsible.

Step 4: Generate Report

Write a structured markdown report with:

Summary: Total errors, breakdown by model and error type
Per-bug section: For each distinct error pattern found:
- Description of what happens
- Evidence (agent IDs, run IDs, timestamps)
- Impact (false failure count, corrected scores)
Agent ID table: For debugging, include agent IDs and ghost run IDs so the team can inspect directly

Known Error Patterns

Ghost Run (CLI sends stale approval after agent completion)

Symptom: CLI exits code 1, empty stderr. Agent completed on server with assistant_message. Ghost run exists with error "Cannot process approval response: No tool call is currently awaiting approval".

Cause: In --yolo mode, the CLI sends a delayed approval after the agent's final run has already ended. This creates a new run that immediately fails.

Affected models: minimax-m2.5 (~50% crash rate), kimi-k2.5 (~18%), glm-5 (~6%).

Zero-Token Completion (provider returns empty response)

Symptom: Extraction error. Agent has 2 messages (system + user). Run step shows completion_tokens=0, status="success", stop_reason="end_turn".

Cause: The LLM provider returns an empty response that the server treats as a valid end-of-turn.

Approval Race Conditions (older letta-code versions)

Symptom: Various errors — "Failed to fetch pending approvals for resync", "CONFLICT: Cannot send a new message", "Unexpected stop reason: error". Agent may be stuck with last_message_type=approval_request_message.

Cause: CLI loses sync with the server's approval state during --yolo mode execution. Mostly fixed in newer versions but ghost run pattern persists.

letta-ai/analyzing-eval-errors

.skills/analyzing-eval-errors/SKILL.md

Investigate errors in letta_evals runs by parsing results JSONL, cross-referencing agent and run state on the Letta server via the Python SDK, and producing structured error reports. Use when an eval run has errors, crashes, or unexpected failures that need diagnosis.

67 stars

development

Updated Apr 6, 2026

$ install --global

skillsauth

npx skillsauth add letta-ai/letta-evals analyzing-eval-errors

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 20, 2026, 10:36 AM74.3s1 file scanned

SKILL.md

name:: analyzing-eval-errors
description:: Investigate errors in letta_evals runs by parsing results JSONL, cross-referencing agent and run state on the Letta server via the Python SDK, and producing structured error reports. Use when an eval run has errors, crashes, or unexpected failures that need diagnosis.

Analyzing Eval Errors

Use this skill when:

An eval run has errored samples that need investigation
CLI crashes need diagnosis (exit code 1, empty stderr)
Scores seem wrong and you suspect false failures
You need to determine if agents actually completed on the server despite being recorded as errors

Quick Start: Run the Analysis Script

For an initial overview, run scripts/analyze_errors.py:

# Classify errors from JSONL (no API calls)
python <skill-dir>/scripts/analyze_errors.py --results-dir path/to/results

# Full analysis with Letta server cross-reference
python <skill-dir>/scripts/analyze_errors.py --results-dir path/to/results --check-server

This produces error_analysis.json with classified errors and server state. Read the output to understand the error landscape before diving deeper.

Investigation Workflow

Step 1: Parse and Classify

Read results.jsonl and summary.json. See references/results-schema.md for the data format.

Classify errors into buckets:

timeout — "timed out" in error message. Usually expected. Skip unless investigating slow models.
cli_crash — "return code" in error message. The letta CLI subprocess crashed. Most common bug category.
extraction — ExtractionError. Agent ran but produced no extractable submission.
grading — Grading failed after extraction succeeded.
other — Anything else.

Step 2: Cross-Reference with Server

For non-timeout errors, check what actually happened on the server. See references/letta-sdk-inspection.md for API details.

For each errored agent:

Check agent state: client.agents.retrieve(agent_id) → is last_stop_reason "end_turn" (normal) or "error"?
Check messages: client.agents.messages.list(agent_id, limit=200, order="asc") → did the agent produce a final assistant_message?
Compare: If JSONL says error but server shows assistant_message at end → false failure.

Step 3: Investigate Discrepancies

For false failures (agent completed on server but recorded as error):

Find ghost runs: Compare client.runs.list(agent_id) against run_ids from messages. Runs with zero messages are ghost runs.
Inspect ghost runs: client.runs.retrieve(run_id) → check metadata.error for the actual error detail.
Check timing: Compare ghost run created_at vs last message date. Ghost runs typically appear 0.5-2s after the agent's final message.

For extraction errors (agent never responded):

Check run steps: client.runs.steps.list(run_id) → check completion_tokens. Zero tokens with status="success" means the provider returned an empty response.
Check provider: step.provider_name identifies which LLM provider is responsible.

Step 4: Generate Report

Write a structured markdown report with:

Summary: Total errors, breakdown by model and error type
Per-bug section: For each distinct error pattern found:
- Description of what happens
- Evidence (agent IDs, run IDs, timestamps)
- Impact (false failure count, corrected scores)
Agent ID table: For debugging, include agent IDs and ghost run IDs so the team can inspect directly

Known Error Patterns

Ghost Run (CLI sends stale approval after agent completion)

Cause: In --yolo mode, the CLI sends a delayed approval after the agent's final run has already ended. This creates a new run that immediately fails.

Affected models: minimax-m2.5 (~50% crash rate), kimi-k2.5 (~18%), glm-5 (~6%).

Zero-Token Completion (provider returns empty response)

Symptom: Extraction error. Agent has 2 messages (system + user). Run step shows completion_tokens=0, status="success", stop_reason="end_turn".

Cause: The LLM provider returns an empty response that the server treats as a valid end-of-turn.

Approval Race Conditions (older letta-code versions)

Cause: CLI loses sync with the server's approval state during --yolo mode execution. Mostly fixed in newer versions but ghost run pattern persists.

Related Skills

openclaw/openclaw-secret-scanning-maintainer

development

VerifiedTrustedCommunity

Maintainer-only workflow for handling GitHub Secret Scanning alerts on OpenClaw. Use when Codex needs to triage, redact, clean up, and resolve secret leakage found in issue comments, issue bodies, PR comments, or other GitHub content.

357,764SKILL.mdUpdated Apr 15, 2026

openclaw/openclaw-secret-scanning-maintainer

openclaw/openclaw-release-maintainer

development

VerifiedTrustedCommunity

Maintainer workflow for OpenClaw releases, prereleases, changelog release notes, and publish validation. Use when Codex needs to prepare or verify stable or beta release steps, align version naming, assemble release notes, check release auth requirements, or validate publish-time commands and artifacts.

357,764SKILL.mdUpdated Apr 10, 2026

openclaw/openclaw-release-maintainer

openclaw/openclaw-qa-testing

development

VerifiedTrustedCommunity

Run, watch, debug, and extend OpenClaw QA testing with qa-lab and qa-channel. Use when Codex needs to execute the repo-backed QA suite, inspect live QA artifacts, debug failing scenarios, add new QA scenarios, or explain the OpenClaw QA workflow. Prefer the live OpenAI lane with regular openai/gpt-5.4 in fast mode; do not use gpt-5.4-pro or gpt-5.4-mini unless the user explicitly overrides that policy.

357,764SKILL.mdUpdated Apr 10, 2026

openclaw/openclaw-qa-testing

openclaw/openclaw-parallels-smoke

development

VerifiedTrustedCommunity

End-to-end Parallels smoke, upgrade, and rerun workflow for OpenClaw across macOS, Windows, and Linux guests. Use when Codex needs to run, rerun, debug, or interpret VM-based install, onboarding, gateway smoke tests, latest-release-to-main upgrade checks, fresh snapshot retests, or optional Discord roundtrip verification under Parallels.

357,764SKILL.mdUpdated Apr 10, 2026

openclaw/openclaw-parallels-smoke

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/letta-ai/letta-evals.git

# Copy into Claude Code skills folder (global)
cp -r letta-evals/.skills/analyzing-eval-errors ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

letta-ai/letta-evals

67 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT