claude/skills/read-inspect-eval/SKILL.md
Read and analyze Inspect AI evaluation log files using the Python API. Extract samples, messages, events, and metrics from .eval files.
npx skillsauth add tbroadley/dotfiles read-inspect-evalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides instructions for reading and analyzing Inspect AI evaluation log files using the Python API.
Use this skill when the user needs to:
.eval log filesRequires the inspect_ai package:
pip install inspect-ai
The inspect_ai.log module provides these primary functions:
| Function | Purpose |
|----------|---------|
| list_eval_logs() | List all eval logs at a location |
| read_eval_log() | Read an EvalLog from a file path |
| read_eval_log_sample() | Retrieve a single EvalSample |
| read_eval_log_samples() | Read all samples incrementally (generator) |
| read_eval_log_sample_summaries() | Access summary-level info for all samples |
from inspect_ai import log
# List available logs
logs = log.list_eval_logs("./logs")
# Read a specific log
eval_log = log.read_eval_log("path/to/file.eval")
# Check status
print(f"Status: {eval_log.status}") # "success", "error", or "started"
print(f"Results: {eval_log.results}")
# Read only header/metadata (fast, no sample data)
eval_log = log.read_eval_log("file.eval", header_only=True)
# Resolve de-duplicated attachments
eval_log = log.read_eval_log("file.eval", resolve_attachments=True)
# Require successful completion
eval_log = log.read_eval_log("file.eval", all_samples_required=True)
For large files, iterate samples without loading all into memory:
from inspect_ai import log
# Read samples as generator (memory efficient)
for sample in log.read_eval_log_samples("file.eval"):
print(f"Sample {sample.id}: {sample.scores}")
# Or get summaries only (even faster)
for summary in log.read_eval_log_sample_summaries("file.eval"):
print(f"Sample {summary.id}: score={summary.scores}")
The EvalLog object contains:
eval_log.status # "success", "error", "started"
eval_log.samples # List[EvalSample] - individual evaluations
eval_log.results # Aggregate metrics
eval_log.stats # Token usage data
eval_log.error # Exception details if failed
eval_log.eval # Eval configuration (task, model, etc.)
Each sample contains:
sample.id # Sample identifier
sample.epoch # Epoch number
sample.input # Input to the model
sample.target # Expected target
sample.output # Model output
sample.scores # Dict of scorer results
sample.messages # List of conversation messages
sample.metadata # Custom metadata
sample.error # Error if sample failed
The inspect_ai.analysis module provides functions to extract dataframes:
from inspect_ai import analysis
# One row per eval log
df = analysis.evals_df("./logs")
# Custom columns
df = analysis.evals_df("./logs", columns=analysis.EvalInfo + analysis.EvalResults)
# One row per sample (summary info - fast)
df = analysis.samples_df("./logs")
# Full sample data (slower, use parallel)
df = analysis.samples_df("./logs", full=True, parallel=True)
# One row per message
df = analysis.messages_df("./logs", parallel=True)
# Filter by role
df = analysis.messages_df("./logs", filter=["assistant"], parallel=True)
# One row per event (model calls, tool use, etc.)
df = analysis.events_df(
"logs",
columns=analysis.EventTiming + analysis.ModelEventColumns,
filter=lambda e: e.event == "model",
parallel=True
)
Pre-defined column groups for evals_df:
EvalInfo - Metadata (created, tags, git commit)EvalTask - Task configurationEvalModel - Model detailsEvalResults - Status, errors, headline metricsEvalScores - All scores as separate columnsPre-defined column groups for samples_df:
SampleSummary - Default lightweight columnsSampleScores - Score details (answer, metadata, explanation)SampleMessages - Message content as stringsfrom inspect_ai import log, analysis
# Load eval
eval_log = log.read_eval_log("results.eval")
if eval_log.status == "success":
print(f"Total samples: {len(eval_log.samples)}")
print(f"Results: {eval_log.results}")
# Accuracy by task
df = analysis.samples_df([eval_log])
print(df.groupby("task")["score"].mean())
from inspect_ai import log
for sample in log.read_eval_log_samples("results.eval"):
print(f"\n=== Sample {sample.id} ===")
for msg in sample.messages:
role = msg.role
content = msg.content if isinstance(msg.content, str) else str(msg.content)
print(f"{role}: {content[:200]}...")
from inspect_ai import log
for sample in log.read_eval_log_samples("results.eval"):
if sample.error:
print(f"Sample {sample.id} failed: {sample.error}")
from inspect_ai import analysis
df = analysis.evals_df("./logs")
print(df.groupby("model")["accuracy"].mean().sort_values(ascending=False))
from inspect_ai import analysis
samples = analysis.samples_df("./logs", parallel=True)
samples.to_csv("samples.csv", index=False)
evals = analysis.evals_df("./logs")
evals.to_csv("evals.csv", index=False)
If you downloaded an .eval file using the download-inspect-eval skill, use the standard inspect_ai functions:
from inspect_ai import log
# Read the downloaded file directly
eval_log = log.read_eval_log("2025-12-17T05-42-03+00-00_debug_xyz.eval")
# Iterate samples
for sample in log.read_eval_log_samples("file.eval"):
print(sample.id, sample.scores)
As a last resort, you can manually extract the archive (.eval files are zip archives):
import zipfile
with zipfile.ZipFile("file.eval", "r") as z:
z.extractall("extracted")
# Read sample JSON directly
import json
with open("extracted/samples/task_epoch_1.json") as f:
sample_data = json.load(f)
from inspect_ai import log, analysis
# Handle missing fields gracefully
df, errors = analysis.evals_df("./logs", strict=False)
if errors:
print(f"Errors: {errors}")
# Check log status before processing
eval_log = log.read_eval_log("file.eval")
if eval_log.status == "error":
print(f"Evaluation failed: {eval_log.error}")
elif eval_log.status == "started":
print("Evaluation incomplete")
header_only=True when you only need metadataread_eval_log_samples() generator for large filesSampleSummary columns (default) for fast samples_df()parallel=True for full sample/message/event extractionmessages_df() and events_df() to reduce datatools
Add words to the Wispr Flow dictionary. Use when the user wants to add a word, phrase, or snippet to Wispr Flow for voice dictation.
documentation
Upload images to a GitHub PR description or comment using a shared gist as image hosting. Use when the user wants to add plots, screenshots, or other images to a PR.
testing
Manage tasks, projects, and productivity in Todoist. View tasks, add new items, check completed work, and organize projects.
data-ai
Use when working with stacked diffs (branch B based on branch A, which is based on main).