claude/skills/hawk-view-results/SKILL.md
View and analyze Hawk evaluation results. Use when the user wants to see eval-set results, check evaluation status, list samples, view transcripts, or analyze agent behavior from a completed evaluation run.
npx skillsauth add tbroadley/dotfiles hawk-view-resultsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When the user wants to analyze evaluation results, use these hawk CLI commands:
You can list all eval sets if the user do not know the eval set ID:
hawk list eval-sets
Shows: eval set ID, creation date, creator.
You can increase the limit of results returned by --limit N.
hawk list eval-sets --limit 50
Or you can search for a specific eval set by using --search QUERY.
hawk list eval-sets --search pico
With an eval set ID, you can list all evaluations in the eval-set:
hawk list evals [EVAL_SET_ID]
Shows: task name, model, status (success/error/cancelled), and sample counts.
Or you can list individual samples and their scores:
hawk list samples [EVAL_SET_ID] [--eval FILE] [--limit N]
To get the full conversation for a specific sample:
hawk transcript <UUID>
The transcript includes full conversation with tool calls, scores, and metadata.
To get even more details, you can get the raw data by using --raw:
hawk transcript <UUID> --raw
You can also download all transcripts for an entire eval set:
# Fetch all samples in an eval set
hawk transcripts <EVAL_SET_ID>
# Write to individual files in a directory
hawk transcripts <EVAL_SET_ID> --output-dir ./transcripts
# Limit number of samples
hawk transcripts <EVAL_SET_ID> --limit 10
# Raw JSON output (one JSON per line to stdout, or .json files with --output-dir)
hawk transcripts <EVAL_SET_ID> --raw
hawk list samples has a max --limit of 500 (API returns 422 for higher values)hawk transcript and hawk transcripts time out on large eval files (100MB+), common with side-task evals that have thousands of samples × multiple epochshawk list samples does not index score values — the score_value field is often NoneFor querying sample-level data across eval sets (scores, limits, errors, token counts), use the warehouse-query skill instead of downloading eval files from S3 via inspect_ai.log.read_eval_log(). The warehouse has eval, sample, score, and message tables. A SQL query takes seconds vs minutes/hours for large eval files.
Example — find all samples that hit the working limit:
SELECT s.id AS sample_id, s.epoch, e.eval_set_id, e.model, e.task_args
FROM sample s
JOIN eval e ON s.eval_pk = e.pk
WHERE e.eval_set_id = 'eval-set-xxx'
AND s."limit" = 'working';
Use hawk transcript <uuid> only when you need the full conversation transcript for a specific sample.
hawk list eval-sets to see available eval sets
2a. Run hawk list evals <EVAL_SET_ID> to see available evaluations
2b. or run hawk list samples <EVAL_SET_ID> to find samples of interest (max 500 per request)warehouse-query skill with SQLhawk transcript <uuid> only for full conversation details on individual samplesProduction (https://api.inspect-ai.internal.metr.org) is used by default. Set HAWK_API_URL only when targeting non-production environments:
| Environment | URL |
|-------------|-----|
| Staging | https://api.inspect-ai.staging.metr-dev.org |
| Dev1 | https://api.inspect-ai.dev1.staging.metr-dev.org |
| Dev2 | https://api.inspect-ai.dev2.staging.metr-dev.org |
| Dev3 | https://api.inspect-ai.dev3.staging.metr-dev.org |
| Dev4 | https://api.inspect-ai.dev4.staging.metr-dev.org |
Example:
HAWK_API_URL=https://api.inspect-ai.staging.metr-dev.org hawk list eval_sets
tools
Add words to the Wispr Flow dictionary. Use when the user wants to add a word, phrase, or snippet to Wispr Flow for voice dictation.
documentation
Upload images to a GitHub PR description or comment using a shared gist as image hosting. Use when the user wants to add plots, screenshots, or other images to a PR.
testing
Manage tasks, projects, and productivity in Todoist. View tasks, add new items, check completed work, and organize projects.
data-ai
Use when working with stacked diffs (branch B based on branch A, which is based on main).