skills/arize-experiment/SKILL.md
INVOKE THIS SKILL when creating, running, or analyzing Arize experiments. Covers experiment CRUD, exporting runs, comparing results, and evaluation workflows using the ax CLI.
npx skillsauth add williamlimasilva/.copilot arize-experimentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
correctness, relevance), with optional label, score, and explanationThe typical flow: export a dataset → process each example → collect outputs and evaluations → create an experiment with the runs.
Proceed directly with the task — run the ax command you need. Do NOT check versions, env vars, or profiles upfront.
If an ax command fails, troubleshoot based on the error:
command not found or version error → see references/ax-setup.md401 Unauthorized / missing API key → run ax profiles show to inspect the current profile. If the profile is missing or the API key is wrong: check .env for ARIZE_API_KEY and use it to create/update the profile via references/ax-profiles.md. If .env has no key either, ask the user for their Arize API key (https://app.arize.com/admin > API Keys).env for ARIZE_SPACE_ID, or run ax spaces list -o json, or ask the user.env for ARIZE_DEFAULT_PROJECT, or ask, or run ax projects list -o json --limit 100 and present as selectable optionsax experiments listBrowse experiments, optionally filtered by dataset. Output goes to stdout.
ax experiments list
ax experiments list --dataset-id DATASET_ID --limit 20
ax experiments list --cursor CURSOR_TOKEN
ax experiments list -o json
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| --dataset-id | string | none | Filter by dataset |
| --limit, -l | int | 15 | Max results (1-100) |
| --cursor | string | none | Pagination cursor from previous response |
| -o, --output | string | table | Output format: table, json, csv, parquet, or file path |
| -p, --profile | string | default | Configuration profile |
ax experiments getQuick metadata lookup -- returns experiment name, linked dataset/version, and timestamps.
ax experiments get EXPERIMENT_ID
ax experiments get EXPERIMENT_ID -o json
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| EXPERIMENT_ID | string | required | Positional argument |
| -o, --output | string | table | Output format |
| -p, --profile | string | default | Configuration profile |
| Field | Type | Description |
|-------|------|-------------|
| id | string | Experiment ID |
| name | string | Experiment name |
| dataset_id | string | Linked dataset ID |
| dataset_version_id | string | Specific dataset version used |
| experiment_traces_project_id | string | Project where experiment traces are stored |
| created_at | datetime | When the experiment was created |
| updated_at | datetime | Last modification time |
ax experiments exportDownload all runs to a file. By default uses the REST API; pass --all to use Arrow Flight for bulk transfer.
ax experiments export EXPERIMENT_ID
# -> experiment_abc123_20260305_141500/runs.json
ax experiments export EXPERIMENT_ID --all
ax experiments export EXPERIMENT_ID --output-dir ./results
ax experiments export EXPERIMENT_ID --stdout
ax experiments export EXPERIMENT_ID --stdout | jq '.[0]'
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| EXPERIMENT_ID | string | required | Positional argument |
| --all | bool | false | Use Arrow Flight for bulk export (see below) |
| --output-dir | string | . | Output directory |
| --stdout | bool | false | Print JSON to stdout instead of file |
| -p, --profile | string | default | Configuration profile |
--all)--all): Required for experiments with more than 500 runs. Uses gRPC+TLS on a separate host/port (flight.arize.com:443) which some corporate networks may block.Agent auto-escalation rule: If a REST export returns exactly 500 runs, the result is likely truncated. Re-run with --all to get the full dataset.
Output is a JSON array of run objects:
[
{
"id": "run_001",
"example_id": "ex_001",
"output": "The answer is 4.",
"evaluations": {
"correctness": { "label": "correct", "score": 1.0 },
"relevance": { "score": 0.95, "explanation": "Directly answers the question" }
},
"metadata": { "model": "gpt-4o", "latency_ms": 1234 }
}
]
ax experiments createCreate a new experiment with runs from a data file.
ax experiments create --name "gpt-4o-baseline" --dataset-id DATASET_ID --file runs.json
ax experiments create --name "claude-test" --dataset-id DATASET_ID --file runs.csv
| Flag | Type | Required | Description |
|------|------|----------|-------------|
| --name, -n | string | yes | Experiment name |
| --dataset-id | string | yes | Dataset to run the experiment against |
| --file, -f | path | yes | Data file with runs: CSV, JSON, JSONL, or Parquet |
| -o, --output | string | no | Output format |
| -p, --profile | string | no | Configuration profile |
Use --file - to pipe data directly — no temp file needed:
echo '[{"example_id": "ex_001", "output": "Paris"}]' | ax experiments create --name "my-experiment" --dataset-id DATASET_ID --file -
# Or with a heredoc
ax experiments create --name "my-experiment" --dataset-id DATASET_ID --file - << 'EOF'
[{"example_id": "ex_001", "output": "Paris"}]
EOF
| Column | Type | Required | Description |
|--------|------|----------|-------------|
| example_id | string | yes | ID of the dataset example this run corresponds to |
| output | string | yes | The model/system output for this example |
Additional columns are passed through as additionalProperties on the run.
ax experiments deleteax experiments delete EXPERIMENT_ID
ax experiments delete EXPERIMENT_ID --force # skip confirmation prompt
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| EXPERIMENT_ID | string | required | Positional argument |
| --force, -f | bool | false | Skip confirmation prompt |
| -p, --profile | string | default | Configuration profile |
Each run corresponds to one dataset example:
{
"example_id": "required -- links to dataset example",
"output": "required -- the model/system output for this example",
"evaluations": {
"metric_name": {
"label": "optional string label (e.g., 'correct', 'incorrect')",
"score": "optional numeric score (e.g., 0.95)",
"explanation": "optional freeform text"
}
},
"metadata": {
"model": "gpt-4o",
"temperature": 0.7,
"latency_ms": 1234
}
}
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| label | string | no | Categorical classification (e.g., correct, incorrect, partial) |
| score | number | no | Numeric quality score (e.g., 0.0 - 1.0) |
| explanation | string | no | Freeform reasoning for the evaluation |
At least one of label, score, or explanation should be present per evaluation.
ax datasets list
ax datasets export DATASET_ID --stdout | jq 'length'
ax datasets export DATASET_ID
example_id, output, and optional evaluations:
[
{"example_id": "ex_001", "output": "4", "evaluations": {"correctness": {"label": "correct", "score": 1.0}}},
{"example_id": "ex_002", "output": "Paris", "evaluations": {"correctness": {"label": "correct", "score": 1.0}}}
]
ax experiments create --name "gpt-4o-baseline" --dataset-id DATASET_ID --file runs.json
ax experiments get EXPERIMENT_IDax experiments export EXPERIMENT_ID_A --stdout > a.json
ax experiments export EXPERIMENT_ID_B --stdout > b.json
example_id:
# Average correctness score for experiment A
jq '[.[] | .evaluations.correctness.score] | add / length' a.json
# Same for experiment B
jq '[.[] | .evaluations.correctness.score] | add / length' b.json
jq -s '.[0] as $a | .[1][] | . as $run |
{
example_id: $run.example_id,
b_score: $run.evaluations.correctness.score,
a_score: ($a[] | select(.example_id == $run.example_id) | .evaluations.correctness.score)
}' a.json b.json
# Count by label for experiment A
jq '[.[] | .evaluations.correctness.label] | group_by(.) | map({label: .[0], count: length})' a.json
jq -s '
[.[0][] | select(.evaluations.correctness.label == "correct")] as $passed_a |
[.[1][] | select(.evaluations.correctness.label != "correct") |
select(.example_id as $id | $passed_a | any(.example_id == $id))
]
' a.json b.json
Statistical significance note: Score comparisons are most reliable with ≥ 30 examples per evaluator. With fewer examples, treat the delta as directional only — a 5% difference on n=10 may be noise. Report sample size alongside scores: jq 'length' a.json.
ax experiments list --dataset-id DATASET_ID -- find experimentsax experiments export EXPERIMENT_ID -- download to filejq '.[] | {example_id, score: .evaluations.correctness.score}' experiment_*/runs.json# Count runs
ax experiments export EXPERIMENT_ID --stdout | jq 'length'
# Extract all outputs
ax experiments export EXPERIMENT_ID --stdout | jq '.[].output'
# Get runs with low scores
ax experiments export EXPERIMENT_ID --stdout | jq '[.[] | select(.evaluations.correctness.score < 0.5)]'
# Convert to CSV
ax experiments export EXPERIMENT_ID --stdout | jq -r '.[] | [.example_id, .output, .evaluations.correctness.score] | @csv'
arize-dataset firstarize-prompt-optimizationarize-tracearize-link| Problem | Solution |
|---------|----------|
| ax: command not found | See references/ax-setup.md |
| 401 Unauthorized | API key is wrong, expired, or doesn't have access to this space. Fix the profile using references/ax-profiles.md. |
| No profile found | No profile is configured. See references/ax-profiles.md to create one. |
| Experiment not found | Verify experiment ID with ax experiments list |
| Invalid runs file | Each run must have example_id and output fields |
| example_id mismatch | Ensure example_id values match IDs from the dataset (export dataset to verify) |
| No runs found | Export returned empty -- verify experiment has runs via ax experiments get |
| Dataset not found | The linked dataset may have been deleted; check with ax datasets list |
See references/ax-profiles.md § Save Credentials for Future Use.
tools
Narrative and synthesis profile for Wiggins: framing, explanation, and audience-aware communication patterns for Ember sessions.
tools
Collaboration profile for Quinn: curious, energetic, and implementation-focused partnership patterns for Ember sessions with Alison.
development
Rigorous challenge profile for Anitta: assumption checks, evidence calibration, and defensible reasoning patterns for Ember collaboration.
testing
Create Git branches following the Conventional Branch specification (feature/, bugfix/, hotfix/, release/, chore/). Use when creating a new branch, naming a branch, or checking whether a branch name complies with the spec.