plugins/arize-ax/skills/arize-experiment/SKILL.md
INVOKE THIS SKILL when creating, running, or analyzing Arize experiments. Covers experiment CRUD, exporting runs, comparing results, and evaluation workflows using the ax CLI.
npx skillsauth add github/awesome-copilot arize-experimentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
correctness, relevance), with optional label, score, and explanationThe typical flow: export a dataset → process each example → collect outputs and evaluations → create an experiment with the runs.
Proceed directly with the task — run the ax command you need. Do NOT check versions, env vars, or profiles upfront.
If an ax command fails, troubleshoot based on the error:
command not found or version error → see references/ax-setup.md401 Unauthorized / missing API key → run ax profiles show to inspect the current profile. If the profile is missing or the API key is wrong: check .env for ARIZE_API_KEY and use it to create/update the profile via references/ax-profiles.md. If .env has no key either, ask the user for their Arize API key (https://app.arize.com/admin > API Keys).env for ARIZE_SPACE_ID, or run ax spaces list -o json, or ask the user.env for ARIZE_DEFAULT_PROJECT, or ask, or run ax projects list -o json --limit 100 and present as selectable optionsax experiments listBrowse experiments, optionally filtered by dataset. Output goes to stdout.
ax experiments list
ax experiments list --dataset-id DATASET_ID --limit 20
ax experiments list --cursor CURSOR_TOKEN
ax experiments list -o json
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| --dataset-id | string | none | Filter by dataset |
| --limit, -l | int | 15 | Max results (1-100) |
| --cursor | string | none | Pagination cursor from previous response |
| -o, --output | string | table | Output format: table, json, csv, parquet, or file path |
| -p, --profile | string | default | Configuration profile |
ax experiments getQuick metadata lookup -- returns experiment name, linked dataset/version, and timestamps.
ax experiments get EXPERIMENT_ID
ax experiments get EXPERIMENT_ID -o json
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| EXPERIMENT_ID | string | required | Positional argument |
| -o, --output | string | table | Output format |
| -p, --profile | string | default | Configuration profile |
| Field | Type | Description |
|-------|------|-------------|
| id | string | Experiment ID |
| name | string | Experiment name |
| dataset_id | string | Linked dataset ID |
| dataset_version_id | string | Specific dataset version used |
| experiment_traces_project_id | string | Project where experiment traces are stored |
| created_at | datetime | When the experiment was created |
| updated_at | datetime | Last modification time |
ax experiments exportDownload all runs to a file. By default uses the REST API; pass --all to use Arrow Flight for bulk transfer.
ax experiments export EXPERIMENT_ID
# -> experiment_abc123_20260305_141500/runs.json
ax experiments export EXPERIMENT_ID --all
ax experiments export EXPERIMENT_ID --output-dir ./results
ax experiments export EXPERIMENT_ID --stdout
ax experiments export EXPERIMENT_ID --stdout | jq '.[0]'
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| EXPERIMENT_ID | string | required | Positional argument |
| --all | bool | false | Use Arrow Flight for bulk export (see below) |
| --output-dir | string | . | Output directory |
| --stdout | bool | false | Print JSON to stdout instead of file |
| -p, --profile | string | default | Configuration profile |
--all)--all): Required for experiments with more than 500 runs. Uses gRPC+TLS on a separate host/port (flight.arize.com:443) which some corporate networks may block.Agent auto-escalation rule: If a REST export returns exactly 500 runs, the result is likely truncated. Re-run with --all to get the full dataset.
Output is a JSON array of run objects:
[
{
"id": "run_001",
"example_id": "ex_001",
"output": "The answer is 4.",
"evaluations": {
"correctness": { "label": "correct", "score": 1.0 },
"relevance": { "score": 0.95, "explanation": "Directly answers the question" }
},
"metadata": { "model": "gpt-4o", "latency_ms": 1234 }
}
]
ax experiments createCreate a new experiment with runs from a data file.
ax experiments create --name "gpt-4o-baseline" --dataset-id DATASET_ID --file runs.json
ax experiments create --name "claude-test" --dataset-id DATASET_ID --file runs.csv
| Flag | Type | Required | Description |
|------|------|----------|-------------|
| --name, -n | string | yes | Experiment name |
| --dataset-id | string | yes | Dataset to run the experiment against |
| --file, -f | path | yes | Data file with runs: CSV, JSON, JSONL, or Parquet |
| -o, --output | string | no | Output format |
| -p, --profile | string | no | Configuration profile |
Use --file - to pipe data directly — no temp file needed:
echo '[{"example_id": "ex_001", "output": "Paris"}]' | ax experiments create --name "my-experiment" --dataset-id DATASET_ID --file -
# Or with a heredoc
ax experiments create --name "my-experiment" --dataset-id DATASET_ID --file - << 'EOF'
[{"example_id": "ex_001", "output": "Paris"}]
EOF
| Column | Type | Required | Description |
|--------|------|----------|-------------|
| example_id | string | yes | ID of the dataset example this run corresponds to |
| output | string | yes | The model/system output for this example |
Additional columns are passed through as additionalProperties on the run.
ax experiments deleteax experiments delete EXPERIMENT_ID
ax experiments delete EXPERIMENT_ID --force # skip confirmation prompt
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| EXPERIMENT_ID | string | required | Positional argument |
| --force, -f | bool | false | Skip confirmation prompt |
| -p, --profile | string | default | Configuration profile |
Each run corresponds to one dataset example:
{
"example_id": "required -- links to dataset example",
"output": "required -- the model/system output for this example",
"evaluations": {
"metric_name": {
"label": "optional string label (e.g., 'correct', 'incorrect')",
"score": "optional numeric score (e.g., 0.95)",
"explanation": "optional freeform text"
}
},
"metadata": {
"model": "gpt-4o",
"temperature": 0.7,
"latency_ms": 1234
}
}
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| label | string | no | Categorical classification (e.g., correct, incorrect, partial) |
| score | number | no | Numeric quality score (e.g., 0.0 - 1.0) |
| explanation | string | no | Freeform reasoning for the evaluation |
At least one of label, score, or explanation should be present per evaluation.
ax datasets list
ax datasets export DATASET_ID --stdout | jq 'length'
ax datasets export DATASET_ID
example_id, output, and optional evaluations:
[
{"example_id": "ex_001", "output": "4", "evaluations": {"correctness": {"label": "correct", "score": 1.0}}},
{"example_id": "ex_002", "output": "Paris", "evaluations": {"correctness": {"label": "correct", "score": 1.0}}}
]
ax experiments create --name "gpt-4o-baseline" --dataset-id DATASET_ID --file runs.json
ax experiments get EXPERIMENT_IDax experiments export EXPERIMENT_ID_A --stdout > a.json
ax experiments export EXPERIMENT_ID_B --stdout > b.json
example_id:
# Average correctness score for experiment A
jq '[.[] | .evaluations.correctness.score] | add / length' a.json
# Same for experiment B
jq '[.[] | .evaluations.correctness.score] | add / length' b.json
jq -s '.[0] as $a | .[1][] | . as $run |
{
example_id: $run.example_id,
b_score: $run.evaluations.correctness.score,
a_score: ($a[] | select(.example_id == $run.example_id) | .evaluations.correctness.score)
}' a.json b.json
# Count by label for experiment A
jq '[.[] | .evaluations.correctness.label] | group_by(.) | map({label: .[0], count: length})' a.json
jq -s '
[.[0][] | select(.evaluations.correctness.label == "correct")] as $passed_a |
[.[1][] | select(.evaluations.correctness.label != "correct") |
select(.example_id as $id | $passed_a | any(.example_id == $id))
]
' a.json b.json
Statistical significance note: Score comparisons are most reliable with ≥ 30 examples per evaluator. With fewer examples, treat the delta as directional only — a 5% difference on n=10 may be noise. Report sample size alongside scores: jq 'length' a.json.
ax experiments list --dataset-id DATASET_ID -- find experimentsax experiments export EXPERIMENT_ID -- download to filejq '.[] | {example_id, score: .evaluations.correctness.score}' experiment_*/runs.json# Count runs
ax experiments export EXPERIMENT_ID --stdout | jq 'length'
# Extract all outputs
ax experiments export EXPERIMENT_ID --stdout | jq '.[].output'
# Get runs with low scores
ax experiments export EXPERIMENT_ID --stdout | jq '[.[] | select(.evaluations.correctness.score < 0.5)]'
# Convert to CSV
ax experiments export EXPERIMENT_ID --stdout | jq -r '.[] | [.example_id, .output, .evaluations.correctness.score] | @csv'
arize-dataset firstarize-prompt-optimizationarize-tracearize-link| Problem | Solution |
|---------|----------|
| ax: command not found | See references/ax-setup.md |
| 401 Unauthorized | API key is wrong, expired, or doesn't have access to this space. Fix the profile using references/ax-profiles.md. |
| No profile found | No profile is configured. See references/ax-profiles.md to create one. |
| Experiment not found | Verify experiment ID with ax experiments list |
| Invalid runs file | Each run must have example_id and output fields |
| example_id mismatch | Ensure example_id values match IDs from the dataset (export dataset to verify) |
| No runs found | Export returned empty -- verify experiment has runs via ax experiments get |
| Dataset not found | The linked dataset may have been deleted; check with ax datasets list |
See references/ax-profiles.md § Save Credentials for Future Use.
tools
End-to-end skill for building, testing, linting, versioning, and publishing a production-grade Python library to PyPI. Covers all four build backends (setuptools+setuptools_scm, hatchling, flit, poetry), PEP 440 versioning, semantic versioning, dynamic git-tag versioning, OOP/SOLID design, type hints (PEP 484/526/544/561), Trusted Publishing (OIDC), and the full PyPA packaging flow. Use for: creating Python packages, pip-installable SDKs, CLI tools, framework plugins, pyproject.toml setup, py.typed, setuptools_scm, semver, mypy, pre-commit, GitHub Actions CI/CD, or PyPI publishing.
tools
Audit MCP (Model Context Protocol) server configurations for security issues. Use this skill when: - Reviewing .mcp.json files for security risks - Checking MCP server args for hardcoded secrets or shell injection patterns - Validating that MCP servers use pinned versions (not @latest) - Detecting unpinned dependencies in MCP server configurations - Auditing which MCP servers a project registers and whether they're on an approved list - Checking for environment variable usage vs. hardcoded credentials in MCP configs - Any request like "is my MCP config secure?", "audit my MCP servers", or "check .mcp.json" keywords: [mcp, security, audit, secrets, shell-injection, supply-chain, governance]
tools
Enable code intelligence (go-to-definition, find-references, hover, type info) for any programming language by installing and configuring an LSP server for Copilot CLI. Detects the OS, installs the right server, and generates the JSON configuration (user-level or repo-level). Use when you need deeper code understanding and no LSP server is configured, or when the user asks to set up, install, or configure an LSP server.
development
Use this skill whenever the user wants to build scroll animations, scroll effects, parallax, scroll-triggered reveals, pinned sections, horizontal scroll, text animations, or any motion tied to scroll position — in vanilla JS, React, or Next.js. Covers GSAP ScrollTrigger (pinning, scrubbing, snapping, timelines, horizontal scroll, ScrollSmoother, matchMedia) and Framer Motion / Motion v12 (useScroll, useTransform, useSpring, whileInView, variants). Use this skill even if the user just says "animate on scroll", "fade in as I scroll", "make it scroll like Apple", "parallax effect", "sticky section", "scroll progress bar", or "entrance animation". Also triggers for Copilot prompt patterns for GSAP or Framer Motion code generation. Pairs with the premium-frontend-ui skill for creative philosophy and design-level polish.