skills/quality-engineer/SKILL.md
Test and evaluate AI assistant behavior. Create test cases, run evaluations, analyze results, simulate conversations, and compare playbook versions. Use when asked to test an assistant, create QA scenarios, run evals, check assertion pass rates, or verify assistant behavior.
npx skillsauth add studiochat/skills quality-engineerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Create test cases, run evaluations, and simulate conversations to verify AI assistant behavior. All API calls are authenticated automatically via environment variables. The API base URL (https://api.studiochat.io) is hardcoded in the scripts.
Assistants and playbooks are the same concept. In the API, the term "playbook" is used everywhere — but users refer to them as "assistants," "bots," or "agents." When the user mentions any of these, they mean a playbook.
Playbook IDs:
playbook_base_id — stable ID across all versions of an assistant. Use this for case management.playbook_id — ID of a specific version. Use this when triggering a run (you choose which version to test).Set the following environment variables before using the scripts:
export STUDIO_API_TOKEN="sbs_your_api_key_here"
export STUDIO_PROJECT_ID="your-project-uuid"
API keys are available by request from the Studio Chat team at [email protected].
The typical request looks like: "se quejaron de tal ejemplo X — investigá y arreglá". The skill is built around that flow: reproduce the complaint, understand what drove the behaviour, fix it via in-memory overrides, validate, and only then persist a regression test.
Before touching the assistant: find the actual conversation the customer is complaining about. The data-expert skill is the right tool for this — it knows how to query the conversations API, download a specific conversation, and pull the messages + events + tool calls that happened in production. Get the conversation_id, the playbook version that was active, and the user context that was passed in. Don't try to diagnose from a screenshot or a vague summary — read what actually happened.
To know why the assistant did what it did, you need to understand how Studio Chat assembles the agent at runtime:
| Layer | What it is | How it loads |
|---|---|---|
| Playbook instructions (content) | The base system prompt for the agent. Free-form text — sets persona, tone, hard rules. | Stored on the saved Playbook row; injected into the system prompt at compile time. |
| Skills (casuísticas) | Per-scenario instruction blocks the agent loads on demand via the load_skill tool. Each skill has name, description, and content. Only name + description are surfaced to the LLM during skill discovery — the content is fetched only when load_skill is called. | Stored on the playbook version; the compiler attaches them to the agent. |
| Knowledge bases (kb_ids) | Indexed snippets the agent searches via search_knowledge_base. Returns ranked passages with citation IDs ([[abc12]]) inlined in the assistant's response. | Configured on the playbook; the agent calls search_knowledge_base when it needs grounded info. |
| API tools / toolkits | Real-world actions: Composio integrations (SLACK_SEND_MESSAGE, CAL_POST_NEW_BOOKING_REQUEST, GMAIL_*), custom API tools, custom toolkits. | Registered on the playbook's api_tools / integrations; the wrapper dispatches calls at runtime. |
| Examples | Reference conversations the agent learns from. | Stored on the playbook; injected into the system prompt. |
| Enrichment tools | Run BEFORE the agent's first turn (e.g. fetch user profile, lookup order). Their result lands in context.enrichment. | Configured via enrichment_tool_ids; not mockable through tool_mocks. |
Whatever the assistant did wrong, the root cause lives in one (or a few) of these layers. The triage question is: was the instruction wrong? Was a skill missing or loading the wrong content? Was the KB result irrelevant? Did a tool return unexpected data? Use qa.py chat --verbose to see the full picture of each turn (see Step 4 below).
Before you chat with the assistant for the first time and before you trigger any eval run, you MUST ask the user whether any tools should be mocked. This is not optional. Reasons:
When you ask, enumerate the tools that are mockable for this playbook so the user knows what's available. The list is the playbook's api_tools + the project's configured Composio/custom toolkits + the built-in tools the agent always has (search_knowledge_base, load_skill, list_agents, list_teams, list_kbs). Don't enumerate the events (they aren't tools) — see "What's NOT mockable" below.
Example prompt to the user:
"Antes de chatear / antes de correr el eval: ¿querés mockear alguna tool? Las disponibles en este assistant son:
- Composio:
SLACK_SEND_MESSAGE,CAL_POST_NEW_BOOKING_REQUEST,GMAIL_SEND_EMAIL(los que tenga el project)- API tools del playbook:
lookup_order,process_refund,create_ticket(los que estén enapi_tools)- Built-ins:
search_knowledge_base,load_skillPara reproducir la queja, capaz quieras mockear
lookup_ordercon un payload de error osearch_knowledge_basecon un snippet específico. ¿Algún caso particular?"
If the user says "ninguna, dale así nomás" — proceed without mocks. But don't skip the question.
# Reproduce the complaint exactly. Override the instructions OR skills if
# you already have a hypothesis about the fix; mock the tools that drove
# the failing behaviour.
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
--message "Quiero un reembolso del pedido ORD-99999" \
--conversation-id qa_repro_001 \
--tool-mocks-file ./mocks/refund-not-found.json \
--verbose
Read the WHOLE response, not just the message. qa.py chat prints to stderr a structured breakdown:
| Section | What to look at |
|---|---|
| Events | The message the user would see, but also label / note / handoff_agent / priority events the assistant emitted. A [note] is a private note (invisible to the customer but written to the conversation). A [handoff_agent] or [handoff_team] means the assistant gave up. |
| Tool calls | Every tool invoked, with arguments and result. load_skill tells you which skill the agent picked up — if the wrong one fired, the description text on that skill is wrong (the LLM picks skills off name+description, not content). search_knowledge_base shows the query and which articles came back — if the result is irrelevant, the KB content or chunking is the problem. [MOCKED] after a tool name means the response came from your mock, not the real tool. |
| Citations | Which KB articles the agent actually quoted. If the assistant cites the wrong article, either the article content is wrong or the search ranking is. |
| Explanation | Agent's own reasoning summary. Useful for catching subtle path choices ("decided to escalate because user asked twice"). |
Iterate by tweaking --instructions / --skills-file / --tool-mocks-file until the fix works. Nothing is persisted — no version bump, no chatlog pollution.
Before persisting a new test case, validate the case definition itself with qa.py dry-run start. This runs the simulator + judge ONCE against an unsaved EvalCaseCreate payload (with the same playbook_override if you're still iterating on the variant). Catches problems like a too-vague termination that never fires, or an assertion the LLM judge can't grade. Nothing is written to eval_cases or eval_runs.
Once the fix works and the case definition is sound, save it and re-run just that case against the variant (or against the new saved playbook version, if you've promoted the override).
# Save the case
python3 scripts/qa.py cases create PLAYBOOK_BASE_ID --body '{
"name": "refund-order-not-found-handoff",
"scenario": "Customer asks for a refund for an order that the lookup API returns as not-found.",
"termination": "The assistant escalates to a human agent",
"max_turns": 5,
"assertions": [
{"criteria": "The assistant does not fabricate an order status"},
{"type": "handoff"}
],
"tool_mocks": {
"lookup_order": {"match_kind": "any", "error": "Order not found"}
}
}'
# Run only this case (ignores is_enabled — works even on disabled cases
# while you're still iterating)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID \
--playbook-id VERSION_ID \
--case-ids NEW_CASE_ID
The case carries its own tool_mocks so future runs reproduce the exact failure mode deterministically without you having to specify mocks again.
| Question | Mechanism |
|---|---|
| "What did the assistant actually do in prod?" | data-expert skill — pull the real conversation |
| "Does changing the prompt fix it?" | qa.py chat --instructions ... |
| "Does a different skill fire?" | qa.py chat --skills-file ... |
| "What does the assistant do when this tool returns X?" | qa.py chat --tool-mocks-file ... |
| "Is my new case definition gradable?" | qa.py dry-run start --case ... |
| "Did the fix pass without breaking the rest?" | qa.py runs create (no --case-ids) |
| "Just re-run THIS one case quickly" | qa.py runs create --case-ids ID |
# List test cases
python3 scripts/qa.py cases list PLAYBOOK_BASE_ID
# Create a single test case
python3 scripts/qa.py cases create PLAYBOOK_BASE_ID --body '{...}'
# Create multiple test cases at once
python3 scripts/qa.py cases batch PLAYBOOK_BASE_ID --body '{"cases": [...]}'
# Delete a test case
python3 scripts/qa.py cases delete CASE_ID
# Trigger an eval run (all enabled cases, playbook's default models)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID --playbook-id VERSION_ID [--context '{}']
# Trigger an eval run against UNSAVED instructions / skills (no version bumped)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID --playbook-id VERSION_ID \
--instructions-file ./draft-prompt.md \
--skills-file ./draft-skills.json
# Trigger a run on a SUBSET of cases (ignores is_enabled — disabled cases included)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID --playbook-id VERSION_ID \
--case-ids CASE_ID_1,CASE_ID_2
# Trigger a run with model overrides + parallelism
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID --playbook-id VERSION_ID \
--model openai-direct/gpt-4o-mini \
--simulator-model anthropic/claude-sonnet-4 \
--judge-model openai/gpt-4o \
--concurrency 4
# List eval runs
python3 scripts/qa.py runs list PLAYBOOK_BASE_ID
# Check run status
python3 scripts/qa.py runs status RUN_ID
# Get run results (per-case details)
python3 scripts/qa.py runs results RUN_ID
# Cancel a running eval
python3 scripts/qa.py runs cancel RUN_ID
# Chat with an assistant (simulate a conversation)
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Hello, I need help"
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Follow up" --conversation-id CONV_ID
# Chat with UNSAVED instructions to iterate quickly (no version bumped)
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Hi" \
--instructions "Reply in English. Be very concise." \
--skills-file ./draft-skills.json
# Chat with MOCKED tool responses (stub Slack / KB / API tools — admin only)
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Quiero un reembolso de ORD-99999" \
--tool-mocks-file ./mocks.json
# Dry-run a candidate eval case WITHOUT persisting it (validate the case
# definition before committing it via `cases create`)
python3 scripts/qa.py dry-run start PLAYBOOK_BASE_ID --playbook-id VERSION_ID \
--case '{"name":"poc","scenario":"...","termination":"...","assertions":[{"criteria":"..."}]}'
python3 scripts/qa.py dry-run status DRY_RUN_ID
python3 scripts/qa.py dry-run cancel DRY_RUN_ID
See references/api-reference.md for complete endpoint specs.
A test case defines what to test and how to judge:
| Field | Required | Description |
|-------|----------|-------------|
| name | Yes | Unique name (lowercase with dashes, e.g., refund-happy-path) |
| scenario | Yes | Instructions for the simulated user (what they want, how they behave) |
| termination | Yes | Expected outcome — the simulator checks this after each turn |
| first_message | No | Exact first message. If omitted, the LLM generates one from the scenario |
| max_turns | No | Max conversation turns (1-50, default: 10) |
| assertions | No | Criteria to evaluate after the conversation (LLM-as-judge) |
| assertion_tags | No | Tags to check — verifies the assistant applied specific tags |
For each test case, the system:
first_message or an LLM-generated message based on the scenarioYou can pass a user_context dict when triggering a run. This context is forwarded to the
assistant agent, simulating a real user with specific attributes:
{
"user_context": {
"email": "[email protected]",
"plan": "premium",
"account_id": "acc-123"
}
}
The assistant sees this context exactly as it would in a real conversation.
By default a run executes every enabled case for the playbook. Two ways to narrow it:
is_enabled flag (persistent): toggle via the UI or PATCH /eval-cases/{id}. Permanently skips a case across all runs.case_ids per-run (ephemeral): pass --case-ids on runs create. Only those cases execute and is_enabled is ignored — pick a single disabled case while iterating without flipping flags on the rest.# Run just two specific cases (works even if they're disabled)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID --playbook-id VERSION_ID \
--case-ids c1f8...,c2d4...
Empty --case-ids ⇒ 400. Unknown IDs ⇒ 404. Omit the flag for the historical "all enabled" behaviour.
qa.py dry-run start runs the simulator + judge ONCE against an unsaved EvalCaseCreate payload — same pipeline as a real eval run, no rows written to eval_cases or eval_runs. Use this to:
termination fire? are the assertions gradable?).State lives in memory for ~30 minutes and is polled by dry-run ID:
DRY_RUN_ID=$(python3 scripts/qa.py dry-run start PLAYBOOK_BASE_ID \
--playbook-id VERSION_ID \
--case '{
"name": "refund-poc",
"scenario": "User asks for a refund for ORD-12345.",
"termination": "The assistant confirms the refund will be processed",
"max_turns": 6,
"assertions": [{"criteria": "The assistant asks for the order number"}]
}' \
--instructions "You are a refund specialist. Always ask for the order number first." \
| jq -r .dry_run_id)
# Poll until completed / failed / cancelled
python3 scripts/qa.py dry-run status $DRY_RUN_ID
# or bail out if it's clearly going wrong
python3 scripts/qa.py dry-run cancel $DRY_RUN_ID
If the dry-run passes and the conversation looks right, persist the case via cases create and move to Step 6 of the QA practice workflow.
Three independent model knobs control different LLM calls during an eval. All default to the playbook's configured model (or the eval-system env default for simulator/judge), and all accept the same OpenRouter-compatible syntax.
| Flag | What it controls | Default |
|---|---|---|
| --model | The assistant LLM (the playbook agent under test) | Playbook's configured model |
| --simulator-model | The LLM that role-plays the user in each case | anthropic/claude-sonnet-4 (env: EVAL_SIMULATOR_MODEL) |
| --judge-model | The LLM that grades text-type assertions. Structured assertions (tool_called, handoff_to_agent, etc.) run deterministic checks and ignore this. | openai/gpt-4o (env: EVAL_EVALUATOR_MODEL) |
Syntax (same for all three flags):
provider/model_id — single model. Examples: openai-direct/gpt-4o-mini, anthropic/claude-sonnet-4.primary{timeout}fallback — primary first; on timeout (seconds) fall back. Example: groq/llama-3.3-70b-versatile{8}openai-direct/gpt-4o-mini.modelA:50,modelB:50 — A/B experiment. Percentages must sum to 100; cases are hash-assigned to a variant by conversation_id.Bad input ⇒ 422 at the API edge (e.g. percentages that don't sum to 100, or a bare gpt-4o-mini without provider prefix). Empty/whitespace ⇒ field ignored.
OpenRouter's catalog is strict; invented slugs will 422. These are the slugs actually in use across the Studio Chat stack:
Anthropic — Claude:
| Slug | Notes |
|---|---|
| anthropic/claude-sonnet-4.6 | Newest Sonnet. Default for the assistant in most scenarios. |
| anthropic/claude-sonnet-4.5 | One rev behind 4.6. |
| anthropic/claude-sonnet-4 | Eval-system default for the simulator (EVAL_SIMULATOR_MODEL). |
| anthropic/claude-3.5-sonnet | Stable older Sonnet; cheap baseline for diffs. |
| anthropic/claude-haiku-4.5 | Newest Haiku — fast / cheap. Good for high-volume runs or the simulator when latency matters more than nuance. |
OpenAI — GPT:
| Slug | Notes |
|---|---|
| openai/gpt-5.4 | Newest GPT-5 flagship. |
| openai/gpt-5.4-mini | GPT-5 mini. Supports [reasoning=…] suffix (see below). |
| openai/gpt-5.2-chat | Stable GPT-5 chat variant. |
| openai/gpt-4.1-mini | Solid mid-tier. |
| openai/gpt-4.1-nano | Smallest GPT-4.1 — cheap. |
| openai/gpt-4o | GPT-4o via the OpenRouter pool. Default for the judge (EVAL_EVALUATOR_MODEL). |
| openai/gpt-4o-mini | Cheap judge / assistant. |
| openai-direct/gpt-4o | Same model via the direct OpenAI provider (skips OpenRouter pool — lower latency, different billing). |
| openai-direct/gpt-4o-mini | Direct-provider 4o-mini. |
Google — Gemini:
| Slug | Notes |
|---|---|
| google/gemini-2.5-flash | Newest stable Flash. |
| google/gemini-2.5-flash-lite | Cheaper Flash variant. |
| google/gemini-2.0-flash-001 | Previous Flash generation. |
| google/gemini-3-flash-preview | Gemini 3 Flash preview — may change. |
Gemini caveat: there's a known tool-calling bias in this codebase (docs/gemini-tool-call-bias.md). Prefer Sonnet for the assistant when the playbook leans heavily on tools.
OpenAI reasoning-capable models accept an optional [reasoning=X] suffix. Valid efforts: none, low, medium, high, xhigh. none disables reasoning entirely. Example:
openai/gpt-5.4-mini[reasoning=medium]
openai/gpt-5.2-chat[reasoning=none]
The suffix composes with the other syntactic forms.
--concurrency (1..5, default 1) fans cases out across a server-side thread pool. 1 keeps the sequential walk; higher values are useful when the case suite is large but watch for the simulator/agent provider's rate limits — 429s surface in EvalResult.error_message. Recommended 3–5 for ad-hoc runs.
The QA pain point: every prompt or skill change normally bumps a saved playbook version, which means a version-history littered with throwaway revisions and an approval step in front of every iteration. The --instructions, --skills-file, --examples-file, --kb-ids, and --api-tools flags on chat and runs create let you test in-memory replacements without persisting anything.
| Flag | Replaces… |
|---|---|
| --instructions TEXT / --instructions-file FILE | Main playbook instructions (the free-text content). |
| --skills-file FILE | The full set of skills. Pass [] to disable all skills. |
| --examples-file FILE | Global reference examples. |
| --kb-ids id1,id2 | Knowledge base IDs (pass "" to disable all KBs). |
| --api-tools t1,t2 | API tool IDs (pass "" to disable all tools). |
Rules:
sbs_ / kps_ tokens the skill already uses).--skills-file accepts two shapes — pick the one that matches what you want to do.
Drop the saved playbook's skills entirely and use exactly these:
[
{
"name": "refund-flow",
"description": "Handle refund requests with order id verification",
"content": "First ask for the order id. Then check eligibility..."
},
{
"name": "english-only",
"description": "Force English replies",
"content": "Reply only in English regardless of customer language."
}
]
Pass [] to disable all skills.
add / replace / remove)Keep most of the saved skills and only modify a few. Operators are applied in order: remove → replace → add:
{
"remove": ["legacy-skill-a", "legacy-skill-b"],
"replace": [
{
"name": "refund-flow",
"description": "Refund handling, tightened policy",
"content": "ASK for order id BEFORE confirming any refund..."
}
],
"add": [
{
"name": "english-only",
"description": "Force English replies",
"content": "Reply only in English regardless of customer language."
}
]
}
Strict-validation rules (the BE returns 422 if violated, before any LLM call):
remove of a name that isn't on the saved playbook → 422.replace of a name that isn't on the saved playbook → 422 (use add instead).add of a name that already exists (after remove ran) → 422 (use replace instead).remove: [X] + add: [{name:X, ...}] of the same name is allowed — after remove drops the saved row, the slot is free for add.
Both shapes share the same skill object shape (matches SkillOverrideInput on the BE). Note: only the fields the LLM actually reads during skill discovery are accepted here — trigger lives on the saved-version skill row but is never injected into the system prompt, so the override endpoint omits it.
| Field | Required | Notes |
|---|---|---|
| name | yes | Unique identifier (kebab-case recommended). Matched by name in replace/remove. |
| description | yes | Short summary shown to the agent during skill discovery. |
| content | yes | Full instructions. Macros like {{ kb(<id>) }} and {{ tool(<id>) }} are expanded — the referenced KB / API tool must already exist in the project (the override doesn't create them). |
| examples | no | Optional list of reference conversation examples. |
| order | no | Display/listing order. Auto-assigned if omitted. |
# Quick prompt tweak via inline flag
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Hi" \
--instructions "Always reply in English, be terse."
# Full file-based override for chat
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "I want a refund" \
--instructions-file ./draft-prompt.md \
--skills-file ./draft-skills.json \
--examples-file ./draft-examples.json \
--conversation-id qa_iter_001
# Disable all skills to test the bare prompt
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Hello" \
--skills-file <(echo '[]')
# Surgically modify the saved skills (patch shape)
# ./skills-patch.json:
# { "remove": ["old-flow"], "add": [{"name": "english-only", ...}] }
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Hello" \
--skills-file ./skills-patch.json
# Run the full eval suite against an unsaved prompt
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID \
--playbook-id PB_VERSION_ID \
--instructions-file ./candidate-prompt.md
Both endpoints accept a playbook_override object on the request body. The CLI builds it for you, but if you need to call the API directly:
Full-replace form (list-shape skills):
{
"conversation_id": "qa_iter_001",
"user_message": "...",
"playbook_override": {
"content": "...full instructions...",
"skills": [
{"name": "...", "description": "...", "content": "..."}
],
"examples": [],
"kb_ids": ["kb-uuid-1"],
"api_tools": []
}
}
Surgical-patch form (object-shape skills):
{
"conversation_id": "qa_iter_001",
"user_message": "...",
"playbook_override": {
"skills": {
"remove": ["old-flow"],
"replace": [{"name": "refund-flow", "description": "...", "content": "..."}],
"add": [{"name": "english-only", "description": "...", "content": "..."}]
}
}
}
Any subset of these keys is valid — omitted keys keep the saved playbook value.
tool_mocks)By default, an eval run or a chat call invokes real tools — searches the real KB, hits the real API, sends the real Slack message. That's great for end-to-end coverage but bad for QA: it depends on real-world state, can have destructive side effects, and makes it hard to reproduce a customer complaint that involved a specific tool failure.
tool_mocks lets you stub specific tools with canned responses. Same wire shape and semantics in two places:
| Where | Field | When mocks apply |
|---|---|---|
| Per case (saved) | EvalCase.tool_mocks | Every time the case runs in any eval |
| Per chat call (ephemeral) | PlaybookChatRequest.tool_mocks | This chat request only — forced into preview+eval mode (admin only) |
Step 3 of the QA practice workflow makes asking the user about mocks mandatory before any chat or eval run. The rest of this section covers the shape + the rules.
Mockable (anything that dispatches through the agent's toolset wrapper):
TOOLKIT_ACTION uppercase convention. Examples: SLACK_SEND_MESSAGE, CAL_POST_NEW_BOOKING_REQUEST, CAL_CANCEL_BOOKING_VIA_UID, GMAIL_SEND_EMAIL. The exact list depends on which toolkits are configured in SUPPORTED_TOOLKITS for the project.api_tools. Names match what the playbook author registered (typically snake_case).CUSTOM_TOOLKIT_REGISTRY.search_knowledge_base, load_skill, list_agents, list_teams, list_kbs.Tool names in the mock map must match ToolCallTrace.name exactly — uppercase for Composio, snake_case for the rest.
NOT mockable (not dispatched as tool calls):
message, note, label, priority, handoff_agent, handoff_team. These are items the LLM emits inside its structured output payload, parsed downstream by the chat handler — they never go through the wrapper. To validate them, use the matching structured assertions (HandoffAssertion, PrioritySetAssertion, TagAddedAssertion, PrivateNoteContainsAssertion).enrichment_tool_ids): run BEFORE the agent's first turn. The wrapper doesn't exist yet at that point in the request lifecycle.tool_mocks lives on the case body. Each key is a tool name; each value is one rule or an ordered list of rules:
{
"name": "refund-when-eligible",
"scenario": "...",
"termination": "...",
"tool_mocks": {
"lookup_order": {
"match_kind": "any",
"return_value": {"order_id": "ORD-123", "status": "delivered", "days_since": 5}
},
"process_refund": {
"match_kind": "any",
"return_value": {"refund_id": "RFND-999", "amount": 49.99}
}
}
}
A rule fires only when its match_kind matches the call:
| match_kind | Fires when… | Required field |
|---|---|---|
| any | always | — |
| call_ordinal | the Nth call to this tool (1-indexed) | call_ordinal: int |
| args_match | call args ⊇ match_args (subset filter; missing keys are wildcards) | match_args: dict |
return_value or error{
"tool_mocks": {
"send_email": {"match_kind": "any", "error": "SMTP unavailable"},
"lookup_user": {
"match_kind": "args_match",
"match_args": {"plan": "enterprise"},
"return_value": {"name": "Acme Corp", "tier": "enterprise"}
}
}
}
error raises a tool error inside the run (the assistant sees it the same way it would see a real API failure — useful for testing error handling). return_value is whatever JSON the tool would normally return.
Pass an array to express "specific case → fallback":
{
"tool_mocks": {
"lookup_order": [
{"match_kind": "args_match", "match_args": {"order_id": "ORD-123"}, "return_value": {"status": "delivered"}},
{"match_kind": "args_match", "match_args": {"order_id": "ORD-999"}, "return_value": {"status": "lost"}},
{"match_kind": "any", "error": "Order not found"}
]
}
}
Rules are evaluated top-to-bottom; the first match wins. The catch-all any rule at the end keeps the run safe if the assistant calls with an unexpected argument.
Use call_ordinal to return different values across consecutive calls:
{
"tool_mocks": {
"search_kb": [
{"match_kind": "call_ordinal", "call_ordinal": 1, "return_value": [{"title": "Refund policy", "snippet": "..."}]},
{"match_kind": "call_ordinal", "call_ordinal": 2, "return_value": [{"title": "Refund timeline", "snippet": "..."}]}
]
}
}
Once you mock a tool, the rule list is complete — every call to that tool during the run must match a rule. If the assistant calls lookup_order a third time and no rule matches, the run fails loudly with no mock matched call #3 for tool lookup_order.
This is intentional: silently falling through to the real implementation would let production state leak into eval runs and make failures impossible to reproduce. Always include an any catch-all (or extra call_ordinal rules) if you don't know exactly how many times the assistant will call a tool.
Tools you DON'T list in tool_mocks are unaffected — they call the real implementation as usual.
tool_called and tool_not_called assertions both support args_match for "the assistant called X with these specific args" (or "must not call X with these args"). Combine with mocks to test fine-grained behavior:
{
"tool_mocks": {
"send_email": {"match_kind": "any", "return_value": {"sent": true}}
},
"assertions": [
{"type": "tool_called", "name": "send_email", "args_match": {"to": "[email protected]"}},
{"type": "tool_not_called", "name": "send_email", "args_match": {"to": "[email protected]"}}
]
}
The same tool_mocks shape works on qa.py chat via --tool-mocks-file, letting you stub tool responses for a single ad-hoc chat without authoring a case. Useful in Step 4 of the QA practice workflow when you're still hunting for the fix.
Semantics:
--instructions / --skills-file.--instructions, so they compose cleanly.qa.py chat prints [MOCKED] next to the tool name in the Tool calls section so you can tell stubbed responses apart from real ones.Example mocks.json reproducing a "lookup_order returns not-found, KB has no refund policy" scenario:
{
"lookup_order": {"match_kind": "any", "error": "Order not found"},
"search_knowledge_base": {
"match_kind": "any",
"return_value": []
}
}
Then chat:
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
--message "Quiero un reembolso del pedido ORD-99999" \
--tool-mocks-file ./mocks.json \
--verbose
Watch the Tool calls section for lookup_order [custom] [MOCKED] and verify the assistant's response handles the error gracefully (e.g. escalates instead of fabricating an order status).
Note: run-level
tool_mocksonqa.py runs createis not supported — per-casetool_mockslives on the case body itself. If you need deterministic mocks across a whole run, put them on each case (or use a--case-idssubset against cases that already carry mocks). The CLI rejects--tool-mocks-fileonruns createwith a clear error.
user_context)A case can override the run-level user_context for its own scope. Case keys win over run keys; the special eval_overrides sub-namespace is shallow-merged.
{
"name": "vip-customer-flow",
"user_context": {
"plan": "enterprise",
"eval_overrides": {"agent_name": "PremiumBot"}
}
}
Use this when one case needs a different user identity, plan, or simulated state without affecting the rest of the run.
The scenario tells the simulated user how to behave. Write it from the user's perspective:
Assertions live on EvalCase.assertions as a list of typed objects. There are two families:
type: "text") — graded by the LLM judge. Use for free-form claims about what the assistant said.turn.events and turn.tool_calls. No LLM call. Faster, cheaper, and immune to judge flakiness — prefer these whenever the check fits a structured shape.The --judge-model flag affects only text assertions. Structured assertions ignore it.
Picking the right type: every time you reach for
textto check something the assistant did (called a tool, handed off, set a priority, applied a tag, wrote a private note), there's a structured variant that does it deterministically. Usetextonly for what the assistant said (tone, content of the message, that it mentioned a specific policy).
text — LLM-as-judge{"type": "text", "criteria": "The assistant mentions the 30-day refund policy"}
type defaults to text, so the legacy short form still works:
{"criteria": "The assistant asks for the order number before proceeding"}
Write the criteria as a clear, verifiable statement:
tool_called — a specific tool was invoked{"type": "tool_called", "name": "search_knowledge_base", "min_count": 1}
Optional args_match narrows the match to calls whose arguments dict is a superset of the given keys/values:
{
"type": "tool_called",
"name": "SLACK_SEND_MESSAGE",
"args_match": {"channel": "#alerts"},
"min_count": 1
}
min_count defaults to 1 — bump it when you need to assert "called at least N times" (e.g., retrieved KB info twice during a long flow).
tool_not_called — a tool was never invoked{"type": "tool_not_called", "name": "process_refund"}
With args_match, narrows to "must never be called with these args" (other calls to the same tool are ignored):
{
"type": "tool_not_called",
"name": "send_email",
"args_match": {"to": "[email protected]"}
}
Useful for negative tests: "the agent must NOT email the CEO."
tool_call_sequence — tools fired in a specific order{
"type": "tool_call_sequence",
"names": ["lookup_order", "check_refund_eligibility", "process_refund"],
"strict": false
}
strict: false (default) means the listed tools must appear in order but other tool calls may interleave. strict: true means they must appear contiguously in the exact order — useful when you need to lock down "no extra calls between A and B."
handoff — some handoff event was emitted (mode-agnostic){"type": "handoff"}
Matches either handoff_agent or handoff_team. Standalone accounts only ever see "the agent gave up" with no agent/team distinction — this is the only handoff assertion that makes sense there. Kaption accounts also have the more specific variants below.
handoff_to_agent — handoff to a specific agent (Kaption){"type": "handoff_to_agent", "agent_id": 162}
Or "any" for "any specific agent" (i.e., assert that a handoff_agent event fired with some agent_id):
{"type": "handoff_to_agent", "agent_id": "any"}
handoff_to_team — handoff to a specific team (Kaption){"type": "handoff_to_team", "team_id": 7}
no_handoff — assert the agent did NOT hand off{"type": "no_handoff"}
Matches both agent and team handoffs. Useful when the playbook is supposed to resolve the issue end-to-end.
priority_set — priority event with a specific value{"type": "priority_set", "value": "urgent"}
value must be one of urgent | high | medium | low. Pairs well with playbooks that triage by severity.
tag_added — a label event added a specific tag{"type": "tag_added", "tag": "billing"}
The agent emits these as
labelevents internally, but the user-facing term is "tag" — this assertion usestag_addedfor consistency with the rest of the eval surface.
There's also a legacy assertion_tags: ["billing", "escalation"] field on the case body that asserts a list of tags in one shot. It still works for backwards compatibility, but new cases should prefer one tag_added assertion per tag — they show up individually in the diff view and per-assertion result rows.
private_note_contains — a private note's content contains a substring{
"type": "private_note_contains",
"substring": "asignar a Lautaro",
"case_insensitive": true
}
Useful when the playbook is supposed to write specific context into a private note for the next human agent. case_insensitive defaults to true.
A single case can mix text and structured assertions freely. Structured assertions run first (deterministic, no LLM cost); text assertions run after.
{
"name": "refund-flow-eligible",
"scenario": "Customer with order ORD-12345 (5 days old) wants a refund.",
"termination": "The assistant confirms the refund will be processed",
"max_turns": 6,
"assertions": [
{"type": "tool_call_sequence", "names": ["lookup_order", "process_refund"]},
{"type": "tool_called", "name": "process_refund", "args_match": {"order_id": "ORD-12345"}},
{"type": "tool_not_called", "name": "human_handoff"},
{"type": "priority_set", "value": "low"},
{"type": "tag_added", "tag": "refund-completed"},
{"criteria": "The assistant confirms the refund amount in the response"}
]
}
A handoff-on-failure case for a standalone account:
{
"name": "refund-flow-not-eligible-handoff",
"scenario": "Customer wants a refund for a 60-day-old order. Policy is 30 days.",
"termination": "The assistant escalates to a human",
"max_turns": 5,
"assertions": [
{"type": "handoff"},
{"type": "tool_not_called", "name": "process_refund"},
{"type": "private_note_contains", "substring": "fuera de política de 30 días"},
{"criteria": "The assistant explains the 30-day policy to the customer"}
]
}
| Question | Assertion |
|---|---|
| Did the assistant say X? | text (LLM judge) |
| Did the assistant call tool X? | tool_called |
| Did the assistant call tool X with these args? | tool_called + args_match |
| Did the assistant NOT call tool X? | tool_not_called |
| Did the assistant NOT call tool X with these args? | tool_not_called + args_match |
| Did tools fire in this order? | tool_call_sequence |
| Did the assistant give up? (standalone) | handoff / no_handoff |
| Did the assistant route to agent N? (Kaption) | handoff_to_agent |
| Did the assistant route to team N? (Kaption) | handoff_to_team |
| Did the assistant set conversation priority? | priority_set |
| Did the assistant apply tag X? | tag_added |
| Did the assistant write a private note containing X? | private_note_contains |
Reminder: events (
priority,label,note,handoff_*) cannot be mocked withtool_mocksbecause they're not tool calls — they're items in the LLM's structured output. But they CAN be asserted with the structured assertions above. The two features are complementary: mocks shape the inputs the agent sees during the run; assertions verify the side effects (events + tool calls) it produced.
python3 scripts/qa.py cases create PLAYBOOK_BASE_ID --body '{
"name": "refund-happy-path",
"scenario": "The customer bought a product 5 days ago and wants a full refund. They have their order number ready (ORD-12345).",
"termination": "The assistant confirms the refund will be processed",
"first_message": "Hi, I want to return a product I bought last week",
"max_turns": 8,
"assertions": [
{"criteria": "The assistant asks for the order number"},
{"criteria": "The assistant confirms the refund amount"},
{"criteria": "The assistant mentions the expected refund timeline"}
]
}'
python3 scripts/qa.py cases batch PLAYBOOK_BASE_ID --body '{
"cases": [
{
"name": "greeting-basic",
"scenario": "A new customer visits for the first time and says hello.",
"termination": "The assistant greets the customer and offers help",
"max_turns": 3,
"assertions": [
{"criteria": "The assistant introduces itself"},
{"criteria": "The assistant asks how it can help"}
]
},
{
"name": "out-of-scope-question",
"scenario": "The customer asks about something completely unrelated to the business, like the weather or sports.",
"termination": "The assistant redirects to relevant topics or escalates",
"max_turns": 5,
"assertions": [
{"criteria": "The assistant does not make up an answer about unrelated topics"},
{"criteria": "The assistant politely redirects the conversation"}
]
},
{
"name": "angry-customer-handoff",
"scenario": "The customer is very angry about a delayed order. They curse and demand to speak to a manager. The order number is ORD-99999.",
"termination": "The assistant escalates to a human agent",
"max_turns": 6,
"assertions": [
{"criteria": "The assistant remains calm and professional"},
{"criteria": "The assistant attempts to help before escalating"},
{"criteria": "The assistant escalates to a human agent"}
],
"assertion_tags": ["escalation"]
},
{
"name": "pricing-inquiry",
"scenario": "The customer wants to know the pricing for the Pro plan and asks about discounts for annual billing.",
"termination": "The assistant provides pricing information",
"max_turns": 5,
"assertions": [
{"criteria": "The assistant provides the correct Pro plan price"},
{"criteria": "The assistant mentions annual billing discount if available"}
]
}
]
}'
# First, get the playbook version to test
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --list-versions
# Trigger a run against a specific version
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID \
--playbook-id PLAYBOOK_VERSION_ID
# Trigger with user context (simulate a specific user)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID \
--playbook-id PLAYBOOK_VERSION_ID \
--context '{"email": "[email protected]", "plan": "enterprise"}'
# Poll status until completed (check every 15 seconds)
python3 scripts/qa.py runs status RUN_ID
# Get full results
python3 scripts/qa.py runs results RUN_ID -o eval_results.json
Chat directly with a playbook without creating test cases. Useful for ad-hoc testing. The chat command shows the full picture of what the assistant did: skills loaded, KB articles searched, tool calls made, citations used, events emitted, and agent reasoning.
# Start a new conversation
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
--message "Hi, I need to cancel my subscription"
# Continue the same conversation (use the conversation_id from the response)
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
--message "My account email is [email protected]" \
--conversation-id conv_qa_12345
# Chat with user context
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
--message "Check my order status" \
--context '{"email": "[email protected]", "order_id": "ORD-555"}'
# Verbose mode — show full tool results and raw JSON response
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
--message "I want a refund" --verbose
The chat command prints a structured breakdown to stderr:
| Section | What it shows | |---------|---------------| | Events | Messages (stdout), labels, notes, handoffs, priority changes | | Tool calls | Every tool the assistant invoked, with arguments and result summaries | | Citations | KB articles referenced in the response, with source and snippet | | Explanation | Agent reasoning summary (why it chose a particular path) |
Tool call details (always shown):
load_skill [custom] — which skill was loaded (or not loaded)search_knowledge_base [kb_search] — query used, number of results, relevance scores, and content snippets--verbose / -v expands tool results to 500 chars and dumps the full JSON response.
import json
with open("eval_results.json") as f:
results = json.load(f)
total = len(results)
passed = sum(1 for r in results if r["status"] == "passed")
failed = sum(1 for r in results if r["status"] == "failed")
errored = sum(1 for r in results if r["status"] == "error")
print(f"Results: {passed}/{total} passed ({passed/total*100:.0f}%)")
if failed:
print(f" Failed: {failed}")
if errored:
print(f" Errors: {errored}")
print("\nFailed cases:")
for r in results:
if r["status"] != "passed":
print(f"\n [{r.get('case_name', r['case_id'])}] — {r['status']}")
for a in r.get("assertion_results", []):
status = "PASS" if a["passed"] else "FAIL"
print(f" [{status}] {a['criteria']}")
if not a["passed"]:
print(f" {a['explanation']}")
import json
# Load results from two different runs (different playbook versions)
with open("results_v3.json") as f:
v3 = json.load(f)
with open("results_v5.json") as f:
v5 = json.load(f)
# Build lookup by case name
v3_by_case = {r.get("case_name", r["case_id"]): r for r in v3}
v5_by_case = {r.get("case_name", r["case_id"]): r for r in v5}
all_cases = sorted(set(v3_by_case) | set(v5_by_case))
print(f"{'Case':<30} {'v3':>8} {'v5':>8} {'Delta':>8}")
print("-" * 56)
for case in all_cases:
r3 = v3_by_case.get(case)
r5 = v5_by_case.get(case)
s3 = r3["status"] if r3 else "—"
s5 = r5["status"] if r5 else "—"
delta = ""
if s3 == "passed" and s5 != "passed":
delta = "REGRESSION"
elif s3 != "passed" and s5 == "passed":
delta = "FIXED"
print(f" {case:<28} {s3:>8} {s5:>8} {delta:>8}")
When setting up QA for an assistant for the first time:
Before deploying a new playbook version:
Periodically run evals to catch drift:
The most common request — full detail in the QA Practice Workflow section at the top. Short form:
qa.py chat --instructions ... --tool-mocks-file .... Read the full response (events, tool calls, citations, explanation), not just the assistant message.cases create then runs create --case-ids NEW_CASE_ID (still with the override / mocks while iterating; drop them once the playbook version is promoted).tools
Build and configure Studio Chat assistants — instructions, knowledge bases, skills, example blocks, API tools, alerts, schedules, and trending topics. Use when asked to create, update, or manage any aspect of an assistant's configuration. Covers all CRUD operations via the Studio Chat API.
tools
Manage Studio Chat project configuration — knowledge bases, playbooks, syncing, schedule, API tools, alerts, and trending topics. Use when asked to create, update, delete, or inspect KBs, playbooks, office hours, alerts, or any project settings. Also use to generate and browse trending topics analyses. Covers all CRUD operations via the Studio Chat API.
development
Create and configure automated reports in Studio Chat. Use when asked to set up a new report, schedule recurring reports, define report instructions, select which assistants/playbooks to include, configure Slack delivery, or manage existing report definitions. Expert at crafting report instructions that produce structured, high-quality output using the Block Kit format.
tools
Analyze customer conversation data, compute metrics, identify patterns, and generate reports using the Studio Chat Analytics API. Use when asked to analyze conversations, review performance, understand trends, examine deflection rates, sentiment distributions, handoff patterns, API tool usage, toolkit usage, resource analytics, sparklines, or any data analysis task involving platform activity.