Name: quality-engineer
Author: studiochat

Quality Engineer

Create test cases, run evaluations, and simulate conversations to verify AI assistant behavior. All API calls are authenticated automatically via environment variables. The API base URL (https://api.studiochat.io) is hardcoded in the scripts.

Key Terminology

Assistants and playbooks are the same concept. In the API, the term "playbook" is used everywhere — but users refer to them as "assistants," "bots," or "agents." When the user mentions any of these, they mean a playbook.

Playbook IDs:

playbook_base_id — stable ID across all versions of an assistant. Use this for case management.
playbook_id — ID of a specific version. Use this when triggering a run (you choose which version to test).

Setup

Set the following environment variables before using the scripts:

export STUDIO_API_TOKEN="sbs_your_api_key_here"
export STUDIO_PROJECT_ID="your-project-uuid"

API keys are available by request from the Studio Chat team at [email protected].

QA Practice Workflow (read this first)

The typical request looks like: "se quejaron de tal ejemplo X — investigá y arreglá". The skill is built around that flow: reproduce the complaint, understand what drove the behaviour, fix it via in-memory overrides, validate, and only then persist a regression test.

Step 1 — Understand the complaint (cross-skill: data-expert)

Before touching the assistant: find the actual conversation the customer is complaining about. The data-expert skill is the right tool for this — it knows how to query the conversations API, download a specific conversation, and pull the messages + events + tool calls that happened in production. Get the conversation_id, the playbook version that was active, and the user context that was passed in. Don't try to diagnose from a screenshot or a vague summary — read what actually happened.

Step 2 — Build the mental model

To know why the assistant did what it did, you need to understand how Studio Chat assembles the agent at runtime:

| Layer | What it is | How it loads | |---|---|---| | Playbook instructions (content) | The base system prompt for the agent. Free-form text — sets persona, tone, hard rules. | Stored on the saved Playbook row; injected into the system prompt at compile time. | | Skills (casuísticas) | Per-scenario instruction blocks the agent loads on demand via the load_skill tool. Each skill has name, description, and content. Only name + description are surfaced to the LLM during skill discovery — the content is fetched only when load_skill is called. | Stored on the playbook version; the compiler attaches them to the agent. | | Knowledge bases (kb_ids) | Indexed snippets the agent searches via search_knowledge_base. Returns ranked passages with citation IDs ([[abc12]]) inlined in the assistant's response. | Configured on the playbook; the agent calls search_knowledge_base when it needs grounded info. | | API tools / toolkits | Real-world actions: Composio integrations (SLACK_SEND_MESSAGE, CAL_POST_NEW_BOOKING_REQUEST, GMAIL_*), custom API tools, custom toolkits. | Registered on the playbook's api_tools / integrations; the wrapper dispatches calls at runtime. | | Examples | Reference conversations the agent learns from. | Stored on the playbook; injected into the system prompt. | | Enrichment tools | Run BEFORE the agent's first turn (e.g. fetch user profile, lookup order). Their result lands in context.enrichment. | Configured via enrichment_tool_ids; not mockable through tool_mocks. |

Whatever the assistant did wrong, the root cause lives in one (or a few) of these layers. The triage question is: was the instruction wrong? Was a skill missing or loading the wrong content? Was the KB result irrelevant? Did a tool return unexpected data? Use qa.py chat --verbose to see the full picture of each turn (see Step 4 below).

Step 3 — REQUIRED: ask the user what to mock before chatting / before any eval run

Before you chat with the assistant for the first time and before you trigger any eval run, you MUST ask the user whether any tools should be mocked. This is not optional. Reasons:

Reproducibility: real tools depend on real state. A refund flow that worked yesterday may behave differently today because the order was actually refunded. Mocks let you reproduce the exact conditions of the customer's complaint.
Safety: real tools can have side effects (send a Slack message, charge a card, create a calendar event). Mocking the destructive ones during QA prevents collateral damage.
Edge cases: the customer complaint often involves a tool failure ("the assistant said my order was lost"). The only way to reliably reproduce that is to mock the tool with the failing payload.

When you ask, enumerate the tools that are mockable for this playbook so the user knows what's available. The list is the playbook's api_tools + the project's configured Composio/custom toolkits + the built-in tools the agent always has (search_knowledge_base, load_skill, list_agents, list_teams, list_kbs). Don't enumerate the events (they aren't tools) — see "What's NOT mockable" below.

Example prompt to the user:

"Antes de chatear / antes de correr el eval: ¿querés mockear alguna tool? Las disponibles en este assistant son:

Composio: SLACK_SEND_MESSAGE, CAL_POST_NEW_BOOKING_REQUEST, GMAIL_SEND_EMAIL (los que tenga el project)

API tools del playbook: lookup_order, process_refund, create_ticket (los que estén en api_tools)

Built-ins: search_knowledge_base, load_skill

Para reproducir la queja, capaz quieras mockear lookup_order con un payload de error o search_knowledge_base con un snippet específico. ¿Algún caso particular?"

If the user says "ninguna, dale así nomás" — proceed without mocks. But don't skip the question.

Step 4 — Reproduce + iterate via chat with overrides + mocks

# Reproduce the complaint exactly. Override the instructions OR skills if
# you already have a hypothesis about the fix; mock the tools that drove
# the failing behaviour.
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
  --message "Quiero un reembolso del pedido ORD-99999" \
  --conversation-id qa_repro_001 \
  --tool-mocks-file ./mocks/refund-not-found.json \
  --verbose

Read the WHOLE response, not just the message. qa.py chat prints to stderr a structured breakdown:

| Section | What to look at | |---|---| | Events | The message the user would see, but also label / note / handoff_agent / priority events the assistant emitted. A [note] is a private note (invisible to the customer but written to the conversation). A [handoff_agent] or [handoff_team] means the assistant gave up. | | Tool calls | Every tool invoked, with arguments and result. load_skill tells you which skill the agent picked up — if the wrong one fired, the description text on that skill is wrong (the LLM picks skills off name+description, not content). search_knowledge_base shows the query and which articles came back — if the result is irrelevant, the KB content or chunking is the problem. [MOCKED] after a tool name means the response came from your mock, not the real tool. | | Citations | Which KB articles the agent actually quoted. If the assistant cites the wrong article, either the article content is wrong or the search ranking is. | | Explanation | Agent's own reasoning summary. Useful for catching subtle path choices ("decided to escalate because user asked twice"). |

Iterate by tweaking --instructions / --skills-file / --tool-mocks-file until the fix works. Nothing is persisted — no version bump, no chatlog pollution.

Step 5 — Dry-run a candidate eval case (optional but recommended)

Before persisting a new test case, validate the case definition itself with qa.py dry-run start. This runs the simulator + judge ONCE against an unsaved EvalCaseCreate payload (with the same playbook_override if you're still iterating on the variant). Catches problems like a too-vague termination that never fires, or an assertion the LLM judge can't grade. Nothing is written to eval_cases or eval_runs.

Step 6 — Persist the case + run only it

Once the fix works and the case definition is sound, save it and re-run just that case against the variant (or against the new saved playbook version, if you've promoted the override).

# Save the case
python3 scripts/qa.py cases create PLAYBOOK_BASE_ID --body '{
  "name": "refund-order-not-found-handoff",
  "scenario": "Customer asks for a refund for an order that the lookup API returns as not-found.",
  "termination": "The assistant escalates to a human agent",
  "max_turns": 5,
  "assertions": [
    {"criteria": "The assistant does not fabricate an order status"},
    {"type": "handoff"}
  ],
  "tool_mocks": {
    "lookup_order": {"match_kind": "any", "error": "Order not found"}
  }
}'

# Run only this case (ignores is_enabled — works even on disabled cases
# while you're still iterating)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID \
  --playbook-id VERSION_ID \
  --case-ids NEW_CASE_ID

The case carries its own tool_mocks so future runs reproduce the exact failure mode deterministically without you having to specify mocks again.

Cheat sheet — which mechanism for which question

| Question | Mechanism | |---|---| | "What did the assistant actually do in prod?" | data-expert skill — pull the real conversation | | "Does changing the prompt fix it?" | qa.py chat --instructions ... | | "Does a different skill fire?" | qa.py chat --skills-file ... | | "What does the assistant do when this tool returns X?" | qa.py chat --tool-mocks-file ... | | "Is my new case definition gradable?" | qa.py dry-run start --case ... | | "Did the fix pass without breaking the rest?" | qa.py runs create (no --case-ids) | | "Just re-run THIS one case quickly" | qa.py runs create --case-ids ID |

Scripts

qa.py — Eval & testing API client

# List test cases
python3 scripts/qa.py cases list PLAYBOOK_BASE_ID

# Create a single test case
python3 scripts/qa.py cases create PLAYBOOK_BASE_ID --body '{...}'

# Create multiple test cases at once
python3 scripts/qa.py cases batch PLAYBOOK_BASE_ID --body '{"cases": [...]}'

# Delete a test case
python3 scripts/qa.py cases delete CASE_ID

# Trigger an eval run (all enabled cases, playbook's default models)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID --playbook-id VERSION_ID [--context '{}']

# Trigger an eval run against UNSAVED instructions / skills (no version bumped)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID --playbook-id VERSION_ID \
    --instructions-file ./draft-prompt.md \
    --skills-file ./draft-skills.json

# Trigger a run on a SUBSET of cases (ignores is_enabled — disabled cases included)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID --playbook-id VERSION_ID \
    --case-ids CASE_ID_1,CASE_ID_2

# Trigger a run with model overrides + parallelism
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID --playbook-id VERSION_ID \
    --model openai-direct/gpt-4o-mini \
    --simulator-model anthropic/claude-sonnet-4 \
    --judge-model openai/gpt-4o \
    --concurrency 4

# List eval runs
python3 scripts/qa.py runs list PLAYBOOK_BASE_ID

# Check run status
python3 scripts/qa.py runs status RUN_ID

# Get run results (per-case details)
python3 scripts/qa.py runs results RUN_ID

# Cancel a running eval
python3 scripts/qa.py runs cancel RUN_ID

# Chat with an assistant (simulate a conversation)
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Hello, I need help"
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Follow up" --conversation-id CONV_ID

# Chat with UNSAVED instructions to iterate quickly (no version bumped)
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Hi" \
    --instructions "Reply in English. Be very concise." \
    --skills-file ./draft-skills.json

# Chat with MOCKED tool responses (stub Slack / KB / API tools — admin only)
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Quiero un reembolso de ORD-99999" \
    --tool-mocks-file ./mocks.json

# Dry-run a candidate eval case WITHOUT persisting it (validate the case
# definition before committing it via `cases create`)
python3 scripts/qa.py dry-run start PLAYBOOK_BASE_ID --playbook-id VERSION_ID \
    --case '{"name":"poc","scenario":"...","termination":"...","assertions":[{"criteria":"..."}]}'
python3 scripts/qa.py dry-run status DRY_RUN_ID
python3 scripts/qa.py dry-run cancel DRY_RUN_ID

Full API Reference

See references/api-reference.md for complete endpoint specs.

How Evaluations Work

The Eval Lifecycle

Create test cases — define scenarios, expected outcomes, and assertions for a playbook
Trigger a run — select a playbook version to test; runs execute asynchronously
Monitor progress — poll run status (pending → running → completed)
Analyze results — per-case pass/fail with LLM-generated explanations

Test Case Anatomy

A test case defines what to test and how to judge:

| Field | Required | Description | |-------|----------|-------------| | name | Yes | Unique name (lowercase with dashes, e.g., refund-happy-path) | | scenario | Yes | Instructions for the simulated user (what they want, how they behave) | | termination | Yes | Expected outcome — the simulator checks this after each turn | | first_message | No | Exact first message. If omitted, the LLM generates one from the scenario | | max_turns | No | Max conversation turns (1-50, default: 10) | | assertions | No | Criteria to evaluate after the conversation (LLM-as-judge) | | assertion_tags | No | Tags to check — verifies the assistant applied specific tags |

How the Simulator Works

For each test case, the system:

Generates a user message — either the exact first_message or an LLM-generated message based on the scenario
Sends it to the assistant — calls the actual playbook agent with the message
Checks termination — an LLM judges whether the expected outcome was reached
Repeats — generates the next user message based on the scenario + conversation so far
Evaluates assertions — after the conversation ends, each assertion is evaluated by an LLM judge
Checks tags — verifies expected tags were applied during the conversation

User Context

You can pass a user_context dict when triggering a run. This context is forwarded to the assistant agent, simulating a real user with specific attributes:

{
  "user_context": {
    "email": "[email protected]",
    "plan": "premium",
    "account_id": "acc-123"
  }
}

The assistant sees this context exactly as it would in a real conversation.

Picking which cases to run

By default a run executes every enabled case for the playbook. Two ways to narrow it:

is_enabled flag (persistent): toggle via the UI or PATCH /eval-cases/{id}. Permanently skips a case across all runs.
case_ids per-run (ephemeral): pass --case-ids on runs create. Only those cases execute and is_enabled is ignored — pick a single disabled case while iterating without flipping flags on the rest.

# Run just two specific cases (works even if they're disabled)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID --playbook-id VERSION_ID \
  --case-ids c1f8...,c2d4...

Empty --case-ids ⇒ 400. Unknown IDs ⇒ 404. Omit the flag for the historical "all enabled" behaviour.

Dry-running a candidate case (no persistence)

qa.py dry-run start runs the simulator + judge ONCE against an unsaved EvalCaseCreate payload — same pipeline as a real eval run, no rows written to eval_cases or eval_runs. Use this to:

Validate a candidate case definition before persisting (does the simulator generate plausible user turns? does the termination fire? are the assertions gradable?).
Test an instructions/skills change against one specific scenario without bumping a playbook version.

State lives in memory for ~30 minutes and is polled by dry-run ID:

DRY_RUN_ID=$(python3 scripts/qa.py dry-run start PLAYBOOK_BASE_ID \
  --playbook-id VERSION_ID \
  --case '{
    "name": "refund-poc",
    "scenario": "User asks for a refund for ORD-12345.",
    "termination": "The assistant confirms the refund will be processed",
    "max_turns": 6,
    "assertions": [{"criteria": "The assistant asks for the order number"}]
  }' \
  --instructions "You are a refund specialist. Always ask for the order number first." \
  | jq -r .dry_run_id)

# Poll until completed / failed / cancelled
python3 scripts/qa.py dry-run status $DRY_RUN_ID
# or bail out if it's clearly going wrong
python3 scripts/qa.py dry-run cancel $DRY_RUN_ID

If the dry-run passes and the conversation looks right, persist the case via cases create and move to Step 6 of the QA practice workflow.

Model overrides

Three independent model knobs control different LLM calls during an eval. All default to the playbook's configured model (or the eval-system env default for simulator/judge), and all accept the same OpenRouter-compatible syntax.

| Flag | What it controls | Default | |---|---|---| | --model | The assistant LLM (the playbook agent under test) | Playbook's configured model | | --simulator-model | The LLM that role-plays the user in each case | anthropic/claude-sonnet-4 (env: EVAL_SIMULATOR_MODEL) | | --judge-model | The LLM that grades text-type assertions. Structured assertions (tool_called, handoff_to_agent, etc.) run deterministic checks and ignore this. | openai/gpt-4o (env: EVAL_EVALUATOR_MODEL) |

Syntax (same for all three flags):

provider/model_id — single model. Examples: openai-direct/gpt-4o-mini, anthropic/claude-sonnet-4.
primary{timeout}fallback — primary first; on timeout (seconds) fall back. Example: groq/llama-3.3-70b-versatile{8}openai-direct/gpt-4o-mini.
modelA:50,modelB:50 — A/B experiment. Percentages must sum to 100; cases are hash-assigned to a variant by conversation_id.

Bad input ⇒ 422 at the API edge (e.g. percentages that don't sum to 100, or a bare gpt-4o-mini without provider prefix). Empty/whitespace ⇒ field ignored.

Recommended models (use these exact slugs — don't invent new ones)

OpenRouter's catalog is strict; invented slugs will 422. These are the slugs actually in use across the Studio Chat stack:

Anthropic — Claude:

| Slug | Notes | |---|---| | anthropic/claude-sonnet-4.6 | Newest Sonnet. Default for the assistant in most scenarios. | | anthropic/claude-sonnet-4.5 | One rev behind 4.6. | | anthropic/claude-sonnet-4 | Eval-system default for the simulator (EVAL_SIMULATOR_MODEL). | | anthropic/claude-3.5-sonnet | Stable older Sonnet; cheap baseline for diffs. | | anthropic/claude-haiku-4.5 | Newest Haiku — fast / cheap. Good for high-volume runs or the simulator when latency matters more than nuance. |

OpenAI — GPT:

| Slug | Notes | |---|---| | openai/gpt-5.4 | Newest GPT-5 flagship. | | openai/gpt-5.4-mini | GPT-5 mini. Supports [reasoning=…] suffix (see below). | | openai/gpt-5.2-chat | Stable GPT-5 chat variant. | | openai/gpt-4.1-mini | Solid mid-tier. | | openai/gpt-4.1-nano | Smallest GPT-4.1 — cheap. | | openai/gpt-4o | GPT-4o via the OpenRouter pool. Default for the judge (EVAL_EVALUATOR_MODEL). | | openai/gpt-4o-mini | Cheap judge / assistant. | | openai-direct/gpt-4o | Same model via the direct OpenAI provider (skips OpenRouter pool — lower latency, different billing). | | openai-direct/gpt-4o-mini | Direct-provider 4o-mini. |

Google — Gemini:

| Slug | Notes | |---|---| | google/gemini-2.5-flash | Newest stable Flash. | | google/gemini-2.5-flash-lite | Cheaper Flash variant. | | google/gemini-2.0-flash-001 | Previous Flash generation. | | google/gemini-3-flash-preview | Gemini 3 Flash preview — may change. |

Gemini caveat: there's a known tool-calling bias in this codebase (docs/gemini-tool-call-bias.md). Prefer Sonnet for the assistant when the playbook leans heavily on tools.

Reasoning effort suffix (GPT-5 family)

OpenAI reasoning-capable models accept an optional [reasoning=X] suffix. Valid efforts: none, low, medium, high, xhigh. none disables reasoning entirely. Example:

openai/gpt-5.4-mini[reasoning=medium]
openai/gpt-5.2-chat[reasoning=none]

The suffix composes with the other syntactic forms.

Concurrency

--concurrency (1..5, default 1) fans cases out across a server-side thread pool. 1 keeps the sequential walk; higher values are useful when the case suite is large but watch for the simulator/agent provider's rate limits — 429s surface in EvalResult.error_message. Recommended 3–5 for ad-hoc runs.

Iterating without saving: playbook overrides

The QA pain point: every prompt or skill change normally bumps a saved playbook version, which means a version-history littered with throwaway revisions and an approval step in front of every iteration. The --instructions, --skills-file, --examples-file, --kb-ids, and --api-tools flags on chat and runs create let you test in-memory replacements without persisting anything.

When to use

Iterating on the system prompt — try a new tone, a new rule, a new fallback — and see how the assistant responds turn-by-turn.
Validating a draft skill (casuística) end-to-end against the full eval suite before promoting it.
A/B comparing two prompt variants without bumping the active version.
Reproducing a production conversation with a tweaked prompt to confirm the fix.

Override semantics

| Flag | Replaces… | |---|---| | --instructions TEXT / --instructions-file FILE | Main playbook instructions (the free-text content). | | --skills-file FILE | The full set of skills. Pass [] to disable all skills. | | --examples-file FILE | Global reference examples. | | --kb-ids id1,id2 | Knowledge base IDs (pass "" to disable all KBs). | | --api-tools t1,t2 | API tool IDs (pass "" to disable all tools). |

Rules:

Each flag is independent — omit a flag and the saved playbook field stays.
Replace, not merge — lists are swapped wholesale; there's no union.
Conversations are forced into preview + eval mode — overridden runs never count toward production analytics, the sticky-model cache, or chatlogs.
No version is created — the saved playbook is untouched; if you like the result, edit and save it through the normal flow.
Requires admin or API-key authentication (the same sbs_ / kps_ tokens the skill already uses).

Skills file shape

--skills-file accepts two shapes — pick the one that matches what you want to do.

1. Full replace (list of skill objects)

Drop the saved playbook's skills entirely and use exactly these:

[
  {
    "name": "refund-flow",
    "description": "Handle refund requests with order id verification",
    "content": "First ask for the order id. Then check eligibility..."
  },
  {
    "name": "english-only",
    "description": "Force English replies",
    "content": "Reply only in English regardless of customer language."
  }
]

Pass [] to disable all skills.

2. Surgical patch (object with `add` / `replace` / `remove`)

Keep most of the saved skills and only modify a few. Operators are applied in order: remove → replace → add:

{
  "remove": ["legacy-skill-a", "legacy-skill-b"],
  "replace": [
    {
      "name": "refund-flow",
      "description": "Refund handling, tightened policy",
      "content": "ASK for order id BEFORE confirming any refund..."
    }
  ],
  "add": [
    {
      "name": "english-only",
      "description": "Force English replies",
      "content": "Reply only in English regardless of customer language."
    }
  ]
}

Strict-validation rules (the BE returns 422 if violated, before any LLM call):

remove of a name that isn't on the saved playbook → 422.
replace of a name that isn't on the saved playbook → 422 (use add instead).
add of a name that already exists (after remove ran) → 422 (use replace instead).
Duplicate names within a single operator list → 422.

remove: [X] + add: [{name:X, ...}] of the same name is allowed — after remove drops the saved row, the slot is free for add.

Skill object shape

Both shapes share the same skill object shape (matches SkillOverrideInput on the BE). Note: only the fields the LLM actually reads during skill discovery are accepted here — trigger lives on the saved-version skill row but is never injected into the system prompt, so the override endpoint omits it.

| Field | Required | Notes | |---|---|---| | name | yes | Unique identifier (kebab-case recommended). Matched by name in replace/remove. | | description | yes | Short summary shown to the agent during skill discovery. | | content | yes | Full instructions. Macros like {{ kb(<id>) }} and {{ tool(<id>) }} are expanded — the referenced KB / API tool must already exist in the project (the override doesn't create them). | | examples | no | Optional list of reference conversation examples. | | order | no | Display/listing order. Auto-assigned if omitted. |

Examples

# Quick prompt tweak via inline flag
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Hi" \
    --instructions "Always reply in English, be terse."

# Full file-based override for chat
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "I want a refund" \
    --instructions-file ./draft-prompt.md \
    --skills-file ./draft-skills.json \
    --examples-file ./draft-examples.json \
    --conversation-id qa_iter_001

# Disable all skills to test the bare prompt
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Hello" \
    --skills-file <(echo '[]')

# Surgically modify the saved skills (patch shape)
# ./skills-patch.json:
#   { "remove": ["old-flow"], "add": [{"name": "english-only", ...}] }
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Hello" \
    --skills-file ./skills-patch.json

# Run the full eval suite against an unsaved prompt
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID \
    --playbook-id PB_VERSION_ID \
    --instructions-file ./candidate-prompt.md

Wire shape

Both endpoints accept a playbook_override object on the request body. The CLI builds it for you, but if you need to call the API directly:

Full-replace form (list-shape skills):

{
  "conversation_id": "qa_iter_001",
  "user_message": "...",
  "playbook_override": {
    "content": "...full instructions...",
    "skills": [
      {"name": "...", "description": "...", "content": "..."}
    ],
    "examples": [],
    "kb_ids": ["kb-uuid-1"],
    "api_tools": []
  }
}

Surgical-patch form (object-shape skills):

{
  "conversation_id": "qa_iter_001",
  "user_message": "...",
  "playbook_override": {
    "skills": {
      "remove": ["old-flow"],
      "replace": [{"name": "refund-flow", "description": "...", "content": "..."}],
      "add": [{"name": "english-only", "description": "...", "content": "..."}]
    }
  }
}

Any subset of these keys is valid — omitted keys keep the saved playbook value.

Mocking Tools (`tool_mocks`)

By default, an eval run or a chat call invokes real tools — searches the real KB, hits the real API, sends the real Slack message. That's great for end-to-end coverage but bad for QA: it depends on real-world state, can have destructive side effects, and makes it hard to reproduce a customer complaint that involved a specific tool failure.

tool_mocks lets you stub specific tools with canned responses. Same wire shape and semantics in two places:

| Where | Field | When mocks apply | |---|---|---| | Per case (saved) | EvalCase.tool_mocks | Every time the case runs in any eval | | Per chat call (ephemeral) | PlaybookChatRequest.tool_mocks | This chat request only — forced into preview+eval mode (admin only) |

Step 3 of the QA practice workflow makes asking the user about mocks mandatory before any chat or eval run. The rest of this section covers the shape + the rules.

What can and cannot be mocked

Mockable (anything that dispatches through the agent's toolset wrapper):

Composio integrations: tool names follow the TOOLKIT_ACTION uppercase convention. Examples: SLACK_SEND_MESSAGE, CAL_POST_NEW_BOOKING_REQUEST, CAL_CANCEL_BOOKING_VIA_UID, GMAIL_SEND_EMAIL. The exact list depends on which toolkits are configured in SUPPORTED_TOOLKITS for the project.
Custom API tools: the entries in the playbook's api_tools. Names match what the playbook author registered (typically snake_case).
Custom toolkit tools: same as Composio but registered via the in-house CUSTOM_TOOLKIT_REGISTRY.
Built-in tools: search_knowledge_base, load_skill, list_agents, list_teams, list_kbs.

Tool names in the mock map must match ToolCallTrace.name exactly — uppercase for Composio, snake_case for the rest.

NOT mockable (not dispatched as tool calls):

Agent events: message, note, label, priority, handoff_agent, handoff_team. These are items the LLM emits inside its structured output payload, parsed downstream by the chat handler — they never go through the wrapper. To validate them, use the matching structured assertions (HandoffAssertion, PrioritySetAssertion, TagAddedAssertion, PrivateNoteContainsAssertion).
Enrichment tools (enrichment_tool_ids): run BEFORE the agent's first turn. The wrapper doesn't exist yet at that point in the request lifecycle.

Shape

tool_mocks lives on the case body. Each key is a tool name; each value is one rule or an ordered list of rules:

{
  "name": "refund-when-eligible",
  "scenario": "...",
  "termination": "...",
  "tool_mocks": {
    "lookup_order": {
      "match_kind": "any",
      "return_value": {"order_id": "ORD-123", "status": "delivered", "days_since": 5}
    },
    "process_refund": {
      "match_kind": "any",
      "return_value": {"refund_id": "RFND-999", "amount": 49.99}
    }
  }
}

Match modes

A rule fires only when its match_kind matches the call:

| match_kind | Fires when… | Required field | |---|---|---| | any | always | — | | call_ordinal | the Nth call to this tool (1-indexed) | call_ordinal: int | | args_match | call args ⊇ match_args (subset filter; missing keys are wildcards) | match_args: dict |

Payload: exactly one of `return_value` or `error`

{
  "tool_mocks": {
    "send_email": {"match_kind": "any", "error": "SMTP unavailable"},
    "lookup_user": {
      "match_kind": "args_match",
      "match_args": {"plan": "enterprise"},
      "return_value": {"name": "Acme Corp", "tier": "enterprise"}
    }
  }
}

error raises a tool error inside the run (the assistant sees it the same way it would see a real API failure — useful for testing error handling). return_value is whatever JSON the tool would normally return.

Multiple rules per tool, first-match-wins

Pass an array to express "specific case → fallback":

{
  "tool_mocks": {
    "lookup_order": [
      {"match_kind": "args_match", "match_args": {"order_id": "ORD-123"}, "return_value": {"status": "delivered"}},
      {"match_kind": "args_match", "match_args": {"order_id": "ORD-999"}, "return_value": {"status": "lost"}},
      {"match_kind": "any", "error": "Order not found"}
    ]
  }
}

Rules are evaluated top-to-bottom; the first match wins. The catch-all any rule at the end keeps the run safe if the assistant calls with an unexpected argument.

Multi-call sequences

Use call_ordinal to return different values across consecutive calls:

{
  "tool_mocks": {
    "search_kb": [
      {"match_kind": "call_ordinal", "call_ordinal": 1, "return_value": [{"title": "Refund policy", "snippet": "..."}]},
      {"match_kind": "call_ordinal", "call_ordinal": 2, "return_value": [{"title": "Refund timeline", "snippet": "..."}]}
    ]
  }
}

Important: mocks are exhaustive per tool

Once you mock a tool, the rule list is complete — every call to that tool during the run must match a rule. If the assistant calls lookup_order a third time and no rule matches, the run fails loudly with no mock matched call #3 for tool lookup_order.

This is intentional: silently falling through to the real implementation would let production state leak into eval runs and make failures impossible to reproduce. Always include an any catch-all (or extra call_ordinal rules) if you don't know exactly how many times the assistant will call a tool.

Tools you DON'T list in tool_mocks are unaffected — they call the real implementation as usual.

Combining mocks with tool assertions

tool_called and tool_not_called assertions both support args_match for "the assistant called X with these specific args" (or "must not call X with these args"). Combine with mocks to test fine-grained behavior:

{
  "tool_mocks": {
    "send_email": {"match_kind": "any", "return_value": {"sent": true}}
  },
  "assertions": [
    {"type": "tool_called", "name": "send_email", "args_match": {"to": "[email protected]"}},
    {"type": "tool_not_called", "name": "send_email", "args_match": {"to": "[email protected]"}}
  ]
}

Mocking during chat (no persisted case)

The same tool_mocks shape works on qa.py chat via --tool-mocks-file, letting you stub tool responses for a single ad-hoc chat without authoring a case. Useful in Step 4 of the QA practice workflow when you're still hunting for the fix.

Semantics:

Admin / API-key only — same gate as --instructions / --skills-file.
Forces preview + eval mode — the conversation is excluded from chatlogs, sticky-model assignment, and production analytics. Same semantics as --instructions, so they compose cleanly.
Mocked calls are flagged in the output: qa.py chat prints [MOCKED] next to the tool name in the Tool calls section so you can tell stubbed responses apart from real ones.

Example mocks.json reproducing a "lookup_order returns not-found, KB has no refund policy" scenario:

{
  "lookup_order": {"match_kind": "any", "error": "Order not found"},
  "search_knowledge_base": {
    "match_kind": "any",
    "return_value": []
  }
}

Then chat:

python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
  --message "Quiero un reembolso del pedido ORD-99999" \
  --tool-mocks-file ./mocks.json \
  --verbose

Watch the Tool calls section for lookup_order [custom] [MOCKED] and verify the assistant's response handles the error gracefully (e.g. escalates instead of fabricating an order status).

Note: run-level tool_mocks on qa.py runs create is not supported — per-case tool_mocks lives on the case body itself. If you need deterministic mocks across a whole run, put them on each case (or use a --case-ids subset against cases that already carry mocks). The CLI rejects --tool-mocks-file on runs create with a clear error.

Per-Case User Context (`user_context`)

A case can override the run-level user_context for its own scope. Case keys win over run keys; the special eval_overrides sub-namespace is shallow-merged.

{
  "name": "vip-customer-flow",
  "user_context": {
    "plan": "enterprise",
    "eval_overrides": {"agent_name": "PremiumBot"}
  }
}

Use this when one case needs a different user identity, plan, or simulated state without affecting the rest of the run.

Writing Good Test Cases

Scenario Guidelines

The scenario tells the simulated user how to behave. Write it from the user's perspective:

Describe what the user wants, not what the assistant should do
Include constraints: "the user is impatient," "the user doesn't have their order number"
Keep it focused — one scenario per case

Assertion types — pick the right tool for each check

Assertions live on EvalCase.assertions as a list of typed objects. There are two families:

Text (type: "text") — graded by the LLM judge. Use for free-form claims about what the assistant said.
Structured (everything else) — graded deterministically by walking turn.events and turn.tool_calls. No LLM call. Faster, cheaper, and immune to judge flakiness — prefer these whenever the check fits a structured shape.

The --judge-model flag affects only text assertions. Structured assertions ignore it.

Picking the right type: every time you reach for text to check something the assistant did (called a tool, handed off, set a priority, applied a tag, wrote a private note), there's a structured variant that does it deterministically. Use text only for what the assistant said (tone, content of the message, that it mentioned a specific policy).

`text` — LLM-as-judge

{"type": "text", "criteria": "The assistant mentions the 30-day refund policy"}

type defaults to text, so the legacy short form still works:

{"criteria": "The assistant asks for the order number before proceeding"}

Write the criteria as a clear, verifiable statement:

✅ "The assistant mentions the 30-day refund policy"
✅ "The assistant asks for the order number before proceeding"
❌ "The assistant is helpful" (too vague)
❌ "Response time is under 2 seconds" (not evaluable from text)

`tool_called` — a specific tool was invoked

{"type": "tool_called", "name": "search_knowledge_base", "min_count": 1}

Optional args_match narrows the match to calls whose arguments dict is a superset of the given keys/values:

{
  "type": "tool_called",
  "name": "SLACK_SEND_MESSAGE",
  "args_match": {"channel": "#alerts"},
  "min_count": 1
}

min_count defaults to 1 — bump it when you need to assert "called at least N times" (e.g., retrieved KB info twice during a long flow).

`tool_not_called` — a tool was never invoked

{"type": "tool_not_called", "name": "process_refund"}

With args_match, narrows to "must never be called with these args" (other calls to the same tool are ignored):

{
  "type": "tool_not_called",
  "name": "send_email",
  "args_match": {"to": "[email protected]"}
}

Useful for negative tests: "the agent must NOT email the CEO."

`tool_call_sequence` — tools fired in a specific order

{
  "type": "tool_call_sequence",
  "names": ["lookup_order", "check_refund_eligibility", "process_refund"],
  "strict": false
}

strict: false (default) means the listed tools must appear in order but other tool calls may interleave. strict: true means they must appear contiguously in the exact order — useful when you need to lock down "no extra calls between A and B."

`handoff` — some handoff event was emitted (mode-agnostic)

{"type": "handoff"}

Matches either handoff_agent or handoff_team. Standalone accounts only ever see "the agent gave up" with no agent/team distinction — this is the only handoff assertion that makes sense there. Kaption accounts also have the more specific variants below.

`handoff_to_agent` — handoff to a specific agent (Kaption)

{"type": "handoff_to_agent", "agent_id": 162}

Or "any" for "any specific agent" (i.e., assert that a handoff_agent event fired with some agent_id):

{"type": "handoff_to_agent", "agent_id": "any"}

`handoff_to_team` — handoff to a specific team (Kaption)

{"type": "handoff_to_team", "team_id": 7}

`no_handoff` — assert the agent did NOT hand off

{"type": "no_handoff"}

Matches both agent and team handoffs. Useful when the playbook is supposed to resolve the issue end-to-end.

`priority_set` — `priority` event with a specific value

{"type": "priority_set", "value": "urgent"}

value must be one of urgent | high | medium | low. Pairs well with playbooks that triage by severity.

`tag_added` — a `label` event added a specific tag

{"type": "tag_added", "tag": "billing"}

The agent emits these as label events internally, but the user-facing term is "tag" — this assertion uses tag_added for consistency with the rest of the eval surface.

There's also a legacy assertion_tags: ["billing", "escalation"] field on the case body that asserts a list of tags in one shot. It still works for backwards compatibility, but new cases should prefer one tag_added assertion per tag — they show up individually in the diff view and per-assertion result rows.

`private_note_contains` — a private note's content contains a substring

{
  "type": "private_note_contains",
  "substring": "asignar a Lautaro",
  "case_insensitive": true
}

Useful when the playbook is supposed to write specific context into a private note for the next human agent. case_insensitive defaults to true.

Combined examples

A single case can mix text and structured assertions freely. Structured assertions run first (deterministic, no LLM cost); text assertions run after.

{
  "name": "refund-flow-eligible",
  "scenario": "Customer with order ORD-12345 (5 days old) wants a refund.",
  "termination": "The assistant confirms the refund will be processed",
  "max_turns": 6,
  "assertions": [
    {"type": "tool_call_sequence", "names": ["lookup_order", "process_refund"]},
    {"type": "tool_called", "name": "process_refund", "args_match": {"order_id": "ORD-12345"}},
    {"type": "tool_not_called", "name": "human_handoff"},
    {"type": "priority_set", "value": "low"},
    {"type": "tag_added", "tag": "refund-completed"},
    {"criteria": "The assistant confirms the refund amount in the response"}
  ]
}

A handoff-on-failure case for a standalone account:

{
  "name": "refund-flow-not-eligible-handoff",
  "scenario": "Customer wants a refund for a 60-day-old order. Policy is 30 days.",
  "termination": "The assistant escalates to a human",
  "max_turns": 5,
  "assertions": [
    {"type": "handoff"},
    {"type": "tool_not_called", "name": "process_refund"},
    {"type": "private_note_contains", "substring": "fuera de política de 30 días"},
    {"criteria": "The assistant explains the 30-day policy to the customer"}
  ]
}

Picking the right assertion — cheat sheet

| Question | Assertion | |---|---| | Did the assistant say X? | text (LLM judge) | | Did the assistant call tool X? | tool_called | | Did the assistant call tool X with these args? | tool_called + args_match | | Did the assistant NOT call tool X? | tool_not_called | | Did the assistant NOT call tool X with these args? | tool_not_called + args_match | | Did tools fire in this order? | tool_call_sequence | | Did the assistant give up? (standalone) | handoff / no_handoff | | Did the assistant route to agent N? (Kaption) | handoff_to_agent | | Did the assistant route to team N? (Kaption) | handoff_to_team | | Did the assistant set conversation priority? | priority_set | | Did the assistant apply tag X? | tag_added | | Did the assistant write a private note containing X? | private_note_contains |

Reminder: events (priority, label, note, handoff_*) cannot be mocked with tool_mocks because they're not tool calls — they're items in the LLM's structured output. But they CAN be asserted with the structured assertions above. The two features are complementary: mocks shape the inputs the agent sees during the run; assertions verify the side effects (events + tool calls) it produced.

Examples

Example: Create a Single Test Case

python3 scripts/qa.py cases create PLAYBOOK_BASE_ID --body '{
  "name": "refund-happy-path",
  "scenario": "The customer bought a product 5 days ago and wants a full refund. They have their order number ready (ORD-12345).",
  "termination": "The assistant confirms the refund will be processed",
  "first_message": "Hi, I want to return a product I bought last week",
  "max_turns": 8,
  "assertions": [
    {"criteria": "The assistant asks for the order number"},
    {"criteria": "The assistant confirms the refund amount"},
    {"criteria": "The assistant mentions the expected refund timeline"}
  ]
}'

Example: Batch Create Test Cases

python3 scripts/qa.py cases batch PLAYBOOK_BASE_ID --body '{
  "cases": [
    {
      "name": "greeting-basic",
      "scenario": "A new customer visits for the first time and says hello.",
      "termination": "The assistant greets the customer and offers help",
      "max_turns": 3,
      "assertions": [
        {"criteria": "The assistant introduces itself"},
        {"criteria": "The assistant asks how it can help"}
      ]
    },
    {
      "name": "out-of-scope-question",
      "scenario": "The customer asks about something completely unrelated to the business, like the weather or sports.",
      "termination": "The assistant redirects to relevant topics or escalates",
      "max_turns": 5,
      "assertions": [
        {"criteria": "The assistant does not make up an answer about unrelated topics"},
        {"criteria": "The assistant politely redirects the conversation"}
      ]
    },
    {
      "name": "angry-customer-handoff",
      "scenario": "The customer is very angry about a delayed order. They curse and demand to speak to a manager. The order number is ORD-99999.",
      "termination": "The assistant escalates to a human agent",
      "max_turns": 6,
      "assertions": [
        {"criteria": "The assistant remains calm and professional"},
        {"criteria": "The assistant attempts to help before escalating"},
        {"criteria": "The assistant escalates to a human agent"}
      ],
      "assertion_tags": ["escalation"]
    },
    {
      "name": "pricing-inquiry",
      "scenario": "The customer wants to know the pricing for the Pro plan and asks about discounts for annual billing.",
      "termination": "The assistant provides pricing information",
      "max_turns": 5,
      "assertions": [
        {"criteria": "The assistant provides the correct Pro plan price"},
        {"criteria": "The assistant mentions annual billing discount if available"}
      ]
    }
  ]
}'

Example: Run Evaluation

# First, get the playbook version to test
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --list-versions

# Trigger a run against a specific version
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID \
  --playbook-id PLAYBOOK_VERSION_ID

# Trigger with user context (simulate a specific user)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID \
  --playbook-id PLAYBOOK_VERSION_ID \
  --context '{"email": "[email protected]", "plan": "enterprise"}'

# Poll status until completed (check every 15 seconds)
python3 scripts/qa.py runs status RUN_ID

# Get full results
python3 scripts/qa.py runs results RUN_ID -o eval_results.json

Example: Simulate a Conversation

Chat directly with a playbook without creating test cases. Useful for ad-hoc testing. The chat command shows the full picture of what the assistant did: skills loaded, KB articles searched, tool calls made, citations used, events emitted, and agent reasoning.

# Start a new conversation
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
  --message "Hi, I need to cancel my subscription"

# Continue the same conversation (use the conversation_id from the response)
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
  --message "My account email is [email protected]" \
  --conversation-id conv_qa_12345

# Chat with user context
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
  --message "Check my order status" \
  --context '{"email": "[email protected]", "order_id": "ORD-555"}'

# Verbose mode — show full tool results and raw JSON response
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
  --message "I want a refund" --verbose

Chat Output Breakdown

The chat command prints a structured breakdown to stderr:

| Section | What it shows | |---------|---------------| | Events | Messages (stdout), labels, notes, handoffs, priority changes | | Tool calls | Every tool the assistant invoked, with arguments and result summaries | | Citations | KB articles referenced in the response, with source and snippet | | Explanation | Agent reasoning summary (why it chose a particular path) |

Tool call details (always shown):

load_skill [custom] — which skill was loaded (or not loaded)
search_knowledge_base [kb_search] — query used, number of results, relevance scores, and content snippets
Custom API tools — name, arguments, and result preview

--verbose / -v expands tool results to 500 chars and dumps the full JSON response.

Example: Analyze Results

import json

with open("eval_results.json") as f:
    results = json.load(f)

total = len(results)
passed = sum(1 for r in results if r["status"] == "passed")
failed = sum(1 for r in results if r["status"] == "failed")
errored = sum(1 for r in results if r["status"] == "error")

print(f"Results: {passed}/{total} passed ({passed/total*100:.0f}%)")
if failed:
    print(f"  Failed: {failed}")
if errored:
    print(f"  Errors: {errored}")

print("\nFailed cases:")
for r in results:
    if r["status"] != "passed":
        print(f"\n  [{r.get('case_name', r['case_id'])}] — {r['status']}")
        for a in r.get("assertion_results", []):
            status = "PASS" if a["passed"] else "FAIL"
            print(f"    [{status}] {a['criteria']}")
            if not a["passed"]:
                print(f"           {a['explanation']}")

Example: Compare Two Versions

import json

# Load results from two different runs (different playbook versions)
with open("results_v3.json") as f:
    v3 = json.load(f)
with open("results_v5.json") as f:
    v5 = json.load(f)

# Build lookup by case name
v3_by_case = {r.get("case_name", r["case_id"]): r for r in v3}
v5_by_case = {r.get("case_name", r["case_id"]): r for r in v5}

all_cases = sorted(set(v3_by_case) | set(v5_by_case))

print(f"{'Case':<30} {'v3':>8} {'v5':>8} {'Delta':>8}")
print("-" * 56)
for case in all_cases:
    r3 = v3_by_case.get(case)
    r5 = v5_by_case.get(case)
    s3 = r3["status"] if r3 else "—"
    s5 = r5["status"] if r5 else "—"
    delta = ""
    if s3 == "passed" and s5 != "passed":
        delta = "REGRESSION"
    elif s3 != "passed" and s5 == "passed":
        delta = "FIXED"
    print(f"  {case:<28} {s3:>8} {s5:>8} {delta:>8}")

Workflows

1. Initial QA Setup

When setting up QA for an assistant for the first time:

List playbooks — identify the assistant and its base_id
Design test cases — cover happy paths, edge cases, error handling
Batch create — create all cases in one API call
Run first eval — establish a baseline
Review results — identify gaps in the assistant's behavior

2. Pre-Deploy Validation

Before deploying a new playbook version:

Get the new version ID — from the playbook version history
Run eval against the new version
Compare with baseline — check for regressions
If pass rate drops — investigate failing cases before deploying

3. Ongoing Monitoring

Periodically run evals to catch drift:

Trigger run against the active version
Check pass rate — compare with historical runs
Investigate new failures — read the conversation + assertion explanations
Update test cases — add new cases for issues found in production (use data-expert skill to find problematic conversations)

4. Triage a customer complaint

The most common request — full detail in the QA Practice Workflow section at the top. Short form:

data-expert → pull the offending conversation, identify the playbook version + user_context that was active
Build mental model — which layer (instructions / skill / KB / tool) drove the bad behaviour
REQUIRED: ask the user what to mock — enumerate available tools, get confirmation
Reproduce + iterate: qa.py chat --instructions ... --tool-mocks-file .... Read the full response (events, tool calls, citations, explanation), not just the assistant message.
Dry-run the candidate case to validate the case definition itself
Persist case + run only it: cases create then runs create --case-ids NEW_CASE_ID (still with the override / mocks while iterating; drop them once the playbook version is promoted).

Quality Engineer

Key Terminology

Playbook IDs:

playbook_base_id — stable ID across all versions of an assistant. Use this for case management.
playbook_id — ID of a specific version. Use this when triggering a run (you choose which version to test).

Setup

Set the following environment variables before using the scripts:

export STUDIO_API_TOKEN="sbs_your_api_key_here"
export STUDIO_PROJECT_ID="your-project-uuid"

API keys are available by request from the Studio Chat team at [email protected].

QA Practice Workflow (read this first)

Step 1 — Understand the complaint (cross-skill: data-expert)

Step 2 — Build the mental model

To know why the assistant did what it did, you need to understand how Studio Chat assembles the agent at runtime:

Step 3 — REQUIRED: ask the user what to mock before chatting / before any eval run

Before you chat with the assistant for the first time and before you trigger any eval run, you MUST ask the user whether any tools should be mocked. This is not optional. Reasons:

Reproducibility: real tools depend on real state. A refund flow that worked yesterday may behave differently today because the order was actually refunded. Mocks let you reproduce the exact conditions of the customer's complaint.
Safety: real tools can have side effects (send a Slack message, charge a card, create a calendar event). Mocking the destructive ones during QA prevents collateral damage.
Edge cases: the customer complaint often involves a tool failure ("the assistant said my order was lost"). The only way to reliably reproduce that is to mock the tool with the failing payload.

Example prompt to the user:

"Antes de chatear / antes de correr el eval: ¿querés mockear alguna tool? Las disponibles en este assistant son:

Composio: SLACK_SEND_MESSAGE, CAL_POST_NEW_BOOKING_REQUEST, GMAIL_SEND_EMAIL (los que tenga el project)

API tools del playbook: lookup_order, process_refund, create_ticket (los que estén en api_tools)

Built-ins: search_knowledge_base, load_skill

Para reproducir la queja, capaz quieras mockear lookup_order con un payload de error o search_knowledge_base con un snippet específico. ¿Algún caso particular?"

If the user says "ninguna, dale así nomás" — proceed without mocks. But don't skip the question.

Step 4 — Reproduce + iterate via chat with overrides + mocks

# Reproduce the complaint exactly. Override the instructions OR skills if
# you already have a hypothesis about the fix; mock the tools that drove
# the failing behaviour.
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
  --message "Quiero un reembolso del pedido ORD-99999" \
  --conversation-id qa_repro_001 \
  --tool-mocks-file ./mocks/refund-not-found.json \
  --verbose

Read the WHOLE response, not just the message. qa.py chat prints to stderr a structured breakdown:

Iterate by tweaking --instructions / --skills-file / --tool-mocks-file until the fix works. Nothing is persisted — no version bump, no chatlog pollution.

Step 5 — Dry-run a candidate eval case (optional but recommended)

Step 6 — Persist the case + run only it

Once the fix works and the case definition is sound, save it and re-run just that case against the variant (or against the new saved playbook version, if you've promoted the override).

# Save the case
python3 scripts/qa.py cases create PLAYBOOK_BASE_ID --body '{
  "name": "refund-order-not-found-handoff",
  "scenario": "Customer asks for a refund for an order that the lookup API returns as not-found.",
  "termination": "The assistant escalates to a human agent",
  "max_turns": 5,
  "assertions": [
    {"criteria": "The assistant does not fabricate an order status"},
    {"type": "handoff"}
  ],
  "tool_mocks": {
    "lookup_order": {"match_kind": "any", "error": "Order not found"}
  }
}'

# Run only this case (ignores is_enabled — works even on disabled cases
# while you're still iterating)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID \
  --playbook-id VERSION_ID \
  --case-ids NEW_CASE_ID

The case carries its own tool_mocks so future runs reproduce the exact failure mode deterministically without you having to specify mocks again.

Cheat sheet — which mechanism for which question

Scripts

qa.py — Eval & testing API client

# List test cases
python3 scripts/qa.py cases list PLAYBOOK_BASE_ID

# Create a single test case
python3 scripts/qa.py cases create PLAYBOOK_BASE_ID --body '{...}'

# Create multiple test cases at once
python3 scripts/qa.py cases batch PLAYBOOK_BASE_ID --body '{"cases": [...]}'

# Delete a test case
python3 scripts/qa.py cases delete CASE_ID

# Trigger an eval run (all enabled cases, playbook's default models)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID --playbook-id VERSION_ID [--context '{}']

# Trigger an eval run against UNSAVED instructions / skills (no version bumped)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID --playbook-id VERSION_ID \
    --instructions-file ./draft-prompt.md \
    --skills-file ./draft-skills.json

# Trigger a run on a SUBSET of cases (ignores is_enabled — disabled cases included)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID --playbook-id VERSION_ID \
    --case-ids CASE_ID_1,CASE_ID_2

# Trigger a run with model overrides + parallelism
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID --playbook-id VERSION_ID \
    --model openai-direct/gpt-4o-mini \
    --simulator-model anthropic/claude-sonnet-4 \
    --judge-model openai/gpt-4o \
    --concurrency 4

# List eval runs
python3 scripts/qa.py runs list PLAYBOOK_BASE_ID

# Check run status
python3 scripts/qa.py runs status RUN_ID

# Get run results (per-case details)
python3 scripts/qa.py runs results RUN_ID

# Cancel a running eval
python3 scripts/qa.py runs cancel RUN_ID

# Chat with an assistant (simulate a conversation)
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Hello, I need help"
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Follow up" --conversation-id CONV_ID

# Chat with UNSAVED instructions to iterate quickly (no version bumped)
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Hi" \
    --instructions "Reply in English. Be very concise." \
    --skills-file ./draft-skills.json

# Chat with MOCKED tool responses (stub Slack / KB / API tools — admin only)
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Quiero un reembolso de ORD-99999" \
    --tool-mocks-file ./mocks.json

# Dry-run a candidate eval case WITHOUT persisting it (validate the case
# definition before committing it via `cases create`)
python3 scripts/qa.py dry-run start PLAYBOOK_BASE_ID --playbook-id VERSION_ID \
    --case '{"name":"poc","scenario":"...","termination":"...","assertions":[{"criteria":"..."}]}'
python3 scripts/qa.py dry-run status DRY_RUN_ID
python3 scripts/qa.py dry-run cancel DRY_RUN_ID

Full API Reference

See references/api-reference.md for complete endpoint specs.

How Evaluations Work

The Eval Lifecycle

Create test cases — define scenarios, expected outcomes, and assertions for a playbook
Trigger a run — select a playbook version to test; runs execute asynchronously
Monitor progress — poll run status (pending → running → completed)
Analyze results — per-case pass/fail with LLM-generated explanations

Test Case Anatomy

A test case defines what to test and how to judge:

How the Simulator Works

For each test case, the system:

Generates a user message — either the exact first_message or an LLM-generated message based on the scenario
Sends it to the assistant — calls the actual playbook agent with the message
Checks termination — an LLM judges whether the expected outcome was reached
Repeats — generates the next user message based on the scenario + conversation so far
Evaluates assertions — after the conversation ends, each assertion is evaluated by an LLM judge
Checks tags — verifies expected tags were applied during the conversation

User Context

You can pass a user_context dict when triggering a run. This context is forwarded to the assistant agent, simulating a real user with specific attributes:

{
  "user_context": {
    "email": "[email protected]",
    "plan": "premium",
    "account_id": "acc-123"
  }
}

The assistant sees this context exactly as it would in a real conversation.

Picking which cases to run

By default a run executes every enabled case for the playbook. Two ways to narrow it:

is_enabled flag (persistent): toggle via the UI or PATCH /eval-cases/{id}. Permanently skips a case across all runs.
case_ids per-run (ephemeral): pass --case-ids on runs create. Only those cases execute and is_enabled is ignored — pick a single disabled case while iterating without flipping flags on the rest.

# Run just two specific cases (works even if they're disabled)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID --playbook-id VERSION_ID \
  --case-ids c1f8...,c2d4...

Empty --case-ids ⇒ 400. Unknown IDs ⇒ 404. Omit the flag for the historical "all enabled" behaviour.

Dry-running a candidate case (no persistence)

qa.py dry-run start runs the simulator + judge ONCE against an unsaved EvalCaseCreate payload — same pipeline as a real eval run, no rows written to eval_cases or eval_runs. Use this to:

Validate a candidate case definition before persisting (does the simulator generate plausible user turns? does the termination fire? are the assertions gradable?).
Test an instructions/skills change against one specific scenario without bumping a playbook version.

State lives in memory for ~30 minutes and is polled by dry-run ID:

DRY_RUN_ID=$(python3 scripts/qa.py dry-run start PLAYBOOK_BASE_ID \
  --playbook-id VERSION_ID \
  --case '{
    "name": "refund-poc",
    "scenario": "User asks for a refund for ORD-12345.",
    "termination": "The assistant confirms the refund will be processed",
    "max_turns": 6,
    "assertions": [{"criteria": "The assistant asks for the order number"}]
  }' \
  --instructions "You are a refund specialist. Always ask for the order number first." \
  | jq -r .dry_run_id)

# Poll until completed / failed / cancelled
python3 scripts/qa.py dry-run status $DRY_RUN_ID
# or bail out if it's clearly going wrong
python3 scripts/qa.py dry-run cancel $DRY_RUN_ID

If the dry-run passes and the conversation looks right, persist the case via cases create and move to Step 6 of the QA practice workflow.

Model overrides

Syntax (same for all three flags):

provider/model_id — single model. Examples: openai-direct/gpt-4o-mini, anthropic/claude-sonnet-4.
primary{timeout}fallback — primary first; on timeout (seconds) fall back. Example: groq/llama-3.3-70b-versatile{8}openai-direct/gpt-4o-mini.
modelA:50,modelB:50 — A/B experiment. Percentages must sum to 100; cases are hash-assigned to a variant by conversation_id.

Bad input ⇒ 422 at the API edge (e.g. percentages that don't sum to 100, or a bare gpt-4o-mini without provider prefix). Empty/whitespace ⇒ field ignored.

Recommended models (use these exact slugs — don't invent new ones)

OpenRouter's catalog is strict; invented slugs will 422. These are the slugs actually in use across the Studio Chat stack:

Anthropic — Claude:

OpenAI — GPT:

Google — Gemini:

Gemini caveat: there's a known tool-calling bias in this codebase (docs/gemini-tool-call-bias.md). Prefer Sonnet for the assistant when the playbook leans heavily on tools.

Reasoning effort suffix (GPT-5 family)

OpenAI reasoning-capable models accept an optional [reasoning=X] suffix. Valid efforts: none, low, medium, high, xhigh. none disables reasoning entirely. Example:

openai/gpt-5.4-mini[reasoning=medium]
openai/gpt-5.2-chat[reasoning=none]

The suffix composes with the other syntactic forms.

Concurrency

Iterating without saving: playbook overrides

When to use

Iterating on the system prompt — try a new tone, a new rule, a new fallback — and see how the assistant responds turn-by-turn.
Validating a draft skill (casuística) end-to-end against the full eval suite before promoting it.
A/B comparing two prompt variants without bumping the active version.
Reproducing a production conversation with a tweaked prompt to confirm the fix.

Override semantics

Rules:

Each flag is independent — omit a flag and the saved playbook field stays.
Replace, not merge — lists are swapped wholesale; there's no union.
Conversations are forced into preview + eval mode — overridden runs never count toward production analytics, the sticky-model cache, or chatlogs.
No version is created — the saved playbook is untouched; if you like the result, edit and save it through the normal flow.
Requires admin or API-key authentication (the same sbs_ / kps_ tokens the skill already uses).

Skills file shape

--skills-file accepts two shapes — pick the one that matches what you want to do.

1. Full replace (list of skill objects)

Drop the saved playbook's skills entirely and use exactly these:

[
  {
    "name": "refund-flow",
    "description": "Handle refund requests with order id verification",
    "content": "First ask for the order id. Then check eligibility..."
  },
  {
    "name": "english-only",
    "description": "Force English replies",
    "content": "Reply only in English regardless of customer language."
  }
]

Pass [] to disable all skills.

2. Surgical patch (object with `add` / `replace` / `remove`)

Keep most of the saved skills and only modify a few. Operators are applied in order: remove → replace → add:

{
  "remove": ["legacy-skill-a", "legacy-skill-b"],
  "replace": [
    {
      "name": "refund-flow",
      "description": "Refund handling, tightened policy",
      "content": "ASK for order id BEFORE confirming any refund..."
    }
  ],
  "add": [
    {
      "name": "english-only",
      "description": "Force English replies",
      "content": "Reply only in English regardless of customer language."
    }
  ]
}

Strict-validation rules (the BE returns 422 if violated, before any LLM call):

remove of a name that isn't on the saved playbook → 422.
replace of a name that isn't on the saved playbook → 422 (use add instead).
add of a name that already exists (after remove ran) → 422 (use replace instead).
Duplicate names within a single operator list → 422.

remove: [X] + add: [{name:X, ...}] of the same name is allowed — after remove drops the saved row, the slot is free for add.

Skill object shape

Examples

# Quick prompt tweak via inline flag
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Hi" \
    --instructions "Always reply in English, be terse."

# Full file-based override for chat
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "I want a refund" \
    --instructions-file ./draft-prompt.md \
    --skills-file ./draft-skills.json \
    --examples-file ./draft-examples.json \
    --conversation-id qa_iter_001

# Disable all skills to test the bare prompt
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Hello" \
    --skills-file <(echo '[]')

# Surgically modify the saved skills (patch shape)
# ./skills-patch.json:
#   { "remove": ["old-flow"], "add": [{"name": "english-only", ...}] }
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --message "Hello" \
    --skills-file ./skills-patch.json

# Run the full eval suite against an unsaved prompt
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID \
    --playbook-id PB_VERSION_ID \
    --instructions-file ./candidate-prompt.md

Wire shape

Both endpoints accept a playbook_override object on the request body. The CLI builds it for you, but if you need to call the API directly:

Full-replace form (list-shape skills):

{
  "conversation_id": "qa_iter_001",
  "user_message": "...",
  "playbook_override": {
    "content": "...full instructions...",
    "skills": [
      {"name": "...", "description": "...", "content": "..."}
    ],
    "examples": [],
    "kb_ids": ["kb-uuid-1"],
    "api_tools": []
  }
}

Surgical-patch form (object-shape skills):

{
  "conversation_id": "qa_iter_001",
  "user_message": "...",
  "playbook_override": {
    "skills": {
      "remove": ["old-flow"],
      "replace": [{"name": "refund-flow", "description": "...", "content": "..."}],
      "add": [{"name": "english-only", "description": "...", "content": "..."}]
    }
  }
}

Any subset of these keys is valid — omitted keys keep the saved playbook value.

Mocking Tools (`tool_mocks`)

tool_mocks lets you stub specific tools with canned responses. Same wire shape and semantics in two places:

Step 3 of the QA practice workflow makes asking the user about mocks mandatory before any chat or eval run. The rest of this section covers the shape + the rules.

What can and cannot be mocked

Mockable (anything that dispatches through the agent's toolset wrapper):

Composio integrations: tool names follow the TOOLKIT_ACTION uppercase convention. Examples: SLACK_SEND_MESSAGE, CAL_POST_NEW_BOOKING_REQUEST, CAL_CANCEL_BOOKING_VIA_UID, GMAIL_SEND_EMAIL. The exact list depends on which toolkits are configured in SUPPORTED_TOOLKITS for the project.
Custom API tools: the entries in the playbook's api_tools. Names match what the playbook author registered (typically snake_case).
Custom toolkit tools: same as Composio but registered via the in-house CUSTOM_TOOLKIT_REGISTRY.
Built-in tools: search_knowledge_base, load_skill, list_agents, list_teams, list_kbs.

Tool names in the mock map must match ToolCallTrace.name exactly — uppercase for Composio, snake_case for the rest.

NOT mockable (not dispatched as tool calls):

Agent events: message, note, label, priority, handoff_agent, handoff_team. These are items the LLM emits inside its structured output payload, parsed downstream by the chat handler — they never go through the wrapper. To validate them, use the matching structured assertions (HandoffAssertion, PrioritySetAssertion, TagAddedAssertion, PrivateNoteContainsAssertion).
Enrichment tools (enrichment_tool_ids): run BEFORE the agent's first turn. The wrapper doesn't exist yet at that point in the request lifecycle.

Shape

tool_mocks lives on the case body. Each key is a tool name; each value is one rule or an ordered list of rules:

{
  "name": "refund-when-eligible",
  "scenario": "...",
  "termination": "...",
  "tool_mocks": {
    "lookup_order": {
      "match_kind": "any",
      "return_value": {"order_id": "ORD-123", "status": "delivered", "days_since": 5}
    },
    "process_refund": {
      "match_kind": "any",
      "return_value": {"refund_id": "RFND-999", "amount": 49.99}
    }
  }
}

Match modes

A rule fires only when its match_kind matches the call:

Payload: exactly one of `return_value` or `error`

{
  "tool_mocks": {
    "send_email": {"match_kind": "any", "error": "SMTP unavailable"},
    "lookup_user": {
      "match_kind": "args_match",
      "match_args": {"plan": "enterprise"},
      "return_value": {"name": "Acme Corp", "tier": "enterprise"}
    }
  }
}

Multiple rules per tool, first-match-wins

Pass an array to express "specific case → fallback":

{
  "tool_mocks": {
    "lookup_order": [
      {"match_kind": "args_match", "match_args": {"order_id": "ORD-123"}, "return_value": {"status": "delivered"}},
      {"match_kind": "args_match", "match_args": {"order_id": "ORD-999"}, "return_value": {"status": "lost"}},
      {"match_kind": "any", "error": "Order not found"}
    ]
  }
}

Rules are evaluated top-to-bottom; the first match wins. The catch-all any rule at the end keeps the run safe if the assistant calls with an unexpected argument.

Multi-call sequences

Use call_ordinal to return different values across consecutive calls:

{
  "tool_mocks": {
    "search_kb": [
      {"match_kind": "call_ordinal", "call_ordinal": 1, "return_value": [{"title": "Refund policy", "snippet": "..."}]},
      {"match_kind": "call_ordinal", "call_ordinal": 2, "return_value": [{"title": "Refund timeline", "snippet": "..."}]}
    ]
  }
}

Important: mocks are exhaustive per tool

Tools you DON'T list in tool_mocks are unaffected — they call the real implementation as usual.

Combining mocks with tool assertions

{
  "tool_mocks": {
    "send_email": {"match_kind": "any", "return_value": {"sent": true}}
  },
  "assertions": [
    {"type": "tool_called", "name": "send_email", "args_match": {"to": "[email protected]"}},
    {"type": "tool_not_called", "name": "send_email", "args_match": {"to": "[email protected]"}}
  ]
}

Mocking during chat (no persisted case)

Semantics:

Admin / API-key only — same gate as --instructions / --skills-file.
Forces preview + eval mode — the conversation is excluded from chatlogs, sticky-model assignment, and production analytics. Same semantics as --instructions, so they compose cleanly.
Mocked calls are flagged in the output: qa.py chat prints [MOCKED] next to the tool name in the Tool calls section so you can tell stubbed responses apart from real ones.

Example mocks.json reproducing a "lookup_order returns not-found, KB has no refund policy" scenario:

{
  "lookup_order": {"match_kind": "any", "error": "Order not found"},
  "search_knowledge_base": {
    "match_kind": "any",
    "return_value": []
  }
}

Then chat:

python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
  --message "Quiero un reembolso del pedido ORD-99999" \
  --tool-mocks-file ./mocks.json \
  --verbose

Watch the Tool calls section for lookup_order [custom] [MOCKED] and verify the assistant's response handles the error gracefully (e.g. escalates instead of fabricating an order status).

Note: run-level tool_mocks on qa.py runs create is not supported — per-case tool_mocks lives on the case body itself. If you need deterministic mocks across a whole run, put them on each case (or use a --case-ids subset against cases that already carry mocks). The CLI rejects --tool-mocks-file on runs create with a clear error.

Per-Case User Context (`user_context`)

A case can override the run-level user_context for its own scope. Case keys win over run keys; the special eval_overrides sub-namespace is shallow-merged.

{
  "name": "vip-customer-flow",
  "user_context": {
    "plan": "enterprise",
    "eval_overrides": {"agent_name": "PremiumBot"}
  }
}

Use this when one case needs a different user identity, plan, or simulated state without affecting the rest of the run.

Writing Good Test Cases

Scenario Guidelines

The scenario tells the simulated user how to behave. Write it from the user's perspective:

Describe what the user wants, not what the assistant should do
Include constraints: "the user is impatient," "the user doesn't have their order number"
Keep it focused — one scenario per case

Assertion types — pick the right tool for each check

Assertions live on EvalCase.assertions as a list of typed objects. There are two families:

Text (type: "text") — graded by the LLM judge. Use for free-form claims about what the assistant said.
Structured (everything else) — graded deterministically by walking turn.events and turn.tool_calls. No LLM call. Faster, cheaper, and immune to judge flakiness — prefer these whenever the check fits a structured shape.

The --judge-model flag affects only text assertions. Structured assertions ignore it.

Picking the right type: every time you reach for text to check something the assistant did (called a tool, handed off, set a priority, applied a tag, wrote a private note), there's a structured variant that does it deterministically. Use text only for what the assistant said (tone, content of the message, that it mentioned a specific policy).

`text` — LLM-as-judge

{"type": "text", "criteria": "The assistant mentions the 30-day refund policy"}

type defaults to text, so the legacy short form still works:

{"criteria": "The assistant asks for the order number before proceeding"}

Write the criteria as a clear, verifiable statement:

✅ "The assistant mentions the 30-day refund policy"
✅ "The assistant asks for the order number before proceeding"
❌ "The assistant is helpful" (too vague)
❌ "Response time is under 2 seconds" (not evaluable from text)

`tool_called` — a specific tool was invoked

{"type": "tool_called", "name": "search_knowledge_base", "min_count": 1}

Optional args_match narrows the match to calls whose arguments dict is a superset of the given keys/values:

{
  "type": "tool_called",
  "name": "SLACK_SEND_MESSAGE",
  "args_match": {"channel": "#alerts"},
  "min_count": 1
}

min_count defaults to 1 — bump it when you need to assert "called at least N times" (e.g., retrieved KB info twice during a long flow).

`tool_not_called` — a tool was never invoked

{"type": "tool_not_called", "name": "process_refund"}

With args_match, narrows to "must never be called with these args" (other calls to the same tool are ignored):

{
  "type": "tool_not_called",
  "name": "send_email",
  "args_match": {"to": "[email protected]"}
}

Useful for negative tests: "the agent must NOT email the CEO."

`tool_call_sequence` — tools fired in a specific order

{
  "type": "tool_call_sequence",
  "names": ["lookup_order", "check_refund_eligibility", "process_refund"],
  "strict": false
}

`handoff` — some handoff event was emitted (mode-agnostic)

{"type": "handoff"}

`handoff_to_agent` — handoff to a specific agent (Kaption)

{"type": "handoff_to_agent", "agent_id": 162}

Or "any" for "any specific agent" (i.e., assert that a handoff_agent event fired with some agent_id):

{"type": "handoff_to_agent", "agent_id": "any"}

`handoff_to_team` — handoff to a specific team (Kaption)

{"type": "handoff_to_team", "team_id": 7}

`no_handoff` — assert the agent did NOT hand off

{"type": "no_handoff"}

Matches both agent and team handoffs. Useful when the playbook is supposed to resolve the issue end-to-end.

`priority_set` — `priority` event with a specific value

{"type": "priority_set", "value": "urgent"}

value must be one of urgent | high | medium | low. Pairs well with playbooks that triage by severity.

`tag_added` — a `label` event added a specific tag

{"type": "tag_added", "tag": "billing"}

The agent emits these as label events internally, but the user-facing term is "tag" — this assertion uses tag_added for consistency with the rest of the eval surface.

`private_note_contains` — a private note's content contains a substring

{
  "type": "private_note_contains",
  "substring": "asignar a Lautaro",
  "case_insensitive": true
}

Useful when the playbook is supposed to write specific context into a private note for the next human agent. case_insensitive defaults to true.

Combined examples

A single case can mix text and structured assertions freely. Structured assertions run first (deterministic, no LLM cost); text assertions run after.

{
  "name": "refund-flow-eligible",
  "scenario": "Customer with order ORD-12345 (5 days old) wants a refund.",
  "termination": "The assistant confirms the refund will be processed",
  "max_turns": 6,
  "assertions": [
    {"type": "tool_call_sequence", "names": ["lookup_order", "process_refund"]},
    {"type": "tool_called", "name": "process_refund", "args_match": {"order_id": "ORD-12345"}},
    {"type": "tool_not_called", "name": "human_handoff"},
    {"type": "priority_set", "value": "low"},
    {"type": "tag_added", "tag": "refund-completed"},
    {"criteria": "The assistant confirms the refund amount in the response"}
  ]
}

A handoff-on-failure case for a standalone account:

{
  "name": "refund-flow-not-eligible-handoff",
  "scenario": "Customer wants a refund for a 60-day-old order. Policy is 30 days.",
  "termination": "The assistant escalates to a human",
  "max_turns": 5,
  "assertions": [
    {"type": "handoff"},
    {"type": "tool_not_called", "name": "process_refund"},
    {"type": "private_note_contains", "substring": "fuera de política de 30 días"},
    {"criteria": "The assistant explains the 30-day policy to the customer"}
  ]
}

Picking the right assertion — cheat sheet

Reminder: events (priority, label, note, handoff_*) cannot be mocked with tool_mocks because they're not tool calls — they're items in the LLM's structured output. But they CAN be asserted with the structured assertions above. The two features are complementary: mocks shape the inputs the agent sees during the run; assertions verify the side effects (events + tool calls) it produced.

Examples

Example: Create a Single Test Case

python3 scripts/qa.py cases create PLAYBOOK_BASE_ID --body '{
  "name": "refund-happy-path",
  "scenario": "The customer bought a product 5 days ago and wants a full refund. They have their order number ready (ORD-12345).",
  "termination": "The assistant confirms the refund will be processed",
  "first_message": "Hi, I want to return a product I bought last week",
  "max_turns": 8,
  "assertions": [
    {"criteria": "The assistant asks for the order number"},
    {"criteria": "The assistant confirms the refund amount"},
    {"criteria": "The assistant mentions the expected refund timeline"}
  ]
}'

Example: Batch Create Test Cases

python3 scripts/qa.py cases batch PLAYBOOK_BASE_ID --body '{
  "cases": [
    {
      "name": "greeting-basic",
      "scenario": "A new customer visits for the first time and says hello.",
      "termination": "The assistant greets the customer and offers help",
      "max_turns": 3,
      "assertions": [
        {"criteria": "The assistant introduces itself"},
        {"criteria": "The assistant asks how it can help"}
      ]
    },
    {
      "name": "out-of-scope-question",
      "scenario": "The customer asks about something completely unrelated to the business, like the weather or sports.",
      "termination": "The assistant redirects to relevant topics or escalates",
      "max_turns": 5,
      "assertions": [
        {"criteria": "The assistant does not make up an answer about unrelated topics"},
        {"criteria": "The assistant politely redirects the conversation"}
      ]
    },
    {
      "name": "angry-customer-handoff",
      "scenario": "The customer is very angry about a delayed order. They curse and demand to speak to a manager. The order number is ORD-99999.",
      "termination": "The assistant escalates to a human agent",
      "max_turns": 6,
      "assertions": [
        {"criteria": "The assistant remains calm and professional"},
        {"criteria": "The assistant attempts to help before escalating"},
        {"criteria": "The assistant escalates to a human agent"}
      ],
      "assertion_tags": ["escalation"]
    },
    {
      "name": "pricing-inquiry",
      "scenario": "The customer wants to know the pricing for the Pro plan and asks about discounts for annual billing.",
      "termination": "The assistant provides pricing information",
      "max_turns": 5,
      "assertions": [
        {"criteria": "The assistant provides the correct Pro plan price"},
        {"criteria": "The assistant mentions annual billing discount if available"}
      ]
    }
  ]
}'

Example: Run Evaluation

# First, get the playbook version to test
python3 scripts/qa.py chat PLAYBOOK_BASE_ID --list-versions

# Trigger a run against a specific version
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID \
  --playbook-id PLAYBOOK_VERSION_ID

# Trigger with user context (simulate a specific user)
python3 scripts/qa.py runs create PLAYBOOK_BASE_ID \
  --playbook-id PLAYBOOK_VERSION_ID \
  --context '{"email": "[email protected]", "plan": "enterprise"}'

# Poll status until completed (check every 15 seconds)
python3 scripts/qa.py runs status RUN_ID

# Get full results
python3 scripts/qa.py runs results RUN_ID -o eval_results.json

Example: Simulate a Conversation

# Start a new conversation
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
  --message "Hi, I need to cancel my subscription"

# Continue the same conversation (use the conversation_id from the response)
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
  --message "My account email is [email protected]" \
  --conversation-id conv_qa_12345

# Chat with user context
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
  --message "Check my order status" \
  --context '{"email": "[email protected]", "order_id": "ORD-555"}'

# Verbose mode — show full tool results and raw JSON response
python3 scripts/qa.py chat PLAYBOOK_BASE_ID \
  --message "I want a refund" --verbose

Chat Output Breakdown

The chat command prints a structured breakdown to stderr:

Tool call details (always shown):

load_skill [custom] — which skill was loaded (or not loaded)
search_knowledge_base [kb_search] — query used, number of results, relevance scores, and content snippets
Custom API tools — name, arguments, and result preview

--verbose / -v expands tool results to 500 chars and dumps the full JSON response.

Example: Analyze Results

import json

with open("eval_results.json") as f:
    results = json.load(f)

total = len(results)
passed = sum(1 for r in results if r["status"] == "passed")
failed = sum(1 for r in results if r["status"] == "failed")
errored = sum(1 for r in results if r["status"] == "error")

print(f"Results: {passed}/{total} passed ({passed/total*100:.0f}%)")
if failed:
    print(f"  Failed: {failed}")
if errored:
    print(f"  Errors: {errored}")

print("\nFailed cases:")
for r in results:
    if r["status"] != "passed":
        print(f"\n  [{r.get('case_name', r['case_id'])}] — {r['status']}")
        for a in r.get("assertion_results", []):
            status = "PASS" if a["passed"] else "FAIL"
            print(f"    [{status}] {a['criteria']}")
            if not a["passed"]:
                print(f"           {a['explanation']}")

Example: Compare Two Versions

import json

# Load results from two different runs (different playbook versions)
with open("results_v3.json") as f:
    v3 = json.load(f)
with open("results_v5.json") as f:
    v5 = json.load(f)

# Build lookup by case name
v3_by_case = {r.get("case_name", r["case_id"]): r for r in v3}
v5_by_case = {r.get("case_name", r["case_id"]): r for r in v5}

all_cases = sorted(set(v3_by_case) | set(v5_by_case))

print(f"{'Case':<30} {'v3':>8} {'v5':>8} {'Delta':>8}")
print("-" * 56)
for case in all_cases:
    r3 = v3_by_case.get(case)
    r5 = v5_by_case.get(case)
    s3 = r3["status"] if r3 else "—"
    s5 = r5["status"] if r5 else "—"
    delta = ""
    if s3 == "passed" and s5 != "passed":
        delta = "REGRESSION"
    elif s3 != "passed" and s5 == "passed":
        delta = "FIXED"
    print(f"  {case:<28} {s3:>8} {s5:>8} {delta:>8}")

Workflows

1. Initial QA Setup

When setting up QA for an assistant for the first time:

List playbooks — identify the assistant and its base_id
Design test cases — cover happy paths, edge cases, error handling
Batch create — create all cases in one API call
Run first eval — establish a baseline
Review results — identify gaps in the assistant's behavior

2. Pre-Deploy Validation

Before deploying a new playbook version:

Get the new version ID — from the playbook version history
Run eval against the new version
Compare with baseline — check for regressions
If pass rate drops — investigate failing cases before deploying

3. Ongoing Monitoring

Periodically run evals to catch drift:

Trigger run against the active version
Check pass rate — compare with historical runs
Investigate new failures — read the conversation + assertion explanations
Update test cases — add new cases for issues found in production (use data-expert skill to find problematic conversations)

4. Triage a customer complaint

The most common request — full detail in the QA Practice Workflow section at the top. Short form:

data-expert → pull the offending conversation, identify the playbook version + user_context that was active
Build mental model — which layer (instructions / skill / KB / tool) drove the bad behaviour
REQUIRED: ask the user what to mock — enumerate available tools, get confirmation
Reproduce + iterate: qa.py chat --instructions ... --tool-mocks-file .... Read the full response (events, tool calls, citations, explanation), not just the assistant message.
Dry-run the candidate case to validate the case definition itself
Persist case + run only it: cases create then runs create --case-ids NEW_CASE_ID (still with the override / mocks while iterating; drop them once the playbook version is promoted).

Adoption

studiochat/quality-engineer

$ install --global

Security Scan Results

SKILL.md

Quality Engineer

Key Terminology

Setup

QA Practice Workflow (read this first)

Step 1 — Understand the complaint (cross-skill: data-expert)

Step 2 — Build the mental model

Step 3 — REQUIRED: ask the user what to mock before chatting / before any eval run

Step 4 — Reproduce + iterate via chat with overrides + mocks

Step 5 — Dry-run a candidate eval case (optional but recommended)

Step 6 — Persist the case + run only it

Cheat sheet — which mechanism for which question

Scripts

qa.py — Eval & testing API client

Full API Reference

How Evaluations Work

The Eval Lifecycle

Test Case Anatomy

How the Simulator Works

User Context

Picking which cases to run

Dry-running a candidate case (no persistence)

Model overrides

Recommended models (use these exact slugs — don't invent new ones)

Reasoning effort suffix (GPT-5 family)

Concurrency

Iterating without saving: playbook overrides

When to use

Override semantics

Skills file shape

1. Full replace (list of skill objects)

2. Surgical patch (object with add / replace / remove)

Skill object shape

Examples

Wire shape

Mocking Tools (tool_mocks)

What can and cannot be mocked

Shape

Match modes

Payload: exactly one of return_value or error

Multiple rules per tool, first-match-wins

Multi-call sequences

Important: mocks are exhaustive per tool

Combining mocks with tool assertions

Mocking during chat (no persisted case)

Per-Case User Context (user_context)

Writing Good Test Cases

Scenario Guidelines

Assertion types — pick the right tool for each check

text — LLM-as-judge

tool_called — a specific tool was invoked

tool_not_called — a tool was never invoked

tool_call_sequence — tools fired in a specific order

handoff — some handoff event was emitted (mode-agnostic)

handoff_to_agent — handoff to a specific agent (Kaption)

handoff_to_team — handoff to a specific team (Kaption)

no_handoff — assert the agent did NOT hand off

priority_set — priority event with a specific value

tag_added — a label event added a specific tag

private_note_contains — a private note's content contains a substring

Combined examples

Picking the right assertion — cheat sheet

Examples

Example: Create a Single Test Case

Example: Batch Create Test Cases

Example: Run Evaluation

Example: Simulate a Conversation

Chat Output Breakdown

Example: Analyze Results

Example: Compare Two Versions

Workflows

1. Initial QA Setup

2. Pre-Deploy Validation

3. Ongoing Monitoring

4. Triage a customer complaint

Related Skills

2. Surgical patch (object with `add` / `replace` / `remove`)

Mocking Tools (`tool_mocks`)

Payload: exactly one of `return_value` or `error`

Per-Case User Context (`user_context`)

`text` — LLM-as-judge

`tool_called` — a specific tool was invoked

`tool_not_called` — a tool was never invoked

`tool_call_sequence` — tools fired in a specific order

`handoff` — some handoff event was emitted (mode-agnostic)

`handoff_to_agent` — handoff to a specific agent (Kaption)

`handoff_to_team` — handoff to a specific team (Kaption)

`no_handoff` — assert the agent did NOT hand off

`priority_set` — `priority` event with a specific value

`tag_added` — a `label` event added a specific tag

`private_note_contains` — a private note's content contains a substring

2. Surgical patch (object with `add` / `replace` / `remove`)

Mocking Tools (`tool_mocks`)

Payload: exactly one of `return_value` or `error`

Per-Case User Context (`user_context`)

`text` — LLM-as-judge

`tool_called` — a specific tool was invoked

`tool_not_called` — a tool was never invoked

`tool_call_sequence` — tools fired in a specific order

`handoff` — some handoff event was emitted (mode-agnostic)

`handoff_to_agent` — handoff to a specific agent (Kaption)

`handoff_to_team` — handoff to a specific team (Kaption)

`no_handoff` — assert the agent did NOT hand off

`priority_set` — `priority` event with a specific value