skills/continuous-improvement/SKILL.md
The proactive loop for shipping a behaviour change to an assistant — adding or changing a policy in the instructions or casuísticas (skills). Use when the request is "necesitamos que el asistente haga X", "agregá esta política", "cambiá el tono", "sumá una casuística para Y", or when a trend surfaced by data (trending topic, recurring handoffs, a monitor/alert) justifies an improvement. Drives the loop: clarify the policy → decide WHERE it lives (base instruction vs casuística vs KB vs example) → minimal draft → validate with in-memory overrides → ship via approvals → eval coverage. This is the feature/change-request counterpart to the quality-engineer skill (which is the reactive bug-report loop from a conversation ID).
npx skillsauth add studiochat/skills continuous-improvementInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Ship a behaviour change to an assistant: add or change a policy in the instructions or
the casuísticas (skills), validate it without polluting version history, push it through
the approval gate, and close the loop with eval coverage. All API calls are authenticated
automatically via environment variables. The API base URL (https://api.studiochat.io) is
hardcoded in the scripts.
This skill is the proactive counterpart to quality-engineer:
| | Entry point | Shape |
|---|---|---|
| quality-engineer | a conversation_id where the assistant misbehaved | reactive — bug report → root cause → fix |
| continuous-improvement | a desired policy/behaviour (a request, or a trend in the data) | proactive — change request → ship a feature |
It owns almost no new mechanics: the building blocks live in builder
(how to write a casuística, edit instructions, create a KB / example block / tool) and the
validation lives in quality-engineer (in-memory overrides,
dry-run, mocks, eval cases). What's unique here is the decision and iteration process —
above all Step 2: where does the policy live?
Assistants and playbooks are the same concept. In the API, the term "playbook" is used everywhere — but users refer to them as "assistants," "bots," or "agents." When the user mentions any of these, they mean a playbook.
Instructions vs casuísticas (skills):
content on the playbook) — the base system prompt, injected into every
conversation the assistant handles.load_skill,
and only when the conversation matches the skill's description (the trigger). Their
content is invisible to the LLM until that happens.Playbook IDs:
playbook_base_id — stable ID across all versions of an assistant. Use this for skill/version management.playbook_id — ID of a specific version.export STUDIO_API_TOKEN="sbs_your_api_key_here"
export STUDIO_PROJECT_ID="your-project-uuid"
API keys are available by request from the Studio Chat team at [email protected].
The loop has two entry points that converge on the same flow:
Policy / trend → Placement decision → Minimal draft → Validated change → Approved version → Eval coverage
There are four user checkpoints where you MUST stop: clarifying the policy (Step 1, if ambiguous), feedback on the draft (Step 3), the mock question (Step 4), and waiting for the human to approve each change (Step 5) — plus the full-suite offer at the end (Step 6). Don't barrel through them.
End this step with explicit agreement on what the assistant must / must not do, and the edge cases. Concretely:
Don't move forward on a vague ask ("que sea más amable"): pin it to observable behaviour
("que salude por el nombre cuando viene en el user_context, y que no use más de un emoji
por mensaje").
Everything the assistant reads at runtime lives in one of a few layers. Putting a policy in the wrong one is the most common mistake — a global rule buried in a casuística silently never fires; a one-off scenario stuffed into the base instructions bloats every conversation and invites regressions.
The deciding question is scope: does the policy affect EVERY conversation, or only the conversations that match a trigger?
| Scope | Goes in | Why | Examples |
|---|---|---|---|
| Global — applies to every conversation the assistant handles | Base instruction (content) | It's always in the prompt, so it always applies | tono · usar / no usar emojis · política de tagging global · formato de respuesta · idioma · reglas de marca |
| Segmented — only when a specific situation occurs (fired by the casuística's description/trigger) | Casuística (skill) — new or edit an existing one | It loads only for the conversations that match the trigger | derivación (solo en conversaciones con derivación) · robo de tarjeta (solo cuando hablan de robo de tarjeta) · reembolsos · un flujo de producto puntual |
The test is literally: "¿esto tiene que pasar en TODAS las conversaciones, o solo en las que disparan el trigger de la descripción?" All-conversations → instruction. Triggered-segment → casuística.
Once you've decided it's a casuística:
description already fires on
those conversations) → edit it: extend its content. Don't create a second skill that
competes for the same trigger.description is
what makes it load — write it to capture exactly the situation that should trigger it (the
LLM picks skills off name + description, never the content). A casuística with a vague
description never loads, no matter how good its content is.Some policies belong in a different layer entirely — see builder for the mechanics:
| The change is really… | Layer | Note |
|---|---|---|
| Factual info that changes over time (precios, horarios, una política que se actualiza seguido) | Knowledge base (or a correction note on a KB item) | Don't hard-code volatile facts into the prompt — put them where they can be searched and updated without a version bump. |
| Tone / style shown by example | Example block via {{ examples: ID }} | Never paste sample conversations into instruction/skill prose — always a block. |
| A real-world action (mandar un Slack, crear un booking) | API tool / toolkit action | Wire it with the macro; the object must exist first. |
| A global tagging policy | Base instruction, with each tag value in single backticks | The whitelist is parsed from backticked tag values — see builder's Tagging section. |
State your placement decision and the reason before drafting — it's the call most worth getting agreement on.
Write the smallest change that implements the policy, and present it before iterating:
Keep the diff tight: broad rewrites are how regressions sneak in. Then wait for feedback — the user often knows constraints invisible in the data (business rules, upcoming changes, tone).
Iterate the same way QA does — never save a throwaway playbook version to test. Everything
runs through the quality-engineer override flow: qa.py chat
and qa.py runs create with --instructions / --instructions-file / --skills-file (full
replace or surgical add/replace/remove patch). Nothing is persisted, no version is bumped,
no approval is generated yet.
Before chatting / before any eval run, ask the user what to mock (this is required, same as QA Step 6) — enumerate the playbook's tools so they can pin down deterministic conditions.
Then:
load_skill call in the qa.py chat --verbose output. If it doesn't fire,
the description is the problem, not the content (Step 2).qa.py dry-run start validates a case is
gradable before you persist it.Iterate --instructions / --skills-file / mocks until the new behaviour is reliable and
you've spot-checked that adjacent scenarios still behave (a quick regression pass over whatever
shares the layer you touched).
This is the "CI" of continuous-improvement — and the one place it differs from QA's pure hand-off. When the change is validated, push it via builder:
PATCH /playbooks/BASE_ID/latest (fetch the latest first, apply on top).POST/PATCH …/skills/...).Every instruction or skill modification you push generates an approval — one per change —
that a human must approve before it goes live. Sandbox (sbs_) callers get a 202 with
{"approval_id": "...", "status": "pending", "message": "Request queued for admin approval."}
instead of an immediate write. So:
Push one change per logical edit so each approval is reviewable on its own (don't bundle a tone change and a new refund casuística into one opaque diff).
Describe every approval right after the 202 — the reviewer reads your text, not the payload:
python3 scripts/api.py "/approvals/APPROVAL_ID/description" -X PATCH --body '{
"description": "WHAT changes and WHY, plus the before → after in plain language."
}'
Include the policy being added/changed, what motivated it (the user ask, the trend, the conversation), and the observable before → after. Pending-only (409 once reviewed).
Confirm each change with the user before pushing it (builder confirms every write anyway), then wait for the human to approve the queued change(s) and for the new version to go live. Get the new version ID.
Don't author eval runs against the change assuming it'll be approved — wait for it to be live.
Close the loop with quality-engineer, scaling coverage to the scope from Step 3:
termination,
structured assertions for what must (not) happen, and the same tool_mocks you used to
validate so the case is deterministic.
| Question | Mechanism |
|---|---|
| "Is this a real, grounded need?" | data-expert — pull the trend / conversations behind it |
| "Does this apply to every conversation, or just a segment?" | The Step 2 scope test → instruction vs casuística |
| "Is there already a casuística for this segment?" | builder — list skills; edit it instead of adding a rival |
| "Will the new casuística even load?" | qa.py chat --verbose → watch the load_skill call (it's the description that fires it) |
| "Does the draft work without breaking neighbours?" | qa.py chat / runs create with --instructions / --skills-file overrides — no save |
| "Is my new eval case gradable?" | qa.py dry-run start |
| "How does the change ship?" | Push via builder → approval per change → human approves → version goes live |
| "Did the change land without regressions?" | qa.py runs create against the new version — offer the full suite |
Changing instructions or a casuística means assembling building blocks, not transcribing prose. The assistant gets its power from the macros and behaviours you wire into the text; a policy written as plain paragraphs when it should fire a tool, search a KB, or show an example is a half-built change. As you draft (Step 3), read the policy for the signals below and OFFER to wire the right block — the user frequently won't know to ask for it, so it's on you to propose it.
These are the four inline macros the compiler understands. The object has to exist before you reference it (no save-time validation — a dangling macro silently degrades to literal text), so offering one often spawns a quick sub-task in builder to create it.
| When the policy implies… | Offer to wire | Macro | Built via (builder) |
|---|---|---|---|
| "answer from our info", "look this up", "según nuestra política de X" | a KB search | {{ kb(KB_ID) }} | Knowledge Bases |
| "here's how a good reply sounds", a tone/style to imitate, a sample turn | an inline example | {{ examples: BLOCK_ID }} | Example Blocks — the only way to add examples; never paste sample turns into prose |
| "notify the team", "avisá por Slack", "escalá al canal de guardia" | a toolkit action | {{ custom_tool: short_name }} | Toolkit Actions — the toolkit must be connected by the user first; you can't connect it |
| "check the order", "call our API", "consultá el estado de X" | a custom API tool | {{ tool(TOOL_ID) }} | API Tools |
These aren't macros — you wire them by telling the assistant when to do it inside the instruction/casuística:
| When the policy implies… | Wire by instructing… | |---|---| | "derivá a un humano", "escalá", "que lo tome una persona" | a handoff — describe the exact trigger condition | | "etiquetá como X", a tagging rule | a tag — name the value in single backticks so it's whitelisted (see builder's Tagging) | | "marcá urgente", triage by severity | a priority set | | "dejá nota para el próximo agente" | a private note |
Don't wait to be told. When the policy type commonly pairs with a block, ask:
User: "Necesito una casuística de emergencias." You (Step 1/3): "Dale. Una pregunta: cuando se dispara una emergencia, ¿querés que el asistente mande un Slack al equipo de guardia, además de responderle al cliente? Si sí, lo dejo cableado en la casuística."
If yes → check the Slack toolkit is connected (GET /custom-toolkits; if not, ask the user to
connect it — you can't), create the tool configuration in builder, and embed
{{ custom_tool: short_name }} in the casuística content at the step that handles the emergency —
next to the instruction describing when and what to send. The casuística is now a wired flow
("detectá la emergencia → respondé con calma → {{ custom_tool: notify_guardia_x1y2z }} → derivá"),
not just a paragraph of advice.
Same reflex for the others: a "explicá la política de reembolsos" casuística should probably
{{ kb(...) }} the refund policy rather than hard-code it; a "manejá reclamos con este tono"
casuística should carry an {{ examples: ... }} block, not a described tone.
Validate the wiring, not just the words (Step 4). After embedding a macro, run
qa.py chat --verboseand confirm it actually fired — thesearch_knowledge_base/ tool call shows up in the output, and[MOCKED]tells you a stub answered. A macro that points at a missing object just vanishes at runtime; the verbose run is how you catch it before shipping.
reembolsos skill already exists and its description fires on refund talk → edit it,
don't add a rival.reembolsos content: "Antes de confirmar un reembolso, verificá la
ventana de 30 días. Fuera de ventana: explicá la política y derivá; nunca prometas el
reembolso." Regression risk: contained to the refund segment.lookup_order with a 60-day-old order). Baseline reproduces
the over-promise; with the --skills-file patch it explains + hands off. Confirm the
reembolsos skill loaded in the verbose output.PATCH …/skills/reembolsos via builder → 202 approval queued → human approves →
new version live, grab the version ID.handoff + tool_not_called process_refund +
text "explica la política de 30 días". Run it against the new version; offer the full suite.Contrast with a global change — "de ahora en más, máximo un emoji por mensaje": that's every
conversation → base instruction, pushed via PATCH /playbooks/BASE_ID/latest, and it warrants
eval cases across several flows because its blast radius is the whole assistant.
This skill orchestrates three others — reach for them rather than re-deriving:
tools
Build and configure Studio Chat assistants — instructions, knowledge bases, skills, example blocks, API tools, toolkit actions (Slack), alerts, schedules, and trending topics. Use when asked to create, update, or manage any aspect of an assistant's configuration, including wiring up the template macros (pills) and the objects they reference. Covers all CRUD operations via the Studio Chat API.
tools
Manage Studio Chat project configuration — knowledge bases, playbooks, syncing, schedule, API tools, alerts, and trending topics. Use when asked to create, update, delete, or inspect KBs, playbooks, office hours, alerts, or any project settings. Also use to generate and browse trending topics analyses. Covers all CRUD operations via the Studio Chat API.
development
Create and configure automated reports in Studio Chat. Use when asked to set up a new report, schedule recurring reports, define report instructions, select which assistants/playbooks to include, configure Slack delivery, or manage existing report definitions. Expert at crafting report instructions that produce structured, high-quality output using the Block Kit format.
development
Test and evaluate AI assistant behavior. Create test cases, run evaluations, analyze results, simulate conversations, and compare playbook versions. Also the end-to-end loop to debug and fix incorrect assistant behaviour starting from a conversation ID: root-cause it, validate a fix via overrides without saving a version, hand off to the human, and add regression evals. Use when asked to test an assistant, create QA scenarios, run evals, check assertion pass rates, verify assistant behavior, or investigate a conversation where the assistant misbehaved.