skills/scorable-integration/SKILL.md
Integrate Scorable LLM-as-a-Judge evaluators into applications with LLM interactions. Use when users want to add evaluation, guardrails, or quality monitoring to their LLM-powered applications. Also use when users mention Scorable, judges, LLM evaluation, or safeguarding applications.
npx skillsauth add root-signals/scorable-skills scorable-integrationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
These instructions guide you through creating LLM evaluation judges with Scorable and integrating them into your codebase. Scorable is a tool for creating LLM-as-a-Judge based evaluators for safeguarding applications. Judge is the Scorable term for grouping evaluations from different metrics (Helpfulness, Policy Adherence, etc...)
You are responsible for completing Scorable setup and integration end-to-end in as few turns as possible.
Your role is to:
Note: These instructions work for both creating new judges from scratch and integrating existing judges. If the user provides a judge ID, you can skip the judge creation step (Step 3) and proceed directly to integration (Step 4).
Before performing any analysis or technical steps, pause and clearly brief the user on what is about to happen. Explain that you will:
Examine the codebase to understand:
If multiple LLM interactions exist, help the user prioritize. Recommend starting with the most critical one first.
First, install the Scorable CLI:
curl -sSL https://scorable.ai/cli/install.sh | sh
Or with npm:
npm install -g @root-signals/scorable-cli
Or run without installing via npx:
npx @root-signals/scorable-cli judge list
Then ask the user which authentication option they prefer:
Direct them to: https://scorable.ai/api-key-setup to create an API key, then set it via the CLI:
scorable auth set-key
# paste the key when prompted
# or alternatively:
scorable auth set-key <your-api-key>
Security: Instruct the user to use environment variables or the project's secret management. Use existing .env files if available or ask user to save it as environment variable. Do not ask the user to paste the key into this session.
Get a free demo key (no registration required):
scorable auth demo-key
Warn the user appropriately that:
Remember also the api_token field. It is used in the URL parameters for the judge URL, not in any other context.
If they have an account: https://scorable.ai/settings/api-keys
Set via CLI:
scorable auth set-key <your-api-key>
Or use an environment variable:
export SCORABLE_API_KEY="sk-your-api-key"
Projects: Resources you create land in the org's default project automatically. Don't pass --project-id unless the user explicitly asks to scope a specific project.
Note: If the user has already provided a judge ID (e.g., in their message), you can skip this step and proceed directly to Step 4 (Integration).
Note: After the user has authenticated, take control back and run the following commands yourself.
Generate a judge using the Scorable CLI with a detailed intent string.
Example:
scorable judge generate \
--intent "An email automation system that creates summary emails using an LLM based on database query results and user input. Evaluate the LLM output for: accuracy in summarizing data, appropriate tone for the audience, inclusion of all key information from queries, proper formatting, and absence of hallucinations. The system is used for customer-facing communications." \
--visibility private \
--reasoning-effort medium
Use --visibility public if using a temporary API key.
Optional fields:
enable_context_aware_evaluators: Set to true if the application interaction uses RAG (document chunks) that are relevant and can be extracted to the evaluation (hallucinations, context drift, etc.).Note that this can take up to 2 minutes to complete.
The CLI may return:
1. missing_context_from_system_goal - Additional context needed:
→ Ask the user for these details (if not evident from the code base), then re-run with the additional context:
scorable judge generate \
--intent "..." \
--judge-id <existing-judge-id> \
--extra-contexts '{"target_audience":"Enterprise customers"}'
2. multiple_stages - Judge detected multiple evaluation points:
{
"error_code": "multiple_stages",
"stages": ["Stage 1", "Stage 2", "Stage 3"]
}
→ Ask the user which stage to focus on, or if they have a custom stage name. Each judge evaluates one stage. You can create additional judges later for other stages. Re-run with --stage "<stage name>".
3. Success - Judge created:
{
"judge_id": "abc123...",
"evaluator_details": [...]
}
→ Proceed to integration.
Add code to evaluate LLM outputs at the appropriate execution point(s). If the codebase is using a framework, check if there are integration instructions in Scorable docs (using curl is enough): https://docs.scorable.ai/llms.txt
Choose the appropriate integration guide based on the codebase language:
response parameter: The text you want to evaluate (required)request parameter: The input that prompted the response (optional but recommended)If a multi-turn conversation is detected, use the multi-turn format to evaluate the entire conversation flow. This may also include tool calls. Confirm from user if multi-turn evaluation would suit their needs. See language-specific guides for details.
If the application uses RAG, you MUST pass a contexts parameter to the judge run. Optional parameters (user_id, tags, expected_output) and the exact result shape are documented in the language-specific reference files linked above.
After integration:
https://scorable.ai/judge/{judge_id}
tools
Wire end-to-end OpenTelemetry-based tracing and evaluation for an LLM application — instrument the application with OTEL via OpenInference, ship traces to Scorable's OTLP endpoint, then use the Scorable CLI to query traces and create evaluation filters that auto-run an evaluator or judge against matching traces. Use when users want OTEL/OpenTelemetry tracing for their LLM app, want to monitor production LLM traffic, want auto-evaluation on a fraction of traffic, mention OTLP, OpenInference, Arize, pydantic-ai, openai-agents, LangChain, or "tracing my agent."
testing
Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".
testing
Host security hardening and risk-tolerance configuration for OpenClaw deployments. Use when a user asks for security audits, firewall/SSH/update hardening, risk posture, exposure review, OpenClaw cron scheduling for periodic checks, or version status checks on a machine running OpenClaw (laptop, workstation, Pi, VPS).
testing
Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".