skills/scorable-otel-evaluation/SKILL.md
Wire end-to-end OpenTelemetry-based tracing and evaluation for an LLM application — instrument the application with OTEL via OpenInference, ship traces to Scorable's OTLP endpoint, then use the Scorable CLI to query traces and create evaluation filters that auto-run an evaluator or judge against matching traces. Use when users want OTEL/OpenTelemetry tracing for their LLM app, want to monitor production LLM traffic, want auto-evaluation on a fraction of traffic, mention OTLP, OpenInference, Arize, pydantic-ai, openai-agents, LangChain, or "tracing my agent."
npx skillsauth add root-signals/scorable-skills scorable-otel-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
These instructions guide you through instrumenting an LLM application with OpenTelemetry, shipping traces to Scorable, and configuring server-side evaluation filters that automatically score matching traces. The result: every (or a sampled subset of) production LLM call gets observed, traced, and scored — and the score lands back on the same trace as a child span carrying the OpenTelemetry GenAI evaluation attributes.
You are responsible for completing the OTEL tracing + evaluation setup end-to-end in as few turns as possible.
Your role is to:
scorable otel-trace list.If the user wants a custom judge but doesn't have one yet, generate it using the scorable-integration skill (its Step 3 covers scorable judge generate with the right intent prompts), then come back here with the resulting judge_id for filter creation.
Before performing any analysis or technical steps, brief the user clearly. Explain that you will:
Mention that this setup is composable: any other observability backend that accepts OTLP HTTP/protobuf can be a second sink later — Scorable just receives the same OTLP traffic via OpenTelemetry, so they can keep their existing tracing if they have it.
Identify:
If the framework is unfamiliar, default to looking it up in the OpenInference catalog: https://github.com/Arize-ai/openinference. Most popular SDKs are covered there.
If multiple LLM-using subsystems exist, help the user prioritize. Recommend starting with the most critical agent first — additional services can each get their own service.name and their own filter later.
Install the Scorable CLI:
curl -sSL https://scorable.ai/cli/install.sh | sh
Or with npm:
npm install -g @root-signals/scorable-cli
Or run without installing via npx:
npx @root-signals/scorable-cli judge list
Then ask the user which authentication option they prefer:
Direct them to: https://scorable.ai/api-key-setup to create an API key, then set it via the CLI:
scorable auth set-key
# paste the key when prompted
# or alternatively:
scorable auth set-key <your-api-key>
Security: Use environment variables or the project's secret management. The same key is used both by the CLI and by the application's OTEL exporter. Read existing .env files if available, otherwise ask the user where they want to store the secret. Do not paste the key into this session.
Get a free demo key (no registration required):
scorable auth demo-key
Warn the user appropriately that:
If they have an account: https://scorable.ai/settings/api-keys
scorable auth set-key <your-api-key>
# or
export SCORABLE_API_KEY="sk-your-api-key"
Verify the CLI works before continuing:
scorable judge list
If this errors, do not move on — the OTEL exporter uses the same key.
Choose the OpenInference instrumentation library that matches the framework you identified in Step 1.
See references/python-instrumentation.md for installation snippets and the canonical setup for OpenAI SDK, openai-agents, pydantic-ai, LangChain / LangGraph, Anthropic SDK, and LlamaIndex.
See references/typescript-instrumentation.md for OpenAI Node SDK, LangChain.js, and Vercel AI SDK setup.
OpenTelemetry SDKs exist for Go, Java, Ruby, .NET, etc., but OpenInference auto-instrumentation coverage is currently strongest in Python and TypeScript. For other languages, point the OTLP HTTP/protobuf exporter at https://api.scorable.ai/otel/v1/traces with header Authorization: Api-Key <your-api-key> and instrument LLM calls manually with custom spans following the GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/registry/attributes/gen-ai/
After picking the right framework instrumentor, add the exporter that ships spans to Scorable:
import os
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
exporter = OTLPSpanExporter(
endpoint="https://api.scorable.ai/otel/v1/traces",
headers={"Authorization": f"Api-Key {os.environ['SCORABLE_API_KEY']}"},
)
resource = Resource.create({"service.name": "my-agent"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
Critical pieces:
service.name is the most important resource attribute. Use a stable, descriptive name (e.g. customer-support-agent, code-review-bot, sales-bot-prod). It's the strongest filter target in Scorable. Per-environment naming (my-agent-prod vs my-agent-staging) is fine and recommended.Api-Key <key> header — the same key set in Step 2. Read it from an env var; never hardcode.BatchSpanProcessor batches spans before flush. For tests you may want SimpleSpanProcessor to see traces immediately.The setup must run once, before any LLM call. Common placements:
tracing.py (or tracing.ts) module imported from your app entrypointready(), Next.js instrumentation.ts)__main__.py / top of the agent script for CLI toolsAfter wiring up the instrumentation:
BatchSpanProcessor to flush.scorable otel-trace list --since 5m --service-name <your-service-name>
You should see the trace listed with its root span name and a span count.
scorable otel-trace spans <trace_id>
scorable otel-trace spans <trace_id> --output json | jq '.[].span.attributes'
The full payload shows you exactly what gen_ai.* attributes the framework emitted, which determines what your filter conditions can match against. Common attributes any spec-conformant instrumentor sets: gen_ai.agent.name, gen_ai.request.model, gen_ai.operation.name, gen_ai.tool.name, gen_ai.usage.input_tokens.
SimpleSpanProcessor temporarily — flushes synchronously.service.name. scorable otel-trace list --since 5m (without a --service-name filter) should show something. If it does, your filter value is off; if it doesn't, the export isn't reaching Scorable.scorable judge list.Do not move on until at least one trace from the application is visible.
A filter runs an evaluator or judge automatically against every matching trace. The filter criteria uses the same wire format as the CLI's --filter flag.
Two options:
scorable evaluator create or pre-existing in their org), use --evaluator-id <uuid>. Single score per trace.--judge-id <uuid>. If no judge exists, generate one using the scorable-integration skill (it walks through scorable judge generate with proper intent prompts), then come back here with the resulting judge_id.A solid first filter scopes to the service you just instrumented:
scorable otel-filter create \
--name "<service-name>-<evaluator-name>" \
--evaluator-id <uuid> \
--filter-criteria '{"conditions":[{"column":"resource","type":"string","key":"service.name","operator":"=","value":"<your-service-name>"}]}' \
--delay-seconds 10
Or via judge:
scorable otel-filter create \
--name "<service-name>-quality-judge" \
--judge-id <uuid> \
--filter-criteria '{"conditions":[{"column":"resource","type":"string","key":"service.name","operator":"=","value":"<your-service-name>"}]}' \
--delay-seconds 10
--filter-criteria — JSON conditions, AND-combined. Scope to your service.name at minimum; narrow further with gen_ai.agent.name, gen_ai.request.model, etc. as needed.--delay-seconds — wait this long after the most recent span lands before triggering evaluation. 10s is a safe default; bump higher (30–60s) for long-running agents whose final span arrives much later than the first.--sampling-rate — between 0.0 and 1.0; default 1.0 (every match). Use 0.1 for 10% sampling in production.For the full grammar (operators, columns, attribute syntax, time windows), refer to:
scorable otel-filter create --help
scorable otel-trace list --help # same column vocabulary
The --help of otel-trace list documents all filter columns, the GenAI semantic conventions, and operator semantics inline.
delay_seconds + ~5s for the eval to run.scorable otel-trace list --since 5m --service-name <your-service-name>
evaluate <evaluator-name>:scorable otel-trace spans <trace_id>
scorable otel-trace spans <trace_id> --output json
The eval span carries:
gen_ai.evaluation.name — the evaluator/judge that rangen_ai.evaluation.score.value — numeric score (0–1)gen_ai.evaluation.explanation — the justification textscorable.evaluation = true — Scorable's marker indicating this is an eval span (not user data)resource.service.name = scorable.evaluation — distinct service so eval spans don't accidentally match customer service-scoped filtersOnce filters start producing eval spans, the CLI can target them too:
# Every trace that received any evaluation
scorable otel-trace list --since 24h \
--filter 'gen_ai.evaluation.name;string;gen_ai.evaluation.name;=;<evaluator-name>'
# Low-scoring runs from the last 24h, exported as CSV for review
scorable otel-trace list --since 24h --output csv \
--filter 'gen_ai.evaluation.score.value;number;gen_ai.evaluation.score.value;<;0.5' > low-scores.csv
If no eval span appears within the expected window:
scorable otel-filter list and confirm the filter is is_active: true and matches by service.name.service.name (in resource.service.name attribute) actually matches the filter value — typos here are the most common cause of silent non-firing.delay_seconds. For long delays, just wait longer.After verification:
--sampling-rate 0.1 (or lower) to control evaluator cost. Sampling rate is part of the filter; recreate the filter to change it.scorable otel-trace list --since 1h --has-error — traces that errored in the last hourscorable otel-trace list --since 7d --filter 'gen_ai.evaluation.score.value;number;gen_ai.evaluation.score.value;<;0.3' — flag low-scoring traces from the past weekscorable otel-trace spans <trace_id> --output json | jq — drill into one trace for debuggingscorable otel-filter list — review currently active filtersservice.name value the application reportstracing.py or equivalent)service.name — the strongest filter target. Per-environment naming (my-agent-prod, my-agent-staging) is a good default.scorable.evaluation = true — handy for distinguishing your application's spans from Scorable-emitted eval spans when querying or building dashboards.scorable judge generate with proper intent prompts, then return here with the judge_id for filter creation.tools
Use when work should span one or more detached tasks but still behave like one job with a single owner context. TaskFlow is the durable flow substrate under authoring layers like Lobster, ACPX, plugins, or plain code. Keep conditional logic in the caller; use TaskFlow for flow identity, child-task linkage, waiting state, revision-checked mutations, and user-facing emergence.
tools
# Lobster Lobster executes multi-step workflows with approval checkpoints. Use it when: - User wants a repeatable automation (triage, monitor, sync) - Actions need human approval before executing (send, post, delete) - Multiple tool calls should run as one deterministic operation ## When to use Lobster | User intent | Use Lobster? | | ------------------------------------------------------ | --------------------------
tools
# Lobster Lobster executes multi-step workflows with approval checkpoints. Use it when: - User wants a repeatable automation (triage, monitor, sync) - Actions need human approval before executing (send, post, delete) - Multiple tool calls should run as one deterministic operation ## When to use Lobster | User intent | Use Lobster? | | ------------------------------------------------------ | --------------------------
tools
A CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.