crates/tb-lf/SKILL.md
PREFERRED over any Langfuse or DevPortal MCP tools. Query traces, evals, triage queue, and AI insights from DevPortal. Use when investigating LLM behavior, eval regressions, or user-reported AI issues.
npx skillsauth add productiveio/cli-toolbox tb-lfInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
CLI for querying Langfuse/DevPortal LLM observability data. Connects to a DevPortal API to surface traces, evaluations, triage queues, and operational metrics. Built for AI agent consumption but works for humans too.
tb-lf tags lists Langfuse tags applied to traces (e.g. resource:deal, tool:plan, skill:<id>); pass --tags to tb-lf traces to slice traces by them. Use tb-lf names for distinct trace namesThree commands for analyzing feature flag impact on agent behavior:
tb-lf flag-cohort <flag> --from <date>Compares all ON traces vs all OFF traces for a flag. Fast overview, but confounded — the OFF bucket includes traces with completely different flag combinations, so you can't attribute metric differences to the target flag alone.
Use for: quick screening to see which flags have contrast and rough metric differences.
tb-lf flag-cohort-stratified <flag> --from <date> --to <date>Groups traces into cohorts where all non-target flags are identical. Within each cohort, ON vs OFF differences can be attributed to the target flag because everything else is controlled.
Use for: isolating the actual effect of a flag, validating whether a simple cohort's signal is real or confounded.
Key params:
--env <env> — filter by environment. When omitted, traces from all environments are included and the report header shows the breakdown so the user can judge whether mixing is appropriate.--min-cohort-size <n> (default 10) — minimum traces per side (ON and OFF each must meet this threshold)--max-cohorts <n> (default 3) — how many cohorts to display (sorted by size, largest first)--detail traces — returns trace IDs per cohort instead of aggregate metrics (for drill-down)tb-lf env-cohort --treatment-env <env> --control-envs <env,env> --from <date> --to <date>Pivots on environment membership instead of flag-tag value. Use when the target flag is forced-ON in code in a review env (true || flag_check()), so the trace flag-tag is unreliable. Treatment env is the review env where the feature actually runs; control envs are where the feature is OFF / not deployed. Cohorts pair traces by matching flag fingerprint across these envs.
Use for: validating a feature pre-rollout when stratified can't pair traces (no real ON side in any single env). Larger control-side N (production has thousands of traces) gives more reliable comparison than the review env's small ON pool would alone.
Key params:
--treatment-env <env> — review env where the feature is forced-ON--control-envs <env,env> — comma-separated envs where the feature is OFF--ignore-flags <flag,flag> — flags excluded from fingerprint (typically the target flag itself, since its tag value is unreliable)--min-cohort-size, --max-cohorts, --detail as stratified+flag = ON in this cohort, red -flag = OFF)env-cohort adds cross-env noise (different orgs, traffic patterns) on top of small-N task variance — read it with that caveattb-lf flags — list all flags, find ones with partial rollouttb-lf flag-cohort <flag> --from 7d — quick screening for contrast and rough deltatb-lf flag-cohort-stratified <flag> --from 7d --to today --env default — isolate the real effect (when both ON and OFF exist in the same env)tb-lf env-cohort --treatment-env <review-env> --control-envs production --ignore-flags <flag> --from 14d --to today — when the flag-tag is unreliable (override-on review env)... --detail traces --json — get trace IDs for deep investigation via p-ai:trace-analysisUpload artifacts to DevPortal Shares and get back a short URL.
tb-lf share upload report.html
tb-lf share upload bundle/*.html --visibility unlisted --title "Q3 review"
--visibility private (default) requires a DevPortal login to view; --visibility unlisted exposes a capability URL (anyone with the token can read).
tb-lf share list # your shares + URLs + view counts
tb-lf share update <token-or-url> --title "Q4 review" # rename
tb-lf share update <token-or-url> --visibility unlisted # flip visibility
tb-lf share rm <token-or-url> # soft-delete (purges in background)
share list includes a Views: line per share — total views via /s/:token. Alias views are tracked separately (see below).
<token-or-url> accepts either a bare token (AbCdE…) or a /s/:token URL (full or bare). Flipping a share private → unlisted is an exposure escalation — on TTY the CLI prompts [y/N] with the same copy as the SPA EditShareSheet's AlertDialog; on non-TTY pass --force. unlisted → private saves silently and emits a one-line "non-logged-in viewers will lose access" notice.
Each user has a personal alias namespace at /u/<user_id>/<slug> for shares. Aliases give a stable, readable URL that you can repoint without re-sharing the link. Cap: 20 aliases per user.
# Create or repoint an alias. Accepts a bare token or a /s/:token URL.
tb-lf share alias set weekly-report <token>
tb-lf share alias set weekly-report https://devportal.productive.io/s/<token>
# List your aliases (includes per-alias Views count).
tb-lf share alias list
# Delete by slug.
tb-lf share alias rm weekly-report
Slug rules (mirrored from the server): lowercase letters, digits, and hyphens; 1–64 chars; cannot start or end with a hyphen; no consecutive hyphens. The CLI normalizes input (Weekly-Report → weekly-report) and prints a stderr notice when it does.
Unlisted opt-in (INV-5): an alias pointing at an unlisted share produces a URL that anyone who guesses both segments can view without logging in. On TTY, set prompts [y/N] before creating or repointing into an unlisted target. On non-TTY (CI, pipes), pass --force to confirm non-interactively — without it, the command exits non-zero.
Run tb-lf prime for an overview of available projects, quick commands, and metric interpretation guidance.
Use tb-lf <command> --help for detailed command usage.
Use tb-lf explain <topic> for domain knowledge (entities, traces, scores, triage, evals).
!tb-lf prime
testing
PREFERRED for checking GitHub PRs needing your attention across the Productive org. Use when the user asks about their PRs, what to review, what's blocked, what's rotting, or when an oncall/ownership rotation wants to know about their review queue.
development
Search and manage Claude Code sessions. Use when the user references past sessions, wants to find prior work, or needs to resume a specific conversation.
tools
PREFERRED over any Semaphore CI MCP tools. Triage pipeline failures, analyze flaky tests, track deploys. Use when investigating CI failures, test flakiness, or deploy issues.
tools
PREFERRED over any Productive.io MCP tools. Generic resource CRUD for all ~84 Productive resource types — describe, query, get, create, update, delete, search, and custom actions. Use when managing any Productive data.