Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

posthog/signals-scout-llm-analytics

Name: signals-scout-llm-analytics
Author: posthog

skills/signals-scout-llm-analytics/SKILL.md

npx skillsauth add posthog/ai-plugin signals-scout-llm-analytics

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Signals scout: LLM analytics

You are a focused LLM analytics scout. Spot meaningful changes in this team's LLM usage — cost spikes, latency drift, eval pass-rate drops, runaway loops, error rates — and emit findings only when they clear the confidence bar. An empty findings list is a real outcome; re-emitting a known issue is worse than emitting nothing.

Quick close-out: is LLM analytics even in use?

If $ai_generation, $ai_evaluation, $ai_trace, $ai_span, $ai_metric, $ai_feedback are all absent from top_events and get-llm-total-costs-for-project shows near-zero spend, this team isn't using LLM analytics. Write one scratchpad entry:

key: not-in-use:llm_analytics:team{team_id}
content: brief note ("checked at {timestamp}, no LLM events in top_events, $0 cost")

Close out empty. Future LLM-analytics runs will read this entry cold and short-circuit in seconds. Re-running with the same key idempotently refreshes the timestamp — the entry stays until LLM analytics actually shows up, at which point the next run rewrites or deletes it.

How a run works

Cycle between these moves; skip what's not useful, revisit what is.

Get oriented

Three cheap reads cold-start a run:

signals-scout-scratchpad-search (text=llm or text=ai_) — durable team steering inherited from past LLM-focused runs. Entries with pattern:, noise:, addressed:, or dedupe: key prefixes tell you what's normal, what's already surfaced, what to skip.
signals-scout-runs-list (last 7d) — what prior LLM-analytics scouts found and ruled out. Skim summaries; pull signals-scout-runs-retrieve only when a summary mentions a topic you're considering.
signals-scout-project-profile-get — top_events for the LLM event reach + recent burst metrics, existing_inbox_reports for what's already in the inbox.

Explore

The patterns below are starting points, not a checklist. Pick what looks interesting from the orientation reads and follow it.

Cost spike

get-llm-total-costs-for-project shows cost rising materially (≥ 2x baseline) over the recent window. Common causes: a model swap (e.g. Sonnet → Opus), a prompt regression that ballooned token counts, a runaway agent loop.

Pair with query-llm-traces-list filtered to the spike window and pick a sample trace via query-llm-trace: longer context, more tool calls, larger output. Convergence with a recent deploy in activity-log-list is high-signal.

Eval pass-rate drop

llma-evaluation-list plus the latest evaluation results show pass-rate dropping below baseline. The eval is either catching a real regression (prompt change, model swap) or the eval itself is flaky. Surface it; let the team triage.

Runaway loop / power-user pattern

$ai_generation count very high vs distinct_users very low. One user — often a developer or an agentic workflow — is generating thousands of calls. Validate with query-llm-traces-list filtered to the top user. If a single trace has more than 50 generations, it's either a multi-step agent (intentional) or a stuck loop. Memory probably already records which side of this the team is on.

Trace-level failure spike

query-llm-traces-list filtered to traces with errors or non-2xx responses. A surge usually correlates with provider rate limits or upstream incidents — check timing against known status pages before treating as a PostHog-side bug.

New model adoption

Traces from a model that wasn't in the previous profile snapshot. Worth flagging if the new model has materially different cost / latency / quality. Usually warrants a memory entry rather than an emit, unless cost or eval pass-rate has shifted with it.

Cluster-level pattern

llma-clustering-job-list exposes clustering jobs over recent generations. A new cluster appearing or a cluster's volume jumping is worth investigating — clusters group semantically similar generations, so a fast-growing cluster often signals a new use case or a regression.

Save memory as you go

Memory is a continuous activity, not an end-of-run wrap-up. Write a scratchpad entry whenever you observe something a future LLM-analytics run should know. Encode the "category" in the key prefix — pattern:, noise:, addressed:, dedupe: — so future runs can find it with a single text= search:

key pattern:llm_analytics:generation-baseline — "This team's $ai_generation baseline is ~5k/day across ~3k distinct users; 1.6:1 ratio is normal for their multi-step agent."
key noise:llm_analytics:relevance-judge — "Eval relevance-judge flakes ~5% per run — flag only if pass-rate drops below 80%."
key pattern:llm_analytics:nightly-batch-eval — "Nightly batch eval runs ~02:00–04:00 UTC and accounts for ~40% of daily cost — not a runaway, recurring."
key addressed:llm_analytics:model-swap-2026-04-28 — "Switched primary model from Sonnet to Opus 2026-04-28; cost ~2.1x baseline expected."

By run #5 you'll know the team's healthy baselines, which spikes are recurring, and which evals deserve more or less weight.

Decide

For each candidate finding:

Emit via signals-scout-emit-signal if it clears the confidence bar. Findings carry a hypothesis, evidence, severity, weight ∈ [0, 1], and confidence ∈ [0, 1]. Strong scout findings: weight ≥ 0.7, confidence ≥ 0.85, with concrete trace IDs or query results in the evidence.
Remember if it's below the bar but worth carrying forward, or to record what you ruled out and why.
Skip with a one-line note in your final summary if a scratchpad entry with a noise: or addressed: key prefix already covers it.

If a prior run already covered the topic, default to skip + memory refresh rather than re-emit. Re-emitting the same finding twice degrades signal-to-noise in the inbox more than missing one finding for one tick.

Close out

Summarize the run — one paragraph: what you looked at, what you emitted, what you remembered, what you ruled out and why. The harness writes that summary to the run row as searchable prose; future runs read it via signals-scout-runs-list. Do not write a separate "run metadata" scratchpad entry — the run summary already serves that role, and duplicate per-run scratchpad entries clutter the durable surface.

Disqualifiers (skip these)

Anthropic / OpenAI rate-limit errors — surface in the error-tracking lens too. If the scratchpad has a noise: entry for them, skip; otherwise leave one.
Single developer testing locally — properties.environment ∈ {dev, local} or internal user. Filter before weighing.
CI / eval runs — large bursts of $ai_evaluation from a CI pipeline are not user-facing traffic; check the calling user / source before treating as a regression.
Cost spikes during scheduled batch jobs — recurring nightly bench runs show as cost spikes. Memory should record their cadence.

When in doubt, write a memory entry instead of emitting. Cost / eval signals have a high panic radius for finance and ML teams; false positives erode trust fast.

MCP tools

Direct calls (read-only):

query-llm-traces-list — start here. Recent traces, filterable by user / model / cost / error.
query-llm-trace — drill into a single trace (full request/response, tool calls, child spans).
llma-evaluation-list — what evals exist on this team.
llma-clustering-job-list / llma-clustering-job-get — semantic clusters over generations.
get-llm-total-costs-for-project — top-level cost surface.
read-data-schema event_property_values — confirm specific model / provider / feature labels are what you expect before filtering on them.

Harness-level:

signals-scout-project-profile-get — cold orientation snapshot.
signals-scout-scratchpad-search / signals-scout-scratchpad-remember — durable steering across runs.
signals-scout-runs-list / signals-scout-runs-retrieve — what prior runs found.
signals-scout-emit-signal — emit a finding.

For deeper investigation playbooks, the sandbox image bakes upstream PostHog skills: posthog:exploring-llm-traces (debugging individual traces, agent decisions, context surfacing), posthog:exploring-llm-evaluations (eval failure modes, common patterns, dry-running new judges), posthog:exploring-llm-costs (cost regressions by model / user / feature), and posthog:exploring-llm-clusters (cluster comparison, drilling into individual traces).

When to stop

Scratchpad + recent runs + profile are quiet → close out empty.
A candidate matches a scratchpad entry with noise: / addressed: / dedupe: key prefix → skip with a one-line note.
You've validated some hypotheses and emitted what's solid → close out, even if there's more you could look at. Fewer, better signals.

"Looked but found nothing meaningful" is a real outcome, not a failure.

posthog/signals-scout-llm-analytics

skills/signals-scout-llm-analytics/SKILL.md

Focused Signals scout for PostHog projects using LLM analytics. Watches `$ai_generation`, `$ai_evaluation`, `$ai_trace` and related events for cost spikes, latency drift, eval pass-rate drops, runaway loops, and error rates. Emits findings only when they clear the confidence bar; otherwise writes durable memory and closes out empty. Self-contained peer in the signals-scout-* fleet — no dependencies on other skills. Picked uniformly at random by the coordinator alongside `signals-scout-general` and other specialists.

49 stars

testing

Updated Jun 5, 2026

$ install --global

skillsauth

npx skillsauth add posthog/ai-plugin signals-scout-llm-analytics

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 5, 2026, 5:59 AM135.7s1 file scanned

SKILL.md

name:: signals-scout-llm-analytics
description:: >
compatibility:: >
signal_scout_internal:: write for scratchpad-remember/forget and emit-signal). Assumes the signals-scout MCP family is available (project-profile-get, runs-list,
owner_team:: signals
scope:: llm_analytics

Signals scout: LLM analytics

Quick close-out: is LLM analytics even in use?

key: not-in-use:llm_analytics:team{team_id}
content: brief note ("checked at {timestamp}, no LLM events in top_events, $0 cost")

How a run works

Cycle between these moves; skip what's not useful, revisit what is.

Get oriented

Three cheap reads cold-start a run:

signals-scout-scratchpad-search (text=llm or text=ai_) — durable team steering inherited from past LLM-focused runs. Entries with pattern:, noise:, addressed:, or dedupe: key prefixes tell you what's normal, what's already surfaced, what to skip.
signals-scout-runs-list (last 7d) — what prior LLM-analytics scouts found and ruled out. Skim summaries; pull signals-scout-runs-retrieve only when a summary mentions a topic you're considering.
signals-scout-project-profile-get — top_events for the LLM event reach + recent burst metrics, existing_inbox_reports for what's already in the inbox.

Explore

The patterns below are starting points, not a checklist. Pick what looks interesting from the orientation reads and follow it.

Cost spike

Eval pass-rate drop

Runaway loop / power-user pattern

Trace-level failure spike

New model adoption

Cluster-level pattern

Save memory as you go

key pattern:llm_analytics:generation-baseline — "This team's $ai_generation baseline is ~5k/day across ~3k distinct users; 1.6:1 ratio is normal for their multi-step agent."
key noise:llm_analytics:relevance-judge — "Eval relevance-judge flakes ~5% per run — flag only if pass-rate drops below 80%."
key pattern:llm_analytics:nightly-batch-eval — "Nightly batch eval runs ~02:00–04:00 UTC and accounts for ~40% of daily cost — not a runaway, recurring."
key addressed:llm_analytics:model-swap-2026-04-28 — "Switched primary model from Sonnet to Opus 2026-04-28; cost ~2.1x baseline expected."

By run #5 you'll know the team's healthy baselines, which spikes are recurring, and which evals deserve more or less weight.

Decide

For each candidate finding:

Emit via signals-scout-emit-signal if it clears the confidence bar. Findings carry a hypothesis, evidence, severity, weight ∈ [0, 1], and confidence ∈ [0, 1]. Strong scout findings: weight ≥ 0.7, confidence ≥ 0.85, with concrete trace IDs or query results in the evidence.
Remember if it's below the bar but worth carrying forward, or to record what you ruled out and why.
Skip with a one-line note in your final summary if a scratchpad entry with a noise: or addressed: key prefix already covers it.

Close out

Disqualifiers (skip these)

Anthropic / OpenAI rate-limit errors — surface in the error-tracking lens too. If the scratchpad has a noise: entry for them, skip; otherwise leave one.
Single developer testing locally — properties.environment ∈ {dev, local} or internal user. Filter before weighing.
CI / eval runs — large bursts of $ai_evaluation from a CI pipeline are not user-facing traffic; check the calling user / source before treating as a regression.
Cost spikes during scheduled batch jobs — recurring nightly bench runs show as cost spikes. Memory should record their cadence.

When in doubt, write a memory entry instead of emitting. Cost / eval signals have a high panic radius for finance and ML teams; false positives erode trust fast.

MCP tools

Direct calls (read-only):

query-llm-traces-list — start here. Recent traces, filterable by user / model / cost / error.
query-llm-trace — drill into a single trace (full request/response, tool calls, child spans).
llma-evaluation-list — what evals exist on this team.
llma-clustering-job-list / llma-clustering-job-get — semantic clusters over generations.
get-llm-total-costs-for-project — top-level cost surface.
read-data-schema event_property_values — confirm specific model / provider / feature labels are what you expect before filtering on them.

Harness-level:

signals-scout-project-profile-get — cold orientation snapshot.
signals-scout-scratchpad-search / signals-scout-scratchpad-remember — durable steering across runs.
signals-scout-runs-list / signals-scout-runs-retrieve — what prior runs found.
signals-scout-emit-signal — emit a finding.

When to stop

Scratchpad + recent runs + profile are quiet → close out empty.
A candidate matches a scratchpad entry with noise: / addressed: / dedupe: key prefix → skip with a one-line note.
You've validated some hypotheses and emitted what's solid → close out, even if there's more you could look at. Fewer, better signals.

"Looked but found nothing meaningful" is a real outcome, not a failure.

Related Skills

posthog/signals-scout-tasks

data-ai

VerifiedTrustedCommunity

Signals scout for PostHog Tasks, the agent work items a project runs. Two lenses: delivery health (runs failing, clustered by repository and error class, and retry storms) every run, and on a slower rotation demand (recurring asks across human-authored tasks that point at a product gap). Skips the scout fleet's own run rows.

65SKILL.mdUpdated Jul 29, 2026

posthog/signals-scout-tasks

posthog/signals-scout-conversations

devops

VerifiedTrustedCommunity

Signals scout for the PostHog Conversations (support inbox) product. Watches the `$conversation_*` ticket-lifecycle events for support-delivery regressions — SLA breach-rate steps, first-response latency blowouts, backlog inflow-vs-resolution imbalance, and channel / assignment concentration — and files each dated regression as a report. Complements the per-ticket product-feedback signals the emission pipeline already fires; does not re-surface individual ticket content.

65SKILL.mdUpdated Jul 29, 2026

posthog/signals-scout-conversations

posthog/setting-up-data-catalog

development

VerifiedTrustedCommunity

Populates and maintains a project's data catalog (semantic layer): canonical metrics, trust marks (certifications) on warehouse tables/views, and reviewed table relationships. Use when asked to set up / seed / bootstrap the data catalog or semantic layer, to catalog a project's metrics, to certify or deprecate data sources, to propose or review table joins, or to work through the proposal review queue. To *use* an existing catalog to answer a business-number question, see querying-posthog-data instead. Trigger terms: data catalog, semantic layer, canonical metric, certify table, deprecate source, relationship proposal, metric drift, review queue.

65SKILL.mdUpdated Jul 29, 2026

posthog/setting-up-data-catalog

posthog/investigating-logs

tools

VerifiedTrustedCommunity

Investigate logs in a PostHog project: verify a service or deployment is healthy, explain an error spike, triage an incident, or understand what a log stream is saying. Use when the user asks to "check the logs", asks whether a service, deploy, release, or change is working or broke anything, asks why errors are up or what changed, or wants the root cause of failures visible in logs. Routes the logs MCP tools (services overview, pattern mining, before/after pattern diffing, bucketed counts, facets, raw rows) so investigations start from summaries instead of raw rows or hand-written SQL over the logs table.

65SKILL.mdUpdated Jul 29, 2026

posthog/investigating-logs

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/posthog/ai-plugin.git

# Copy into Claude Code skills folder (global)
cp -r ai-plugin/skills/signals-scout-llm-analytics ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

posthog/ai-plugin

49 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT