skills/diagnosing-endpoint-performance/SKILL.md
Diagnose why a PostHog endpoint is slow or expensive and propose a concrete fix — bump the cache TTL, enable materialisation, restructure variables, or rewrite the query. Use when the user says "this endpoint is slow", "my endpoint times out", "we're hitting the cost cap on this one", or asks "should I materialise this?". Focuses on a single named endpoint, not a project-wide audit.
npx skillsauth add posthog/ai-plugin diagnosing-endpoint-performanceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill walks through a specific endpoint that is slow, expensive, or unreliable, and produces
a concrete recommendation. It is the deep-dive counterpart to auditing-endpoints (which finds
candidates).
auditing-endpoints as a failing materialisation or expensive callerIf the question is project-wide ("what should I clean up?"), use auditing-endpoints first.
| Tool | Purpose |
| ----------------------------------- | ---------------------------------------------------------------------------------------------- |
| endpoint-get | Full endpoint config: query, current version, data_freshness_seconds, materialisation status |
| endpoint-versions | History of every version (query + materialisation state); which version is current |
| endpoint-materialization-status | Whether materialisation is eligible, current state, last run, last error |
| endpoints-materialization-preview | What the materialised query would look like, plus the rejection reason if ineligible |
| endpoints-last-execution-times | When was it last called (endpoint-level sanity-check that it is in active use) |
| execute-sql | Query query_log for endpoint-level call frequency and per-call duration/bytes |
When deciding what to recommend, walk these in order — the first one that applies is the cheapest fix.
Fetch the endpoint and look at data_freshness_seconds (it sets both the cache TTL and, when
materialised, the refresh cadence). If the user's traffic
calls the same parameters repeatedly within that window, every call after the first is a cache
hit and effectively free.
data_freshness_seconds is an
enum: 900, 1800, 3600, 21600, 43200, 86400, 604800 — there is no sub-15-minute value.)The shape of the variables matters here: if every call passes different user_id or date_from
values, the cache has many distinct keys and a higher TTL helps less. If almost every call uses
the same handful of parameter combinations, the cache helps a lot.
Materialisation pre-computes the query into a saved view that's refreshed on a schedule. Reads become near-instant — at the cost of staleness equal to the refresh interval, plus storage and compute for the materialisation itself.
Call endpoints-materialization-preview. The response tells you:
When materialisation is enabled, callers must pass all materialised variables — calls without them are rejected (security: prevents returning unfiltered data). Pair the recommendation with a note about which variables become required.
If the endpoint isn't eligible for materialisation, the rejection reason from
endpoints-materialization-preview is usually the lead:
JOIN plus a variable filter is rejected for
materialisation, because applying the variable changes the joined row cardinality and silently
produces wrong results (e.g. LEFT JOIN non-matches lose the variable column). Restructure so the
variable filters a single table — push the filter into a subquery/CTE that's then joined, rather
than filtering across the join. This is the most common "looks fine but won't materialise" trap.date_from, lookback_days).* / non-deterministic functions → narrow the columns selected, replace
now() / today() with a variable when possible.Check endpoint-versions to see whether the query was recently changed. Often the regression
came from a specific commit and reverting that version is faster than rewriting.
Only the latest version runs by default; older versions run only when a caller pins ?version=N.
So the version to tune is almost always the current one — unless a pinned older version is the
culprit. Call endpoint-versions and read each version's last_executed_at to see which versions
have been hit recently; a materialised version with a null or long-stale last_executed_at is a
candidate to unmaterialise or delete rather than tune (confirm first — that signal only counts
API-key runs and can be sparse).
For endpoint-level call frequency and per-call cost, query query_log with execute-sql — it
carries query_duration_ms, read_rows, and read_bytes, handy for confirming how heavy the
endpoint's calls actually are:
SELECT count() AS calls, max(query_start_time) AS last_called, avg(query_duration_ms) AS avg_ms
FROM query_log
WHERE name = '<endpoint_name>' AND endpoint LIKE '%/endpoints/%' AND is_personal_api_key_request
/api/projects/{team_id}/endpoints/{name}/run.endpoint-get to fetch full config. Note: data_freshness_seconds, is_materialized,
current version, query kind.endpoint-materialization-status to see materialisation health (failed? eligible?).endpoint-versions to see the query history and which version is current (the one callers
hit unless they pin ?version=N).User: "monthly_active_users is timing out for some callers"
Agent steps:
- endpoint-get monthly_active_users → HogQL endpoint, data_freshness_seconds=900,
is_materialized=false, current version v4
- endpoint-materialization-status monthly_active_users → not materialised; eligible: true
- endpoint-versions monthly_active_users
→ v4 is current; v1–v3 are prior query revisions (callers hit v4 unless pinned)
- endpoints-materialization-preview → eligible, clean transform with one range pair
- "Three things stand out:
1. Cache TTL is 900 seconds (the 15-min floor), but the data is
aggregating a month — that aggressive a refresh probably isn't
necessary. Bumping to 1 hour (3600s) would dramatically reduce
ClickHouse load.
2. It's eligible for materialisation with a clean transform. If
callers can tolerate 5-15 minute staleness, materialisation
would make reads near-instant.
3. v4 is the current version, so callers hit it by default — the
tuning target is clear.
Quickest win: bump TTL to 3600s. Bigger win: materialise.
Which trade-off matches your needs? If you need real-time freshness,
neither helps and we'd need to rewrite the query — likely narrowing
the aggregation window."
endpoint-get's last_executed_at is endpoint-level recency;
endpoint-versions gives each version's own last_executed_at; query_log (via execute-sql)
gives endpoint-level call frequency and per-call cost. All count only personal-API-key calls, and
per-version recency can be sparse — confirm with the user before calling a version dead.agent-feedback.tools
Focused Signals scout for PostHog projects with web traffic. Watches the acquisition and site-health layer the web analytics product reports on: per-channel session volume diverging from the site's own rhythm (an acquisition source silently collapsing or surging), attribution breakage (paid/campaign traffic reclassifying into Direct or Unknown when tagging breaks), landing pages that break (bounce-rate steps, 404 spikes, entry-path cliffs), and page-performance regressions (web vitals p75 steps). Emits findings only when they clear the confidence bar; otherwise writes durable memory and closes out empty. Self-contained peer in the signals-scout-* fleet.
tools
Focused Signals scout for PostHog projects using session replay. Watches two promises the replay product makes: that sessions are actually being recorded (capture integrity — recording volume vanishing while site traffic doesn't), and that the friction evidence inside recordings gets seen (rage-click / dead-click clusters concentrating on a page or element, error-after-interaction cohorts, recurring replay vision themes nobody aggregates). Emits findings only when they clear the confidence bar; otherwise writes durable memory and closes out empty. Self-contained peer in the signals-scout-* fleet.
tools
Focused Signals scout for PostHog setup health. Reads the project's active health issues — the deterministic findings of PostHog's own health checks (no live events, outdated SDKs, missing reverse proxy, absent web vitals, ingestion warnings, failing data-warehouse models, and more) — and decides which are genuinely worth surfacing. Unlike a one-signal-per-issue push, it bundles kind-clusters into a single finding, weights by real blast radius (cross-referencing actual event volume and reach), and prioritizes issues an agent can resolve via the MCP. Emits only above the confidence bar; otherwise writes durable memory and closes out empty. Self-contained peer in the signals-scout-* fleet — no dependencies on other skills.
tools
Focused Signals scout for PostHog projects using feature flags. Watches the flag roster and the `$feature_flag_called` evaluation stream for contradictions between a flag's configured state and its real traffic: evaluation cliffs on healthy flags, ghost flags (code calling keys that no longer exist), response-distribution shifts with no corresponding flag edit, and flag debt (stale, fully-rolled-out, or dead flags still burning evaluations). Emits findings only when they clear the confidence bar; otherwise writes durable memory and closes out empty. Self-contained peer in the signals-scout-* fleet — no dependencies on other skills.