altinity-profiler-clickhouse/SKILL.md
Profile a ClickHouse cluster via MCP and emit a per-cluster "analyst" Skill the user can save in claude.ai. Activate when the user asks to "profile this ClickHouse", "generate an analyst skill", "build a schema guide", "map the data in this cluster", or regenerate an existing cluster-analyst Skill after schema changes. Works against any ClickHouse with read-only SELECT/SHOW/DESCRIBE access via an `execute_query` MCP tool (e.g. the Altinity MCP server). Outputs a 5-file markdown bundle plus a README.
npx skillsauth add altinity/skills altinity-profiler-clickhouseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Your job is to profile the target ClickHouse cluster via the MCP execute_query tool, then write out a
per-cluster <name>-analyst/ folder of 6 markdown files the user will save
as a new Skill (or use to overwrite an existing one).
The artifact is a knowledge base about the cluster — schema, engine
semantics, tenancy model, ingestion graph, observed access patterns,
declared vs. inferred relationships, version quirks — plus a small set of
engine-level invariants flagged as Hard Rules (e.g. "FINAL on AggMT is
the read idiom", "filter on (categoryid, brandid) first"). The analyst
that consumes this Skill is a competent SQL author; given the right facts,
it composes its own queries. Do not write canned SQL recipes for question
classes. Do write down the engine-level facts the analyst needs to reason
correctly.
A useful test: if a fact disappears from the artifact, would a competent analyst fail to write a correct query? If yes, the fact belongs. If no (it's just one of N plausible SQL shapes for the same question), it doesn't.
The profiler emits an artifact saved as a Skill and consumed later. Bad information compounds. Therefore: every fact in the artifact must either be verified by a cluster round-trip at profile time, or marked as inferred / unverified. No exceptions. Three claim types, three verification mechanics:
Existence claims — table X exists; column Y exists on table X; MV
Z feeds table X. Verified by system.tables / system.columns /
system.tables.create_table_query. Cheap. Hard precondition: a
table or column name does not appear in the artifact unless it was
just queried back from system. A fact that fails existence
verification is dropped, not demoted.
Behavior claims — this predicate prunes the primary index; FINAL
on this table amplifies reads ~Nx; this skip index is used; this
query shape is cheap on the captured tenant. Verified by EXPLAIN indexes=1 and a representative SELECT … SETTINGS log_queries=1
followed by a system.query_log lookup keyed on a unique comment
token. Behavior claims include the captured numbers (read_rows,
granule prune ratio, latency) inline. A behavior claim with no
captured measurement must be demoted to "inferred from schema".
Relationship claims — table X joins to table Y on key K. Almost
always inferred on a CH cluster (no real FKs). Profiler verifies (a)
columns exist on both sides, (b) cardinality on a real tenant slice
(matched, distinct_left/right, avg_fanout), and records the
probe artifact alongside the claim. Every relationship claim is
marked [inferred] regardless of probe outcome — the probe answers
"does this join behave 1:1 on the captured tenant", not "is this an
FK". The analyst is told to re-probe on its own tenant slice.
Demote, do not hard-fail. If a behavior or relationship claim cannot be
verified (parser quirk, no representative tenant available, MCP
timeout), keep the claim in the artifact under an explicit
inferred: / unverified: <reason> marker. The exception is existence
claims for non-existent tables/columns: those are dropped — a
non-existent referent in the KB is never useful.
A folder <cluster-name>-analyst/ containing:
SKILL.md — always-loaded entry; at-a-glance, top-15 tables, join map,
engine idioms, staleness. ~2-3k words.
catalog.md — per-table cards for ~15-25 analyst-hot tables. ~2-4k words.
patterns.md — query-writing priors per hot fact (knowledge, not recipes).
~1-2k words.
pipeline.md — demoted infra tables, ingestion shape, service users. ~1-2k words.
glossary.md — naming, users, CH quirks, term decoder. ~0.8-1.5k words.
README.md — meta: source, decisions, limitations, how to regenerate,
verification log (what was checked vs. inferred).
Target: ~5k–10k words total (user may pick "concise" or "full" via questionnaire). The user saves this folder as a Skill (claude.ai Settings → Capabilities, or their agent's skills directory).
OPTIMIZE, ALTER, ATTACH, etc.SELECT, SHOW, DESCRIBE, EXPLAIN. Never
CREATE, ALTER, DROP, INSERT, TRUNCATE, OPTIMIZE, or any
statement that writes. If the MCP tool allows writes, refuse to use
that path for this Skill.query_log
must pass through normalizeQuery() AND have /* ... */ comments
stripped via replaceRegexpAll(query, '/\\*.*?\\*/', '') before any
of its text enters the artifact. Never emit literal values, emails,
IPs, user IDs, or comment metadata. Use <tenant>, <from>, <to>,
<id> placeholders in example SQL.system.dictionaries), if a column doesn't exist in this CH version
(e.g. as_select on CH < 22), or if the query log has no business
traffic — say so explicitly. Do not invent data or infer beyond what
the observations support.Dim.X as
ReplicatedReplacingMergeTree is NOT a Dictionary. The engine column
in system.tables is authoritative; DB naming is a hint only.engine_full, never from name
patterns. _local / _d / dist_ / no-suffix are all cluster-specific
conventions. Parse Distributed('<cluster>','<db>','<target>',<shard>)
in engine_full to map pairs.archetypes/<primary>.md after Phase 1.5) is REQUIRED in
patterns.md — it tells the analyst the cluster's overall shape
before any per-fact card. Engine traps in the artifact come from
the primary archetype's module first, then any secondaries' trap
lists merged in.inferred from schema; relationship → mark [inferred]
and attach the cardinality probe). Hallucinated identifiers
(table/column names that don't exist) are the most damaging failure
mode — they survive review wrapped in plausible-looking SQL, then
break at consumer query time. The existence check is cheap; run it
for every name.verified 2026-04-25; read_rows=734,350; query_id=…). Reviewers can
tell what was actually tested. The analyst, at read time, sees the
same evidence and can judge whether today's tenant matches the
captured shape. Use inferred: for schema-only inference and
unverified: <reason> when a planned verification could not run.Code: 43, etc.), no
finalizeAggregation / *Merge internals, no archetype-trap
explanations beyond a one-sentence cluster-application (full trap
prose lives in the profiler's archetypes/<id>.md modules and is
NOT copied into per-cluster bundles). Each rule, probe, and
identifier names one source-of-truth file in the bundle; other
files cross-reference rather than restate. Files other than
SKILL.md are loaded lazily — restating a rule across files inflates
context whenever the analyst question pulls multiple files.<cluster>-analyst Skill
(e.g. after a major schema change).Phase 0 · Connect and detect shape → CH version, query_log shape,
Distributed naming
Phase 1 · Discovery (cheap, 5 queries) → cluster topology, DB roster,
engine mix, qlog span
Phase 1.5 · Archetype detect → assign primary archetype + 0..N
secondaries from Phase 1d/1e
counts; load
`archetypes/<primary>.md` for
phase emphasis
Phase 2 · Questionnaire (1 round) → DB scope, window, sandboxes,
cluster name, verification tenant
Phase 3 · Catalog → per-table metadata for business DBs
Phase 4 · Relations → dependency graph + Dictionaries
Phase 5 · Query mining → top-50 hot, co-occurrence, hot
columns, representative queries.
Phase 5d.1 form-mining passes
archetypes to the helper for
corpus-override application.
Phase 6 · Demotion → split analyst-hot vs pipeline.
Pareto helper takes archetype
for shape-aware thresholds.
Phase 7 · Classification → Dim/Fact/Mart/Staging/Other
(confident-only)
Phase 7.5 · Verification → existence checks for every named
identifier; EXPLAIN + log_queries
probe for every behavior claim;
cardinality probe for every
inferred relationship. Captures
numbers that get embedded in the
artifact at phase 8.
Phase 8 · Synthesis → write the 6 output files, leading
with the primary archetype's
signature card; merge engine
traps from primary + secondaries;
each claim either carries a
verification record or is marked
`inferred` / `unverified`.
Detailed SQL for each phase is in pipeline.md (read when you reach phase
execution). Output templates are in templates.md (read at phase 8).
Edge-case gotchas are in edge-cases.md (read when you hit one).
Archetype modules are in archetypes/ — load archetypes/README.md
once at Phase 1.5 to evaluate the routing rules, then load the
selected primary's full module (and any secondaries' Engine traps +
Corpus-override list) at Phase 8.
Work sequentially through phases. Don't skip ahead. After phase 1, run phase 1.5 (archetype routing) immediately — it has no SQL of its own and uses Phase 1d/1e counts. Mention the assigned primary archetype to the user as part of the phase-1.5 progress note (it shapes the artifact). Then pause for phase 2 (questionnaire); after phase 7, pause to confirm the classification summary before synthesizing. Between phases, emit short user-facing progress notes — the user wants to see what you found.
At the end of each phase, emit one concise line to the user:
Phase 0: CH <version>, <N>-shard cluster, query_log is <raw|pre-aggregated>.Phase 1: <N> business DBs, <M> tables, top engines: <list>. Query log <range>, <rows> rows.Phase 1.5: archetype <primary> (rule <K>: <one-line reason>); secondaries <list-or-none>.Phase 2: ... (asking questions)Phase 3: cataloged <N> tables.Phase 4: <N> dependencies, <M> dictionaries.Phase 5: top-50 Pareto mined; <K> multi-table queries; <J> tables with hot-column signal.Phase 6: <X> analyst-hot, <Y> demoted to pipeline.Phase 7: <A> Fact, <B> Dim, <C> Mart, <D> Staging, <E> Other.Phase 8: drafting SKILL.md ... then catalog, patterns, pipeline, glossary, README.Keep these to one line per phase. User watches progress; don't bury them in SQL output dumps.
Use the MCP execute_query tool (or equivalent read-only ClickHouse SQL
tool). On transient errors (network reset, SSL drop, timeout):
Never try to open a different connection path. The MCP tool is the only allowed channel.
If the user says "regenerate" or "update the <name>-analyst Skill" and such a Skill is loaded in the current session:
SKILL.md frontmatter to get the prior
cluster_fingerprint and generated_at.If no prior Skill is loaded, regeneration is indistinguishable from first-run.
You need a kebab-case cluster name for the output folder. Pick via this fallback chain:
SELECT cluster, count() FROM system.clusters GROUP BY cluster —
pick the distinctive one, skipping default, all-*, test_*,
parallel_replicas, replicated, clickhouse, local, prod
(these are generic macros, not business identifiers).engine_full of
Kafka engines (e.g., prod-noncde-kafka.razorpay.com → razorpay).hockeystack,
divinity).Normalize: lowercase, ASCII, kebab-case. Strip _cluster, _replicated,
_sharded suffixes. Append -analyst. If the name collides with a
Skill already loaded in the session, append -2, -3, etc.
Before any mining, detect three axes of variation. See pipeline.md
(Phase 0) for the exact SQL:
as_select column vs create_table_query regex
fallback).any(...) / count() / sum(query_duration_ms), the log is
pre-aggregated (audit corpus). If columns are flat (tables,
columns, query_kind), it's raw (live cluster).engine_full of Distributed engines
to discover the local-naming pattern and shard keys.Record these in a "shape profile" you'll reference throughout. Emit them to the user in the phase-0 progress note.
After phase 1, present a compact preview and ask ≤5 targeted questions. Sample preview:
Phase 1 summary:
- ClickHouse <version>, cluster <cluster-name>, <topology>
- <N> business DBs: <top-5 by table count>
- <M> user sandboxes candidates (heuristic): <list-if-any>
- Query log: <from>..<to>, <rows> rows
- Engine mix: <top-5 by count>
Then ask (combine as few questions as possible):
system, information_schema, _temporary_and_external_tables."
This answer feeds Phase 3's database IN (:included) filter and
Phase 5a's workload-shape window — narrowing here saves the most
downstream work. Show the table-count column so the user can see
where the volume actually lives.<list> as service accounts?"<candidate>-analyst. Override?"read_rows, granule prune, and join fanout
numbers. Which tenant slice should I use?" — pick a tenant with
non-trivial data volume in the last 7 days. State that the probes
are read-only SELECT queries with LIMIT clauses; no modification.If the user declines a verification tenant, behavior and relationship
claims will all be marked inferred from schema rather than carrying
captured numbers — proceed but warn that the artifact's evidence layer
will be thinner.
Do not ask blind. Every question must reference concrete findings from phase 1. If phase 1 is entirely unambiguous on a given dimension, skip that question and state your assumption in the phase-2 progress note.
A good artifact:
system.tables/system.columns (no hallucinations);
every behavior claim carries a captured measurement (read_rows,
granules_prune_pct, query_id, verified_at) or an inferred from schema marker; every JOIN is [inferred] with its cardinality probe.system.dictionaries,
dependencies_* sparsely populated, etc.).A bad artifact:
system.dictionaries was
empty.spatialrss.authors when the real table is
mstuserinformation_new — the worst failure mode, surviving to
consumer-query time as silent errors), asserts a JOIN without an
[inferred] marker, or makes a behavior claim with neither a captured
measurement nor an inferred from schema marker.patterns.md with "patterns inferred from schema; no
observed workload" caveat prominently.DESCRIBE system.query_log fails: the cluster may have query_log
disabled. Switch to catalog-only; note in README.md.Write files to a folder named <cluster>-analyst/ in a location the user
has write access to. Ask in phase 2 if unsure (default: prompt the user).
The user is expected to zip or copy the folder and install it as a new
Skill — in claude.ai via Settings → Capabilities, or by placing it in
their agent's skills directory (e.g. Claude Code / Codex).
End with a short summary:
Wrote <cluster>-analyst/ (6 files, <N> words total). To save as a
Skill: compress the folder and add it to your agent — claude.ai
Settings → Capabilities → Add Skill, or your skills directory.
If the user already had a matching analyst Skill loaded, add:
This artifact is a full replacement for the previously loaded
<cluster>-analyst Skill (generated <prior_date>). Re-upload to update.
Load these on demand. Do not load all at once.
pipeline.md (in this Skill) — full SQL recipes for each of the
all phases (0–8). Load at phase start. ~9-11k words.templates.md (in this Skill) — section templates for each of the
6 output files. Load at phase 8. ~10-12k words.edge-cases.md (in this Skill) — gotcha library (~27 known edge
cases). Load when you hit a suspected edge case, or scan once at
pipeline start if the cluster shape feels unusual. ~5-6k words.archetypes/README.md — index + first-match-wins decision rules
for archetype routing. Load at Phase 1.5. ~1.5k words.archetypes/<primary>.md — the selected archetype's full module
(signature card, phase emphasis, engine traps, corpus-override
list). Load after Phase 1.5 assigns the primary, and re-consult at
Phase 5d.1 (passing --archetypes to synthesize_conventions.py)
and at Phase 8 (rendering the signature card and engine-trap
sections). ~1-2k words each.archetypes/<secondary>.md — any secondary archetypes' Engine
traps + Corpus-override list sections only. Load at Phase 8 when
hybrid signals fired in Phase 1.5.backticks for identifiers.If anything in a phase is unclear to you — if the SQL result is unexpected, if an engine you've never seen appears, if the user's questionnaire answer contradicts discovery — stop and check with the user rather than guess. The artifact will be saved and consumed later; bad information compounds.
tools
Read-only ClickHouse security audit expert for live or exported systems. Use when assessing ClickHouse security posture, reviewing users, roles, grants, settings profiles, row policies, table functions, external sources, table engines, executable UDFs, audit logs, named collections, password hash hygiene, SQL SECURITY DEFINER, impersonation, TLS/network exposure, Keeper/interserver security, encryption at rest, backups, the HTTP interface surface, cluster security, or version-specific ClickHouse security behavior. Diagnoses from SQL/system tables, supplied configuration files, query logs, access metadata, and ClickHouse/Altinity documentation.
tools
Diagnose and resolve ClickHouse grant and authentication errors, especially after upgrades. Use when queries fail with ACCESS_DENIED/NOT_ENOUGH_PRIVILEGES, AUTHENTICATION_FAILED/WRONG_PASSWORD/REQUIRED_PASSWORD, or ON CLUSTER privilege errors; when system.* or INFORMATION_SCHEMA access is denied; or when grant behavior changes after version upgrades.
tools
Diagnose ClickHouse disk usage, compression efficiency, part sizes, and storage bottlenecks. Use for disk space issues and slow IO.
tools
Analyze ClickHouse table structure, partitioning, ORDER BY keys, materialized views, and identify schema design anti-patterns. Use for table design issues and optimization.