plugins/rai/skills/rai-health/SKILL.md
Guides diagnosis of RAI engine performance, failed transactions, CDC/data-stream health, and CDC engine management. Use when a reasoner is slow or queuing, a transaction or batch has failed, a CDC stream is suspended or quarantined, or CDC engine sizing/recovery is needed.
npx skillsauth add RelationalAI/rai-agent-skills rai-healthInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
What: A process skill for diagnosing RAI operational health across four domains: reasoner performance (memory/CPU/demand), failed transactions, CDC / data-stream health, and CDC engine management. Each domain has its own step with decision tables and remediation actions.
When to use:
get_transaction_problems, get_own_transaction_problems,
or get_load_errors must be calledresume_cdc is neededalter_cdc_engine_size) or force-deletionWhen NOT to use:
rai-pyrel-codingrai-setuprai-prescriptive-solver-managementOverview (process steps):
Navigation: Steps 1–3 cover reasoner health only. For CDC/stream issues go directly to Step 5. For transaction failures go directly to Step 4. For CDC engine sizing or force-delete go directly to Step 6.
OBSERVABILITY_PREVIEW)| View | Key Column | Healthy Signal |
|------|-----------|----------------|
| logic_reasoner__memory_utilization | MEMORY_UTILIZATION (0.0–1.0) | < 0.80 on most runs |
| logic_reasoner__cpu_utilization | CPU_UTILIZATION (0.0–1.0) | < 0.85 sustained; < 0.95 peak |
| logic_reasoner__demand | DEMAND (0.0+) | ≤ 1.0 (> 1.0 = queuing) |
Quickest health check — all three metrics joined, last hour:
SELECT
m.REASONER_NAME,
m.TIMESTAMP,
m.MEMORY_UTILIZATION,
c.CPU_UTILIZATION,
d.DEMAND,
d.REASONER_CAPACITY
FROM relationalai.observability_preview.logic_reasoner__memory_utilization m
JOIN relationalai.observability_preview.logic_reasoner__cpu_utilization c
ON m.REASONER_ID = c.REASONER_ID AND m.TIMESTAMP = c.TIMESTAMP
JOIN relationalai.observability_preview.logic_reasoner__demand d
ON m.REASONER_ID = d.REASONER_ID AND m.TIMESTAMP = d.TIMESTAMP
WHERE m.TIMESTAMP >= DATEADD(hour, -1, CURRENT_TIMESTAMP())
ORDER BY m.TIMESTAMP DESC;
Always include a time-range filter. Querying without
WHERE timestamp >= ...scans the entire Event Table and incurs high Snowflake compute costs.
Before querying metrics, confirm the events view is registered and data is flowing.
CALL relationalai.app.CHECK_EVENTS_VIEW_STATUS();
| Status | Meaning | Action |
|--------|---------|--------|
| Events view active | Healthy, events flowing | None |
| No events view registered | Setup not done | Follow setup in references/setup-guide.md |
| ERROR | Configuration broken | Fix per error message reported |
Run
CHECK_EVENTS_VIEW_STATUS()whenever observability views return unexpected or empty results — it diagnoses most configuration issues automatically.
SELECT REASONER_NAME,
AVG(MEMORY_UTILIZATION) AS avg_mem,
MAX(MEMORY_UTILIZATION) AS peak_mem
FROM relationalai.observability_preview.logic_reasoner__memory_utilization
WHERE timestamp >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
GROUP BY REASONER_NAME;
SELECT REASONER_NAME,
AVG(CPU_UTILIZATION) AS avg_cpu,
MAX(CPU_UTILIZATION) AS peak_cpu
FROM relationalai.observability_preview.logic_reasoner__cpu_utilization
WHERE timestamp >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
GROUP BY REASONER_NAME;
SELECT REASONER_NAME,
AVG(DEMAND) AS avg_demand,
MAX(DEMAND) AS peak_demand,
REASONER_CAPACITY
FROM relationalai.observability_preview.logic_reasoner__demand
WHERE timestamp >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
GROUP BY REASONER_NAME, REASONER_CAPACITY;
Interpret patterns, not isolated spikes. Utilization is naturally spiky. The key question is: does almost every workload run exceed the threshold? Isolated peaks are normal; consistent exceedance across runs is the signal to act.
Signals: MEMORY_UTILIZATION < 0.80, CPU_UTILIZATION < 0.85, DEMAND ≤ 1.0 on most runs.
If transactions are still failing despite healthy metrics, the problem is not resource-related — go to Step 4 to diagnose the transaction directly.
Signals (any of):
MEMORY_UTILIZATION > 0.80 on most workload runsCPU_UTILIZATION consistently > 0.95CPU_UTILIZATION = 1.0 AND MEMORY_UTILIZATION = 1.0 AND DEMAND > 1.0 (all hard limits hit)Action: Upgrade reasoner size. No in-place resize exists — delete and recreate.
rai reasoners:suspend --type Logic --name <name>
rai reasoners:delete --type Logic --name <name>
rai reasoners:create --type Logic --name <name> --size <larger-size>
Signals: CPU_UTILIZATION consistently 0.85–0.95 (below critical but limited headroom for bursts).
Action: Schedule a resize during a low-traffic window before the next traffic spike. No immediate action required.
Signals: DEMAND consistently > 1.0 (more jobs than available queue slots).
Action:
Signals: CPU_UTILIZATION < 0.30 AND MEMORY_UTILIZATION never exceeds 0.30 across workload runs.
Action: Downgrade to a smaller reasoner — you are paying for unused capacity.
rai reasoners:suspend --type Logic --name <name>
rai reasoners:delete --type Logic --name <name>
rai reasoners:create --type Logic --name <name> --size <smaller-size>
Signals: DEMAND = 0 for extended periods.
Action: Suspend the reasoner or reduce its auto_suspend threshold to stop billing for idle time.
rai reasoners:suspend --type Logic --name <name>
Use these procedures when a transaction appears stuck, aborted, or when load errors are reported.
CALL relationalai.api.get_transaction('<transaction_id>');
Returns the full transaction record including status, owner, start/end timestamps, and error detail.
-- Problems for any transaction (requires admin-level role)
CALL relationalai.api.get_transaction_problems('<transaction_id>');
-- Problems for transactions you own (end-user role)
CALL relationalai.api.get_own_transaction_problems('<transaction_id>');
| Procedure | Accessible by | Returns |
|-----------|--------------|---------|
| get_transaction_problems | Admin roles | All transactions |
| get_own_transaction_problems | Any role | Only caller-owned transactions |
CALL relationalai.api.get_load_errors('<transaction_id>');
Returns row-level load errors associated with a transaction: source object, error message, and affected row count.
WARNING — Owner-restriction pitfall: If
get_transaction_problemsreturns HTTP 400, check the transaction owner before assuming a permissions misconfiguration:CALL relationalai.api.get_transaction('<transaction_id>');Then use the table below to interpret the result.
| get_transaction result | Meaning | Next step |
|--------------------------|---------|-----------|
| owner = cdc.scheduler@erp | Expected behavior — CDC-owned transactions are not visible to end-user roles by design | Use SELECT * FROM relationalai.api.cdc_status or an admin role |
| owner = any other identity; called get_transaction_problems without an admin role | Permission issue — get_transaction_problems requires an admin role | Grant the admin role, or switch to get_own_transaction_problems if you own the transaction |
| owner = any other identity; caller has an admin role | Genuine API failure — not a permissions problem | Open a support ticket with the transaction ID and full error response |
| get_transaction itself returns 400 | Invalid transaction ID, or insufficient role to read any transactions | Verify the transaction ID; if correct, confirm read access to relationalai.api |
When get_transaction returns an unexpected column or state code, or when inspecting load errors per row, see transaction-debug.md.
WARNING — Auto-quarantine gotcha: A stream that has been in
SUSPENDEDstate for approximately one month will be automatically promoted toQUARANTINED— without creating any rows indata_stream_errors. The absence of error rows does not mean the stream is healthy. Always confirm stream status fromcdc_statusordata_stream_batchesbefore treating an empty errors result as a clean bill of health.
SELECT * FROM relationalai.api.cdc_status;
Key columns: stream_name, stream_status, engine_name, engine_status. Use the stream_name
values from this output as '<stream_name>' in the queries below.
SELECT stream_name, batch_id, status, error_message, created_at
FROM relationalai.api.data_stream_batches
WHERE stream_name = '<stream_name>'
AND created_at >= DATEADD(day, -7, CURRENT_TIMESTAMP())
ORDER BY created_at DESC;
Use a 7-day window rather than 24 hours — a quarantined or long-suspended stream may have had no batches for days, and a 24-hour filter returns empty output indistinguishable from a healthy-but-idle stream.
SELECT *
FROM relationalai.api.data_stream_errors
WHERE stream_name = '<stream_name>'
ORDER BY created_at DESC
LIMIT 50;
For auto-quarantined streams this may return empty — that is expected. Use
data_stream_batches status as the authoritative source.
| Status | Meaning | Action |
|--------|---------|--------|
| ACTIVE | Healthy, batches flowing | None |
| SUSPENDED | Paused; no new batches | Call resume_cdc() — see below |
| QUARANTINED | Permanently paused; data integrity issue | Follow quarantine recovery flow |
| FAILED | Batch or load error | Check data_stream_errors and get_load_errors |
CALL relationalai.app.resume_cdc('<stream_name>');
See cdc-recovery.md — Quarantine Recovery Runbook for the full step-by-step recovery checklist, schema reference, and official docs link.
A freshly created or freshly resumed Predictive reasoner can sit in STATUS=PROVISIONING for many minutes while the in-pod data index hydrates from CDC streams — and gnn.fit() running against it appears to hang in Step 1 (dataset prep) before reaching "Training job submitted." This is the same per-table CDC stream-sync compounding that affects unwarmed Logic reasoners on first model query, surfacing on the Predictive side as a silent prep-step stall.
Reasoner status — is the pod still hydrating or actually ready?
CALL RELATIONALAI.API.GET_REASONER('predictive', '<reasoner_name>');
STATUS=PROVISIONING → wait (1–3 min typical for warm pools; longer on first-time Snowflake-stream attach). STATUS=READY plus a hang → continue to (2) and (3).
CDC stream health — is a stream the predictive reasoner depends on suspended or quarantined?
SELECT * FROM relationalai.api.cdc_status;
engine_status != READY or stream_status not RUNNING → see § Step 5 — Diagnose CDC / Data Stream Health for the quarantine-recovery / resume_cdc runbook. Predictive jobs cannot proceed until the upstream streams are healthy.
Transaction problems — if a specific transaction id appeared in client logs (often the Failed to pull data into index: transaction was aborted wrapper), pull its problems:
CALL RELATIONALAI.API.GET_OWN_TRANSACTION_PROBLEMS('<transaction_id>');
-- or with admin role: GET_TRANSACTION_PROBLEMS('<id>')
See § Step 4 — Diagnose a Failed Transaction for the schema reference and owner-restriction pitfall.
If CDC is healthy and the reasoner has been PROVISIONING for more than ~5 minutes with no client-side progress, treat it the same as a stuck worker — SUSPEND_REASONER + RESUME_REASONER_ASYNC (see § Predictive train jobs stuck QUEUED below). Persistent failure: rebuild on a fresh GPU reasoner via DELETE_REASONER + CREATE_REASONER_ASYNC('predictive', ..., 'GPU_NV_S', OBJECT_CONSTRUCT()).
A predictive train job submitted via gnn.fit() can sit in STATE='QUEUED' in RELATIONALAI.API.JOBS indefinitely while CALL RELATIONALAI.API.GET_REASONER('predictive', '<name>') still reports STATUS='READY'. The SDK only checks reasoner-pod status before submitting — the in-pod worker queue can be out of sync with that status, and the SDK has no way to detect it (relationalai_gnns/core/connector.py::_check_engine_availability).
Long-running predictive jobs are usually fine — distinguish stuck from slow before suspending anything. Use the same ladder as rai-predictive-training § "Training appears stuck":
CALL RELATIONALAI.API.GET_REASONER('predictive', '<name>') → STATUS=READY?client.jobs.list("Predictive", name="<name>") → is there a RUNNING train job (with rising AGE_MIN), or a QUEUED one going stale?SHOW EXPERIMENTS IN SCHEMA <exp_db>.<exp_schema> → did a new experiment row append within ~60s of the RUNNING train job?A QUEUED train job that won't advance while the reasoner reports READY is the worker-queue desync this section addresses. Genuine long runs progress through (1) READY → (2) RUNNING → (3) new experiment row.
Suspend then resume the predictive reasoner to force a worker recycle, kill any stuck client, then re-instantiate GNN(...) and resubmit (gnn.fit() is idempotent — see rai-predictive-training § gnn.fit() is idempotent). Use only the supported RELATIONALAI.API.* surface — do not invoke RELATIONALAI.EXPERIMENTAL.* procedures as a workaround.
-- 1. Confirm a stuck train job
SELECT ID, STATE, DATEDIFF('minute', CREATED_ON, CURRENT_TIMESTAMP()) AS AGE_MIN
FROM RELATIONALAI.API.JOBS
WHERE STATE IN ('QUEUED','RUNNING')
AND PAYLOAD LIKE '%"job_type": "train"%'
ORDER BY CREATED_ON ASC;
-- 2. Recycle the worker
CALL RELATIONALAI.API.SUSPEND_REASONER('predictive', '<reasoner_name>');
CALL RELATIONALAI.API.RESUME_REASONER_ASYNC('predictive', '<reasoner_name>');
-- 3. Wait for STATUS=READY, kill the stuck client, then resubmit (re-instantiate GNN(...))
CALL RELATIONALAI.API.GET_REASONER('predictive', '<reasoner_name>');
If recycling doesn't unstick the worker, rebuild on a fresh GPU predictive reasoner — the documented path:
CALL RELATIONALAI.API.DELETE_REASONER('predictive', '<reasoner_name>');
CALL RELATIONALAI.API.CREATE_REASONER_ASYNC('predictive', '<reasoner_name>', 'GPU_NV_S', OBJECT_CONSTRUCT());
-- Poll until STATUS=READY:
CALL RELATIONALAI.API.GET_REASONER('predictive', '<reasoner_name>');
See rai-predictive-training § Worker not ready to accept jobs for the matching client-side symptom and § Stalled train job: SDK polls without a timeout for stalled-job forensics.
CDC engine ≠ reasoner engine. The CDC pipeline runs on a dedicated managed engine distinct from Logic reasoner engines.
alter_cdc_engine_sizetargets only the CDC engine; the CLI commands (rai reasoners:create/delete) target Logic reasoners only. Do not apply the Step 3 CLI commands to the CDC engine.
SELECT * FROM relationalai.api.cdc_status;
Key columns: engine_name, engine_size, engine_status, stream_name, stream_status.
CALL relationalai.app.alter_cdc_engine_size('<size>');
| Size | Use When |
|------|----------|
| HIGHMEM_X64_S | Small CDC load; default starting point |
| HIGHMEM_X64_M | Moderate load or frequent quarantine timeouts |
| HIGHMEM_X64_L | High-volume streams (see 395019 pitfall in Common Pitfalls) |
| HIGHMEM_X64_SL | Largest available; high-volume with large memory requirement (see 395019 pitfall) |
The CDC engine is suspended and recreated during a resize — expect brief CDC downtime.
Use when the CDC engine is stuck in a non-deletable state:
CALL relationalai.api.delete_engine('CDC_MANAGED_ENGINE', TRUE);
The second argument TRUE enables force deletion. RAI will recreate the CDC engine automatically
on the next CDC trigger. Confirm recovery with SELECT * FROM relationalai.api.cdc_status.
WARNING — If
delete_enginereturns "engine not found" butcdc_statusstill shows the engine as suspended: this is a control-plane / data-plane desync — the engine record exists in RAI's metadata but the underlying Snowflake engine is gone. No self-serve command resolves this state. Run the following and retain the output for support:SELECT * FROM relationalai.api.cdc_status;Then open a support ticket with that output. Do not attempt
alter_cdc_engine_sizein this state — it will fail or create a duplicate record.
Two application roles control who can configure and who can read observability data:
| Role | Capabilities | Grant To |
|------|-------------|----------|
| observability_admin | Register/unregister events view; call CHECK_EVENTS_VIEW_STATUS() | Small trusted ops group |
| observability_viewer | Read-only on all observability views | Engineering and operations users |
GRANT APPLICATION ROLE relationalai.observability_viewer TO ROLE <your_role>;
GRANT APPLICATION ROLE relationalai.observability_admin TO ROLE <your_role>;
Observability views are non-materialized — every query scans the Snowflake Event Table in real time. No extra storage cost, but Snowflake compute credits are consumed on every query.
Cost scales with: event volume × time range × query complexity.
| Rule | Detail |
|------|--------|
| Always filter by time | WHERE timestamp >= DATEADD(hour, -24, ...) — never query without bounds |
| Monitor query costs | SELECT query_id, total_elapsed_time, credits_used_cloud_services FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY WHERE query_text ILIKE '%observability%' |
| Prefer hourly/daily aggregations for dashboards | Avoid raw per-minute scans in scheduled jobs |
| Mistake | Cause | Fix |
|---------|-------|-----|
| No data in metric views | Events view not registered or setup incomplete | Run CHECK_EVENTS_VIEW_STATUS(); complete setup if needed |
| Registration fails: "CHANGE_TRACKING" error | Change tracking not enabled on view or event table | Add CHANGE_TRACKING = TRUE to view definition and underlying table |
| Registration fails: "A view is already registered" | Prior view still bound | Call UNREGISTER_EVENTS_VIEW() first, then re-register |
| Queries are slow or expensive | Missing time-range filter scans full event table | Always add WHERE timestamp >= DATEADD(...) |
| DEMAND > 1.0 but CPU is low | Job routing: one reasoner saturated, others idle | Redistribute jobs across reasoner instances |
| Isolated spikes look alarming | Normal: spiky workloads cause transient peaks | Focus on pattern across runs, not individual data points |
| observability_viewer cannot see views | Role not granted | Run GRANT APPLICATION ROLE ... TO ROLE ... as admin |
| Engine create fails: Failed to parse 'service spec' as YAML (Snowflake error 395019) | ENGINE_CONFIG_OVERRIDE is set and the size is HIGHMEM_X64_L or HIGHMEM_X64_SL | Use HIGHMEM_X64_S or HIGHMEM_X64_M until the platform fix is deployed |
| Reference | Description | File |
|-----------|-------------|------|
| Setup guide | Full 6-step observability setup | setup-guide.md |
| Metric schemas | All metric view schemas | metric-schemas.md |
| Transaction debug | get_transaction / get_load_errors API reference and owner-restriction details | transaction-debug.md |
| CDC recovery | data_stream_errors, batches schema, quarantine recovery, resume_cdc | cdc-recovery.md |
data-ai
Configure and train graph neural network (GNN) models, generate predictions, evaluate results, and manage trained models. Use when ready to train, generate predictions, evaluate, or manage models; for concepts, data loading, edges, and feature configuration, see `rai-predictive-modeling`.
development
Build graph neural network (GNN) models — concepts, Snowflake data loading, task relationships, graph edges, and PropertyTransformer features. Use for node classification, regression, and link prediction tasks; for training, predictions, and evaluation, see `rai-predictive-training`.
development
Setup and configuration for RelationalAI — first-time install walkthrough and all raiconfig.yaml tuning. Use when installing RAI, connecting to Snowflake, or editing raiconfig.yaml. Not for writing PyRel model code (see rai-pyrel-coding) or solver usage and diagnostics (see rai-prescriptive-solver-management).
testing
Converts natural language business rules into PyRel derived properties — validation, classification, derivation, alerting, and reconciliation. Use whenever a task assigns each entity a new tier, segment, score, or flag, or derives a new property; author it here as a derived property, then query it with rai-querying.