Name: rai-health
Author: RelationalAI

Summary

What: A process skill for diagnosing RAI operational health across four domains: reasoner performance (memory/CPU/demand), failed transactions, CDC / data-stream health, and CDC engine management. Each domain has its own step with decision tables and remediation actions.

When to use:

Reasoner is slow, stuck, or queuing; need to check memory, CPU, or demand metrics
Observability views need setup, role grants, or dashboard/alerting work
A transaction was aborted; get_transaction_problems, get_own_transaction_problems, or get_load_errors must be called
Batch processing has failed and load errors need inspection
A CDC task is suspended, a data stream is quarantined, or resume_cdc is needed
CDC engine needs resizing (alter_cdc_engine_size) or force-deletion

When NOT to use:

Writing PyRel models or query logic → see rai-pyrel-coding
Configuring authentication or initial RAI setup → see rai-setup
Managing solver optimization problems → see rai-prescriptive-solver-management

Overview (process steps):

Verify observability is set up (events view registered and healthy)
Query the three metric views: memory, CPU, demand
Apply threshold-based decision rules and prescribe the exact remediation action
Diagnose a failed transaction (get_transaction, get_load_errors, owner-restriction pitfall)
Diagnose CDC / data stream health (errors, batches, quarantine recovery, resume_cdc)
Manage the CDC engine (alter_cdc_engine_size, force delete, cdc_status)

Navigation: Steps 1–3 cover reasoner health only. For CDC/stream issues go directly to Step 5. For transaction failures go directly to Step 4. For CDC engine sizing or force-delete go directly to Step 6.

Quick Reference

The Three Metric Views (all in `OBSERVABILITY_PREVIEW`)

| View | Key Column | Healthy Signal | |------|-----------|----------------| | logic_reasoner__memory_utilization | MEMORY_UTILIZATION (0.0–1.0) | < 0.80 on most runs | | logic_reasoner__cpu_utilization | CPU_UTILIZATION (0.0–1.0) | < 0.85 sustained; < 0.95 peak | | logic_reasoner__demand | DEMAND (0.0+) | ≤ 1.0 (> 1.0 = queuing) |

Quickest health check — all three metrics joined, last hour:

SELECT
  m.REASONER_NAME,
  m.TIMESTAMP,
  m.MEMORY_UTILIZATION,
  c.CPU_UTILIZATION,
  d.DEMAND,
  d.REASONER_CAPACITY
FROM relationalai.observability_preview.logic_reasoner__memory_utilization m
JOIN relationalai.observability_preview.logic_reasoner__cpu_utilization c
  ON m.REASONER_ID = c.REASONER_ID AND m.TIMESTAMP = c.TIMESTAMP
JOIN relationalai.observability_preview.logic_reasoner__demand d
  ON m.REASONER_ID = d.REASONER_ID AND m.TIMESTAMP = d.TIMESTAMP
WHERE m.TIMESTAMP >= DATEADD(hour, -1, CURRENT_TIMESTAMP())
ORDER BY m.TIMESTAMP DESC;

Always include a time-range filter. Querying without WHERE timestamp >= ... scans the entire Event Table and incurs high Snowflake compute costs.

Step 1 — Verify Observability Is Active

Before querying metrics, confirm the events view is registered and data is flowing.

CALL relationalai.app.CHECK_EVENTS_VIEW_STATUS();

| Status | Meaning | Action | |--------|---------|--------| | Events view active | Healthy, events flowing | None | | No events view registered | Setup not done | Follow setup in references/setup-guide.md | | ERROR | Configuration broken | Fix per error message reported |

Run CHECK_EVENTS_VIEW_STATUS() whenever observability views return unexpected or empty results — it diagnoses most configuration issues automatically.

Step 2 — Query Each Metric

Memory Utilization

SELECT REASONER_NAME,
       AVG(MEMORY_UTILIZATION) AS avg_mem,
       MAX(MEMORY_UTILIZATION) AS peak_mem
FROM relationalai.observability_preview.logic_reasoner__memory_utilization
WHERE timestamp >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
GROUP BY REASONER_NAME;

CPU Utilization

SELECT REASONER_NAME,
       AVG(CPU_UTILIZATION) AS avg_cpu,
       MAX(CPU_UTILIZATION) AS peak_cpu
FROM relationalai.observability_preview.logic_reasoner__cpu_utilization
WHERE timestamp >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
GROUP BY REASONER_NAME;

Demand (Queue Pressure)

SELECT REASONER_NAME,
       AVG(DEMAND)   AS avg_demand,
       MAX(DEMAND)   AS peak_demand,
       REASONER_CAPACITY
FROM relationalai.observability_preview.logic_reasoner__demand
WHERE timestamp >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
GROUP BY REASONER_NAME, REASONER_CAPACITY;

Interpret patterns, not isolated spikes. Utilization is naturally spiky. The key question is: does almost every workload run exceed the threshold? Isolated peaks are normal; consistent exceedance across runs is the signal to act.

Step 3 — Health Verdicts and Actions

OK — HEALTHY — No Action

Signals: MEMORY_UTILIZATION < 0.80, CPU_UTILIZATION < 0.85, DEMAND ≤ 1.0 on most runs.

If transactions are still failing despite healthy metrics, the problem is not resource-related — go to Step 4 to diagnose the transaction directly.

CRITICAL — OVERLOADED — Upgrade to Larger Reasoner (Immediate)

Signals (any of):

MEMORY_UTILIZATION > 0.80 on most workload runs
CPU_UTILIZATION consistently > 0.95
CPU_UTILIZATION = 1.0 AND MEMORY_UTILIZATION = 1.0 AND DEMAND > 1.0 (all hard limits hit)

Action: Upgrade reasoner size. No in-place resize exists — delete and recreate.

rai reasoners:suspend --type Logic --name <name>
rai reasoners:delete  --type Logic --name <name>
rai reasoners:create  --type Logic --name <name> --size <larger-size>

ELEVATED — PLAN TO RESIZE — Proactive Warning

Signals: CPU_UTILIZATION consistently 0.85–0.95 (below critical but limited headroom for bursts).

Action: Schedule a resize during a low-traffic window before the next traffic spike. No immediate action required.

WARNING — QUEUING — Review Job Volume / Split Across Reasoners

Signals: DEMAND consistently > 1.0 (more jobs than available queue slots).

Action:

Investigate root cause: is a batch job or burst flooding the queue?
If higher concurrency is genuinely needed, route different job types to separate reasoner instances (send different jobs to different reasoners — do not simply upsize).
Upsizing is not the default fix for queuing — it is a demand and routing problem.

INFORMATIONAL — UNDERUTILIZED — Downsize to Save Cost

Signals: CPU_UTILIZATION < 0.30 AND MEMORY_UTILIZATION never exceeds 0.30 across workload runs.

Action: Downgrade to a smaller reasoner — you are paying for unused capacity.

rai reasoners:suspend --type Logic --name <name>
rai reasoners:delete  --type Logic --name <name>
rai reasoners:create  --type Logic --name <name> --size <smaller-size>

NOMINAL — IDLE — Suspend or Lower Auto-Suspend Threshold

Signals: DEMAND = 0 for extended periods.

Action: Suspend the reasoner or reduce its auto_suspend threshold to stop billing for idle time.

rai reasoners:suspend --type Logic --name <name>

Step 4 — Diagnose a Failed Transaction

Use these procedures when a transaction appears stuck, aborted, or when load errors are reported.

Fetch a Transaction by ID

CALL relationalai.api.get_transaction('<transaction_id>');

Returns the full transaction record including status, owner, start/end timestamps, and error detail.

Get Transaction Problems

-- Problems for any transaction (requires admin-level role)
CALL relationalai.api.get_transaction_problems('<transaction_id>');

-- Problems for transactions you own (end-user role)
CALL relationalai.api.get_own_transaction_problems('<transaction_id>');

| Procedure | Accessible by | Returns | |-----------|--------------|---------| | get_transaction_problems | Admin roles | All transactions | | get_own_transaction_problems | Any role | Only caller-owned transactions |

Get Load Errors

CALL relationalai.api.get_load_errors('<transaction_id>');

Returns row-level load errors associated with a transaction: source object, error message, and affected row count.

WARNING — Owner-restriction pitfall: If get_transaction_problems returns HTTP 400, check the transaction owner before assuming a permissions misconfiguration:
CALL relationalai.api.get_transaction('<transaction_id>');
Then use the table below to interpret the result.

| get_transaction result | Meaning | Next step | |--------------------------|---------|-----------| | owner = cdc.scheduler@erp | Expected behavior — CDC-owned transactions are not visible to end-user roles by design | Use SELECT * FROM relationalai.api.cdc_status or an admin role | | owner = any other identity; called get_transaction_problems without an admin role | Permission issue — get_transaction_problems requires an admin role | Grant the admin role, or switch to get_own_transaction_problems if you own the transaction | | owner = any other identity; caller has an admin role | Genuine API failure — not a permissions problem | Open a support ticket with the transaction ID and full error response | | get_transaction itself returns 400 | Invalid transaction ID, or insufficient role to read any transactions | Verify the transaction ID; if correct, confirm read access to relationalai.api |

When get_transaction returns an unexpected column or state code, or when inspecting load errors per row, see transaction-debug.md.

Step 5 — Diagnose CDC / Data Stream Health

WARNING — Auto-quarantine gotcha: A stream that has been in SUSPENDED state for approximately one month will be automatically promoted to QUARANTINED — without creating any rows in data_stream_errors. The absence of error rows does not mean the stream is healthy. Always confirm stream status from cdc_status or data_stream_batches before treating an empty errors result as a clean bill of health.

Find Your Streams (Start Here)

SELECT * FROM relationalai.api.cdc_status;

Key columns: stream_name, stream_status, engine_name, engine_status. Use the stream_name values from this output as '<stream_name>' in the queries below.

Check Batch-Level Status

SELECT stream_name, batch_id, status, error_message, created_at
FROM relationalai.api.data_stream_batches
WHERE stream_name = '<stream_name>'
  AND created_at >= DATEADD(day, -7, CURRENT_TIMESTAMP())
ORDER BY created_at DESC;

Use a 7-day window rather than 24 hours — a quarantined or long-suspended stream may have had no batches for days, and a 24-hour filter returns empty output indistinguishable from a healthy-but-idle stream.

Check Stream-Level Errors

SELECT *
FROM relationalai.api.data_stream_errors
WHERE stream_name = '<stream_name>'
ORDER BY created_at DESC
LIMIT 50;

For auto-quarantined streams this may return empty — that is expected. Use data_stream_batches status as the authoritative source.

Stream State Verdicts

| Status | Meaning | Action | |--------|---------|--------| | ACTIVE | Healthy, batches flowing | None | | SUSPENDED | Paused; no new batches | Call resume_cdc() — see below | | QUARANTINED | Permanently paused; data integrity issue | Follow quarantine recovery flow | | FAILED | Batch or load error | Check data_stream_errors and get_load_errors |

Resume a Suspended Stream

CALL relationalai.app.resume_cdc('<stream_name>');

Quarantine Recovery

See cdc-recovery.md — Quarantine Recovery Runbook for the full step-by-step recovery checklist, schema reference, and official docs link.

Predictive reasoner stuck in data-index init

A freshly created or freshly resumed Predictive reasoner can sit in STATUS=PROVISIONING for many minutes while the in-pod data index hydrates from CDC streams — and gnn.fit() running against it appears to hang in Step 1 (dataset prep) before reaching "Training job submitted." This is the same per-table CDC stream-sync compounding that affects unwarmed Logic reasoners on first model query, surfacing on the Predictive side as a silent prep-step stall.

Diagnose (in order)

Reasoner status — is the pod still hydrating or actually ready?
```
CALL RELATIONALAI.API.GET_REASONER('predictive', '<reasoner_name>');
```
STATUS=PROVISIONING → wait (1–3 min typical for warm pools; longer on first-time Snowflake-stream attach). STATUS=READY plus a hang → continue to (2) and (3).
CDC stream health — is a stream the predictive reasoner depends on suspended or quarantined?
```
SELECT * FROM relationalai.api.cdc_status;
```
engine_status != READY or stream_status not RUNNING → see § Step 5 — Diagnose CDC / Data Stream Health for the quarantine-recovery / resume_cdc runbook. Predictive jobs cannot proceed until the upstream streams are healthy.
Transaction problems — if a specific transaction id appeared in client logs (often the Failed to pull data into index: transaction was aborted wrapper), pull its problems:
```
CALL RELATIONALAI.API.GET_OWN_TRANSACTION_PROBLEMS('<transaction_id>');
-- or with admin role: GET_TRANSACTION_PROBLEMS('<id>')
```
See § Step 4 — Diagnose a Failed Transaction for the schema reference and owner-restriction pitfall.

Recovery

If CDC is healthy and the reasoner has been PROVISIONING for more than ~5 minutes with no client-side progress, treat it the same as a stuck worker — SUSPEND_REASONER + RESUME_REASONER_ASYNC (see § Predictive train jobs stuck QUEUED below). Persistent failure: rebuild on a fresh GPU reasoner via DELETE_REASONER + CREATE_REASONER_ASYNC('predictive', ..., 'GPU_NV_S', OBJECT_CONSTRUCT()).

Predictive train jobs stuck QUEUED

A predictive train job submitted via gnn.fit() can sit in STATE='QUEUED' in RELATIONALAI.API.JOBS indefinitely while CALL RELATIONALAI.API.GET_REASONER('predictive', '<name>') still reports STATUS='READY'. The SDK only checks reasoner-pod status before submitting — the in-pod worker queue can be out of sync with that status, and the SDK has no way to detect it (relationalai_gnns/core/connector.py::_check_engine_availability).

Diagnostic ladder

Long-running predictive jobs are usually fine — distinguish stuck from slow before suspending anything. Use the same ladder as rai-predictive-training § "Training appears stuck":

CALL RELATIONALAI.API.GET_REASONER('predictive', '<name>') → STATUS=READY?
client.jobs.list("Predictive", name="<name>") → is there a RUNNING train job (with rising AGE_MIN), or a QUEUED one going stale?
SHOW EXPERIMENTS IN SCHEMA <exp_db>.<exp_schema> → did a new experiment row append within ~60s of the RUNNING train job?

A QUEUED train job that won't advance while the reasoner reports READY is the worker-queue desync this section addresses. Genuine long runs progress through (1) READY → (2) RUNNING → (3) new experiment row.

Recovery

Suspend then resume the predictive reasoner to force a worker recycle, kill any stuck client, then re-instantiate GNN(...) and resubmit (gnn.fit() is idempotent — see rai-predictive-training § gnn.fit() is idempotent). Use only the supported RELATIONALAI.API.* surface — do not invoke RELATIONALAI.EXPERIMENTAL.* procedures as a workaround.

-- 1. Confirm a stuck train job
SELECT ID, STATE, DATEDIFF('minute', CREATED_ON, CURRENT_TIMESTAMP()) AS AGE_MIN
FROM RELATIONALAI.API.JOBS
WHERE STATE IN ('QUEUED','RUNNING')
  AND PAYLOAD LIKE '%"job_type": "train"%'
ORDER BY CREATED_ON ASC;

-- 2. Recycle the worker
CALL RELATIONALAI.API.SUSPEND_REASONER('predictive', '<reasoner_name>');
CALL RELATIONALAI.API.RESUME_REASONER_ASYNC('predictive', '<reasoner_name>');

-- 3. Wait for STATUS=READY, kill the stuck client, then resubmit (re-instantiate GNN(...))
CALL RELATIONALAI.API.GET_REASONER('predictive', '<reasoner_name>');

If recycling doesn't unstick the worker, rebuild on a fresh GPU predictive reasoner — the documented path:

CALL RELATIONALAI.API.DELETE_REASONER('predictive', '<reasoner_name>');
CALL RELATIONALAI.API.CREATE_REASONER_ASYNC('predictive', '<reasoner_name>', 'GPU_NV_S', OBJECT_CONSTRUCT());
-- Poll until STATUS=READY:
CALL RELATIONALAI.API.GET_REASONER('predictive', '<reasoner_name>');

See rai-predictive-training § Worker not ready to accept jobs for the matching client-side symptom and § Stalled train job: SDK polls without a timeout for stalled-job forensics.

Step 6 — CDC Engine Management

CDC engine ≠ reasoner engine. The CDC pipeline runs on a dedicated managed engine distinct from Logic reasoner engines. alter_cdc_engine_size targets only the CDC engine; the CLI commands (rai reasoners:create/delete) target Logic reasoners only. Do not apply the Step 3 CLI commands to the CDC engine.

Check Current CDC Status

SELECT * FROM relationalai.api.cdc_status;

Key columns: engine_name, engine_size, engine_status, stream_name, stream_status.

Resize the CDC Engine

CALL relationalai.app.alter_cdc_engine_size('<size>');

| Size | Use When | |------|----------| | HIGHMEM_X64_S | Small CDC load; default starting point | | HIGHMEM_X64_M | Moderate load or frequent quarantine timeouts | | HIGHMEM_X64_L | High-volume streams (see 395019 pitfall in Common Pitfalls) | | HIGHMEM_X64_SL | Largest available; high-volume with large memory requirement (see 395019 pitfall) |

The CDC engine is suspended and recreated during a resize — expect brief CDC downtime.

Force-Delete a Stale CDC Engine

Use when the CDC engine is stuck in a non-deletable state:

CALL relationalai.api.delete_engine('CDC_MANAGED_ENGINE', TRUE);

The second argument TRUE enables force deletion. RAI will recreate the CDC engine automatically on the next CDC trigger. Confirm recovery with SELECT * FROM relationalai.api.cdc_status.

WARNING — If delete_engine returns "engine not found" but cdc_status still shows the engine as suspended: this is a control-plane / data-plane desync — the engine record exists in RAI's metadata but the underlying Snowflake engine is gone. No self-serve command resolves this state. Run the following and retain the output for support:
SELECT * FROM relationalai.api.cdc_status;
Then open a support ticket with that output. Do not attempt alter_cdc_engine_size in this state — it will fail or create a duplicate record.

Access Control

Two application roles control who can configure and who can read observability data:

| Role | Capabilities | Grant To | |------|-------------|----------| | observability_admin | Register/unregister events view; call CHECK_EVENTS_VIEW_STATUS() | Small trusted ops group | | observability_viewer | Read-only on all observability views | Engineering and operations users |

GRANT APPLICATION ROLE relationalai.observability_viewer TO ROLE <your_role>;
GRANT APPLICATION ROLE relationalai.observability_admin  TO ROLE <your_role>;

Cost Guardrails

Observability views are non-materialized — every query scans the Snowflake Event Table in real time. No extra storage cost, but Snowflake compute credits are consumed on every query.

Cost scales with: event volume × time range × query complexity.

| Rule | Detail | |------|--------| | Always filter by time | WHERE timestamp >= DATEADD(hour, -24, ...) — never query without bounds | | Monitor query costs | SELECT query_id, total_elapsed_time, credits_used_cloud_services FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY WHERE query_text ILIKE '%observability%' | | Prefer hourly/daily aggregations for dashboards | Avoid raw per-minute scans in scheduled jobs |

Common Pitfalls

| Mistake | Cause | Fix | |---------|-------|-----| | No data in metric views | Events view not registered or setup incomplete | Run CHECK_EVENTS_VIEW_STATUS(); complete setup if needed | | Registration fails: "CHANGE_TRACKING" error | Change tracking not enabled on view or event table | Add CHANGE_TRACKING = TRUE to view definition and underlying table | | Registration fails: "A view is already registered" | Prior view still bound | Call UNREGISTER_EVENTS_VIEW() first, then re-register | | Queries are slow or expensive | Missing time-range filter scans full event table | Always add WHERE timestamp >= DATEADD(...) | | DEMAND > 1.0 but CPU is low | Job routing: one reasoner saturated, others idle | Redistribute jobs across reasoner instances | | Isolated spikes look alarming | Normal: spiky workloads cause transient peaks | Focus on pattern across runs, not individual data points | | observability_viewer cannot see views | Role not granted | Run GRANT APPLICATION ROLE ... TO ROLE ... as admin | | Engine create fails: Failed to parse 'service spec' as YAML (Snowflake error 395019) | ENGINE_CONFIG_OVERRIDE is set and the size is HIGHMEM_X64_L or HIGHMEM_X64_SL | Use HIGHMEM_X64_S or HIGHMEM_X64_M until the platform fix is deployed |

Reference files

| Reference | Description | File | |-----------|-------------|------| | Setup guide | Full 6-step observability setup | setup-guide.md | | Metric schemas | All metric view schemas | metric-schemas.md | | Transaction debug | get_transaction / get_load_errors API reference and owner-restriction details | transaction-debug.md | | CDC recovery | data_stream_errors, batches schema, quarantine recovery, resume_cdc | cdc-recovery.md |

Summary

When to use:

Reasoner is slow, stuck, or queuing; need to check memory, CPU, or demand metrics
Observability views need setup, role grants, or dashboard/alerting work
A transaction was aborted; get_transaction_problems, get_own_transaction_problems, or get_load_errors must be called
Batch processing has failed and load errors need inspection
A CDC task is suspended, a data stream is quarantined, or resume_cdc is needed
CDC engine needs resizing (alter_cdc_engine_size) or force-deletion

When NOT to use:

Writing PyRel models or query logic → see rai-pyrel-coding
Configuring authentication or initial RAI setup → see rai-setup
Managing solver optimization problems → see rai-prescriptive-solver-management

Overview (process steps):

Verify observability is set up (events view registered and healthy)
Query the three metric views: memory, CPU, demand
Apply threshold-based decision rules and prescribe the exact remediation action
Diagnose a failed transaction (get_transaction, get_load_errors, owner-restriction pitfall)
Diagnose CDC / data stream health (errors, batches, quarantine recovery, resume_cdc)
Manage the CDC engine (alter_cdc_engine_size, force delete, cdc_status)

Navigation: Steps 1–3 cover reasoner health only. For CDC/stream issues go directly to Step 5. For transaction failures go directly to Step 4. For CDC engine sizing or force-delete go directly to Step 6.

Quick Reference

The Three Metric Views (all in `OBSERVABILITY_PREVIEW`)

Quickest health check — all three metrics joined, last hour:

SELECT
  m.REASONER_NAME,
  m.TIMESTAMP,
  m.MEMORY_UTILIZATION,
  c.CPU_UTILIZATION,
  d.DEMAND,
  d.REASONER_CAPACITY
FROM relationalai.observability_preview.logic_reasoner__memory_utilization m
JOIN relationalai.observability_preview.logic_reasoner__cpu_utilization c
  ON m.REASONER_ID = c.REASONER_ID AND m.TIMESTAMP = c.TIMESTAMP
JOIN relationalai.observability_preview.logic_reasoner__demand d
  ON m.REASONER_ID = d.REASONER_ID AND m.TIMESTAMP = d.TIMESTAMP
WHERE m.TIMESTAMP >= DATEADD(hour, -1, CURRENT_TIMESTAMP())
ORDER BY m.TIMESTAMP DESC;

Always include a time-range filter. Querying without WHERE timestamp >= ... scans the entire Event Table and incurs high Snowflake compute costs.

Step 1 — Verify Observability Is Active

Before querying metrics, confirm the events view is registered and data is flowing.

CALL relationalai.app.CHECK_EVENTS_VIEW_STATUS();

Run CHECK_EVENTS_VIEW_STATUS() whenever observability views return unexpected or empty results — it diagnoses most configuration issues automatically.

Step 2 — Query Each Metric

Memory Utilization

SELECT REASONER_NAME,
       AVG(MEMORY_UTILIZATION) AS avg_mem,
       MAX(MEMORY_UTILIZATION) AS peak_mem
FROM relationalai.observability_preview.logic_reasoner__memory_utilization
WHERE timestamp >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
GROUP BY REASONER_NAME;

CPU Utilization

SELECT REASONER_NAME,
       AVG(CPU_UTILIZATION) AS avg_cpu,
       MAX(CPU_UTILIZATION) AS peak_cpu
FROM relationalai.observability_preview.logic_reasoner__cpu_utilization
WHERE timestamp >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
GROUP BY REASONER_NAME;

Demand (Queue Pressure)

SELECT REASONER_NAME,
       AVG(DEMAND)   AS avg_demand,
       MAX(DEMAND)   AS peak_demand,
       REASONER_CAPACITY
FROM relationalai.observability_preview.logic_reasoner__demand
WHERE timestamp >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
GROUP BY REASONER_NAME, REASONER_CAPACITY;

Interpret patterns, not isolated spikes. Utilization is naturally spiky. The key question is: does almost every workload run exceed the threshold? Isolated peaks are normal; consistent exceedance across runs is the signal to act.

Step 3 — Health Verdicts and Actions

OK — HEALTHY — No Action

Signals: MEMORY_UTILIZATION < 0.80, CPU_UTILIZATION < 0.85, DEMAND ≤ 1.0 on most runs.

If transactions are still failing despite healthy metrics, the problem is not resource-related — go to Step 4 to diagnose the transaction directly.

CRITICAL — OVERLOADED — Upgrade to Larger Reasoner (Immediate)

Signals (any of):

MEMORY_UTILIZATION > 0.80 on most workload runs
CPU_UTILIZATION consistently > 0.95
CPU_UTILIZATION = 1.0 AND MEMORY_UTILIZATION = 1.0 AND DEMAND > 1.0 (all hard limits hit)

Action: Upgrade reasoner size. No in-place resize exists — delete and recreate.

rai reasoners:suspend --type Logic --name <name>
rai reasoners:delete  --type Logic --name <name>
rai reasoners:create  --type Logic --name <name> --size <larger-size>

ELEVATED — PLAN TO RESIZE — Proactive Warning

Signals: CPU_UTILIZATION consistently 0.85–0.95 (below critical but limited headroom for bursts).

Action: Schedule a resize during a low-traffic window before the next traffic spike. No immediate action required.

WARNING — QUEUING — Review Job Volume / Split Across Reasoners

Signals: DEMAND consistently > 1.0 (more jobs than available queue slots).

Action:

Investigate root cause: is a batch job or burst flooding the queue?
If higher concurrency is genuinely needed, route different job types to separate reasoner instances (send different jobs to different reasoners — do not simply upsize).
Upsizing is not the default fix for queuing — it is a demand and routing problem.

INFORMATIONAL — UNDERUTILIZED — Downsize to Save Cost

Signals: CPU_UTILIZATION < 0.30 AND MEMORY_UTILIZATION never exceeds 0.30 across workload runs.

Action: Downgrade to a smaller reasoner — you are paying for unused capacity.

rai reasoners:suspend --type Logic --name <name>
rai reasoners:delete  --type Logic --name <name>
rai reasoners:create  --type Logic --name <name> --size <smaller-size>

NOMINAL — IDLE — Suspend or Lower Auto-Suspend Threshold

Signals: DEMAND = 0 for extended periods.

Action: Suspend the reasoner or reduce its auto_suspend threshold to stop billing for idle time.

rai reasoners:suspend --type Logic --name <name>

Step 4 — Diagnose a Failed Transaction

Use these procedures when a transaction appears stuck, aborted, or when load errors are reported.

Fetch a Transaction by ID

CALL relationalai.api.get_transaction('<transaction_id>');

Returns the full transaction record including status, owner, start/end timestamps, and error detail.

Get Transaction Problems

-- Problems for any transaction (requires admin-level role)
CALL relationalai.api.get_transaction_problems('<transaction_id>');

-- Problems for transactions you own (end-user role)
CALL relationalai.api.get_own_transaction_problems('<transaction_id>');

Get Load Errors

CALL relationalai.api.get_load_errors('<transaction_id>');

Returns row-level load errors associated with a transaction: source object, error message, and affected row count.

WARNING — Owner-restriction pitfall: If get_transaction_problems returns HTTP 400, check the transaction owner before assuming a permissions misconfiguration:
CALL relationalai.api.get_transaction('<transaction_id>');
Then use the table below to interpret the result.

When get_transaction returns an unexpected column or state code, or when inspecting load errors per row, see transaction-debug.md.

Step 5 — Diagnose CDC / Data Stream Health

WARNING — Auto-quarantine gotcha: A stream that has been in SUSPENDED state for approximately one month will be automatically promoted to QUARANTINED — without creating any rows in data_stream_errors. The absence of error rows does not mean the stream is healthy. Always confirm stream status from cdc_status or data_stream_batches before treating an empty errors result as a clean bill of health.

Find Your Streams (Start Here)

SELECT * FROM relationalai.api.cdc_status;

Key columns: stream_name, stream_status, engine_name, engine_status. Use the stream_name values from this output as '<stream_name>' in the queries below.

Check Batch-Level Status

SELECT stream_name, batch_id, status, error_message, created_at
FROM relationalai.api.data_stream_batches
WHERE stream_name = '<stream_name>'
  AND created_at >= DATEADD(day, -7, CURRENT_TIMESTAMP())
ORDER BY created_at DESC;

Use a 7-day window rather than 24 hours — a quarantined or long-suspended stream may have had no batches for days, and a 24-hour filter returns empty output indistinguishable from a healthy-but-idle stream.

Check Stream-Level Errors

SELECT *
FROM relationalai.api.data_stream_errors
WHERE stream_name = '<stream_name>'
ORDER BY created_at DESC
LIMIT 50;

For auto-quarantined streams this may return empty — that is expected. Use data_stream_batches status as the authoritative source.

Stream State Verdicts

Resume a Suspended Stream

CALL relationalai.app.resume_cdc('<stream_name>');

Quarantine Recovery

See cdc-recovery.md — Quarantine Recovery Runbook for the full step-by-step recovery checklist, schema reference, and official docs link.

Predictive reasoner stuck in data-index init

Diagnose (in order)

Reasoner status — is the pod still hydrating or actually ready?
```
CALL RELATIONALAI.API.GET_REASONER('predictive', '<reasoner_name>');
```
STATUS=PROVISIONING → wait (1–3 min typical for warm pools; longer on first-time Snowflake-stream attach). STATUS=READY plus a hang → continue to (2) and (3).
CDC stream health — is a stream the predictive reasoner depends on suspended or quarantined?
```
SELECT * FROM relationalai.api.cdc_status;
```
engine_status != READY or stream_status not RUNNING → see § Step 5 — Diagnose CDC / Data Stream Health for the quarantine-recovery / resume_cdc runbook. Predictive jobs cannot proceed until the upstream streams are healthy.
Transaction problems — if a specific transaction id appeared in client logs (often the Failed to pull data into index: transaction was aborted wrapper), pull its problems:
```
CALL RELATIONALAI.API.GET_OWN_TRANSACTION_PROBLEMS('<transaction_id>');
-- or with admin role: GET_TRANSACTION_PROBLEMS('<id>')
```
See § Step 4 — Diagnose a Failed Transaction for the schema reference and owner-restriction pitfall.

Recovery

Predictive train jobs stuck QUEUED

Diagnostic ladder

Long-running predictive jobs are usually fine — distinguish stuck from slow before suspending anything. Use the same ladder as rai-predictive-training § "Training appears stuck":

CALL RELATIONALAI.API.GET_REASONER('predictive', '<name>') → STATUS=READY?
client.jobs.list("Predictive", name="<name>") → is there a RUNNING train job (with rising AGE_MIN), or a QUEUED one going stale?
SHOW EXPERIMENTS IN SCHEMA <exp_db>.<exp_schema> → did a new experiment row append within ~60s of the RUNNING train job?

Recovery

-- 1. Confirm a stuck train job
SELECT ID, STATE, DATEDIFF('minute', CREATED_ON, CURRENT_TIMESTAMP()) AS AGE_MIN
FROM RELATIONALAI.API.JOBS
WHERE STATE IN ('QUEUED','RUNNING')
  AND PAYLOAD LIKE '%"job_type": "train"%'
ORDER BY CREATED_ON ASC;

-- 2. Recycle the worker
CALL RELATIONALAI.API.SUSPEND_REASONER('predictive', '<reasoner_name>');
CALL RELATIONALAI.API.RESUME_REASONER_ASYNC('predictive', '<reasoner_name>');

-- 3. Wait for STATUS=READY, kill the stuck client, then resubmit (re-instantiate GNN(...))
CALL RELATIONALAI.API.GET_REASONER('predictive', '<reasoner_name>');

If recycling doesn't unstick the worker, rebuild on a fresh GPU predictive reasoner — the documented path:

CALL RELATIONALAI.API.DELETE_REASONER('predictive', '<reasoner_name>');
CALL RELATIONALAI.API.CREATE_REASONER_ASYNC('predictive', '<reasoner_name>', 'GPU_NV_S', OBJECT_CONSTRUCT());
-- Poll until STATUS=READY:
CALL RELATIONALAI.API.GET_REASONER('predictive', '<reasoner_name>');

See rai-predictive-training § Worker not ready to accept jobs for the matching client-side symptom and § Stalled train job: SDK polls without a timeout for stalled-job forensics.

Step 6 — CDC Engine Management

CDC engine ≠ reasoner engine. The CDC pipeline runs on a dedicated managed engine distinct from Logic reasoner engines. alter_cdc_engine_size targets only the CDC engine; the CLI commands (rai reasoners:create/delete) target Logic reasoners only. Do not apply the Step 3 CLI commands to the CDC engine.

Check Current CDC Status

SELECT * FROM relationalai.api.cdc_status;

Key columns: engine_name, engine_size, engine_status, stream_name, stream_status.

Resize the CDC Engine

CALL relationalai.app.alter_cdc_engine_size('<size>');

The CDC engine is suspended and recreated during a resize — expect brief CDC downtime.

Force-Delete a Stale CDC Engine

Use when the CDC engine is stuck in a non-deletable state:

CALL relationalai.api.delete_engine('CDC_MANAGED_ENGINE', TRUE);

The second argument TRUE enables force deletion. RAI will recreate the CDC engine automatically on the next CDC trigger. Confirm recovery with SELECT * FROM relationalai.api.cdc_status.

WARNING — If delete_engine returns "engine not found" but cdc_status still shows the engine as suspended: this is a control-plane / data-plane desync — the engine record exists in RAI's metadata but the underlying Snowflake engine is gone. No self-serve command resolves this state. Run the following and retain the output for support:
SELECT * FROM relationalai.api.cdc_status;
Then open a support ticket with that output. Do not attempt alter_cdc_engine_size in this state — it will fail or create a duplicate record.

Access Control

Two application roles control who can configure and who can read observability data:

GRANT APPLICATION ROLE relationalai.observability_viewer TO ROLE <your_role>;
GRANT APPLICATION ROLE relationalai.observability_admin  TO ROLE <your_role>;

Cost Guardrails

Observability views are non-materialized — every query scans the Snowflake Event Table in real time. No extra storage cost, but Snowflake compute credits are consumed on every query.

Cost scales with: event volume × time range × query complexity.

Adoption

RelationalAI/rai-health

$ install --global

Security Scan Results

SKILL.md

Summary

Quick Reference

The Three Metric Views (all in OBSERVABILITY_PREVIEW)

Step 1 — Verify Observability Is Active

Step 2 — Query Each Metric

Memory Utilization

CPU Utilization

Demand (Queue Pressure)

Step 3 — Health Verdicts and Actions

OK — HEALTHY — No Action

CRITICAL — OVERLOADED — Upgrade to Larger Reasoner (Immediate)

ELEVATED — PLAN TO RESIZE — Proactive Warning

WARNING — QUEUING — Review Job Volume / Split Across Reasoners

INFORMATIONAL — UNDERUTILIZED — Downsize to Save Cost

NOMINAL — IDLE — Suspend or Lower Auto-Suspend Threshold

Step 4 — Diagnose a Failed Transaction

Fetch a Transaction by ID

Get Transaction Problems

Get Load Errors

Step 5 — Diagnose CDC / Data Stream Health

Find Your Streams (Start Here)

Check Batch-Level Status

Check Stream-Level Errors

Stream State Verdicts

Resume a Suspended Stream

Quarantine Recovery

Predictive reasoner stuck in data-index init

Diagnose (in order)

Recovery

Predictive train jobs stuck QUEUED

Diagnostic ladder

Recovery

Step 6 — CDC Engine Management

Check Current CDC Status

Resize the CDC Engine

Force-Delete a Stale CDC Engine

Access Control

Cost Guardrails

Common Pitfalls

Reference files

Related Skills

RelationalAI/rai-predictive-training

RelationalAI/rai-predictive-modeling

RelationalAI/rai-setup

RelationalAI/rai-rules-authoring

RelationalAI/rai-health

$ install --global

Security Scan Results

SKILL.md

Summary

Quick Reference

The Three Metric Views (all in OBSERVABILITY_PREVIEW)

Step 1 — Verify Observability Is Active

Step 2 — Query Each Metric

Memory Utilization

CPU Utilization

Demand (Queue Pressure)

Step 3 — Health Verdicts and Actions

OK — HEALTHY — No Action

CRITICAL — OVERLOADED — Upgrade to Larger Reasoner (Immediate)

ELEVATED — PLAN TO RESIZE — Proactive Warning

WARNING — QUEUING — Review Job Volume / Split Across Reasoners

INFORMATIONAL — UNDERUTILIZED — Downsize to Save Cost

NOMINAL — IDLE — Suspend or Lower Auto-Suspend Threshold

Step 4 — Diagnose a Failed Transaction

Fetch a Transaction by ID

Get Transaction Problems

Get Load Errors

Step 5 — Diagnose CDC / Data Stream Health

Find Your Streams (Start Here)

Check Batch-Level Status

Check Stream-Level Errors

Stream State Verdicts

Resume a Suspended Stream

Quarantine Recovery

The Three Metric Views (all in `OBSERVABILITY_PREVIEW`)

The Three Metric Views (all in `OBSERVABILITY_PREVIEW`)