plugins/rai/skills/rai-build-starter-ontology/SKILL.md
Walks through building a first RAI ontology from Snowflake tables or local data samples. Use when creating a new RAI model, starting a proof of concept, or onboarding a new dataset.
npx skillsauth add RelationalAI/rai-agent-skills rai-build-starter-ontologyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
What: Workflow for building a minimal working RAI ontology from raw data.
When to use:
When NOT to use:
rai-ontology-designrai-ontology-designrai-pyrel-codingrai-queryingOverview:
inspect.schema() summary so the user sees what actually registeredRequired dependency: rai-ontology-design is the authority for all design decisions. This skill is the workflow. Use rai-pyrel-coding for syntax and data-loading.md for data binding patterns.
Start from the business domain — what concepts exist, what questions must the model answer — then find data mappings. Domain-first modeling produces better models than table-to-concept mapping.
Write and maintain a document to help people onboard and understand the decisionmaking process. Document your findings and decisions for each step.
Interaction mode: Before starting, ask the user which mode they prefer:
The user knows their domain better than the data does — guided mode lets them steer you through subtleties that aren't obvious from the schema alone.
Before looking at any data, define:
Keep the first version to ~10-15 must-have properties.
| Goal | In scope | Out of scope | |---|---|---| | Identify delayed orders | Orders, shipments, delay timestamps | Returns, carrier contracts, inventory |
Data loading decision: Use model.Table() for Snowflake-backed data (any size, production-ready). Use model.data() for prototyping with DataFrames only (≤ hundreds of rows) or inline scenario data not in Snowflake.
For connection setup, see rai-setup. Run discovery queries via the snow CLI or a Snowpark session:
from relationalai.config import SnowflakeConnection, create_config
from snowflake import snowpark
session: snowpark.Session = create_config().get_session(SnowflakeConnection)
session.sql("""
SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, IS_NULLABLE,
NUMERIC_PRECISION, NUMERIC_SCALE
FROM <database>.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = '<schema>'
ORDER BY TABLE_NAME, ORDINAL_POSITION
""").show()
Keep this result available throughout the workflow. The
DATA_TYPEandNUMERIC_SCALEcolumns are the authoritative source for deriving RAI property types in Step 6. When declaring each property, look up its source column in this result and use the type mapping table (Step 5) to pick the correct RAI type. Do not infer types from column names, CSV samples, or example ontologies.
model.data() is for prototyping only (hundreds of rows). For production, load into Snowflake and use model.Table().
Identify data shape first. Source tables typically arrive in one of two shapes, and the distinction drives whether downstream steps derive metrics or bind them directly:
(entity_i, entity_j, value) matrices). Metrics arrive ready-to-bind; do not re-aggregate already-aggregated values.Two tables in the same schema can be different shapes — classify each independently. For long-form pairwise data specifically, see the pairwise value matrix example in examples.md.
Analyze source data per rai-ontology-design § Design Decision Sequence, step 1 (Analyze sources). Note:
_ID suffixes (likely PKs), columns matching other tables' PKs (likely FKs)IS_/HAS_ prefixes (boolean flags), repeated string categories (enums)Basic EDA:
-- Row count and PK uniqueness
SELECT COUNT(*), COUNT(DISTINCT <pk_col>) FROM <database>.<schema>.<table>;
-- FK cardinality
SELECT COUNT(DISTINCT <fk_col>), COUNT(*) FROM <database>.<schema>.<table>;
-- Null rate
SELECT COUNT(*) AS total, COUNT(<col>) AS non_null,
ROUND(1 - COUNT(<col>) / COUNT(*), 2) AS null_rate
FROM <database>.<schema>.<table>;
-- Value distribution for enum/category columns
SELECT <col>, COUNT(*) AS cnt FROM <database>.<schema>.<table>
GROUP BY <col> ORDER BY cnt DESC LIMIT 20;
-- Subtype discovery: look for TYPE/CATEGORY/CLASS columns with few values
-- that partition entities into fundamentally different kinds
-- e.g., BUSINESS_TYPE with values 'Supplier','Customer' → subtypes of Business
SELECT <type_col>, COUNT(*) AS cnt,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM <database>.<schema>.<table>
GROUP BY <type_col> ORDER BY cnt DESC;
-- Relationship multiplicity: determine 1:1, 1:N, or M:N
-- Run for each FK to understand the join shape
SELECT
ROUND(COUNT(*) * 1.0 / NULLIF(COUNT(DISTINCT <fk_col>), 0), 2) AS avg_rows_per_fk,
MAX(cnt) AS max_rows_per_fk
FROM (
SELECT <fk_col>, COUNT(*) AS cnt
FROM <database>.<schema>.<table>
GROUP BY <fk_col>
);
For graph/network models — trace the topology before modeling:
When your data describes a network (nodes connected by edges via FK chains), trace the full path before Step 3:
-- What node types connect to what? Reveals tiers and layers.
SELECT source.TYPE AS from_type, dest.TYPE AS to_type, COUNT(*) AS edges
FROM <edge_table> e
JOIN <node_table> source ON e.SOURCE_ID = source.ID
JOIN <node_table> dest ON e.DEST_ID = dest.ID
GROUP BY from_type, to_type ORDER BY edges DESC;
-- Check for NULL FKs that create invisible network gaps
SELECT COUNT(*) AS broken_edges FROM <edge_table> WHERE DEST_ID IS NULL OR SOURCE_ID IS NULL;
This reveals hidden layers, missing connections, and whether you need intermediate concepts to bridge tiers.
Work domain-first: brainstorm business entities, then map to tables. Each concept must map to an authoritative source with a clear identity. Refer to rai-ontology-design § Concept Design Principles.
Check for subtypes: If Step 2 EDA found TYPE/CATEGORY columns that partition a concept into fundamentally different kinds (e.g., a BUSINESS_TYPE column with values Supplier and Customer), model these as subtypes using extends. Use subtypes when each kind carries its own properties or relationships — not for every enum column.
Validate:
Identity hygiene. Prefer minimal identity (1-2 natural-key fields). Oversized identity is a signal: usually it means a surrogate ID was missed, or the entity is really a measurement that belongs as a Relationship payload on a smaller concept — both cases inflate the model and complicate downstream queries. If the only natural identity you can find is >3 fields, pause and ask: (a) does a surrogate ID exist in the source table that you missed?, (b) is this really a domain concept, or a measurement / derived row that belongs as a Relationship payload on a smaller concept? When the answer to both is no — keep the compound identity, but flag the concept as a candidate for scope cuts in Step 1.
Choose Property or Relationship per link based on the Step 2 multiplicity results:
Order.customer = model.Property(f"{Order} placed by {Customer:customer}")).{Entity} is <flag>, unary).Anti-bundle rule. Each scalar attribute should be its own Property. If you find yourself packing more than two unrelated scalars into a single Relationship, split them into individual Property declarations. Bundles become un-queryable: you can't filter or aggregate on a single field. Reserve multi-field Relationships for genuinely co-occurring data (e.g., a date range {Promotion} active from {DateTime:start} to {DateTime:end}).
Same-type-slot disambiguation. When a Relationship has two slots of the same concept type (e.g., {Location} ... {Location}), the slots are roles and must be distinguished. Either (a) split into two Property declarations with role short_names, or (b) label both slots: f"{Lane} from {Location:origin} to {Location:destination}". Without this, model.define(...) will silently bind both slots to whichever column you list first, collapsing source and destination into the same entity. Verify with a Step 7c query that origin ≠ destination on at least one row.
See rai-ontology-design § Property vs. Relationship and rai-pyrel-coding § Properties and Relationships for the full decision rule.
Validate the proposed design against source data before coding:
| Check | What to confirm | |-------|-----------------| | Identity columns | Exist in source and uniquely identify rows | | Associations (Property/Relationship) | Valid FK or join key exists, multiplicity determined (1:1 / N:1 → Property; 1:N / M:N → Relationship) | | Properties | Source column exists with compatible data type | | Concept grounding | Every concept maps to an authoritative source | | Orthogonality | No two concepts represent the same entity set |
Verify that every property's RAI type matches its Snowflake source column. Use this mapping:
| Snowflake type | RAI type | Notes |
|---|---|---|
| VARCHAR, TEXT, STRING | String | |
| NUMBER with scale > 0 (e.g., NUMBER(18,2)) | Float | Check NUMERIC_SCALE in INFORMATION_SCHEMA |
| NUMBER, INT, INTEGER (scale = 0) | Integer | |
| FLOAT, DOUBLE, REAL | Float | |
| DATE | Date | |
| TIMESTAMP_NTZ, TIMESTAMP_LTZ, TIMESTAMP | DateTime | |
| BOOLEAN | Boolean property, or unary Relationship for flag-style patterns | |
Run this query to pull actual column types for comparison:
SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, NUMERIC_PRECISION, NUMERIC_SCALE
FROM <database>.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = '<schema>'
AND TABLE_NAME IN ('<table1>', '<table2>')
ORDER BY TABLE_NAME, ORDINAL_POSITION;
Always run this query before writing property declarations. A single type mismatch between any RAI property and its Snowflake column causes a TyperError that blocks ALL queries on the entire model — not just the mismatched property — with no indication of which property failed.
Common mismatches to check:
signup_date, created_at) may be stored as TEXT in Snowflake, especially when loaded from CSV. Check the actual type; use String if the column is TEXT.ship_date, scheduled_arrival_date) may be stored as TIMESTAMP_NTZ (or _LTZ/_TZ), not DATE. Use DateTime, not Date, when INFORMATION_SCHEMA.COLUMNS.DATA_TYPE returns a TIMESTAMP variant. A Date declaration against a TIMESTAMP_NTZ column raises TyperError at first query, blocking ALL queries on the model.NUMBER(38,0) maps to Integer, but NUMBER(18,4) maps to Float. Always check NUMERIC_SCALE; if scale > 0, use Float. If a NUMBER(38,0) column holds values that are conceptually continuous (e.g., CAPACITY_MW, COST_MILLION), flag this to the user — the DDL may need FLOAT or NUMBER(18,2) instead.Boolean property or unary Relationship(f"... is <flag>"). Both work; prefer Relationship when the flag semantics read naturally (e.g., "Order is urgent").CREDIT_CARD_NUMBER or PHONE may be NUMBER, not TEXT. Let the schema dictate the type, not the name.GATE: Do not proceed until Step 5 type validation is complete.
Use the type mapping table and INFORMATION_SCHEMA.COLUMNS output from Step 5 to derive every property type.
Follow conventions in rai-pyrel-coding and rai-ontology-design. Put everything in a single file named after the domain:
<domain>.py # e.g., supply_chain.py, fraud.py, inventory.py
A single file is easiest to iterate on. When the model grows (multiple reasoners, derived layers, shared computed logic), split into a package — see rai-ontology-design § Layering Principles.
If you do split into a package: Never name the directory
model/if yourModelvariable is also calledmodel. Python resolvesimport model.xxxto the directory, shadowing the variable. Use a domain-specific name (e.g.,sc_model/,fraud_model/).
Add validation queries to the bottom of your <domain>.py file to confirm data loaded correctly before answering scoped questions. Fix any import errors or empty results before considering the starter ontology complete. For query syntax, see rai-querying.
Note: Import aggregates with
from relationalai.semantics.std import aggregates.
7a — Spot-check property types against schema before running any queries. For each BOOLEAN, DATE, and NUMBER column in the Step 2 schema output, confirm the corresponding property declaration uses the correct RAI type. These three are the most common sources of silent type mismatches (TyperError at query time with no indication of which property failed). Fix any mismatch before proceeding.
7b — Count instances per concept to confirm data binding loaded rows:
from relationalai.semantics.std import aggregates
df = model.select(aggregates.count(MyConcept).alias("count")).to_df()
print(df)
# Expect: count matches source table row count. Zero means data binding failed.
Note:
aggregates.count(C)on a concept with zero instances returns an empty DataFrame (no rows), not a DataFrame withcount=0— the underlying relation is empty so the aggregation has no rows to reduce over. For chained workflows where a placeholder concept is populated by a downstream step, useinspect.schema()membership to verify the concept is declared rather thancount()to verify data.
7c — Verify relationships to confirm FK joins resolved:
df = model.select(
ConceptA.id.alias("a_id"),
ConceptB.id.alias("b_id"),
).where(
ConceptA.my_relationship(ConceptB)
).to_df()
print(f"Linked pairs: {len(df)}")
# Expect: non-empty results. Empty means the FK join didn't match.
7d — Answer scoped questions from Step 1:
df = model.select(
MyConcept.id.alias("id"),
MyConcept.some_property.alias("value"),
).where(
MyConcept.some_property > threshold
).to_df()
print(df)
# Each scoped question from Step 1 should be answerable with a query like this.
7e — Report what actually registered. After the model runs cleanly, emit an inspect.schema() summary for the user. This is the canonical "here's what got built" artifact — distinct from "here's what I intended to build."
from relationalai.semantics import inspect
schema = inspect.schema(model)
# Full dump for small models
print(schema)
# Targeted per-concept inspection for larger models
for concept_name in scoped_concepts:
c = schema[concept_name]
idents = ", ".join(f"{f.name}:{f.type_name}" for f in c.identify_by)
print(f"{concept_name} [id: {idents}], extends={c.extends}")
for prop in c.properties:
print(f" .{prop.name}: {prop.type_name}")
for rel in c.relationships:
print(f" ~{rel.name}: {rel.reading}")
# Data sources — tables and inline data (schema.tables includes both model.Table() and model.data())
print(f"Tables: {[t.name for t in schema.tables]}")
This is the trust-building step. inspect.schema() enriches the summary with table-backed type information — for properties created via Concept.new(table.to_schema()), it infers and reports concrete types (Integer, String, Date) from the backing table even when the frontend model still types them as Any. The engine produces correctly-typed output regardless (it reads the backing data), but the inspect summary gives you a human-readable view of what the data actually carries. See rai-querying/references/inspect-module.md.
Mapped vs mappable diff. Step 7e shows what's in model. To also surface columns that exist in the backing source but aren't mapped yet (the MODEL_GAP candidates for enrichment), diff inspect.schema() against the source table columns. See rai-discovery § Computing the classification from inspect.schema() for the set-difference pattern.
| Mistake | Cause | Fix |
|---------|-------|-----|
| Schema-driven names (CUST_TABLE, ORD_AMT) | Copying column/table names from schema | Use business domain names (Customer, amount) |
| Boolean columns as String | Guessing types from column names or example ontologies instead of Snowflake schema | Always derive types from INFORMATION_SCHEMA.COLUMNS DATA_TYPE; BOOLEAN → Boolean property or unary Relationship |
| Boolean columns as Boolean property | Using a scalar property when the column represents a semantic flag | Prefer unary Relationship(f"... is active") for flag-style booleans |
| Modeling every column | No scoping step | Only model columns relevant to scoped questions |
| model.data() for large datasets | Treating CSV/DataFrame as production-ready | Prototyping only (≤ hundreds of rows). Use model.Table() |
| Wrong Snowflake table path | Incorrect database, schema, or table name | Verify with SHOW TABLES IN SCHEMA <db>.<schema> |
| "Object does not exist" on valid table | Snowflake role lacks access | Check SELECT CURRENT_ROLE() and SHOW GRANTS ON <object> |
| Model won't load or sync | Engine not running or misconfigured connection | See rai-setup and rai-health |
| SnowflakeChangeTrackingNotEnabledException on first query | model.Table() requires change tracking on each source table | Set data.ensure_change_tracking: true in raiconfig.yaml (see rai-setup/references/raiconfig-yaml.md; requires OWNERSHIP on tables), or run ALTER TABLE <db>.<schema>.<table> SET CHANGE_TRACKING = TRUE on each source table. Note: there is no Model(ensure_change_tracking=...) constructor kwarg |
| First model query hangs 5–10 min with only an Initializing data index spinner | Combined cold-start: per-table CDC stream creation + initial sync (one stream per model.Table()) and, if reasoners.logic.name is unset or doesn't match an existing READY reasoner, logic-engine provisioning | Before the first query: (1) set reasoners.logic.name in raiconfig.yaml to an existing READY engine — see rai-setup/references/raiconfig-yaml.md and engine-management.md (warm reasoners eliminate logic cold-start); (2) pre-enable CHANGE_TRACKING = TRUE on every source table (or set data.ensure_change_tracking: true). Initial stream sync is unavoidable but completes faster when these run in parallel |
| Stale data source cleanup takes minutes on re-run | Re-using a model name after changing data bindings triggers stale source removal | Use a fresh model name during rapid iteration, or wait for cleanup to complete |
| Skipping scope | Starting from tables instead of questions | Complete Step 1 before Step 2 |
| Reference | Description | File | |-----------|-------------|------| | Starter ontology examples | Build patterns: Snowflake tables, CSV, derived concepts, junction concepts, self-referential hierarchies, pairwise matrices, portable source paths | examples.md |
After your starter ontology validates and queries correctly, the typical next steps are:
Build starter ontology → Discover problems (rai-discovery) → Enrich model (rai-ontology-design) → Formulate (rai-prescriptive-problem-formulation or rai-graph-analysis)
| Step | Skill | What it does |
|------|-------|-------------|
| 1. Discover what problems your model can answer | rai-discovery | Surfaces questions by reasoner type (prescriptive, graph, rules, predictive) and assesses data readiness |
| 2. Enrich the model for a selected problem | rai-ontology-design § Enrichment Workflow | Adds properties and relationships from unmapped source columns needed by the problem |
| 3. Identify and fill model gaps | rai-ontology-design § Model Gap Identification | Classifies gaps as READY / MODEL_GAP / DATA_GAP and prescribes fixes |
| 4. Add advanced patterns (as needed) | rai-ontology-design § reference files | Subtypes, enums, time-indexed properties, hierarchies, cross-product concepts, model composition |
data-ai
Configure and train graph neural network (GNN) models, generate predictions, evaluate results, and manage trained models. Use when ready to train, generate predictions, evaluate, or manage models; for concepts, data loading, edges, and feature configuration, see `rai-predictive-modeling`.
development
Build graph neural network (GNN) models — concepts, Snowflake data loading, task relationships, graph edges, and PropertyTransformer features. Use for node classification, regression, and link prediction tasks; for training, predictions, and evaluation, see `rai-predictive-training`.
development
Setup and configuration for RelationalAI — first-time install walkthrough and all raiconfig.yaml tuning. Use when installing RAI, connecting to Snowflake, or editing raiconfig.yaml. Not for writing PyRel model code (see rai-pyrel-coding) or solver usage and diagnostics (see rai-prescriptive-solver-management).
testing
Converts natural language business rules into PyRel derived properties — validation, classification, derivation, alerting, and reconciliation. Use whenever a task assigns each entity a new tier, segment, score, or flag, or derives a new property; author it here as a derived property, then query it with rai-querying.