plugins/rai/skills/rai-predictive-modeling/SKILL.md
Build graph neural network (GNN) models — concepts, Snowflake data loading, task relationships, graph edges, and PropertyTransformer features. Use for node classification, regression, and link prediction tasks; for training, predictions, and evaluation, see `rai-predictive-training`.
npx skillsauth add RelationalAI/rai-agent-skills rai-predictive-modelingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Early access. The RAI predictive reasoner (GNN) is in early access — APIs, engine requirements, and behavior may change. Confirm the latest surface with the RelationalAI team before production use.
What: Data modeling workflow for GNN pipelines -- from imports through graph construction and feature configuration.
When to use:
When NOT to use:
rai-predictive-trainingrai-graph-analysisOverview: 6 steps: imports -> concepts -> populate -> task relationships -> graph -> features
GNN training writes experiment artifacts to a Snowflake schema. Create a database and schema you own, then grant the RELATIONALAI native app the four required privileges:
CREATE DATABASE IF NOT EXISTS <YOUR_DB>;
CREATE SCHEMA IF NOT EXISTS <YOUR_DB>.<YOUR_SCHEMA>;
GRANT USAGE ON DATABASE <YOUR_DB> TO APPLICATION RELATIONALAI;
GRANT USAGE ON SCHEMA <YOUR_DB>.<YOUR_SCHEMA> TO APPLICATION RELATIONALAI;
GRANT CREATE EXPERIMENT ON SCHEMA <YOUR_DB>.<YOUR_SCHEMA> TO APPLICATION RELATIONALAI;
GRANT CREATE MODEL ON SCHEMA <YOUR_DB>.<YOUR_SCHEMA> TO APPLICATION RELATIONALAI;
All four grants are required. Then pass the same database and schema to the GNN constructor:
gnn = GNN(
exp_database="<YOUR_DB>",
exp_schema="<YOUR_SCHEMA>",
...
)
relationalai package versionThe predictive submodule (relationalai.semantics.reasoners.predictive) is not in every published relationalai release — from relationalai.semantics.reasoners.predictive import GNN raises ModuleNotFoundError on releases that pre-date it. Pin a release that ships the submodule (or install from the development branch when iterating against unreleased changes).
A GNN workflow runs against two distinct reasoner engines that must both be READY:
| Reasoner | Handles | Why it matters here |
|----------|---------|---------------------|
| Logic | model.data() / Table().to_schema() ingest, all PyRel queries (including select(...) over Source.predictions), data exports back to Snowflake | The data pipeline that feeds the GNN and reads predictions back is Logic-engine work |
| Predictive | gnn.fit() training, gnn.predictions() inference, experiment + model-registry writes | Where the actual GNN training and inference happen |
When training "hangs" or queries are slow, the first question is which engine — they have separate sizes, separate STATUS, separate auto-suspend timers. rai-health § Predictive train jobs stuck QUEUED covers the Predictive side; the Logic-engine ladder lives in rai-health Steps 1–3.
Use a GPU compute type for the Predictive reasoner. The canonical provisioning shape:
CALL RELATIONALAI.API.CREATE_REASONER_ASYNC(
'predictive',
'<reasoner_name>',
'GPU_NV_S',
OBJECT_CONSTRUCT() -- {} — accept all defaults; or pass auto_suspend_mins, settings, …
);
-- Poll until STATUS=READY (1–3 minutes typical):
CALL RELATIONALAI.API.GET_REASONER('predictive', '<reasoner_name>');
GPU_NV_S is faster per epoch on the GNN training job and is the recommended default for predictive workloads. HIGHMEM_X64_S / _M / _L are also valid sizes for the predictive reasoner, but GPU is the path the platform team recommends; pick it unless you have a specific reason not to.
The rai reasoners:create --type Predictive --size GPU_NV_S CLI form may report an allow-list error (Allowed sizes: HIGHMEM_X64_S, HIGHMEM_X64_M, HIGHMEM_X64_L) on older client versions — the validation list (relationalai/services/reasoners/constants.py::REASONER_SIZES_AWS) trails the backend's AWSEngineSize Literal in config_reasoners_fields.py. The SQL CREATE_REASONER_ASYNC call above is the canonical fall-through; both reach the same backend.
Confirm current sizing options with the RelationalAI team — pool capacity and recommendations evolve.
# Imports
from relationalai.semantics import Model, select, define, Integer, String, Any
from relationalai.semantics.reasoners.graph import Graph
from relationalai.semantics.reasoners.predictive import PropertyTransformer
model = Model("<model_name>")
Concept, Table, Relationship = model.Concept, model.Table, model.Relationship
| Pattern | Code |
|---------|------|
| Single PK | User = Concept("User", identify_by={"user_id": Integer}) |
| Composite PK | Class = Concept("Class", identify_by={"courseid": Integer, "year": Integer}) |
| No PK (e.g. task table) | TrainTable = Concept("TrainTable") |
# Graph init
gnn_graph = Graph(model, directed=True, weighted=False)
Edge = gnn_graph.Edge
# PropertyTransformer
pt = PropertyTransformer(
category=[User.locale, User.gender],
continuous=[User.birthyear],
datetime=[User.joinedAt, Event.start_time],
time_col=[Event.start_time],
)
from relationalai.semantics import Model, select, define, Integer, String, Any
from relationalai.semantics.reasoners.graph import Graph
from relationalai.semantics.reasoners.predictive import PropertyTransformer
model = Model("<model_name>")
Concept, Table, Relationship = model.Concept, model.Table, model.Relationship
Additional type imports as needed: Date, DateTime, Float.
User-input boundary: the only things you need from the user are the 3 inputs in
references/auto-discovery.md-- source table FQNs, task table FQNs, and the experiment tracking database and schema. Auto-derive PKs, FKs, columns, types, edges, task type, and timestamp candidates from Snowflake schema introspection. Use the in-skillget_table_schema(table_name, database, schema)helper inreferences/auto-discovery.mdas the default schema source before any manual SQL fallback. Don't ask the user for column-level details.
Two concept categories show up in a GNN pipeline, distinguished by their role in the graph:
| Category | Role |
|----------|------|
| Graph (node) | Source, target, or other node entities the GNN reasons over -- can carry features and time_col |
| Task table | Holds train/val/test split rows, joined to a graph concept by FK -- not used in edges; not a feature source |
identify_by is not required by the GNN pipeline. Pass it when you want to declare an explicit primary key for a graph concept (matches a Snowflake column); omit it for task tables and for graph concepts where you don't need an explicit PK.
If you have an existing ontology from
rai-build-starter-ontology, create a newModelfor the GNN pipeline.
The identify_by key names must exist as columns in the Snowflake table. Column-name matching is case-insensitive in both identify_by keys and property accesses -- a Snowflake column FOO_BAR can be referenced as Concept.foo_bar, Concept.FOO_BAR, or any other casing. Spelling still has to match exactly. Check INFORMATION_SCHEMA.COLUMNS or run DESCRIBE TABLE to confirm the columns before writing identify_by or property accesses.
User = Concept("User", identify_by={"user_id": Integer})
Event = Concept("Event", identify_by={"event_id": Integer})
Task table concepts have no identify_by:
train_table_concept = Concept("TrainTable")
val_table_concept = Concept("ValidationTable")
test_table_concept = Concept("TestTable")
define(Customer.new(Table("DB.SCHEMA.CUSTOMERS").to_schema()))
define(train_table_concept.new(Table("DB.TASKS.TRAIN").to_schema()))
The GNN pipeline expects pre-existing train/val/test split tables in Snowflake. Each split table must contain: a join key column matching a source concept PK, a label/target column (train/val only), and optionally a timestamp column.
PropertyTransformer and the task-table pattern also work with concepts populated from local data via model.data(df) -- not just Table(...).to_schema(). Useful when some concept data lives in local CSVs (e.g. optimizer parameters) while the graph comes from Snowflake.
Relationships encode the task structure using a template string with three parts:
For per-task-type Relationship Arity Rules and full code examples, see references/task-relationships.md.
gnn_graph = Graph(model, directed=True, weighted=False)
Edge = gnn_graph.Edge
define(Edge.new(src=Interaction, dst=User)).where(
Interaction.user_id == User.user_id)
.ref())PostRef = Post.ref()
define(Edge.new(src=Post, dst=PostRef)).where(
PostRef.parent_id == Post.id)
PeopleRef = People.ref()
define(Edge.new(src=People, dst=PeopleRef)).where(
People.Id == Related.person1,
PeopleRef.Id == Related.person2,
)
BB1Edge = Concept("BB1Edge", extends=[Edge])
BB2Edge = Concept("BB2Edge", extends=[Edge])
Bref = B.ref()
define(BB1Edge.new(src=B, dst=Bref)).where(B.field1 == Bref.id)
define(BB2Edge.new(src=B, dst=Bref)).where(B.field2 == Bref.id)
The PropertyTransformer annotates concept fields with their semantic types for the GNN.
pt = PropertyTransformer(
category=[User.locale, User.gender, Event.city, Event.state, Event.country],
datetime=[User.joinedAt, Event.start_time],
continuous=[User.birthyear],
time_col=[Event.start_time],
)
| Data type | Annotation |
|-----------|-----------|
| Boolean flags, enum/status codes | category |
| Ages, prices, ratings | continuous |
| Free-form text, names, descriptions | text |
| Dates, timestamps | datetime |
| Explicit integer values (not IDs) | integer |
The integer parameter is a distinct type from continuous -- use it for whole-number counts or ordinal values where float precision is not meaningful (e.g. review counts, position ranks):
pt = PropertyTransformer(
integer=[Review.num_votes, Standing.position],
continuous=[Review.rating, Result.points],
...
)
drop=[Study.nct_id, Outcome.id, Outcome.nct_id, ...]text fields. Text embedding is expensive and too many text fields dilute signal. Begin with 3-5 key text fields, add more only if metrics improve.category for discrete location/status fields. Fields like city, state, country have limited cardinality.continuous for numeric measurements. Counts, scores, percentages.Centrality, community labels, and other graph-algorithm outputs from rai-graph-analysis can feed the GNN as features once they're materialized as concept properties. Compute the metric on a separate Graph instance (the algorithm graph -- often a different topology from the GNN graph), bind the result, then include in the PropertyTransformer:
# Algorithm graph (often a different topology from the GNN graph)
algo_graph = Graph(model, directed=False)
define(algo_graph.Edge.new(src=Source, dst=SourceRef)).where(...)
# Bind metric output as a Concept property
Source.pagerank = model.Property(f"{Source} has {Float:pagerank}")
model.define(Source.pagerank(graph_algo_result))
# Include as a continuous (or category) feature
pt = PropertyTransformer(
continuous=[Source.pagerank, ...],
...
)
Two-graph setups are common (the GNN graph and the algorithm graph have different shapes); name them distinctly to avoid confusion.
PropertyTransformer is optional -- omitting it auto-infers all field types. For production, explicit annotation is recommended. Use drop to exclude fields or entire concepts: drop=[Interaction, Item.internal_code].
For the full feature type reference including drop patterns, see references/property-transformer-types.md.
| Mistake | Cause | Fix |
|---------|-------|-----|
| Concept name is plural (e.g. "Customers") | Naming convention | Use singular names: Concept("Customer") |
| Task table concept has identify_by | Task tables don't need primary keys | Use plain Concept("TrainTable") with no identify_by |
| Snowflake table name not fully qualified | Missing database or schema prefix | Use "DATABASE.SCHEMA.TABLE" format |
| Test Relationship includes label/target | Test data should not contain the answer | Omit the "has" clause: f"{Source}" or f"{Source} at {Any:ts}" |
| Positional args in define(Train(...)) don't match template | Template and population call must align | Match the order: source, [timestamp], [label/target] |
| Self-referential edge without .ref() | Same concept on both sides creates ambiguity | Use PostRef = Post.ref() for the destination |
| time_col fields not in datetime list | Both lists must include the field | Add time columns to both datetime=[...] and time_col=[...] |
| Task table concept used in edge definition | Only graph concepts participate in edges | Edges connect domain entities, not task tables |
| Missing type import | e.g. using Date without importing it | Add missing types to the import line |
| Column name has spaces or special characters | Python identifier rules prevent Concept.weight(kg) | Use getattr(People, "weight(kg)") to reference the field |
| identify_by key or property access doesn't match Snowflake column name | Typo or wrong column — matching is case-insensitive, but the column name must exist | Check INFORMATION_SCHEMA.COLUMNS / run DESCRIBE TABLE for the exact spelling |
| Train/Val/Test Relationships have different schemas | Test omits the label but also changes concept or timestamp structure | Train, Val, and Test must share the same concept and timestamp structure — only the label/target is omitted in Test |
| Link prediction join key or target column is VARIANT in task table | Task table stores target IDs as an array instead of one row per pair | Run DESCRIBE TABLE on all three split tables before writing task relationships; see references/task-relationships.md § Link Prediction — Task Table Format Requirements (VARIANT check) for the joined-vs-non-joined branch and the LATERAL FLATTEN recipe |
| Graph topology is only same-entity temporal-lag edges (no heterogeneous edges across concept types) | The GNN's value is message-passing across different entity kinds; lag-only chains carry no signal beyond what shifted/lag features in a tabular model already capture | Add heterogeneous edges across concept types, or re-scope as time-series/tabular regression outside RAI (optionally loaded as pre_computed for downstream reasoners) |
| model.data(df) raises KeyError: 0 | Type inference does a label-based lookup df[col][0], which fails when the DataFrame index doesn't include 0 (typical after df.iloc[N:M] or df.sample() slicing for train/val/test splits) | Call .reset_index(drop=True) on every sliced split before model.data(): m.data(val_df.reset_index(drop=True)) |
| Pattern | Description | File | |---------|-------------|------| | Node classification | Binary classification data model | examples/node_classification_snowflake.py | | Link prediction | Repeated link prediction data model | examples/link_prediction_snowflake.py | | Regression | Regression-with-time data model | examples/regression_snowflake.py |
| Reference | Description | File | |-----------|-------------|------| | Task relationships | Relationship template patterns for all task types with code examples | references/task-relationships.md | | PropertyTransformer types | Full feature type reference, drop patterns, and guidelines | references/property-transformer-types.md | | Auto-discovery | SQL templates for discovering PKs, FKs, edges, and task structure | references/auto-discovery.md |
data-ai
Configure and train graph neural network (GNN) models, generate predictions, evaluate results, and manage trained models. Use when ready to train, generate predictions, evaluate, or manage models; for concepts, data loading, edges, and feature configuration, see `rai-predictive-modeling`.
development
Setup and configuration for RelationalAI — first-time install walkthrough and all raiconfig.yaml tuning. Use when installing RAI, connecting to Snowflake, or editing raiconfig.yaml. Not for writing PyRel model code (see rai-pyrel-coding) or solver usage and diagnostics (see rai-prescriptive-solver-management).
testing
Converts natural language business rules into PyRel derived properties — validation, classification, derivation, alerting, and reconciliation. Use whenever a task assigns each entity a new tier, segment, score, or flag, or derives a new property; author it here as a derived property, then query it with rai-querying.
data-ai
PyRel v1 query construction against `relationalai.semantics.Model` — selects, filters, joins, aggregates, grouping, export. Load this BEFORE writing any PyRel query, even your first one — your prior knowledge of the syntax is likely stale. Use whenever the user asks to query, count, list, rank, aggregate, join, or export data from a RAI model, even if they don't say "PyRel". Does not cover deriving new classifications, tiers, flags, segments, or properties — those must be authored with the `rai-rules-authoring` skill first.