Name: spark-authoring-cli
Author: microsoft

Update Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.

GitHub Copilot CLI / VS Code: invoke the check-updates skill.

Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.

Skip if the check was already performed earlier in this session.

CRITICAL NOTES

To find the workspace details (including its ID) from workspace name: list all workspaces and, then, use JMESPath filtering

To find the item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace and, then, use JMESPath filtering

Spark Authoring — CLI Skill

This skill covers two complementary areas: (1) managing Fabric Spark artifacts via REST APIs (workspaces, lakehouses, notebooks, jobs, pipelines) and (2) writing code inside Fabric Notebook cells (PySpark, Scala, SparkR, SQL with correct lakehouse access, notebookutils, and Spark configuration). For notebook code authoring fundamentals and shared modules, MUST see SPARK-NOTEBOOK-AUTHORING-CORE.md.

| Task | Reference | Notes | |---|---|---| | RULES — Read these first, follow them always | SKILL.md § RULES | MUST read — 4 rules for this skill | | Finding Workspaces and Items in Fabric | COMMON-CLI.md § Finding Workspaces and Items in Fabric | Mandatory — READ link first [needed for finding workspace id by its name or item id by its name, item type, and workspace id] | | Fabric Topology & Key Concepts | COMMON-CORE.md § Fabric Topology & Key Concepts || | Environment URLs | COMMON-CORE.md § Environment URLs || | Authentication & Token Acquisition | COMMON-CORE.md § Authentication & Token Acquisition | Wrong audience = 401; read before any auth issue | | Core Control-Plane REST APIs | COMMON-CORE.md § Core Control-Plane REST APIs || | Pagination | COMMON-CORE.md § Pagination || | Long-Running Operations (LRO) | COMMON-CORE.md § Long-Running Operations (LRO) || | Rate Limiting & Throttling | COMMON-CORE.md § Rate Limiting & Throttling || | OneLake Data Access | COMMON-CORE.md § OneLake Data Access | Requires storage.azure.com token, not Fabric token | | Definition Envelope | ITEM-DEFINITIONS-CORE.md § Definition Envelope | Definition payload structure | | Per-Item-Type Definitions | ITEM-DEFINITIONS-CORE.md § Per-Item-Type Definitions | Support matrix, decoded content, part paths — REST specs, CLI recipes | | Job Execution | COMMON-CORE.md § Job Execution || | Capacity Management | COMMON-CORE.md § Capacity Management || | Gotchas & Troubleshooting | COMMON-CORE.md § Gotchas & Troubleshooting || | Best Practices | COMMON-CORE.md § Best Practices || | Tool Selection Rationale | COMMON-CLI.md § Tool Selection Rationale || | Authentication Recipes | COMMON-CLI.md § Authentication Recipes | az login flows and token acquisition | | Fabric Control-Plane API via az rest | COMMON-CLI.md § Fabric Control-Plane API via az rest | Always pass --resource https://api.fabric.microsoft.com or az rest fails | | Pagination Pattern | COMMON-CLI.md § Pagination Pattern || | Long-Running Operations (LRO) Pattern | COMMON-CLI.md § Long-Running Operations (LRO) Pattern || | OneLake Data Access via curl | COMMON-CLI.md § OneLake Data Access via curl | Use curl not az rest (different token audience) | | SQL / TDS Data-Plane Access | COMMON-CLI.md § SQL / TDS Data-Plane Access || | Job Execution (CLI) | COMMON-CLI.md § Job Execution || | Job Scheduling | COMMON-CLI.md § Job Scheduling | URL is /jobs/{jobType}/schedules; endDateTime required | | OneLake Shortcuts | COMMON-CLI.md § OneLake Shortcuts || | Capacity Management (CLI) | COMMON-CLI.md § Capacity Management || | Composite Recipes | COMMON-CLI.md § Composite Recipes || | Gotchas & Troubleshooting (CLI-Specific) | COMMON-CLI.md § Gotchas & Troubleshooting (CLI-Specific) | az rest audience, shell escaping, token expiry | | Quick Reference: az rest Template | COMMON-CLI.md § Quick Reference: az rest Template || | Quick Reference: Token Audience / CLI Tool Matrix | COMMON-CLI.md § Quick Reference: Token Audience ↔ CLI Tool Matrix | Which --resource + tool for each service | | Relationship to SPARK-CONSUMPTION-CORE.md | SPARK-AUTHORING-CORE.md § Relationship to SPARK-CONSUMPTION-CORE.md || | Data Engineering Authoring Capability Matrix | SPARK-AUTHORING-CORE.md § Data Engineering Authoring Capability Matrix || | Lakehouse Management | SPARK-AUTHORING-CORE.md § Lakehouse Management || | Notebook Management | SPARK-AUTHORING-CORE.md § Notebook Management || | Notebook Execution & Job Management | SPARK-AUTHORING-CORE.md § Notebook Execution & Job Management || | CI/CD & Automation Patterns | SPARK-AUTHORING-CORE.md § CI/CD & Automation Patterns || | Infrastructure-as-Code | SPARK-AUTHORING-CORE.md § Infrastructure-as-Code || | Performance Optimization & Resource Management | SPARK-AUTHORING-CORE.md § Performance Optimization & Resource Management || | Authoring Gotchas and Troubleshooting | SPARK-AUTHORING-CORE.md § Authoring Gotchas and Troubleshooting || | Quick Reference: Authoring Decision Guide | SPARK-AUTHORING-CORE.md § Quick Reference: Authoring Decision Guide || | Recommended Patterns (Data Engineering) |data-engineering-patterns.md § Recommended patterns || | Data Ingestion Principles | data-engineering-patterns.md § Data Ingestion Principles || | Transformation Patterns | data-engineering-patterns.md § Transformation Patterns || | Delta Lake Best Practices | data-engineering-patterns.md § Delta Lake Best Practices || | Quality Assurance Strategies | data-engineering-patterns.md § Quality Assurance Strategies || | Recommended Patterns (Development Workflow) | development-workflow.md § Recommended patterns || | Notebook Lifecycle | development-workflow.md § Notebook Lifecycle || | Parameterization Patterns | development-workflow.md § Parameterization Patterns || | Variable Library (notebook + pipeline usage) | development-workflow.md § Method 4: Variable Library | getLibrary() + dot notation in notebooks; libraryVariables + @pipeline().libraryVariables in pipelines | | Variable Library Definition | ITEM-DEFINITIONS-CORE.md § VariableLibrary | Definition parts, decoded content, types, pipeline mappings, gotchas | | Local Testing Strategy | development-workflow.md § Local Testing Strategy || | Debugging Patterns | development-workflow.md § Debugging Patterns || | Recommended Patterns (Infrastructure) | infrastructure-orchestration.md § Recommended patterns || | Materialized Lake View patterns | materialized-lake-view-patterns.md § Recommended patterns | Spark Lakehouse authoring guidance for MLV design (when to use MLVs, layering patterns) | | MLV incremental refresh patterns | mlv-incremental-refresh-patterns.md § IR-friendly syntax guide | Use for refresh-readiness review and safe non-breaking rewrites | | MLV schedule & job management | mlv-operations-cli | Route here when user asks to schedule, trigger, monitor, or cancel MLV refreshes (not authoring) | | Workspace Provisioning Principles | infrastructure-orchestration.md § Workspace Provisioning Principles || | Lakehouse Configuration Guidance | infrastructure-orchestration.md § Lakehouse Configuration Guidance || | Pipeline Design Patterns | infrastructure-orchestration.md § Pipeline Design Patterns || | CI/CD Integration Strategy | infrastructure-orchestration.md § CI/CD Integration Strategy || | Notebook API — Which Endpoint to Use | notebook-api-operations.md § Quick Decision | Start here for remote notebook edits — getDefinition vs updateDefinition | | Notebook Modification Workflow | notebook-api-operations.md § Workflow | Five-step flow: retrieve, decode, modify, encode, upload | | Notebook API Error Reference | notebook-api-operations.md § Error Reference | 411, 400 (updateMetadata), 401, 403 explained | | Notebook API Gotchas | notebook-api-operations.md § Gotchas | /result suffix, empty body, \n per-line rule, format=ipynb | | Default Lakehouse Binding | notebook-api-operations.md § Default Lakehouse Binding | .ipynb metadata vs .py # METADATA block; discover IDs dynamically | | Public URL Data Ingestion | notebook-api-operations.md § Public URL Data Ingestion | Use real source URL, stage into Files/, then read with Spark | | getDefinition (read notebook content) | notebook-api-operations.md § Step 1 — Retrieve Notebook Content | LRO flow, ?format=ipynb, empty body (--body '{}') requirement | | Decode Base64 Notebook Payload | notebook-api-operations.md § Step 2 — Decode the Notebook Content | Extract payload, base64 decode, ipynb JSON structure | | Modify Notebook Cells | notebook-api-operations.md § Step 3 — Modify the Notebook Content | Find cell, insert/replace lines, \n per-line rule | | updateDefinition (write notebook content) | notebook-api-operations.md § Step 4 — Re-encode and Upload | Re-encode, upload, LRO poll, updateMetadata flag pitfall | | Verify Notebook Update (Optional) | notebook-api-operations.md § Step 5 — Verify the Update | Skip unless you suspect a silent failure — Succeeded from updateDefinition is sufficient (see Rule 2) | | Notebook API Error Reference | notebook-api-operations.md § Error Reference | 411, 400 (updateMetadata), 401, 403 explained | | Notebook API End-to-End Script | notebook-api-operations.md § Complete End-to-End Script | Full bash: get → decode → modify → encode → update → verify | | Quick Start Examples | SKILL.md § Quick Start Examples | Minimal examples for common operations | | — Notebook Code Authoring (shared modules) — | | | | Notebook Authoring Core | SPARK-NOTEBOOK-AUTHORING-CORE.md | READ FIRST for notebook code tasks — fundamentals, code gen approach, module index |

Must/Prefer/Avoid

MUST DO

Check for recent jobs BEFORE creating new notebook runs — Query job instances from last 5 minutes; if recent job exists, monitor it instead of creating duplicate
Capture job instance ID immediately after POST — Store job ID before any other operations to enable proper monitoring
Verify workspace capacity assignment before operations — Workspace must have capacity assigned and active
When user provides a public data URL, follow the Public URL Data Ingestion policy — keep detailed behavior in the linked resource section to avoid drift/duplication
Format notebook cells correctly — Each line in cell source array MUST end with \n to prevent code merging
Use correct Lakehouse Livy session body format — Send a FLAT JSON with name, driverMemory, driverCores, executorMemory, executorCores. Do NOT wrap in {"payload": ...} or send only {"kind": "pyspark"} — that causes HTTP 500. Use valid memory values (28g, 56g, 112g, 224g). See Create Lakehouse Livy Session example below and SPARK-CONSUMPTION-CORE.md.

PREFER

Poll job status with proper intervals — 10-30 seconds between polls; timeout after reasonable duration (e.g., 30 minutes)
Check job history when POST response is unreadable — If POST returns "No Content" or unreadable response, query recent jobs (last 1 minute) before retrying
Use Starter Pool for development — Development/testing workloads should use useStarterPool: true
Use Workspace Pool for production — Production workloads need consistent performance with useWorkspacePool: true
Enable lakehouse schemas during creation — Set creationPayload.enableSchemas: true for better table organization
Implement idempotency checks — Prevent duplicate operations by checking existing state first

AVOID

Never retry POST with same parameters — If you have a job ID, only use GET to check status; don't create duplicate job instances
Don't skip capacity verification — Operations will fail if workspace capacity is paused or unassigned
Avoid immediate POST retries on failures — Check for existing/active jobs first to prevent duplicates
Don't create new runs if monitoring existing job — One job at a time; wait for completion before submitting new runs
Don't hardcode workspace/lakehouse IDs — Discover dynamically via item listing or catalog search APIs
Own %%sql notebook cells here, do not defer to sqldw-consumption-cli — a request to "write a %%sql cell" (or any notebook magic cell) is notebook-cell authoring even when the cell queries a lakehouse table. Only route to sqldw-consumption-cli when the user wants a plain T-SQL query executed against a SQL endpoint, not a notebook cell.
Own "run/execute the notebook named X" here — running a Fabric notebook by name is notebook execution via the Jobs API (RunNotebook), which belongs to this skill; do not defer to spark-consumption-cli (that skill is only for ad-hoc Livy session code execution).
Do NOT use Lakehouse Livy sessions to run a Fabric notebook — Lakehouse Livy sessions (the public Livy API) are for ad-hoc interactive Spark code execution. To run a notebook as a job, use the Jobs API (RunNotebook) which creates a Notebook Spark session internally. See SPARK-AUTHORING-CORE.md § Notebook Execution & Job Management
Do NOT schedule MLV refreshes from notebooks — If the user asks to "schedule MLV refresh", route to mlv-operations-cli which uses the REST API. Notebook-based REFRESH MATERIALIZED LAKE VIEW ... FULL is for one-time manual refresh only, not recurring schedules.

RULES — Read these first, follow them always

Rule 1 — Validate prerequisites before operations. Verify workspace has capacity assigned (see COMMON-CORE.md Create Workspace and Capacity Management) and resource IDs exist before attempting operations.

Rule 2 — Trust updateDefinition success. A Succeeded poll result from updateDefinition is sufficient confirmation that content and lakehouse bindings persisted. Do NOT call getDefinition after every upload — it is an async LRO that adds significant latency. Only use getDefinition for its intended purpose: reading current notebook content before making modifications.

Rule 3 — Prevent duplicate jobs and monitor execution properly. Before submitting new notebook run, ALWAYS check for recent job instances first (last 5 minutes). If recent job exists, monitor it instead of creating duplicate. After submission, capture job instance ID immediately and poll status - never retry POST. See SPARK-AUTHORING-CORE.md Job Monitoring for patterns.

Rule 4 — For notebook code authoring, MUST follow SPARK-NOTEBOOK-AUTHORING-CORE.md. When writing code inside notebook cells, MUST read SPARK-NOTEBOOK-AUTHORING-CORE.md first — it defines the code generation approach, rules, and a Module Index linking to detailed guides (lakehouse paths, connections, context, orchestration, etc.). Use the Spark-specific resources in this skill (data-engineering-patterns.md, development-workflow.md) for Spark-only implementation details. When the task is about Materialized Lake Views, read materialized-lake-view-patterns.md for authoring/design guidance and mlv-incremental-refresh-patterns.md for refresh-readiness analysis.

Notebook Codegen Quick Rules

Quick reference for common notebook-authoring tasks. The shared common/notebook-authoring/ core (see Rule 4 / SPARK-NOTEBOOK-AUTHORING-CORE.md) is authoritative; if these ever differ, follow the common core.

| User asks for | Required output pattern | |---|---| | %%sql / cross-lakehouse query cell | Return a Fabric notebook %%sql cell. Include the named workspace/lakehouse/schema/table in the code or explanatory note. This is notebook authoring, not interactive Spark consumption. | | Pipeline context detection | Use notebookutils.runtime.context; include isForPipeline = context["isForPipeline"] and read context["currentWorkspaceId"]. | | Built-in notebook resource files with Spark | Use notebookutils.nbResPath, builtin/, and the file: prefix: spark.read.json(f"file:{notebookutils.nbResPath}/builtin/config.json"). Spark and notebookutils.fs require file: for resource-folder local paths. | | Fabric connections | Use notebookutils.connections.getCredential("{connectionId}") directly and show a compact code sample. Do not web-search; keep the answer under the tool/turn budget. |

Minimal snippets

# Pipeline context
context = notebookutils.runtime.context
isForPipeline = context["isForPipeline"]
workspace_id = context["currentWorkspaceId"]
print(f"isForPipeline={isForPipeline}, workspace_id={workspace_id}")

# Spark read from notebook built-in resources
config_df = spark.read.option("multiline", "true").json(
    f"file:{notebookutils.nbResPath}/builtin/config.json"
)
display(config_df)

# PostgreSQL through a Fabric connection (fill in connectionId and host/db)
import psycopg2

credential = notebookutils.connections.getCredential("{connectionId}")
conn = psycopg2.connect(
    host="<postgres-host>",
    database="<database>",
    user=credential["username"],
    password=credential["password"],
)

Quick Start Examples

For detailed patterns, authentication, and comprehensive API usage, see:

COMMON-CORE.md — Fabric REST API patterns, authentication, item discovery
COMMON-CLI.md — az rest usage, environment detection, token acquisition
SPARK-AUTHORING-CORE.md — Notebook deployment, lakehouse creation, job execution

Below are minimal quick-start examples. Always reference the COMMON- files for production use.*

Create Workspace & Lakehouse

# See COMMON-CORE.md Environment URLs and SPARK-AUTHORING-CORE.md for full patterns
cat > /tmp/body.json << 'EOF'
{"displayName": "DataEng-Dev"}
EOF
workspace_id=$(az rest --method post --resource "https://api.fabric.microsoft.com" \
  --url "https://api.fabric.microsoft.com/v1/workspaces" \
  --body @/tmp/body.json --query "id" --output tsv)

cat > /tmp/body.json << 'EOF'
{"displayName": "DevLakehouse", "type": "Lakehouse", "creationPayload": {"enableSchemas": true}}
EOF
lakehouse_id=$(az rest --method post --resource "https://api.fabric.microsoft.com" \
  --url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/items" \
  --body @/tmp/body.json --query "id" --output tsv)

Organize Lakehouse Tables with Schemas

# See SPARK-AUTHORING-CORE.md Lakehouse Schema Organization for table organization patterns
# Create schemas for medallion architecture
spark.sql("CREATE SCHEMA IF NOT EXISTS bronze")
spark.sql("CREATE SCHEMA IF NOT EXISTS silver")
spark.sql("CREATE SCHEMA IF NOT EXISTS gold")

Create and Refresh a Materialized Lake View (MLV)

-- See resources/materialized-lake-view-patterns.md for design guidance
-- and resources/mlv-incremental-refresh-patterns.md for refresh-readiness review.

-- Bronze/Silver/Gold schemas in a Lakehouse with schemas enabled
CREATE SCHEMA IF NOT EXISTS bronze;
CREATE SCHEMA IF NOT EXISTS silver;
CREATE SCHEMA IF NOT EXISTS gold;

-- A simple Silver MLV with data quality constraints
--
-- Prerequisite for incremental refresh: enable Change Data Feed (CDF) on every
-- source table the MLV reads from. Without CDF, optimal refresh can only choose
-- between no refresh (sources unchanged) and full refresh — never incremental.
-- See resources/mlv-incremental-refresh-patterns.md.
ALTER TABLE bronze.orders_raw SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

CREATE OR REPLACE MATERIALIZED LAKE VIEW silver.orders_clean
(
    CONSTRAINT valid_order_id CHECK (order_id IS NOT NULL) ON MISMATCH DROP
)
AS
SELECT
  order_id,
  customer_id,
  CAST(order_ts AS TIMESTAMP) AS order_ts,
  amount
FROM bronze.orders_raw;

-- Routine refresh is handled by the lakehouse Materialized lake views → Manage
-- schedule/lineage view; don't orchestrate from notebooks. The SQL form below is
-- documented only for forcing a one-time FULL recompute (troubleshooting / after
-- a correction). There is no documented SQL form for triggering incremental refresh.
REFRESH MATERIALIZED LAKE VIEW silver.orders_clean FULL;

Create Lakehouse Livy Session

# See SPARK-CONSUMPTION-CORE.md for Lakehouse Livy session configuration and management
# IMPORTANT: Body MUST be flat JSON with memory/cores — do NOT wrap in {"payload": ...}
cat > /tmp/body.json << 'EOF'
{"name": "dev-session", "driverMemory": "56g", "driverCores": 8, "executorMemory": "56g", "executorCores": 8, "conf": {"spark.dynamicAllocation.enabled": "true", "spark.fabric.pool.name": "Starter Pool"}}
EOF
az rest --method post --resource "https://api.fabric.microsoft.com" \
  --url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/lakehouses/$lakehouse_id/livyapi/versions/2023-12-01/sessions" \
  --body @/tmp/body.json

Lakehouse Livy Session Body — Common Mistakes

❌ {"payload": {"kind": "pyspark"}} → HTTP 500 (wrong wrapper, missing required fields)

❌ {"kind": "pyspark"} → HTTP 500 (missing driverMemory, executorMemory, etc.)

✅ Flat JSON with name, driverMemory, driverCores, executorMemory, executorCores (and optionally conf with Starter Pool)

Spark Performance Configs

For detailed workload-specific configurations, see data-engineering-patterns.md Delta Lake Best Practices.

Quick reference:

# Write-heavy (Bronze): Disable V-Order, enable autoCompact
# Balanced (Silver): Enable V-Order, adaptive execution  
# Read-heavy (Gold): Vectorized reads, optimal parallelism
# See data-engineering-patterns.md for complete config tables

Focus: Essential CLI patterns for Spark/data engineering development and notebook code authoring, with intelligent routing to specialized resources. For comprehensive patterns, always reference COMMON-* files and resource documents.

Update Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.

GitHub Copilot CLI / VS Code: invoke the check-updates skill.

Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.

Skip if the check was already performed earlier in this session.

CRITICAL NOTES

To find the workspace details (including its ID) from workspace name: list all workspaces and, then, use JMESPath filtering

To find the item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace and, then, use JMESPath filtering

Spark Authoring — CLI Skill

Must/Prefer/Avoid

MUST DO

Check for recent jobs BEFORE creating new notebook runs — Query job instances from last 5 minutes; if recent job exists, monitor it instead of creating duplicate
Capture job instance ID immediately after POST — Store job ID before any other operations to enable proper monitoring
Verify workspace capacity assignment before operations — Workspace must have capacity assigned and active
When user provides a public data URL, follow the Public URL Data Ingestion policy — keep detailed behavior in the linked resource section to avoid drift/duplication
Format notebook cells correctly — Each line in cell source array MUST end with \n to prevent code merging
Use correct Lakehouse Livy session body format — Send a FLAT JSON with name, driverMemory, driverCores, executorMemory, executorCores. Do NOT wrap in {"payload": ...} or send only {"kind": "pyspark"} — that causes HTTP 500. Use valid memory values (28g, 56g, 112g, 224g). See Create Lakehouse Livy Session example below and SPARK-CONSUMPTION-CORE.md.

PREFER

Poll job status with proper intervals — 10-30 seconds between polls; timeout after reasonable duration (e.g., 30 minutes)
Check job history when POST response is unreadable — If POST returns "No Content" or unreadable response, query recent jobs (last 1 minute) before retrying
Use Starter Pool for development — Development/testing workloads should use useStarterPool: true
Use Workspace Pool for production — Production workloads need consistent performance with useWorkspacePool: true
Enable lakehouse schemas during creation — Set creationPayload.enableSchemas: true for better table organization
Implement idempotency checks — Prevent duplicate operations by checking existing state first

AVOID

Never retry POST with same parameters — If you have a job ID, only use GET to check status; don't create duplicate job instances
Don't skip capacity verification — Operations will fail if workspace capacity is paused or unassigned
Avoid immediate POST retries on failures — Check for existing/active jobs first to prevent duplicates
Don't create new runs if monitoring existing job — One job at a time; wait for completion before submitting new runs
Don't hardcode workspace/lakehouse IDs — Discover dynamically via item listing or catalog search APIs
Own %%sql notebook cells here, do not defer to sqldw-consumption-cli — a request to "write a %%sql cell" (or any notebook magic cell) is notebook-cell authoring even when the cell queries a lakehouse table. Only route to sqldw-consumption-cli when the user wants a plain T-SQL query executed against a SQL endpoint, not a notebook cell.
Own "run/execute the notebook named X" here — running a Fabric notebook by name is notebook execution via the Jobs API (RunNotebook), which belongs to this skill; do not defer to spark-consumption-cli (that skill is only for ad-hoc Livy session code execution).
Do NOT use Lakehouse Livy sessions to run a Fabric notebook — Lakehouse Livy sessions (the public Livy API) are for ad-hoc interactive Spark code execution. To run a notebook as a job, use the Jobs API (RunNotebook) which creates a Notebook Spark session internally. See SPARK-AUTHORING-CORE.md § Notebook Execution & Job Management
Do NOT schedule MLV refreshes from notebooks — If the user asks to "schedule MLV refresh", route to mlv-operations-cli which uses the REST API. Notebook-based REFRESH MATERIALIZED LAKE VIEW ... FULL is for one-time manual refresh only, not recurring schedules.

RULES — Read these first, follow them always

Rule 1 — Validate prerequisites before operations. Verify workspace has capacity assigned (see COMMON-CORE.md Create Workspace and Capacity Management) and resource IDs exist before attempting operations.

Rule 2 — Trust updateDefinition success. A Succeeded poll result from updateDefinition is sufficient confirmation that content and lakehouse bindings persisted. Do NOT call getDefinition after every upload — it is an async LRO that adds significant latency. Only use getDefinition for its intended purpose: reading current notebook content before making modifications.

Rule 3 — Prevent duplicate jobs and monitor execution properly. Before submitting new notebook run, ALWAYS check for recent job instances first (last 5 minutes). If recent job exists, monitor it instead of creating duplicate. After submission, capture job instance ID immediately and poll status - never retry POST. See SPARK-AUTHORING-CORE.md Job Monitoring for patterns.

Rule 4 — For notebook code authoring, MUST follow SPARK-NOTEBOOK-AUTHORING-CORE.md. When writing code inside notebook cells, MUST read SPARK-NOTEBOOK-AUTHORING-CORE.md first — it defines the code generation approach, rules, and a Module Index linking to detailed guides (lakehouse paths, connections, context, orchestration, etc.). Use the Spark-specific resources in this skill (data-engineering-patterns.md, development-workflow.md) for Spark-only implementation details. When the task is about Materialized Lake Views, read materialized-lake-view-patterns.md for authoring/design guidance and mlv-incremental-refresh-patterns.md for refresh-readiness analysis.

Notebook Codegen Quick Rules

Minimal snippets

# Pipeline context
context = notebookutils.runtime.context
isForPipeline = context["isForPipeline"]
workspace_id = context["currentWorkspaceId"]
print(f"isForPipeline={isForPipeline}, workspace_id={workspace_id}")

# Spark read from notebook built-in resources
config_df = spark.read.option("multiline", "true").json(
    f"file:{notebookutils.nbResPath}/builtin/config.json"
)
display(config_df)

# PostgreSQL through a Fabric connection (fill in connectionId and host/db)
import psycopg2

credential = notebookutils.connections.getCredential("{connectionId}")
conn = psycopg2.connect(
    host="<postgres-host>",
    database="<database>",
    user=credential["username"],
    password=credential["password"],
)

Quick Start Examples

For detailed patterns, authentication, and comprehensive API usage, see:

COMMON-CORE.md — Fabric REST API patterns, authentication, item discovery
COMMON-CLI.md — az rest usage, environment detection, token acquisition
SPARK-AUTHORING-CORE.md — Notebook deployment, lakehouse creation, job execution

Below are minimal quick-start examples. Always reference the COMMON- files for production use.*

Create Workspace & Lakehouse

# See COMMON-CORE.md Environment URLs and SPARK-AUTHORING-CORE.md for full patterns
cat > /tmp/body.json << 'EOF'
{"displayName": "DataEng-Dev"}
EOF
workspace_id=$(az rest --method post --resource "https://api.fabric.microsoft.com" \
  --url "https://api.fabric.microsoft.com/v1/workspaces" \
  --body @/tmp/body.json --query "id" --output tsv)

cat > /tmp/body.json << 'EOF'
{"displayName": "DevLakehouse", "type": "Lakehouse", "creationPayload": {"enableSchemas": true}}
EOF
lakehouse_id=$(az rest --method post --resource "https://api.fabric.microsoft.com" \
  --url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/items" \
  --body @/tmp/body.json --query "id" --output tsv)

Organize Lakehouse Tables with Schemas

# See SPARK-AUTHORING-CORE.md Lakehouse Schema Organization for table organization patterns
# Create schemas for medallion architecture
spark.sql("CREATE SCHEMA IF NOT EXISTS bronze")
spark.sql("CREATE SCHEMA IF NOT EXISTS silver")
spark.sql("CREATE SCHEMA IF NOT EXISTS gold")

Create and Refresh a Materialized Lake View (MLV)

-- See resources/materialized-lake-view-patterns.md for design guidance
-- and resources/mlv-incremental-refresh-patterns.md for refresh-readiness review.

-- Bronze/Silver/Gold schemas in a Lakehouse with schemas enabled
CREATE SCHEMA IF NOT EXISTS bronze;
CREATE SCHEMA IF NOT EXISTS silver;
CREATE SCHEMA IF NOT EXISTS gold;

-- A simple Silver MLV with data quality constraints
--
-- Prerequisite for incremental refresh: enable Change Data Feed (CDF) on every
-- source table the MLV reads from. Without CDF, optimal refresh can only choose
-- between no refresh (sources unchanged) and full refresh — never incremental.
-- See resources/mlv-incremental-refresh-patterns.md.
ALTER TABLE bronze.orders_raw SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

CREATE OR REPLACE MATERIALIZED LAKE VIEW silver.orders_clean
(
    CONSTRAINT valid_order_id CHECK (order_id IS NOT NULL) ON MISMATCH DROP
)
AS
SELECT
  order_id,
  customer_id,
  CAST(order_ts AS TIMESTAMP) AS order_ts,
  amount
FROM bronze.orders_raw;

-- Routine refresh is handled by the lakehouse Materialized lake views → Manage
-- schedule/lineage view; don't orchestrate from notebooks. The SQL form below is
-- documented only for forcing a one-time FULL recompute (troubleshooting / after
-- a correction). There is no documented SQL form for triggering incremental refresh.
REFRESH MATERIALIZED LAKE VIEW silver.orders_clean FULL;

Create Lakehouse Livy Session

# See SPARK-CONSUMPTION-CORE.md for Lakehouse Livy session configuration and management
# IMPORTANT: Body MUST be flat JSON with memory/cores — do NOT wrap in {"payload": ...}
cat > /tmp/body.json << 'EOF'
{"name": "dev-session", "driverMemory": "56g", "driverCores": 8, "executorMemory": "56g", "executorCores": 8, "conf": {"spark.dynamicAllocation.enabled": "true", "spark.fabric.pool.name": "Starter Pool"}}
EOF
az rest --method post --resource "https://api.fabric.microsoft.com" \
  --url "https://api.fabric.microsoft.com/v1/workspaces/$workspace_id/lakehouses/$lakehouse_id/livyapi/versions/2023-12-01/sessions" \
  --body @/tmp/body.json

Lakehouse Livy Session Body — Common Mistakes

❌ {"payload": {"kind": "pyspark"}} → HTTP 500 (wrong wrapper, missing required fields)

❌ {"kind": "pyspark"} → HTTP 500 (missing driverMemory, executorMemory, etc.)

✅ Flat JSON with name, driverMemory, driverCores, executorMemory, executorCores (and optionally conf with Starter Pool)

Spark Performance Configs

For detailed workload-specific configurations, see data-engineering-patterns.md Delta Lake Best Practices.

Quick reference:

# Write-heavy (Bronze): Disable V-Order, enable autoCompact
# Balanced (Silver): Enable V-Order, adaptive execution  
# Read-heavy (Gold): Vectorized reads, optimal parallelism
# See data-engineering-patterns.md for complete config tables

Adoption

microsoft/spark-authoring-cli

$ install --global

SKILL.md

Spark Authoring — CLI Skill

Table of Contents

Must/Prefer/Avoid

MUST DO

PREFER

AVOID

RULES — Read these first, follow them always

Notebook Codegen Quick Rules

Minimal snippets

Quick Start Examples

Create Workspace & Lakehouse

Organize Lakehouse Tables with Schemas

Create and Refresh a Materialized Lake View (MLV)

Create Lakehouse Livy Session

Spark Performance Configs

Related Skills

microsoft/azmon-mirroredcatalogs-operations-cli

microsoft/azmon-mirroredcatalogs-operations-cli

microsoft/azmon-mirroredcatalogs-operations-cli

microsoft/mlv-operations-cli

microsoft/spark-authoring-cli

$ install --global

SKILL.md

Spark Authoring — CLI Skill

Table of Contents

Must/Prefer/Avoid

MUST DO

PREFER

AVOID

RULES — Read these first, follow them always

Notebook Codegen Quick Rules

Minimal snippets

Quick Start Examples

Create Workspace & Lakehouse

Organize Lakehouse Tables with Schemas

Create and Refresh a Materialized Lake View (MLV)

Create Lakehouse Livy Session

Spark Performance Configs

Related Skills

microsoft/azmon-mirroredcatalogs-operations-cli

microsoft/azmon-mirroredcatalogs-operations-cli

microsoft/azmon-mirroredcatalogs-operations-cli

microsoft/mlv-operations-cli