skills/hdinsight-migration/SKILL.md
Port Azure HDInsight Spark clusters and Hive workloads to Microsoft Fabric. Removes legacy HiveContext and standalone SparkContext constructors, replacing them with the pre-instantiated SparkSession. Converts WASB and ABFS storage paths to OneLake abfss URLs via Shortcuts. Transforms Hive DDL (STORED AS ORC, external tables) to Delta Lake schemas inside Fabric Lakehouse. Maps Oozie workflow actions — spark, hive, shell, sqoop, coordinator — to Fabric Pipeline activities and schedule triggers. Introduces notebookutils for file and credential operations previously handled via subprocess or HDFS client calls. Use when the user wants to: (1) retire an HDInsight cluster and move to Fabric, (2) convert WASB paths or Hive DDL, (3) replace Oozie coordinators with Fabric Pipelines. Triggers: "migrate from hdinsight", "hdi to fabric", "hivecontext sparksession fabric", "wasb to onelake", "hive ddl to delta", "oozie to fabric pipelines", "hive metastore lakehouse", "hdinsight spark migration".
npx skillsauth add microsoft/skills-for-fabric hdinsight-migrationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Update Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.
- GitHub Copilot CLI / VS Code: invoke the
check-updatesskill.- Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.
- Skip if the check was already performed earlier in this session.
CRITICAL NOTES
- To find workspace details (including its ID) from a workspace name: list all workspaces, then use JMESPath filtering
- To find item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace, then use JMESPath filtering
- HDInsight has no
mssparkutilsordbutilsequivalent —notebookutilsis net-new capability being introducedHiveContextandSQLContextare legacy Spark 1.x/2.x APIs — Fabric uses Spark 3.xSparkSessionexclusivelywasb://paths are deprecated and require a Storage Account key or SAS — replace with OneLake shortcuts
Read these companion documents before executing migration tasks:
az rest, az login, token acquisition, Fabric REST via CLIFor notebook and Lakehouse creation, see spark-authoring-cli. For Fabric Warehouse DDL/DML authoring, see sqldw-authoring-cli.
| Topic | Reference |
|---|---|
| Migration Workload Map | § Migration Workload Map |
| SparkSession & Context API Changes | § SparkSession API Changes |
| WASB / ABFS → OneLake Path Migration | path-migration.md |
| Hive DDL → Delta Lake / Lakehouse Schemas | hive-to-delta.md |
| Oozie → Fabric Pipelines | § Oozie → Fabric Pipelines |
| Introducing notebookutils | § Introducing notebookutils |
| Before/After Code Patterns | code-patterns.md |
| Spark Configuration Differences | § Spark Configuration Differences |
| Must / Prefer / Avoid | § Must / Prefer / Avoid |
| Authentication & Token Acquisition | COMMON-CORE.md § Authentication |
| Lakehouse Management | SPARK-AUTHORING-CORE.md § Lakehouse Management |
| HDInsight Component | Fabric Target | Notes |
|---|---|---|
| Spark cluster (notebooks, scripts) | Fabric Spark (Lakehouse / Notebooks / SJD) | No persistent cluster — Starter Pool or Custom Pool provides on-demand Spark |
| Hive / HiveServer2 | Lakehouse SQL Endpoint + Lakehouse schemas | Delta Lake replaces Hive metastore; schemas provide namespace equivalent |
| HBase | Fabric Warehouse or Azure Cosmos DB (separate from Fabric) | HBase has no direct Fabric equivalent — assess workload access patterns |
| Oozie workflows | Fabric Data Pipelines | Map Oozie actions to Fabric activities; see § Oozie → Fabric Pipelines |
| YARN Resource Manager | Fabric Spark monitoring (Spark UI, Monitoring Hub) | No YARN — Fabric manages compute automatically |
| Ambari | Fabric Monitoring Hub + Admin Portal | Cluster health, capacity, and job monitoring |
| WASB / ABFS storage | OneLake Shortcuts → abfss://[email protected]/ | See path-migration.md |
| Ranger policies | Fabric workspace roles + OneLake data access roles | Map Ranger row/column filters to Lakehouse row-level security |
| Livy REST server | Fabric Livy API | Compatible endpoint — see SPARK-AUTHORING-CORE.md |
HDInsight Spark clusters often use legacy Spark 1.x / 2.x API styles. Replace all of these with the unified SparkSession:
| Legacy HDInsight Pattern | Fabric Spark 3.x Replacement |
|---|---|
| from pyspark import SparkContext; sc = SparkContext() | Not needed — sc = spark.sparkContext (pre-instantiated) |
| from pyspark.sql import HiveContext; hc = HiveContext(sc) | Not needed — spark session has Hive-compatible SQL support via Delta schemas |
| from pyspark.sql import SQLContext; sqlc = SQLContext(sc) | Not needed — use spark.sql(...) directly |
| SparkSession.builder.enableHiveSupport().getOrCreate() | Not needed in Fabric — spark is pre-built and available |
| sc.textFile("wasb://[email protected]/path") | spark.read.text("abfss://[email protected]/lh.Lakehouse/Files/path") |
| sqlContext.sql("CREATE TABLE ... STORED AS ORC") | See hive-to-delta.md for Delta DDL equivalent |
In Fabric notebooks,
spark(SparkSession) andsc(SparkContext) are pre-instantiated — do not callSparkContext()orSparkSession.builder...getOrCreate()at the top of migrated notebooks.
Map Oozie workflow actions to Fabric Data Pipeline activities:
| Oozie Action Type | Fabric Pipeline Activity | Notes |
|---|---|---|
| <spark> action | Notebook activity or Spark Job Definition activity | Pass parameters via notebook cell parameters or SJD arguments |
| <hive> action | Script activity (SQL) against Lakehouse SQL Endpoint | Convert HiveQL to Spark SQL or Delta SQL |
| <shell> action | Azure Function activity or Web activity | Shell scripts must be refactored; no direct shell execution in Fabric Pipelines |
| <java> action | Azure Batch activity (external) or refactor to PySpark | Java MapReduce jobs must be rewritten |
| <sqoop> action | Copy Data activity (Fabric Data Factory connector) | Sqoop import/export maps to Fabric Copy Data with JDBC source/sink |
| <coordinator> (time-based schedule) | Pipeline schedule trigger | Set recurrence in pipeline trigger; supports cron-like expressions |
| <coordinator> (data-triggered) | Storage Event trigger | Trigger on OneLake file arrival |
Delegate to
spark-authoring-clifor notebook and SJD creation after mapping pipeline activities.
notebookutilsHDInsight Spark had no built-in utility framework equivalent to mssparkutils or dbutils. When migrating to Fabric, introduce notebookutils for common operations:
| Operation | Old HDInsight Approach | notebookutils Equivalent |
|---|---|---|
| List files | dbutils (N/A) / HDFS CLI | notebookutils.fs.ls("abfss://...") |
| Copy file | HDFS API / shutil | notebookutils.fs.cp(src, dest) |
| Read secret | Azure Key Vault REST call | notebookutils.credentials.getSecret(keyVaultUrl, secretName) |
| Get notebook context | Not available | notebookutils.runtime.context — returns workspace ID, notebook ID, etc. |
| Run child notebook | Not available | notebookutils.notebook.run("notebook_name", timeout, {"param": "value"}) |
| Exit notebook with value | sys.exit() | notebookutils.notebook.exit("value") |
| Mount storage | WASB config in spark-defaults.conf | OneLake Shortcut (no runtime mount needed) |
| HDInsight Concept | Fabric Spark Equivalent | Migration Action |
|---|---|---|
| spark-defaults.conf (cluster-wide) | Fabric Spark Workspace Settings + Environment item | Move config properties to Environment or use %%configure in notebooks |
| %%configure magic | %%configure magic — identical | No change needed |
| YARN queue / resource allocation | Fabric Spark pool node size and autoscale settings | Map queue SLAs to Custom Pool configuration |
| Ambari service configs (HDFS, YARN tuning) | Not applicable — Fabric manages infrastructure | Remove; focus on application-level Spark configs |
| HDI Spark version (e.g., Spark 2.4) | Fabric Runtime 1.3 = Spark 3.5 (latest) | Test for deprecated API removals (e.g., HiveContext, RDD-style ML) |
| Conda environment / bootstrap.sh | Fabric Environment item with custom libraries | Recreate conda/pip dependencies in a Fabric Environment |
| hive-site.xml (metastore connection) | Not needed — Delta Lake IS the metastore in Fabric | Remove metastore config; use Lakehouse schemas for namespace organization |
wasb:// / wasbs:// paths with OneLake abfss:// paths or OneLake Shortcuts — wasb:// requires storage account keys which are not the Fabric-preferred auth modelHiveContext, SQLContext, and standalone SparkContext() — use the pre-instantiated spark session in Fabric notebooksSTORED AS ORC, LOCATION, TBLPROPERTIES) to Delta Lake DDL — see hive-to-delta.mdnotebookutils for file system operations, secret retrieval, and child notebook orchestration where HDInsight used custom scripts or direct API callsbootstrap.sh, conda envs, and runtime %pip install patterns for production workloadsSparkContext() or HiveContext() constructors in Fabric notebooks — they conflict with the pre-instantiated spark session and will raise errorshive-site.xml or external Hive metastore configuration — Fabric's Delta Lake-backed Lakehouse IS the metastore%sh magic for file system operations in production notebooks — use notebookutils.fs.* for portability and OneLake token-based authSee code-patterns.md for full before/after examples. Key quick references:
Legacy context → Fabric pre-instantiated session
# HDInsight (remove entirely)
from pyspark.sql import HiveContext
hc = HiveContext(sc)
# Fabric — use pre-instantiated spark directly
df = spark.sql("SELECT * FROM sales.fact_orders")
WASB path → OneLake path (after shortcut creation)
# HDInsight
df = spark.read.parquet("wasb://[email protected]/orders/")
# Fabric
df = spark.read.parquet("Files/raw/orders/")
Hive DDL → Delta DDL
-- HDInsight
CREATE TABLE sales_db.fact_orders (...) STORED AS ORC LOCATION 'wasb://...';
-- Fabric
CREATE SCHEMA IF NOT EXISTS sales_db;
CREATE TABLE sales_db.fact_orders (...) USING DELTA;
tools
Execute raw DAX queries and inspect metadata of Microsoft Fabric Power BI semantic models via the MCP server ExecuteQuery tool. Use when the user already knows the DAX to write, wants to run EVALUATE statements, or needs to inspect model metadata (tables, columns, measures, relationships, hierarchies) using INFO functions. For natural-language business questions (where you generate the DAX), use `fabriciq`. For creating, deploying, or managing semantic model definitions, use `semantic-model-authoring`. Triggers: "run DAX query", "execute EVALUATE", "semantic model metadata", "list semantic model tables", "INFO.VIEW.TABLES", "get measure expression", "DAX against", "query the model".
development
Develops and manages Power BI semantic models across Desktop, PBIP projects, and Fabric Service. Handles: (1) creating new models (Import, DirectQuery, Direct Lake), (2) editing existing models (e.g. measures, tables, columns, relationships), (3) deploying models to Fabric workspaces, (4) working with PBIP project files, (5) refreshing semantic models, (6) configuring data sources and permissions, (7) DAX performance optimization. Supports both Power BI Desktop and Fabric Service development workflows. For read-only DAX queries, use `semantic-model-consumption`. Does NOT handle report layout/visual authoring, workspace administration, or RLS/OLS role membership management. Triggers: "create semantic model", "edit semantic model", "add a DAX measure to semantic model", "refresh semantic model", "set semantic model permissions", "Prepare semantic model for AI/Copilot".
tools
Answer business questions by querying Power BI reports and dashboards through the FabricIQ MCP endpoint. Orchestrates: discover Power BI artifacts, inspect report/model schemas, resolve entity values, generate DAX, execute queries. Returns plain-language answers from Power BI semantic models. Use when the user asks a natural-language question about Power BI report or dashboard content (not raw DAX). Triggers: "ask power bi", "PBI question", "discover report", "report data", "dashboard data", "what are the top", "show me the power bi data", "which products sold", "compare sales in report".
development
Develops and manages Power BI semantic models across Desktop, PBIP projects, and Fabric Service. Handles: (1) creating new models (Import, DirectQuery, Direct Lake), (2) editing existing models (e.g. measures, tables, columns, relationships), (3) deploying models to Fabric workspaces, (4) working with PBIP project files, (5) refreshing semantic models, (6) configuring data sources and permissions, (7) DAX performance optimization. Supports both Power BI Desktop and Fabric Service development workflows. For read-only DAX queries, use `semantic-model-consumption`. Does NOT handle report layout/visual authoring, workspace administration, or RLS/OLS role membership management. Triggers: "create semantic model", "edit semantic model", "add a DAX measure to semantic model", "refresh semantic model", "set semantic model permissions", "Prepare semantic model for AI/Copilot".