skills/data-engineering/data-lakehouse/SKILL.md
Design and implement data lakehouse architectures for scalable big data storage and analytics.
npx skillsauth add alphaonedev/openclaw-graph data-lakehouseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables the design and implementation of data lakehouse architectures, combining data lakes with warehouse features for scalable big data storage and analytics. Use it to manage petabyte-scale data with ACID transactions, schema evolution, and optimized query performance on platforms like Delta Lake or Iceberg.
Use the OpenClaw CLI or API for this skill. Authentication requires setting $DATA_LAKEHOUSE_API_KEY as an environment variable.
CLI Command: Initialize a lakehouse project:
openclaw skill data-lakehouse init --project my-lakehouse --storage s3://my-bucket --engine delta
This creates a basic configuration file with S3 bucket and Delta Lake engine.
API Endpoint: Create a table via POST request:
curl -H "Authorization: Bearer $DATA_LAKEHOUSE_API_KEY" \
-d '{"table_name": "sales_data", "format": "parquet", "partition_by": ["date"]}' \
https://api.openclaw.ai/data-lakehouse/tables
Response includes table metadata for immediate use.
Code Snippet: In Python, integrate with Spark:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lakehouse").getOrCreate()
df = spark.read.format("delta").load("s3://my-bucket/sales_data")
df.write.format("delta").mode("append").save("s3://my-bucket/sales_data")
This appends data to a Delta table; ensure Spark is configured with AWS credentials.
Config Format: Use JSON for lakehouse configs, e.g.:
{
"storage": "s3://my-bucket",
"engine": "iceberg",
"auth": {"key": "$DATA_LAKEHOUSE_API_KEY"}
}
Load this via CLI: openclaw skill data-lakehouse apply --config path/to/config.json.
--aws-region us-west-2 in CLI commands. For Spark, ensure dependencies like spark.delta are in your environment.$DATA_LAKEHOUSE_API_KEY in chained API calls.$DATA_LAKEHOUSE_API_KEY is set; use os.environ.get('DATA_LAKEHOUSE_API_KEY') in scripts.try:
spark.read.format("delta").load("s3://my-bucket/data")
except Exception as e:
print(f"Error: {e}. Check bucket permissions and retry.")
Retry transient errors like network issues with exponential backoff in API calls.--verbose for detailed logs, e.g., openclaw skill data-lakehouse init --verbose. Validate configs with openclaw skill data-lakehouse validate --file config.json.Example 1: Building a sales analytics lakehouse:
First, initialize: openclaw skill data-lakehouse init --project sales-ana --storage s3://sales-data. Then, ingest data: Use the API to create a table, and append via Spark as shown above. Finally, query with: spark.sql("SELECT * FROM sales_data WHERE date > '2023-01-01'").
Example 2: Optimizing an existing lakehouse for IoT data:
Run: openclaw skill data-lakehouse optimize --table iot_metrics --add-index date. This adds an index; verify with a query snippet: df = spark.read.format("iceberg").load("s3://iot-bucket/metrics").filter(df.date > current_date()). Monitor performance post-optimization.
tools
Root web development: project structure, tooling selection, deployment decisions
development
WebAssembly: Rust/Go/C to WASM, wasm-bindgen, Emscripten, WASM Component Model
development
Vue 3: Composition API script setup, Pinia, Vue Router 4, SFCs, Vite, Nuxt 3
tools
Tailwind CSS 4: utility classes, config, JIT, arbitrary values, darkMode, plugins, shadcn/ui