skills/bauplanlabs/creating-bauplan-pipelines/SKILL.md
Creates bauplan data pipeline projects with SQL and Python models. Use when starting a new pipeline, defining DAG transformations, writing models, or setting up bauplan project structure from scratch.
npx skillsauth add aiskillstore/marketplace creating-bauplan-pipelinesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill guides you through creating a new bauplan data pipeline project from scratch, including the project configuration and SQL/Python transformation models.
NEVER run pipelines on
mainbranch. Always use a development branch.
Branch naming convention: <username>.<branch_name> (e.g., john.feature-pipeline). Get your username with bauplan info. See Workflow Checklist for exact commands.
Before creating the pipeline, verify that:
main)bauplan)A bauplan pipeline is a DAG of functions (models). Key rules:
bauplan.Model() referencestrips.sql → trips)def clean_trips() → clean_trips)Expectations: Data quality functions that take tables as input and return a boolean.
[lakehouse: taxi_fhvhv] ──→ [trips.sql] ──→ [clean_trips] ──→ [daily_summary]
↑
[lakehouse: taxi_zones] ────────────────────────┘
In this example:
taxi_fhvhv and taxi_zones are source tables (already in lakehouse)trips.sql reads from taxi_fhvhv (SQL model, first node)clean_trips takes trips and taxi_zones as inputs (Python model, multiple inputs)daily_summary takes clean_trips as input (Python model, single input)Before writing a pipeline, you MUST gather the following information from the user:
bauplan table getREPLACE (default) or APPEND?--strict flag, which fails on issues like output column mismatches during dry-run, allowing immediate error detection and correction.If the user hasn't provided this information, ask before proceeding with implementation.
When strict mode is enabled, append --strict to all bauplan run commands:
# Without strict mode (default)
bauplan run --dry-run
bauplan run
# With strict mode enabled
bauplan run --dry-run --strict
bauplan run --strict
Benefits of strict mode:
A bauplan project is a folder containing:
my-project/
bauplan_project.yml # Required: project configuration
model.sql # Optional: a single SQL model, one per file
models.py # Optional: Python models (one file can have >1 models, or be split into multiple files)
expectations.py # Optional: data quality tests (if any)
Every project is a separate folder which requires this configuration file:
project:
id: <unique-uuid> # Generate a unique UUID
name: <project_name> # Descriptive name for the project
IMPORTANT: SQL models should be LIMITED to first nodes in the pipeline graph only.
This ensures consistency and allows for better control over transformations, output schema validation, and documentation.
SQL models are .sql files where:
Use SQL models only when reading from existing lakehouse tables:
-- trips.sql
-- First node: reads from taxi_fhvhv table in the lakehouse
SELECT
pickup_datetime,
PULocationID,
trip_miles
FROM taxi_fhvhv
WHERE pickup_datetime >= '2022-12-01'
Output table: trips (from filename)
Input table: taxi_fhvhv (from FROM clause, exists in lakehouse)
Python models use decorators to define transformations. They should be used for all pipeline nodes except first nodes reading from the lakehouse.
@bauplan.model() - Registers function as a model@bauplan.model(columns=[...]) - Specify expected output columns for validation (Optional but recommended)@bauplan.model(materialization_strategy='REPLACE') - Persist output to lakehouse@bauplan.python('3.11', pip={'pandas': '1.5.3'}) - Specify Python version and packagesIMPORTANT: whenever possible, specify the
columnsparameter in@bauplan.model()to define the expected output schema. This enables automatic validation of your model's output.
First, check the schema of your source tables to understand input columns. Then specify the output columns based on your transformation:
# If input has columns: [id, name, age, city]
# And transformation drops 'city' column
# Then output columns should be: [id, name, age]
@bauplan.model(columns=['id', 'name', 'age'])
IMPORTANT: Every Python model should have a docstring describing the transformation and showing the output table structure as an ASCII table (if the table is too wide, show only key columns, if values are too large, truncate them in the cells).
@bauplan.model(columns=['id', 'name', 'age'])
@bauplan.python('3.11')
def clean_users(data=bauplan.Model('raw_users')):
"""
Cleans user data by removing invalid entries and dropping the city column.
| id | name | age |
|-----|---------|-----|
| 1 | Alice | 30 |
| 2 | Bob | 25 |
"""
# transformation logic
return data.drop_columns(['city'])
columns and filterIMPORTANT: whenever possible, use
columnsandfilterparameters inbauplan.Model()to restrict the data read. This enables I/O pushdown, dramatically reducing the amount of data transferred and improving performance. Do not read columns you don't need.
bauplan.Model(
'table_name',
columns=['col1', 'col2', 'col3'], # Only read these columns
filter="date >= '2022-01-01'" # Pre-filter at storage level
)
Whenever possible, specify:
columns: List only the columns your model actually needsfilter: SQL-like filter expression to restrict rows at the storage level, if appropriateimport bauplan
@bauplan.model(
columns=['pickup_datetime', 'PULocationID', 'trip_miles'],
materialization_strategy='REPLACE'
)
@bauplan.python('3.11', pip={'polars': '1.15.0'})
def clean_trips(
# Use columns and filter for I/O pushdown
data=bauplan.Model(
'trips',
columns=['pickup_datetime', 'PULocationID', 'trip_miles'],
filter="trip_miles > 0"
)
):
"""
Filters trips to include only those with positive mileage.
| pickup_datetime | PULocationID | trip_miles |
|---------------------|--------------|------------|
| 2022-12-01 08:00:00 | 123 | 5.2 |
"""
import polars as pl
df = pl.from_arrow(data)
df = df.filter(pl.col('trip_miles') > 0.0)
return df.to_arrow()
Models can take multiple tables as input - just add more bauplan.Model() parameters:
def model_with_joins(
table_a=bauplan.Model('source_a', columns=['id', 'value']),
table_b=bauplan.Model('source_b', columns=['id', 'name'])
):
# Join, transform, return Arrow table
return table_a.join(table_b, 'id', 'id')
See examples.md for complete multi-input examples with Polars.
Copy this checklist and track your progress:
Pipeline Creation Progress:
- [ ] Step 1: Get username → bauplan info
- [ ] Step 2: Checkout main → bauplan branch checkout main
- [ ] Step 3: Create dev branch → bauplan branch create <username>.<branch_name>
- [ ] Step 4: Checkout dev branch → bauplan branch checkout <username>.<branch_name>
- [ ] Step 5: Verify source tables → bauplan table get <namespace>.<table_name>, Optional for data preview: bauplan query "SELECT * FROM <namespace>.<table_name> LIMIT 3"
- [ ] Step 6: Create project folder with bauplan_project.yml
- [ ] Step 7: Write SQL model(s) / Python model(s) for transformations respecting the guidelines
- [ ] Step 8: Verify materialization decorators (see Materialization Checklist below)
- [ ] Step 9: Dry run → bauplan run --dry-run [--strict if strict mode]
- [ ] Step 10: Run pipeline → bauplan run [--strict if strict mode]
CRITICAL: Never run on
mainbranch. Steps 2-4 ensure you're on a development branch.
After writing models, verify that each model has the correct materialization_strategy based on user requirements:
| Model Type | No Materialization (intermediate) | Materialized Output |
|------------|-----------------------------------|---------------------|
| Python | @bauplan.model() (no strategy) | @bauplan.model(materialization_strategy='REPLACE') or 'APPEND' |
| SQL | No comment needed | Add comment: -- bauplan: materialization_strategy=REPLACE or APPEND |
Verify for each model:
materialization_strategy specifiedmaterialization_strategy='REPLACE' (default) or 'APPEND'materialization_strategy='APPEND' is setExample Python decorator for materialized output:
@bauplan.model(materialization_strategy='REPLACE', columns=['col1', 'col2'])
Example SQL comment for materialized output:
-- bauplan: materialization_strategy=REPLACE
SELECT * FROM source_table
See examples.md for:
development
Apple Human Interface Guidelines for content display components. Use this skill when the user asks about charts component, collection view, image view, web view, color well, image well, activity view, lockup, data visualization, content display, displaying images, rendering web content, color pickers, or presenting collections of items in Apple apps. Also use when the user says how should I display charts, what's the best way to show images, should I use a web view, how do I build a grid of items, what component shows media, or how do I present a share sheet. Cross-references: hig-foundations for color/typography/accessibility, hig-patterns for data visualization patterns, hig-components-layout for structural containers, hig-platforms for platform-specific component behavior.
tools
Automate HelpDesk tasks via Rube MCP (Composio): list tickets, manage views, use canned responses, and configure custom fields. Always search tools first for current schemas.
testing
Expert Haskell engineer specializing in advanced type systems, pure functional design, and high-reliability software. Use PROACTIVELY for type-level programming, concurrency, and architecture guidance.
tools
GraphQL gives clients exactly the data they need - no more, no less. One endpoint, typed schema, introspection. But the flexibility that makes it powerful also makes it dangerous. Without proper controls, clients can craft queries that bring down your server. This skill covers schema design, resolvers, DataLoader for N+1 prevention, federation for microservices, and client integration with Apollo/urql. Key insight: GraphQL is a contract. The schema is the API documentation. Design it carefully.