skills/astronomer/tracing-upstream-lineage/SKILL.md
Trace upstream data lineage. Use when the user asks where data comes from, what feeds a table, upstream dependencies, data sources, or needs to understand data origins.
npx skillsauth add rory-data/copilot tracing-upstream-lineageInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Trace the origins of data - answer "Where does this data come from?"
Determine what we're tracing:
Tables are typically populated by Airflow DAGs. Find the connection:
Search DAGs by name: Use af dags list and look for DAG names matching the table name
load_customers -> customers tableetl_daily_orders -> orders tableExplore DAG source code: Use af dags source <dag_id> to read the DAG definition
Check DAG tasks: Use af tasks list <dag_id> to see what operations the DAG performs
If you're running on Astro, the Lineage tab in the Astro UI provides visual lineage exploration across DAGs and datasets. Use it to quickly trace upstream dependencies without manually searching DAG source code.
Use DAG source code and task logs to trace lineage (no built-in cross-DAG UI).
From the DAG code, identify source tables and systems:
SQL Sources (look for FROM clauses):
# In DAG code:
SELECT * FROM source_schema.source_table # <- This is an upstream source
External Sources (look for connection references):
S3Operator -> S3 bucket sourcePostgresOperator -> Postgres database sourceSalesforceOperator -> Salesforce API sourceHttpOperator -> REST API sourceFile Sources:
Recursively trace each source:
TARGET: analytics.orders_daily
^
+-- DAG: etl_daily_orders
^
+-- SOURCE: raw.orders (table)
| ^
| +-- DAG: ingest_orders
| ^
| +-- SOURCE: Salesforce API (external)
|
+-- SOURCE: dim.customers (table)
^
+-- DAG: load_customers
^
+-- SOURCE: PostgreSQL (external DB)
For each upstream source:
af dags statsWhen tracing a specific column:
source.col AS target_colCOALESCE(a.col, b.col) AS target_colSUM(detail.amount) AS total_amountOne-line answer: "This table is populated by DAG X from sources Y and Z"
[Salesforce] --> [raw.opportunities] --> [stg.opportunities] --> [fct.sales]
| |
DAG: ingest_sfdc DAG: transform_sales
| Source | Type | Connection | Freshness | Owner | |--------|------|------------|-----------|-------| | raw.orders | Table | Internal | 2h ago | data-team | | Salesforce | API | salesforce_conn | Real-time | sales-ops |
Describe how data flows and transforms:
raw.orders via Salesforce API synctransform_orders cleans and dedupes into stg.ordersbuild_order_facts joins with dimensions into fct.orderstools
Queries, manages, and troubleshoots Apache Airflow using the af CLI. Covers listing DAGs, triggering runs, reading task logs, diagnosing failures, debugging DAG import errors, checking connections, variables, pools, and monitoring health. Also routes to sub-skills for writing DAGs, debugging, deploying, and migrating Airflow 2 to 3. Use when user mentions "Airflow", "DAG", "DAG run", "task log", "import error", "parse error", "broken DAG", or asks to "trigger a pipeline", "debug import errors", "check Airflow health", "list connections", "retry a run", or any Airflow operation. Do NOT use for warehouse/SQL analytics on Airflow metadata tables — use analyzing-data instead.
tools
Build Airflow 3.1+ plugins that embed FastAPI apps, custom UI pages, React components, middleware, macros, and operator links directly into the Airflow UI. Use this skill whenever the user wants to create an Airflow plugin, add a custom UI page or nav entry to Airflow, build FastAPI-backed endpoints inside Airflow, serve static assets from a plugin, embed a React app in the Airflow UI, add middleware to the Airflow API server, create custom operator extra links, or call the Airflow REST API from inside a plugin. Also trigger when the user mentions AirflowPlugin, fastapi_apps, external_views, react_apps, plugin registration, or embedding a web app in Airflow 3.1+. If someone is building anything custom inside Airflow 3.1+ that involves Python and a browser-facing interface, this skill almost certainly applies.
data-ai
Use when the user needs human-in-the-loop workflows in Airflow (approval/reject, form input, or human-driven branching). Covers ApprovalOperator, HITLOperator, HITLBranchOperator, HITLEntryOperator, HITLTrigger. Requires Airflow 3.1+. Does not cover AI/LLM calls (see airflow-ai).
development
Detects and fixes common code smells during review or refactoring. Invoke whenever reviewing code for quality issues, before merging a PR, when refactoring legacy code, or when the user asks about code quality, anti-patterns, or technical debt. Detects: over-abstraction, complex inheritance, large functions, tight coupling, hidden dependencies, magic numbers, boolean traps, swallowed exceptions, global state, and duplicate code. Provides specific fixes with before/after examples. Also invoke when someone says "review this code", "is this clean?", "can I improve this?", "this feels messy", or "find problems in my code".