claude/skills/bigquery/SKILL.md
Use bigquery CLI (instead of `bq`) for all Google BigQuery and GCP data warehouse operations including SQL query execution, data ingestion (streaming insert, bulk load, JSONL/CSV/Parquet), data extraction/export, dataset/table management, cost estimation with dry-run, authentication with gcloud, data pipelines, ETL workflows, and MCP server integration for AI-assisted querying. Modern TypeScript/Bun implementation replacing the Python `bq` CLI with instant startup (~10ms vs ~500ms), automatic cost awareness with confirmation prompts, and native streaming support (JSONL). Handles both small-scale streaming inserts (<1000 rows) and large-scale bulk loading (>10MB files) from Cloud Storage.
npx skillsauth add lanej/dotfiles bigqueryInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The CLI is the full interface. MCP tools are a subset.
Default to the bigquery CLI for any operation not listed in the MCP tools section below. Use MCP when an agent needs structured tool call semantics.
Uses Google Cloud SDK (gcloud) ADC — no separate BigQuery auth required.
bigquery auth check # verify status and scopes
| Command | Default | Notes |
|---------|---------|-------|
| query | JSON | includes metadata (bytes, cost, cacheHit) |
| datasets list | text | human-readable |
| tables list | text | human-readable |
All commands accept --format json|text|jsonl.
# Basic (automatic dry-run + cost confirmation)
bigquery query "SELECT * FROM dataset.table LIMIT 10"
# Skip cost confirmation (required for automation)
bigquery query --yes "SELECT COUNT(*) FROM dataset.table"
# JSONL output (one object per line — for pipelines/DuckDB)
bigquery query --jsonl "SELECT * FROM dataset.table"
# Stream large results
bigquery query "SELECT * FROM large_table" --stream > results.jsonl
# From file / stdin
bigquery query --file query.sql
cat query.sql | bigquery query --stdin --yes
# Dry run (cost estimate, $0 to run)
bigquery dry-run "SELECT * FROM large_table WHERE date >= '2025-01-01'"
Cost Awareness: auto dry-run before every query; prompts if cost > threshold (default: 1 GB); response includes bytesProcessed, estimatedCostUSD, cacheHit.
Key flags: --yes, --format, --jsonl, --stream, --cost-threshold, --max-results, --project
Important: --max-results limits returned rows, NOT bytes scanned (no cost reduction). Use LIMIT in SQL instead.
bigquery datasets list [--project PROJECT]
bigquery datasets describe PROJECT.DATASET
bigquery datasets create PROJECT.DATASET [--description "..." --location US --default-ttl 30]
bigquery datasets update PROJECT.DATASET --description "..."
bigquery datasets delete PROJECT.DATASET [--force --yes]
# Read
bigquery tables list PROJECT.DATASET [--max-results 20]
bigquery tables describe PROJECT.DATASET.TABLE
# Write
bigquery tables create TABLE --schema '[{"name":"id","type":"STRING"}]'
bigquery tables update TABLE --description "..."
# Insert (streaming, <1000 rows)
bigquery tables insert TABLE file.jsonl --format jsonl
cat data.jsonl | bigquery tables insert TABLE - --format jsonl
# Load (bulk, >10MB — from Cloud Storage)
bigquery tables load TABLE gs://bucket/data.csv --format csv --write-disposition WRITE_TRUNCATE
# Copy / extract / delete
bigquery tables copy SOURCE DEST [--write-disposition WRITE_TRUNCATE]
bigquery tables extract TABLE gs://bucket/output.csv --format csv
bigquery tables delete TABLE
tables load gotcha: does NOT accept --yes; --write-disposition requires uppercase (WRITE_APPEND, not append).
bigquery jobs list [--state-filter running|done --all-users]
bigquery jobs get JOB_ID
bigquery jobs cancel JOB_ID
# query via stdin
result = subprocess.run(
["bigquery", "query", "--format", "jsonl", "--yes", "--stdin"],
input=sql, capture_output=True, text=True, check=True
)
# bulk load (no --yes for tables load)
subprocess.run(
["bigquery", "tables", "load", "--format", "json",
"--write-disposition", "WRITE_APPEND", table, tmpfile],
check=True
)
bigquery mcp stdio # read-only (default)
bigquery mcp stdio --enable-writes # with write tools
bigquery mcp http --port 3000 # HTTP mode
In Claude Code the full identifiers are mcp__bigquery__*.
Read tools (always available):
| Tool | Description |
|------|-------------|
| query | Execute SQL with auto dry-run and cost guard |
| dry_run | Estimate cost without executing ($0) |
| list_datasets | List datasets; returns bigquery://project.dataset URIs |
| list_tables | List tables; returns bigquery://project.dataset.table URIs |
| describe_table | Full schema, row count, metadata |
Write tools (require --enable-writes):
| Tool | Description |
|------|-------------|
| create_dataset | Create a new dataset |
| create_table | Create table with schema, partitioning, clustering |
| insert_rows | Streaming insert (<1000 rows) |
| load_data | Load from GCS URI (bulk) |
| execute_ddl | Run DDL/DML (CREATE, INSERT, UPDATE, etc.) |
Tables are exposed as resources via bigquery://{project.dataset.table}. Fetching a resource returns the same output as describe_table — schema, row count, metadata — without consuming a tool call slot.
bigquery://easypost-platform.DORA.ai_attribution
list_datasets → discover datasets (returns URI list)list_tables → explore tables (returns URI list)bigquery://project.dataset.table → get schemadry_run → estimate costquery → execute{
"mcpServers": {
"bigquery": {
"command": "bigquery",
"args": ["mcp", "stdio"]
}
}
}
table_raw — append-only write target; never UPDATE/DELETE in-placetable (VIEW) — read surface; deduplicates via ROW_NUMBER() OVER (PARTITION BY <key> ORDER BY loaded_at DESC) = 1loaded_at TIMESTAMP REQUIRED; set to CURRENT_TIMESTAMP() at write timeloaded_attables load silently ignores local file pathsbigquery tables load requires a Cloud Storage URI (gs://...). Passing a local path returns exit code 0 and prints nothing — the table is not written and downstream reads return 0 rows. Use tables insert for local files.
bigquery tables insert --format jsonl --yes <project.dataset.table> <local-file.jsonl>
bigquery insert (without tables) does not exist — returns "unknown command". File must be JSONL. --yes skips the cost confirmation prompt.
After bigquery tables create, metadata takes up to ~50 seconds to propagate. Calling tables insert immediately returns a "Not found: Table" error. Wrap the insert in a retry loop:
for attempt in range(10):
result = subprocess.run([...tables insert...], capture_output=True, text=True)
if result.returncode == 0:
break
if "Not found" in result.stderr:
time.sleep(5)
continue
raise RuntimeError(result.stderr)
else:
raise RuntimeError("table not available after 50s")
After a streaming insert, DELETE FROM table WHERE TRUE and TRUNCATE TABLE fail:
UPDATE or DELETE statement over table would affect rows in the streaming buffer, which is not supported
This window lasts up to 90 minutes. For WRITE_TRUNCATE semantics (replace all rows on each run), drop and recreate:
subprocess.run(["bigquery", "tables", "delete", table], capture_output=True)
_ensure_table(table) # recreate with schema
# propagation delay applies after recreate — use retry loop from (C)
DATE columns in JSONL output return {"value": "YYYY-MM-DD"} dicts, not strings. TIMESTAMP columns return epoch float strings ("1.760730316667E9"), not ISO strings.
def _parse_date(v) -> str:
if isinstance(v, dict): return str(v["value"])[:10]
return str(v)[:10]
def _parse_ts(v) -> datetime:
try: return datetime.fromisoformat(str(v))
except ValueError: return datetime.fromtimestamp(float(v), tz=timezone.utc)
Numeric columns arrive as strings. Cast before aggregation:
df["col"] = pd.to_numeric(df["col"], errors="coerce").fillna(0).astype("int64")
carrier_accounts join fan-out: ep-core-data.raw.carrier_accounts has ~77K duplicate IDs. Always deduplicate with SELECT DISTINCT id, ... before joining. Validate: COUNT() after join ≈ COUNT() before join.
CTE/column alias collision: BQ resolves ORDER BY col to the CTE (a STRUCT) if a CTE and column alias share the same name → "Ordering by expressions of type STRUCT is not allowed." Rename the column alias.
Pre-resolve dimension lookups: shipments (4.2TB) is not clustered by user_id. Joining to dim.users to filter by SF account scans the full date partition (77GB+). Instead: resolve dim IDs in a cheap query first, then use IN clause on the fact table — no join needed.
Salesforce created_date is the Polytomic sync timestamp (Feb 2025+), not the SF creation date. Use close_date or SF-native date fields for historical queries.
salesforce.tasks.created_date is INT64 nanoseconds — convert with TIMESTAMP_MILLIS(CAST(created_date / 1000000 AS INT64)).
UNNEST literal size limit: When building a SQL query that passes a large set of keys via UNNEST([key1, key2, ...]) as a literal array in the SQL string (e.g., 50k+ values), the bigquery CLI subprocess fails with a JSON parse error like "Expecting ',' delimiter: line 1 column 1517". The SQL itself may be syntactically valid but the CLI hits an input/output limit.
Fix: Don't embed large key sets in SQL literals. Instead, fetch all rows from the target table (with a REGEXP_CONTAINS or similar pre-filter if possible) and do key matching client-side in Python. This is the same approach used in extract_phab.py's _build_revision_lookup — fetch all revisions, parse ticket refs in Python, intersect with the target key set.
bigquery auth check
bigquery query --yes "SELECT ..."
bigquery dry-run "SELECT ..."
bigquery datasets list
bigquery tables list PROJECT.DATASET
bigquery tables describe TABLE
bigquery tables insert TABLE file.jsonl --format jsonl
bigquery tables load TABLE gs://bucket/file.csv --write-disposition WRITE_TRUNCATE
bigquery mcp stdio
data-ai
Delegate research and context-gathering tasks to a sub-agent to protect the primary context window. Use when the user asks to "research X", "look into X", "find out about X", "gather context on X", or any investigative framing where answering requires 2+ searches or multiple sources. Also use proactively before starting substantive work when prior context is unknown. Never run research inline — always delegate.
documentation
--- name: qmd-math description: Math notation conventions for Quarto/EPQ documents rendered via lualatex. Use when: writing or adding a formula, equation, or mathematical expression to a .qmd file; asked about display math, inline math, or LaTeX notation in a QMD/Quarto context; defining a where-clause or variable definitions for an equation; converting prose variable descriptions into structured math notation; fixing math that renders badly in a PDF; using \lvert, \begin{aligned}, \tfrac, \text
development
Trim a prose document (README, design doc, blog post, notes) for readability by cutting redundancy, filler, and dead weight in the author's own words. Invoke with /trim [file path], or /trim alone to be prompted for a file. Not for source code, data files, or summarization.
business
Query and analyze Josh Lane's org headcount from the staffing DuckDB at ~/workspace/areas/staffing/staffing.duckdb. Use when asked about headcount counts, org structure, direct reports, team breakdown, hiring/attrition trends, international employees, salary/pay grade distribution, offboarding lag, or any question about people in Josh's org. Triggers on questions about how many people, who reports to whom, headcount by team/country/level, who joined or left, org size, staffing, headcount trend.