skills/tier-1-foundation/semantic-layer-audit/SKILL.md
Audit and maintain a data semantic layer for AI agents. Scans BigQuery datasets, GCP APIs, secrets, and service accounts to keep your data catalog current.
npx skillsauth add pbc-os/agent-skills-public semantic-layer-auditInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Why this matters: AI agents can only use data they know about. This skill maintains a living data catalog so you never lose track of available datasets, APIs, or credentials.
gcloud auth login)GCP_PROJECT_ID set to your project# Set your project
export GCP_PROJECT_ID="your-project-id"
# Verify access
gcloud config set project $GCP_PROJECT_ID
bq ls --project_id=$GCP_PROJECT_ID
# Run full infrastructure audit
python3 scripts/audit_infrastructure.py > audit-results.json
# Review and update your semantic layer doc
# (see templates/SEMANTIC-LAYER-TEMPLATE.md)
| Trigger | Action | |---------|--------| | "What data do we have?" | Full audit | | "Audit data sources" | Full audit | | After creating views | Document with full schema + usage guidance | | After new integrations | Infrastructure scan | | Weekly maintenance | Cron: catch schema changes | | Before complex analysis | Verify sources exist |
Most data catalogs are just lists of table names. This leads to:
Every table needs these fields:
| Field | Required | Example |
|-------|----------|---------|
| Name | ✅ | orders |
| Description | ✅ | "All completed orders" |
| Key Fields | ✅ | order_id, created_at, total |
| Granularity | ✅ | Per-order |
| Source | ✅ | "Synced from Shopify API" |
Views require MORE documentation because users need to know when to use them vs raw tables:
#### `daily_sales_summary` (VIEW)
**Purpose:** Pre-aggregated daily sales metrics. Faster than aggregating raw orders.
**When to use:**
- Daily/weekly/monthly revenue trends
- High-level reporting dashboards
- Quick "how did yesterday go?" questions
**When NOT to use:**
- Need individual order details → use `orders`
- Need customer-level data → use `orders` joined with `customers`
| Column | Type | Description |
|--------|------|-------------|
| `date` | DATE | Sales date |
| `total_revenue` | FLOAT64 | Sum of order totals (dollars) |
| `order_count` | INT64 | Number of orders |
| `avg_order_value` | FLOAT64 | Revenue / orders |
**Example:**
\`\`\`sql
-- Last 30 days revenue trend
SELECT date, total_revenue, order_count
FROM `project.dataset.daily_sales_summary`
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
ORDER BY date
\`\`\`
When multiple tables/views could answer similar questions, add a decision table:
### Choosing the Right Sales Data Source
| Question | Use This | Why |
|----------|----------|-----|
| "Total revenue last month?" | `daily_sales_summary` | Pre-aggregated, fast |
| "Revenue by product category?" | `order_items` | Has product details |
| "Specific customer's orders?" | `orders` | Has customer_id |
| "Hour-by-hour sales today?" | `orders` + aggregate | Summary is daily only |
Document when underlying systems changed (critical for historical analysis):
### 🚨 SYSTEM TRANSITION
| Entity | Old System | New System | Cutover Date |
|--------|------------|------------|--------------|
| Orders | Legacy POS | Shopify | Jan 2024 |
**Implications:**
- YoY comparisons spanning the cutover need both systems
- Pre-Jan-2024 data in `legacy_orders`, post in `orders`
- Some fields don't exist in legacy (e.g., `discount_code`)
When different sources show different numbers, explain WHY:
### 💡 Why Shopify Revenue ≠ Accounting Revenue
**Observation:** Shopify shows $100k, QuickBooks shows $94k for same month.
**Explanation:**
- Shopify = Gross merchandise value (what customers paid)
- QuickBooks = Net revenue (after refunds, chargebacks, payment fees)
**Neither is wrong.** Use Shopify for customer-facing metrics, QuickBooks for P&L.
python3 scripts/audit_infrastructure.py
The script scans:
Output is JSON for easy processing.
Search your agent's memory for data access patterns not yet documented:
# Find BigQuery queries in recent sessions
memory_search "BigQuery query SELECT FROM"
# Find new views or tables created
memory_search "CREATE VIEW CREATE TABLE"
# Find API usage patterns
memory_search "API endpoint curl"
Look for:
Compare audit results against your SEMANTIC-LAYER.md:
Completeness:
Quality (the important part):
Update your data catalog with:
Log ad-hoc discoveries for future audits:
## Discovery Log
| Date | Discovery | Source | Added |
|------|-----------|--------|-------|
| 2024-01-15 | Tips not in revenue | Debug session | ✅ |
| 2024-01-18 | New product_prices view | Created for analysis | ✅ |
Maintain a routing table at the TOP of your semantic layer doc:
| Data Need | Source | Notes |
|-----------|--------|-------|
| **Daily Sales** | `daily_sales_summary` | Preferred - pre-aggregated |
| **Order Details** | `orders` | Full order records |
| **Product Prices** | `product_prices` (VIEW) | Use this! |
| **Legacy Orders** | `legacy_orders` | Pre-2024 only |
Use "Preferred" or "Use this!" to guide agents to the right source.
When a table/view is superseded:
In the table listing:
| `old_table` | ⚠️ **Use `new_view` instead** - Legacy | ... |
Add a deprecation section:
## ⚠️ Deprecated Sources
| Old Source | Use Instead | Reason |
|------------|-------------|--------|
| `order_line_items` | `product_sales` view | Better schema, faster |
| `sales_raw` | `orders` | Renamed, same data |
| Variable | Required | Description |
|----------|----------|-------------|
| GCP_PROJECT_ID | Yes | Your GCP project ID |
| SEMANTIC_LAYER_PATH | No | Path to your doc (default: docs/SEMANTIC-LAYER.md) |
Edit scripts/audit_infrastructure.py to:
SKIP_DATASETS)| File | Purpose |
|------|---------|
| scripts/audit_infrastructure.py | Infrastructure scanner |
| templates/SEMANTIC-LAYER-TEMPLATE.md | Starter template |
Weekly audit to catch schema changes:
0 6 * * 0
Text: "Run semantic layer audit - check for new BigQuery tables, views, API changes, and undocumented data patterns"
After each audit:
{
"project_id": "your-project",
"audit_time": "2024-01-15T10:30:00",
"bigquery": {
"datasets": [
{
"dataset_id": "sales_data",
"tables": [
{"table_id": "orders", "type": "TABLE", "row_count": 50000},
{"table_id": "daily_summary", "type": "VIEW"}
]
}
]
},
"secrets": ["api-key-service-a", "oauth-token-service-b"],
"apis": ["bigquery.googleapis.com", "secretmanager.googleapis.com"]
}
tools
Generate and iteratively refine USPTO-style patent figure drawings from provisional patent application markdown files, using nano-banana for v1 generation and targeted single-fix edits for v2+ iteration.
data-ai
Weekly revenue / sales forecasting for small businesses with multiple locations or product lines. Blends recent trend + seasonal baseline + YoY growth with per-entity holiday multipliers and week-of-month adjustments. Ships autoresearch-compatible eval and parameters so you can tune it on your own historical data.
data-ai
Analyze email, calendar, and file patterns to discover repeatable workflows that AI agents can automate.
testing
Automated daily digest for small business owners. Combines email triage, calendar agenda, open tasks, and business KPIs into a single morning briefing. Composable — works with whatever data sources are available. Urgent emails require body inspection and explicit escalation signals — never classified from sender/timing metadata alone.