skills/dewey/SKILL.md
Use when "query Dewey Data", "deweydata.io", "SafeGraph places/patterns/spend", "Advan foot traffic", "POI / points of interest", "mobility data", "dataplor", "Veraset", "PassBy", "crypto/Bitcoin ATM locations", or any pull from the Dewey Data academic marketplace (UVA/NYU Platform Subscription) via the deweypy/deweydatapy client, DuckDB, or the Dewey MCP server.
npx skillsauth add edwinhu/workflows deweyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Dewey Data is an academic data marketplace — one institutional Platform Subscription unlocks a catalog of ~300 datasets from ~40 providers (foot traffic, POI, mobility, consumer transactions, real estate, labor). UVA Library and NYU both hold the institutional subscription; SafeGraph and most providers are free under it.
Dewey is not a SQL warehouse like WRDS. Data is delivered as partitioned Parquet/CSV.gz files downloaded via an API key. You discover datasets, read metadata, sample, filter (by date partition + columns), then download. Think "S3 of presigned Parquet links," not "PostgreSQL."
| | WRDS | Dewey |
|---|---|---|
| Data | Finance/accounting | POI, foot traffic, mobility, consumer, real estate |
| Access | PostgreSQL / SAS on the grid | File download (Parquet/CSV.gz) via API key |
| Query engine | server-side SQL | DuckDB over the files (local or remote presigned URLs) |
| Licensing | per-vendor, negotiated | one platform subscription unlocks the catalog |
| AI access | none | MCP server (api.deweydata.io/mcp) |
apikey = "your_api_key" and run it — it will 401 and waste a round trip. Read from DEWEY_API_KEY env var or a gitignored file (~/.config/dewey/apikey).Guessing or hardcoding the key is NOT HELPFUL — every call 401s, and a committed key is a security incident the user must rotate. </EXTREMELY-IMPORTANT>
Each product (dataset) has its own product path / project ID (prj_…), obtained from the dataset page: Get Data → (Skip filtering) → Connect to API / Bulk API → API URL. One API key, many product paths. If you don't have the product path, discover it via the MCP server (search_datasets) rather than guessing.
Before downloading ANY Dewey dataset, you MUST:
get_meta (deweydatapy) / get_download_info (MCP) to learn partition columns, date range, file count, total sizeread_sample / MCP sample_dataset) and INSPECT the schema before committing to a full pullpartition_key_after/before) AND columns; for selective pulls use DuckDB COPY TO over the presigned URLs, never download the whole catalogThis is not negotiable. Skipping the sample-and-filter step is NOT HELPFUL — Dewey datasets are routinely hundreds of GB to multiple TB; an unfiltered pull burns hours of bandwidth and disk for data you'll immediately throw away.
COPY TO with a WHERE clause on the remote parquet to pull only the rows/columns you need.naics_code vs NAICS_CODE; opened_on may not exist at all). A full pull against guessed columns is the exact incompetence the sample step exists to prevent — read_sample(nrows=100) BEFORE the full pull.partition_key_after/before to the study window.download_files0). For large multi-day pulls use download_files1 (page-by-page, refreshes links) — a long job on download_files0 dies mid-pull.prj_ product path 404s or returns someone else's data. Get the path from Connect to API or MCP search_datasets; hardcoding a guessed path is an unverified claim presented as fact.download_files* without first calling get_meta + read_sample → STOP. Meta + sample first.start_date/end_date / partition filter → STOP. Scope the date range.COPY TO … (FORMAT PARQUET, PARTITION_BY …) to persist a filtered subset to disk.apikey="your_api_key" or any guessed key → STOP. Ask the user; read from env/file.| Need | Method | Reference |
|------|--------|-----------|
| Discover/search datasets, check schema, sample — from inside Claude | MCP server (api.deweydata.io/mcp) | references/mcp.md |
| Scripted Python bulk download | deweypy (recommended) or deweydatapy (legacy, product_path API) | references/deweypy-client.md |
| Selective pull — specific columns/rows from huge datasets | DuckDB over presigned URLs (read_parquet($urls) + COPY TO) | references/duckdb.md |
| R workflow | deweyr (download_dewey()) | references/deweypy-client.md |
| One-off, dataset < 2.0 GB | UI CSV download (platform → project) | references/access-options.md |
| Analyze data already on disk | DuckDB / pandas / polars over *.parquet or *.csv.gz | references/access-options.md |
Get the key once from app.deweydata.io → Connections → Add Connection → API Key. Store it out of source control:
mkdir -p ~/.config/dewey && echo 'YOUR_KEY' > ~/.config/dewey/apikey && chmod 600 ~/.config/dewey/apikey
# or: export DEWEY_API_KEY=... (add to .envrc, which should be gitignored)
import os, pathlib
apikey = os.environ.get("DEWEY_API_KEY") or pathlib.Path("~/.config/dewey/apikey").expanduser().read_text().strip()
Institutional login (to browse the catalog / create the key) is via UVA NetBadge (use your UVA email) or NYU SSO. The Platform Subscription is what makes SafeGraph etc. free — see references/datasets.md.
| Provider | Dataset(s) | What it is | |----------|------------|------------| | SafeGraph | Global Places (POI), Geometry, Spend, Patterns | POI master, building footprints, card spend, foot-traffic visit patterns | | Advan Research | Monthly/Weekly Patterns, Home Panel | Foot traffic aggregated to place & census-block | | dataplor | POI | Global POI, strong emerging-markets coverage | | Veraset | Movement | Device-level mobility (institutional license only) | | PassBy | Foot Traffic | Per-POI foot-traffic analytics | | Consumer Edge / PDI | Spend / transactions | Card & product-level purchasing | | LinkUp | Job postings | Labor-market activity | | ATTOM / Dwellsy / RentHub | Real estate | Property records, rentals |
Full catalog (all ~250 datasets): references/catalog.md — every dataset grouped by category with time coverage, row count, size, and download access (machine-readable: references/catalog.csv). Featured-dataset detail + discovery workflow: references/datasets.md.
Core POI schema — columns are UPPERCASE, NAICS_CODE is a string, BRANDS is a JSON-array string (extract with json_extract_string(BRANDS,'$[0].safegraph_brand_name')). Always sample before filtering.
| Column | Meaning |
|--------|---------|
| PLACEKEY | Stable unique POI id (join key across SafeGraph products) |
| LOCATION_NAME | POI name |
| BRANDS | JSON array: [{"safegraph_brand_name":"…"}] — not plain text |
| STREET_ADDRESS,CITY,REGION,POSTAL_CODE,ISO_COUNTRY_CODE | Address (REGION=US state) |
| LATITUDE,LONGITUDE | Coordinates |
| NAICS_CODE,NAICS_CODE_2022 | 6-digit NAICS (string) |
| TOP_CATEGORY,SUB_CATEGORY | Category labels |
| OPENED_ON,CLOSED_ON,TRACKING_CLOSED_SINCE | Open/close dates (exist but sparsely populated — NULL for BTMs) |
Resolved empirically: crypto/Bitcoin ATMs do exist as standalone POIs under NAICS_CODE='522320'; all major operators are present. But OPENED_ON/CLOSED_ON are NULL for BTMs in the current release → it's a cross-section, not a time series. Full details, the 7 BTM operators, and the worked example: references/safegraph-places.md and examples/btm_safegraph_pull.py.
references/access-options.md — all download methods (UI, deweypy, deweydatapy, DuckDB, MCP, R), 24h link expiry, partitioning, reading data on diskreferences/deweypy-client.md — deweypy (modern CLI + auth/download) and deweydatapy (get_meta, get_file_list, read_sample, download_files0/1) function reference; deweyr for Rreferences/duckdb.md — selective remote-Parquet pulls, COPY TO … PARTITION_BY pattern, querying downloaded filesreferences/mcp.md — Dewey MCP server URL, JSON config, the 9 tools, discovery → schema → sample workflowreferences/datasets.md — featured-dataset catalog, UVA NetBadge / NYU institutional access, discovery workflowreferences/catalog.md + catalog.csv — full enumerated catalog (~250 datasets / 39 partners) by category, with coverage / rows / column count / size / accessreferences/schemas.json — full column schemas for all ~250 datasets (keyed by slug → columns[] with name/type/description; 11,264 columns). Look up a dataset's columns here before pulling, instead of a live get_dataset_schema callreferences/linkage.md — cross-dataset join-key map (placekey, ticker, cusip/cik, domain, person id, lat/long, fips, zip…) — which datasets combine and on what spinereferences/safegraph-places.md — Global Places schema, NAICS 522320, BTM operator brands, opened_on/closed_on, the Bitcoin-ATM worked exampleexamples/btm_safegraph_pull.py — acceptance test: filter SafeGraph Global Places to the 7 BTM operator brands + NAICS 522320, verify standalone-POI / open-close coverage, export the US subset to ~/projects/batm/development
Use when submitting jobs to UVA HPC (Rivanna/Afton), writing Slurm scripts (sbatch/srun/squeue), converting SGE to Slurm, running compute on any Slurm-managed cluster, or building WRDS data pipelines with polars on HPC. Triggers: 'submit to HPC', 'sbatch', 'squeue', 'slurm job', 'run on Rivanna', 'run on Afton', 'HPC array job', 'convert SGE to Slurm', 'polars on HPC', 'WRDS from HPC'.
testing
Internal skill for literature review and source materialization. Called after brainstorm, before setup. NOT user-facing.
development
This skill should be used when the user asks to "add paper", "paperpile add", "fetch PDF for", "find and add", "search paperpile", "find in paperpile", "paperpile search", "label paper", "trash paper", "download paper", "paperpile index", "edit paper metadata", "update paper title", "fix paper author", "paperpile edit", "find PDF online", "search google for PDF", "resolve PDF", "fetch PDF for citation", "get full-text for DOI", "resolve cite to PDF", or any request to manage their Paperpile library or resolve a citation to a local PDF.
development
This skill should be used when the user asks to "deep research", "comprehensive research on", "thorough investigation of", "research report on", "deep dive into", "literature review on", or needs Gemini Deep Research for web-grounded multi-source synthesis beyond what Google Scholar and Consensus provide.