skills/60-regisely-superpapers/skills/data-collection/SKILL.md
Use when collecting data for a research project, downloading time series, building a dataset, accessing economic or social data APIs, or scraping data from a non-API source. Handles source discovery, respectful collection, local caching, and manifest documentation.
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research data-collectionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill guides data collection from the research question to a versionable artifact in data/raw/. It is field-agnostic and open-ended about sources — the references/common-sources.md file is a starting point, not a boundary. For any research question, the skill uses web search to find appropriate sources beyond the common list.
Identify data needs from the research question. Variables, units (country, firm, individual, pixel), frequency, period, geography, and any necessary keys for merging across sources.
Find appropriate sources. Start with references/common-sources.md. If the user's needs are not covered there, search the web for the relevant source. Never invent a URL or API endpoint from memory.
Prefer APIs over scraping. APIs are versioned, documented, and legal. Scraping is the last resort when no API is available.
When scraping is necessary, be respectful:
robots.txtSave raw data in data/raw/ in a versionable format. Parquet is preferred for tabular data; CSV is acceptable for small datasets. Never edit raw files by hand.
Document every dataset in data/manifest.md following the format from replication-driven-research: name, source, URL or API endpoint, collection date, variables used, frequency, period, license or usage notes.
Cache locally. Check data/raw/ before fetching. Only invoke the network if the file is missing or the user has explicitly requested a refresh.
Follow this order when looking for a data source:
references/common-sources.md for known sources in the relevant domain."<topic>" open data API or "<topic>" dataset download.robots.txt prohibits itdata/raw/data/raw/ in a versionable format (parquet preferred)code/, not an interactive sessiontools
Show mcp-stata identity, connected tools, and status. Use when the user asks if mcp-stata is available, asks about access to the toolkit, or asks what Stata tools are connected.
tools
Activate when users mention Stata commands, .do files, regressions, econometrics, stored results, graphs, dataset inspection, replication, or Stata errors. Route the task through mcp-stata tools and the specialized research skills instead of treating it as plain text coding.
development
Build and review paper-ready regression, balance, and summary tables from Stata outputs. Use when the user needs a clean table for a draft, appendix, or coauthor share-out.
tools
Install, configure, update, or verify mcp-stata across Claude Code, Codex, Gemini CLI, Cursor, Windsurf, and VS Code. Activate when users ask to set up the Stata toolkit or troubleshoot the installation.