skills/15-Felpix-Studios-social-science-research/skills/data-finder/SKILL.md
Find and assess datasets for a research question. Dispatches Explorer agents to search across data source categories, then Explorer-Critic to stress-test each candidate. Produces a ranked list with feasibility grades. Make sure to use this skill whenever the user wants to identify or evaluate data sources — not to search for papers or run analysis. Triggers include: "find data", "what data should I use", "find a dataset for this", "where can I get data on X", "assess datasets", "what datasets exist for", "help me find data", "is there data on this", "what are my data options", "I need data for this project", or any request to locate empirical data sources for a research question.
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research data-finderInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Find and assess datasets for your research question. Two Explorer agents search in parallel across data source categories; an Explorer-Critic then stress-tests each candidate against the research design.
Input: $ARGUMENTS — a topic, or from spec to read the research question from quality_reports/.
Find the most recent quality_reports/project_spec_*.md or quality_reports/specs/*.md — extract:
Read references/domain-profile.md if it exists — extract the Common Datasets section (domain-specific datasets to check first).
If no research spec exists, extract the variables and strategy from $ARGUMENTS directly. If the request is vague, ask: "What are the treatment and outcome variables, and what empirical strategy did you have in mind?"
Split the source categories between two Explorer agents to parallelize the search.
Explorer A — Institutional Data:
Task prompt: "You are an Explorer agent. Research question: [question].
Empirical strategy: [strategy].
Variables needed — Treatment: [X], Outcome: [Y], Controls: [list],
Time period: [period], Geography: [geo], Unit: [unit].
Domain datasets (check first): [list from domain-profile if available].
Your source categories to search:
1. Public microdata (CPS, ACS, NHIS, MEPS, SIPP, QWI)
2. Administrative data (Medicare/Medicaid, IRS, SSA, vital statistics, court records)
3. Survey panels (PSID, HRS, Add Health, NLSY97/79, BHPS/UKHLS)
For each dataset found, produce the full Explorer report format.
Follow the Explorer agent instructions."
Explorer B — Broader and Alternative Sources:
Task prompt: "You are an Explorer agent. Research question: [question].
Empirical strategy: [strategy].
Variables needed — Treatment: [X], Outcome: [Y], Controls: [list],
Time period: [period], Geography: [geo], Unit: [unit].
Domain datasets (check first): [list from domain-profile if available].
Your source categories to search:
1. International data (World Bank, OECD, Eurostat, IMF, IPUMS International)
2. Novel/alternative (satellite, web scraping, proprietary, RCT registries)
3. Any field-specific datasets not covered by Explorer A
For each dataset found, produce the full Explorer report format.
Follow the Explorer agent instructions."
After both Explorer agents complete, dispatch the Explorer-Critic with the full combined dataset list.
Task prompt: "You are an Explorer-Critic agent. Research question: [question].
Empirical strategy: [strategy].
Variables needed — Treatment: [X], Outcome: [Y], Controls: [list],
Time period: [period], Geography: [geo], Unit: [unit].
Here is the combined dataset list from the Explorer agents:
[paste all Explorer findings]
Apply the 5-point critique to each dataset:
1. Measurement validity
2. Sample selection
3. External validity
4. Identification compatibility
5. Known issues
Produce adjusted feasibility grades and deal-breaker flags.
Follow the Explorer-Critic agent instructions."
After the Explorer-Critic completes, compile the final ranked report:
Save to quality_reports/data_exploration_[sanitized_topic].md:
# Data Exploration: [Topic]
**Date:** [YYYY-MM-DD]
**Research question:** [one sentence]
**Empirical strategy:** [method]
**Variables sought:** Treatment = [X], Outcome = [Y], Controls = [list]
---
## Top Candidates (Grade A–B)
### 1. [Dataset Name] — Grade: A/B
**Provider:** [Name] | **Access:** [Public/Restricted/etc.] | **URL:** [link]
**Coverage:** [time period] | [geography] | [unit of observation] | N ≈ [size]
**Key Variables:**
- Treatment proxy: [variable]
- Outcome: [variable]
- Controls available: [list]
**Explorer-Critic Assessment:**
- Measurement validity: [1-2 sentences]
- Sample selection: [1-2 sentences]
- External validity: [1-2 sentences]
- Identification compatibility: [focused on the proposed strategy]
- Known issues: [specific documented problems]
**Bottom line:** [1-2 sentences — viable and under what conditions]
---
[Repeat for all A and B grade datasets]
---
## Accessible With Effort (Grade C)
[Brief summaries — name, access path, main limitation, why C not B]
---
## Rejection Table
| Dataset | Reason for Rejection | Deal-breaker? |
|---------|---------------------|---------------|
| [Name] | [Explorer-Critic finding] | YES/NO |
---
## Recommended Path Forward
1. **Best dataset:** [Name] — [one sentence why]
2. **Fallback if [best] unavailable:** [Name] — [why it's second choice]
3. **Access steps for [best]:** [specific actions needed — download link, application URL, IRB requirements]
---
## Next Steps
- **`/data-analysis [dataset]`** — begin analysis with the recommended dataset
- **`/lit-review [topic]`** — check if papers in the literature use these datasets (helps validate choice)
tools
Show mcp-stata identity, connected tools, and status. Use when the user asks if mcp-stata is available, asks about access to the toolkit, or asks what Stata tools are connected.
tools
Activate when users mention Stata commands, .do files, regressions, econometrics, stored results, graphs, dataset inspection, replication, or Stata errors. Route the task through mcp-stata tools and the specialized research skills instead of treating it as plain text coding.
development
Build and review paper-ready regression, balance, and summary tables from Stata outputs. Use when the user needs a clean table for a draft, appendix, or coauthor share-out.
tools
Install, configure, update, or verify mcp-stata across Claude Code, Codex, Gemini CLI, Cursor, Windsurf, and VS Code. Activate when users ask to set up the Stata toolkit or troubleshoot the installation.