skills/data-and-science/research/scientific-skills/llm-web-data-collection/SKILL.md
Automate web-scale data collection for research datasets using a human-in-the-loop LLM framework. This skill implements the methodology from "LLM-Based Web Data Collection for Research Dataset Creation" (EMNLP 2025 Findings). Automatically formulates search queries, navigates web pages, extracts structured data, and performs quality control while mitigating search bias and LLM hallucinations.
npx skillsauth add lunartech-x/superpowers llm-web-data-collectionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides a human-in-the-loop framework for automating web-scale data collection using Large Language Models. It addresses the challenges of manual data collection being time-consuming and error-prone by automating:
Key Innovation: Human-in-the-loop design allows researchers to inspect and adjust decisions at each stage, ensuring alignment with research objectives while mitigating LLM hallucinations and search engine bias.
Use this skill when:
Define Target Dataset:
dataset_spec = {
"name": "Clinical Trial Sites",
"description": "Collect information about clinical trial sites including location, specialties, and contact information",
"fields": [
{"name": "site_name", "type": "string", "required": True},
{"name": "location", "type": "string", "required": True},
{"name": "specialties", "type": "list", "required": False},
{"name": "contact_email", "type": "email", "required": False},
{"name": "phone", "type": "phone", "required": False},
{"name": "website", "type": "url", "required": False}
],
"constraints": [
"Only include active sites",
"Focus on US-based facilities",
"Prefer academic medical centers"
]
}
Human Review Point:
LLM-Based Query Generation:
def generate_search_queries(dataset_spec, llm):
"""
Use LLM to generate diverse search queries
from dataset description
"""
prompt = f"""
Given this dataset specification:
{json.dumps(dataset_spec, indent=2)}
Generate 20 diverse search engine queries that would help
find web pages containing this information.
Consider:
- Different phrasings of the same concept
- Specific vs general queries
- Including and excluding certain terms
- Different source types (directories, databases, articles)
Return as JSON list of queries.
"""
response = llm.generate(prompt)
queries = parse_json(response)
return queries
Query Diversification:
def diversify_queries(initial_queries, llm):
"""
Expand queries to reduce search engine bias:
- Add synonyms
- Vary query structure
- Include different geographic modifiers
- Add temporal modifiers if relevant
"""
diversified = []
for query in initial_queries:
variations = llm.generate_variations(query)
diversified.extend(variations)
# Remove duplicates and near-duplicates
return deduplicate(diversified)
Human Review Point:
Search Execution:
def execute_search(queries, search_engine="google"):
"""
Execute search queries and collect URLs
"""
all_results = []
for query in queries:
results = search_api.search(
query,
num_results=50,
engine=search_engine
)
for result in results:
result['source_query'] = query
all_results.extend(results)
return deduplicate_urls(all_results)
Page Relevance Scoring:
def score_page_relevance(url, page_content, dataset_spec, llm):
"""
Use LLM to assess page relevance to dataset spec
"""
prompt = f"""
Dataset objective: {dataset_spec['description']}
Page URL: {url}
Page content (first 5000 chars): {page_content[:5000]}
Score this page's relevance (0-10) and explain:
1. Does it contain relevant data points?
2. Is the data structured or extractable?
3. Is this a primary source or aggregator?
Return JSON: {{"score": X, "reasoning": "...", "data_fields_present": [...]}}
"""
return llm.generate(prompt)
Human Review Point:
Schema-Guided Extraction:
def extract_data(page_content, dataset_spec, llm):
"""
Extract structured data according to schema
"""
prompt = f"""
Extract the following fields from this page content:
Fields to extract:
{json.dumps(dataset_spec['fields'], indent=2)}
Page content:
{page_content}
Rules:
- Only extract explicitly stated information
- Mark uncertain extractions with confidence score
- Return null for missing required fields
- Flag potential hallucination risks
Return JSON matching the schema.
"""
extracted = llm.generate(prompt)
return validate_extraction(extracted, dataset_spec)
Hallucination Mitigation:
def verify_extraction(extracted_data, page_content, llm):
"""
Verify extracted data against source to prevent hallucination
"""
verification_results = []
for field, value in extracted_data.items():
# Check if value appears verbatim or closely in source
if not find_in_source(value, page_content):
# Use LLM to verify derivation
prompt = f"""
Verify this extraction:
Field: {field}
Extracted value: {value}
Source text: {page_content}
Is this value:
1. Directly stated in source
2. Reasonably derived from source
3. Possibly hallucinated
Return confidence score and evidence.
"""
verification = llm.generate(prompt)
verification_results.append(verification)
return flag_low_confidence(verification_results)
Human Review Point:
Cross-Validation:
def cross_validate(dataset, external_sources):
"""
Validate extracted data against known sources
"""
validation_results = []
for record in dataset:
# Check against external databases/APIs
external_match = lookup_external(record, external_sources)
if external_match:
agreement = compute_agreement(record, external_match)
validation_results.append({
'record': record,
'external_match': external_match,
'agreement': agreement
})
return validation_results
Consistency Checks:
def check_consistency(dataset):
"""
Check for internal consistency:
- Duplicate detection
- Conflicting values
- Outlier detection
- Format validation
"""
issues = []
# Duplicate detection
duplicates = find_duplicates(dataset)
issues.extend(duplicates)
# Value consistency (same entity, different values)
conflicts = find_conflicts(dataset)
issues.extend(conflicts)
# Outlier detection
outliers = detect_outliers(dataset)
issues.extend(outliers)
return issues
Human Review Point:
Generate Research-Ready Output:
def export_dataset(dataset, output_format="csv"):
"""
Export in standard research formats
"""
# CSV for tabular data
if output_format == "csv":
df = pd.DataFrame(dataset)
df.to_csv("dataset.csv", index=False)
# JSON for nested data
elif output_format == "json":
with open("dataset.json", "w") as f:
json.dump(dataset, f, indent=2)
# Generate data dictionary
generate_data_dictionary(dataset)
# Generate provenance log
generate_provenance_log(dataset)
Documentation:
# Dataset Documentation
## Collection Methodology
- Queries used: [list]
- Sources searched: [list]
- Date range: [dates]
## Quality Metrics
- Total records: X
- Verified records: Y%
- Human-reviewed: Z%
## Limitations
- Search engine bias mitigation: [description]
- Known gaps: [description]
## Provenance
- Each record includes source URL
- Extraction confidence scores included
| Phase | Checkpoint | Decision | |-------|------------|----------| | 1 | Dataset spec review | Approve/modify schema | | 2 | Query review | Add/remove queries | | 3 | Page relevance | Adjust scoring criteria | | 4 | Extraction review | Correct extractions | | 5 | Quality review | Resolve conflicts | | 6 | Final approval | Approve dataset |
# LLM
pip install openai # or anthropic, google-generativeai
# Web
pip install requests beautifulsoup4 selenium
# Data
pip install pandas
# Search APIs (optional)
pip install googlesearch-python
tools
Data structure for annotated matrices in single-cell analysis. Use when working with .h5ad files or integrating with the scverse ecosystem. This is the data format skill—for analysis workflows use scanpy; for probabilistic models use scvi-tools; for population-scale queries use cellxgene-census.
testing
Access AlphaFold 200M+ AI-predicted protein structures. Retrieve structures by UniProt ID, download PDB/mmCIF files, analyze confidence metrics (pLDDT, PAE), for drug discovery and structural biology.
development
Access real-time and historical stock market data, forex rates, cryptocurrency prices, commodities, economic indicators, and 50+ technical indicators via the Alpha Vantage API. Use when fetching stock prices (OHLCV), company fundamentals (income statement, balance sheet, cash flow), earnings, options data, market news/sentiment, insider transactions, GDP, CPI, treasury yields, gold/silver/oil prices, Bitcoin/crypto prices, forex exchange rates, or calculating technical indicators (SMA, EMA, MACD, RSI, Bollinger Bands). Requires a free API key from alphavantage.co.
development
This skill should be used for time series machine learning tasks including classification, regression, clustering, forecasting, anomaly detection, segmentation, and similarity search. Use when working with temporal data, sequential patterns, or time-indexed observations requiring specialized algorithms beyond standard ML approaches. Particularly suited for univariate and multivariate time series analysis with scikit-learn compatible APIs.