skills/05-kthorn-research-superpower/research/building-screening-rubrics/SKILL.md
<!-- ╔══════════════════════════════════════════════════════════════╗ ║ 本文件为开源 Skill 原始文档,收录仅供学习与研究参考 ║ ║ CoPaper.AI 收集整理 | https://copaper.ai ║ ╚══════════════════════════════════════════════════════════════╝ 来源仓库: https://github.com/kthorn/research-superpower 项目名称: research-superpower 开源协议: MIT License 收录日期: 2026-04-02 声明: 本文件版权归原作者所有。此处收录旨在为社会科学实证研究者 提供 AI Agent Skills 的集中参考。如有侵权,请联系删除。 --> --- name: Building Paper Screening Rubrics description
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research skills/05-kthorn-research-superpower/research/building-screening-rubricsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Core principle: Build screening rubrics collaboratively through brainstorming → test → refine → automate → review → iterate.
Good rubrics come from understanding edge cases upfront and testing on real papers before bulk screening.
Use this skill when:
When NOT to use:
Ask domain-agnostic questions to understand what makes papers relevant:
Core Concepts:
Data Types & Artifacts:
Paper Types:
Relationships & Context:
Edge Cases:
Document responses in screening-criteria.json
Based on brainstorming, propose scoring logic:
Scoring (0-10):
Keywords Match (0-3 pts):
- Core term 1: +1 pt
- Core term 2 OR synonym: +1 pt
- Related term: +1 pt
Data Type Match (0-4 pts):
- Measurement type (IC50, Ki, EC50, etc.): +2 pts
- Dataset/code available: +1 pt
- Methods described: +1 pt
Specificity (0-3 pts):
- Primary research: +3 pts
- Methods paper: +2 pts
- Review: +1 pt
Special Rules:
- If mentions exclusion term: score = 0
Threshold: ≥7 = relevant, 5-6 = possibly relevant, <5 = not relevant
Present to user and ask: "Does this logic match your expectations?"
Save initial rubric to screening-criteria.json:
{
"version": "1.0.0",
"created": "2025-10-11T15:30:00Z",
"keywords": {
"core_terms": ["term1", "term2"],
"synonyms": {"term1": ["alt1", "alt2"]},
"related_terms": ["related1", "related2"],
"exclusion_terms": ["exclude1", "exclude2"]
},
"data_types": {
"measurements": ["IC50", "Ki", "MIC"],
"datasets": ["GEO:", "SRA:", "PDB:"],
"methods": ["protocol", "synthesis", "assay"]
},
"scoring": {
"keywords_max": 3,
"data_type_max": 4,
"specificity_max": 3,
"relevance_threshold": 7
},
"special_rules": [
{
"name": "scaffold_analogs",
"condition": "mentions target scaffold AND (analog OR derivative)",
"action": "add 3 points"
}
]
}
Do a quick PubMed search to get candidate papers:
# Search for 20 papers using initial keywords
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=YOUR_QUERY&retmax=20&retmode=json"
Fetch abstracts for first 10-15 papers:
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=PMID1,PMID2,...&retmode=xml&rettype=abstract"
Present abstracts to user one at a time:
Paper 1/10:
Title: [Title]
PMID: [12345678]
DOI: [10.1234/example]
Abstract:
[Full abstract text]
Is this paper RELEVANT to your research question? (y/n/maybe)
Record user judgments in test-set.json:
{
"test_papers": [
{
"pmid": "12345678",
"doi": "10.1234/example",
"title": "Paper title",
"abstract": "Full abstract text...",
"user_judgment": "relevant",
"timestamp": "2025-10-11T15:45:00Z"
}
]
}
Continue until have 5-10 papers with clear judgments
Apply rubric to each test paper:
for paper in test_papers:
score = calculate_score(paper['abstract'], rubric)
predicted_status = "relevant" if score >= 7 else "not_relevant"
paper['predicted_score'] = score
paper['predicted_status'] = predicted_status
Calculate accuracy:
correct = sum(1 for p in test_papers
if p['predicted_status'] == p['user_judgment'])
accuracy = correct / len(test_papers)
Present classification report:
RUBRIC TEST RESULTS (5 papers):
✓ PMID 12345678: Score 9 → relevant (user: relevant) ✓
✗ PMID 23456789: Score 4 → not_relevant (user: relevant) ← FALSE NEGATIVE
✓ PMID 34567890: Score 8 → relevant (user: relevant) ✓
✓ PMID 45678901: Score 3 → not_relevant (user: not_relevant) ✓
✗ PMID 56789012: Score 7 → relevant (user: not_relevant) ← FALSE POSITIVE
Accuracy: 60% (3/5 correct)
Target: ≥80%
--- FALSE NEGATIVE: PMID 23456789 ---
Title: "Novel analogs of compound X with improved potency"
Score breakdown:
- Keywords: 1 pt (matched "compound X")
- Data type: 2 pts (mentioned IC50 values)
- Specificity: 1 pt (primary research)
- Total: 4 pts → not_relevant
Why missed: Paper discusses "analogs" but didn't trigger scaffold_analogs rule
Abstract excerpt: "We synthesized 12 analogs of compound X..."
--- FALSE POSITIVE: PMID 56789012 ---
Title: "Review of kinase inhibitors"
Score breakdown:
- Keywords: 2 pts
- Data type: 3 pts
- Specificity: 2 pts (review, not primary)
- Total: 7 pts → relevant
Why wrong: Review paper, user wants primary research only
Ask user for adjustments:
Current accuracy: 60% (below 80% threshold)
Suggestions to improve rubric:
1. Strengthen scaffold_analogs rule - should "synthesized N analogs" always trigger?
2. Lower points for review papers (currently 2 pts, maybe 0 pts?)
3. Add more synonym terms for core concepts?
What would you like to adjust?
Update screening-criteria.json based on feedback
Example update:
{
"special_rules": [
{
"name": "scaffold_analogs",
"condition": "mentions target scaffold AND (analog OR derivative OR synthesized)",
"action": "add 3 points"
}
],
"paper_types": {
"primary_research": 3,
"methods": 2,
"review": 0 // Changed from 1
}
}
Re-score test papers with updated rubric
Show new results:
UPDATED RUBRIC TEST RESULTS (5 papers):
✓ PMID 12345678: Score 9 → relevant (user: relevant) ✓
✓ PMID 23456789: Score 7 → relevant (user: relevant) ✓ (FIXED!)
✓ PMID 34567890: Score 8 → relevant (user: relevant) ✓
✓ PMID 45678901: Score 3 → not_relevant (user: not_relevant) ✓
✓ PMID 56789012: Score 5 → not_relevant (user: not_relevant) ✓ (FIXED!)
Accuracy: 100% (5/5 correct) ✓
Target: ≥80% ✓
Rubric is ready for bulk screening!
If accuracy ≥80%: Proceed to bulk screening If <80%: Continue iterating
Once rubric validated on test set:
{
"10.1234/example": {
"pmid": "12345678",
"title": "Paper title",
"abstract": "Full abstract text...",
"fetched": "2025-10-11T16:00:00Z"
}
}
{
"10.1234/example": {
"pmid": "12345678",
"status": "relevant",
"score": 9,
"source": "pubmed_search",
"timestamp": "2025-10-11T16:00:00Z",
"rubric_version": "1.0.0"
}
}
Screened 127 papers using validated rubric:
- Highly relevant (≥8): 12 papers
- Relevant (7): 18 papers
- Possibly relevant (5-6): 23 papers
- Not relevant (<5): 74 papers
All abstracts cached for re-screening.
Results saved to papers-reviewed.json.
Review offline and provide feedback if any misclassifications found.
User reviews papers offline, identifies issues:
User: "I reviewed the results. Three papers were misclassified:
- PMID 23456789 scored 4 but is actually relevant (discusses scaffold analogs)
- PMID 34567890 scored 8 but not relevant (wrong target)
- PMID 45678901 scored 6 but is highly relevant (has key dataset)
Can we update the rubric?"
Update rubric based on feedback:
Re-screening workflow:
# Load all abstracts from abstracts-cache.json
# Apply updated rubric to each
# Generate change report
RUBRIC UPDATE: v1.0.0 → v1.1.0
Changes:
- Added "derivative" to scaffold_analogs rule
- Increased dataset bonus from +1 to +2 pts
Re-screening 127 cached papers...
Status changes:
not_relevant → relevant: 3 papers
- PMID 23456789 (score 4→7)
- PMID 45678901 (score 6→8)
relevant → not_relevant: 1 paper
- PMID 34567890 (score 8→6)
Updated papers-reviewed.json with new scores.
New summary:
- Highly relevant: 13 papers (+1)
- Relevant: 19 papers (+1)
research-sessions/YYYY-MM-DD-topic/
├── screening-criteria.json # Rubric definition (weights, rules, version)
├── test-set.json # Ground truth papers used for validation
├── abstracts-cache.json # Full abstracts for all screened papers
├── papers-reviewed.json # Simple tracking: DOI, score, status
└── rubric-changelog.md # History of rubric changes and why
Before evaluating-paper-relevance:
When creating helper scripts:
During answering-research-questions:
score = 0
score += count_keyword_matches(abstract, keywords) # 0-3 pts
score += count_data_type_matches(abstract, data_types) # 0-4 pts
score += specificity_score(paper_type) # 0-3 pts
# Apply special rules
if matches_special_rule(abstract, rule):
score += rule['bonus_points']
return score
Medicinal chemistry:
{
"special_rules": [
{
"name": "scaffold_analogs",
"keywords": ["target_scaffold", "analog|derivative|series"],
"bonus": 3
},
{
"name": "sar_data",
"keywords": ["IC50|Ki|MIC", "structure-activity|SAR"],
"bonus": 2
}
]
}
Genomics:
{
"special_rules": [
{
"name": "public_data",
"keywords": ["GEO:|SRA:|ENA:", "accession"],
"bonus": 3
},
{
"name": "differential_expression",
"keywords": ["DEG|differentially expressed", "RNA-seq|microarray"],
"bonus": 2
}
]
}
Computational methods:
{
"special_rules": [
{
"name": "code_available",
"keywords": ["github|gitlab|bitbucket", "code available|software"],
"bonus": 3
},
{
"name": "benchmark",
"keywords": ["benchmark|comparison", "performance|accuracy"],
"bonus": 2
}
]
}
Skipping test-driven validation: Bulk screen without testing rubric → Many misclassifications, wasted time Not caching abstracts: Re-fetch from PubMed when rescreening → Slow, hits rate limits No ground truth testing: Can't measure rubric accuracy → Don't know if it's working Too few test papers: Test on 2-3 papers → Rubric overfits, doesn't generalize Too complex rubric: Boolean logic with 10+ rules → Hard to debug, update, explain Not documenting changes: Update rubric without tracking why → Can't reproduce, learn from mistakes Setting threshold too high: Require 95% accuracy → Iterate forever, diminishing returns Not saving rubric with session: Rubric lost when session ends → Can't reproduce screening
Rubric is ready when:
| Task | File | Contents | |------|------|----------| | Define rubric | screening-criteria.json | Keywords, weights, special rules | | Test rubric | test-set.json | Ground truth papers with user judgments | | Cache abstracts | abstracts-cache.json | Full abstract text for re-screening | | Track results | papers-reviewed.json | DOI, score, status, rubric version | | Document changes | rubric-changelog.md | History of updates and rationale |
Agent: "I'll help you build a screening rubric. Let's start with some questions about what makes papers relevant..."
[Brainstorming phase - ask 5 question categories]
User: "I'm looking for papers on tuberculosis drug resistance with MIC data for bedaquiline and clofazimine. Papers about analogs are also relevant."
Agent: "Got it. Here's the initial rubric:
- Keywords: tuberculosis, drug resistance, bedaquiline, clofazimine (0-3 pts)
- Data: MIC values, IC50, in vitro activity (0-4 pts)
- Specificity: Primary research (0-3 pts)
- Special rule: If mentions target drugs + 'analog/derivative' → +3 bonus pts
- Threshold: ≥7 = relevant
Does this look right?"
User: "Yes, let's test it."
Agent: "I'll search PubMed for 20 papers and show you abstracts to classify..."
[Shows 10 abstracts, user classifies them]
Agent: "Testing rubric on your classifications...
Accuracy: 70% (7/10 correct)
- 2 false negatives (missed relevant papers)
- 1 false positive (marked review as relevant)
False negative example: Paper discusses 'derivatives of bedaquiline' but scored only 5 pts because didn't explicitly say 'analog'.
Suggestion: Add 'derivative' and 'compound' to special rule trigger terms?"
User: "Yes, and lower review paper points."
Agent: "Updated rubric. Re-testing...
Accuracy: 90% (9/10 correct) ✓
Ready to run bulk screening on 127 papers?"
User: "Yes, go ahead."
Agent: [Screens 127 papers, saves results]
"Done! Found 18 relevant papers. All abstracts cached for re-screening.
Review offline and let me know if any misclassifications."
[Later...]
User: "Found 2 misclassifications. Can we update the rubric?"
Agent: "Sure! What did I miss?"
User: "PMID 12345 discusses pretomanid combination therapy with bedaquiline - should be relevant."
Agent: "Got it. Adding 'combination therapy' as related term with +2 bonus pts.
Re-screening all 127 cached papers...
Status changes: 3 papers now relevant (including PMID 12345).
Updated papers-reviewed.json."
After building rubric:
development
Conduct rigorous thematic analysis (TA) of qualitative data following Braun and Clarke's (2006) six-phase framework. Use whenever the user mentions 'thematic analysis', 'TA', 'Braun and Clarke', 'qualitative coding', 'identifying themes', or asks for help analysing interviews, focus groups, open-ended survey responses, or transcripts to identify patterns. Also trigger for questions about inductive vs theoretical coding, semantic vs latent themes, essentialist vs constructionist epistemology, building a thematic map, or writing up a qualitative findings section. Covers all six phases, the four upfront analytic decisions, the 15-point quality checklist, and the five common pitfalls. Produces a Word document write-up and an annotated thematic map. Does NOT cover IPA, grounded theory, discourse analysis, conversation analysis, or narrative analysis — use a different method for those.
development
Guide users through writing a systematic literature review (SLR) following the PRISMA 2020 framework. Use this skill whenever the user mentions 'systematic review', 'systematic literature review', 'SLR', 'PRISMA', 'PRISMA 2020', 'PRISMA flow diagram', 'PRISMA checklist', or asks for help writing, structuring, or auditing a literature review that follows reporting guidelines. Also trigger when the user asks about inclusion/exclusion criteria for a review, search strategies for databases like Scopus/WoS/PubMed, study selection processes, risk of bias assessment, or narrative synthesis for a review paper. This skill covers the full PRISMA 2020 checklist (27 items), produces a Word document manuscript in strict journal article format, generates an annotated PRISMA flow diagram, and enforces APA 7th Edition referencing throughout. It does NOT cover meta-analysis or statistical pooling. By Chuah Kee Man.
testing
Performs placebo-in-time sensitivity analysis with hierarchical null model and optional Bayesian assurance. Use when checking model robustness, verifying lack of pre-intervention effects, or estimating study power.
data-ai
Fit, summarize, plot, and interpret a chosen CausalPy experiment. Use after the causal method has been selected, including when configuring PyMC/sklearn models and scale-aware custom priors.