skills/43-wentorai-research-plugins/skills/literature/fulltext/institutional-repository-guide/SKILL.md
Access papers from institutional and subject repositories at scale
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research institutional-repository-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Institutional repositories (IRs) are university-run digital archives that store and provide open access to their researchers' scholarly output — dissertations, journal articles, conference papers, datasets, and technical reports. Subject repositories like arXiv, bioRxiv, SSRN, and RePEc serve similar functions for specific disciplines. Together, they form a distributed network of open scholarship that complements commercial databases.
This guide covers how to discover, access, and systematically harvest content from institutional and subject repositories for literature reviews, meta-analyses, and research data collection.
Institutional Repositories (IR):
- Run by universities to archive their researchers' output
- Examples: DSpace, EPrints, Fedora-based systems
- Discovery: OpenDOAR directory (v2.sherpa.ac.uk/opendoar)
Subject Repositories:
- Discipline-specific archives
- arXiv (physics, CS, math), bioRxiv, SSRN, RePEc, EarthArXiv
Aggregators:
- Harvest from many repositories into a single search interface
- BASE (Bielefeld Academic Search Engine)
- CORE (core.ac.uk, 200M+ open access articles)
- OpenAIRE (European research output)
OpenDOAR (Directory of Open Access Repositories) is the primary registry for finding institutional repositories:
import urllib.request
import json
def search_opendoar(subject: str = None, country: str = None) -> list:
"""
Search the OpenDOAR registry for institutional repositories.
Args:
subject: Filter by subject area (e.g., "Biology", "Computer Science")
country: ISO country code (e.g., "US", "GB", "CN")
"""
base_url = "https://v2.sherpa.ac.uk/cgi/retrieve"
params = "?item-type=repository&format=Json"
if subject:
params += f"&filter=[[\"{subject}\",\"subject\"]]"
if country:
params += f"&filter=[[\"{country}\",\"country\"]]"
req = urllib.request.Request(base_url + params)
response = urllib.request.urlopen(req)
data = json.loads(response.read())
repositories = []
for item in data.get("items", []):
repo_info = {
"name": item.get("repository_metadata", {}).get("name", [{}])[0].get("name", ""),
"url": item.get("repository_metadata", {}).get("url", ""),
"oai_url": item.get("repository_metadata", {}).get("oai_url", ""),
"software": item.get("repository_metadata", {}).get("software", {}).get("name", ""),
"type": item.get("repository_metadata", {}).get("type", "")
}
repositories.append(repo_info)
return repositories
Most institutional repositories support OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting), the standard protocol for metadata exchange:
import xml.etree.ElementTree as ET
import urllib.request
def harvest_repository(base_url: str, metadata_prefix: str = "oai_dc",
set_spec: str = None, from_date: str = None) -> list:
"""
Harvest metadata records from a repository's OAI-PMH endpoint.
Args:
base_url: The OAI-PMH base URL
metadata_prefix: Metadata format (oai_dc, datacite, mets)
set_spec: Optional set/collection to restrict harvesting
from_date: Harvest only records added after this date (YYYY-MM-DD)
"""
params = f"?verb=ListRecords&metadataPrefix={metadata_prefix}"
if set_spec:
params += f"&set={set_spec}"
if from_date:
params += f"&from={from_date}"
url = base_url + params
records = []
while url:
response = urllib.request.urlopen(url)
tree = ET.parse(response)
root = tree.getroot()
ns = {"oai": "http://www.openarchives.org/OAI/2.0/"}
for record in root.findall(".//oai:record", ns):
header = record.find("oai:header", ns)
identifier = header.find("oai:identifier", ns).text
datestamp = header.find("oai:datestamp", ns).text
records.append({"identifier": identifier, "datestamp": datestamp})
token_elem = root.find(".//oai:resumptionToken", ns)
if token_elem is not None and token_elem.text:
url = f"{base_url}?verb=ListRecords&resumptionToken={token_elem.text}"
else:
url = None
return records
| Verb | Purpose |
|------|---------|
| Identify | Get repository name, admin email, policies |
| ListSets | List available collections/sets |
| ListMetadataFormats | List supported metadata schemas |
| ListIdentifiers | Lightweight listing of record headers |
| ListRecords | Full metadata records with pagination |
| GetRecord | Retrieve a single record by identifier |
The most widely deployed open-source repository platform (used by ~40% of repositories worldwide):
{base-url}/oai/request{base-url}/server/apiPopular in the UK and Europe:
{base-url}/cgi/oai2{base-url}/cgi/export/{id}/{format}Used by larger institutions with complex digital collections:
1. Identify target repositories
- Use OpenDOAR to find IRs by subject or country
- List subject repositories relevant to your discipline
2. Test endpoints
- Send Identify request to verify the endpoint is active
- Check ListMetadataFormats for available schemas
3. Harvest incrementally
- Use "from" parameter to harvest only new records
- Store last harvest date for each repository
- Respect rate limits (typically 1 request per second)
4. Deduplicate
- Match records by DOI when available
- Use title + author fuzzy matching for records without DOIs
- Flag duplicates rather than deleting (keep provenance)
5. Store and index
- Save metadata in structured format (JSON, SQLite, CSV)
- Build a local search index for efficient retrieval
robots.txt and repository rate limitsdevelopment
Conduct rigorous thematic analysis (TA) of qualitative data following Braun and Clarke's (2006) six-phase framework. Use whenever the user mentions 'thematic analysis', 'TA', 'Braun and Clarke', 'qualitative coding', 'identifying themes', or asks for help analysing interviews, focus groups, open-ended survey responses, or transcripts to identify patterns. Also trigger for questions about inductive vs theoretical coding, semantic vs latent themes, essentialist vs constructionist epistemology, building a thematic map, or writing up a qualitative findings section. Covers all six phases, the four upfront analytic decisions, the 15-point quality checklist, and the five common pitfalls. Produces a Word document write-up and an annotated thematic map. Does NOT cover IPA, grounded theory, discourse analysis, conversation analysis, or narrative analysis — use a different method for those.
development
Guide users through writing a systematic literature review (SLR) following the PRISMA 2020 framework. Use this skill whenever the user mentions 'systematic review', 'systematic literature review', 'SLR', 'PRISMA', 'PRISMA 2020', 'PRISMA flow diagram', 'PRISMA checklist', or asks for help writing, structuring, or auditing a literature review that follows reporting guidelines. Also trigger when the user asks about inclusion/exclusion criteria for a review, search strategies for databases like Scopus/WoS/PubMed, study selection processes, risk of bias assessment, or narrative synthesis for a review paper. This skill covers the full PRISMA 2020 checklist (27 items), produces a Word document manuscript in strict journal article format, generates an annotated PRISMA flow diagram, and enforces APA 7th Edition referencing throughout. It does NOT cover meta-analysis or statistical pooling. By Chuah Kee Man.
testing
Performs placebo-in-time sensitivity analysis with hierarchical null model and optional Bayesian assurance. Use when checking model robustness, verifying lack of pre-intervention effects, or estimating study power.
data-ai
Fit, summarize, plot, and interpret a chosen CausalPy experiment. Use after the causal method has been selected, including when configuring PyMC/sklearn models and scale-aware custom priors.