skills/literature/fulltext/institutional-repository-guide/SKILL.md
Access papers from institutional and subject repositories at scale
npx skillsauth add wentorai/research-plugins institutional-repository-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Institutional repositories (IRs) are university-run digital archives that store and provide open access to their researchers' scholarly output — dissertations, journal articles, conference papers, datasets, and technical reports. Subject repositories like arXiv, bioRxiv, SSRN, and RePEc serve similar functions for specific disciplines. Together, they form a distributed network of open scholarship that complements commercial databases.
This guide covers how to discover, access, and systematically harvest content from institutional and subject repositories for literature reviews, meta-analyses, and research data collection.
Institutional Repositories (IR):
- Run by universities to archive their researchers' output
- Examples: DSpace, EPrints, Fedora-based systems
- Discovery: OpenDOAR directory (v2.sherpa.ac.uk/opendoar)
Subject Repositories:
- Discipline-specific archives
- arXiv (physics, CS, math), bioRxiv, SSRN, RePEc, EarthArXiv
Aggregators:
- Harvest from many repositories into a single search interface
- BASE (Bielefeld Academic Search Engine)
- CORE (core.ac.uk, 200M+ open access articles)
- OpenAIRE (European research output)
OpenDOAR (Directory of Open Access Repositories) is the primary registry for finding institutional repositories:
import urllib.request
import json
def search_opendoar(subject: str = None, country: str = None) -> list:
"""
Search the OpenDOAR registry for institutional repositories.
Args:
subject: Filter by subject area (e.g., "Biology", "Computer Science")
country: ISO country code (e.g., "US", "GB", "CN")
"""
base_url = "https://v2.sherpa.ac.uk/cgi/retrieve"
params = "?item-type=repository&format=Json"
if subject:
params += f"&filter=[[\"{subject}\",\"subject\"]]"
if country:
params += f"&filter=[[\"{country}\",\"country\"]]"
req = urllib.request.Request(base_url + params)
response = urllib.request.urlopen(req)
data = json.loads(response.read())
repositories = []
for item in data.get("items", []):
repo_info = {
"name": item.get("repository_metadata", {}).get("name", [{}])[0].get("name", ""),
"url": item.get("repository_metadata", {}).get("url", ""),
"oai_url": item.get("repository_metadata", {}).get("oai_url", ""),
"software": item.get("repository_metadata", {}).get("software", {}).get("name", ""),
"type": item.get("repository_metadata", {}).get("type", "")
}
repositories.append(repo_info)
return repositories
Most institutional repositories support OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting), the standard protocol for metadata exchange:
import xml.etree.ElementTree as ET
import urllib.request
def harvest_repository(base_url: str, metadata_prefix: str = "oai_dc",
set_spec: str = None, from_date: str = None) -> list:
"""
Harvest metadata records from a repository's OAI-PMH endpoint.
Args:
base_url: The OAI-PMH base URL
metadata_prefix: Metadata format (oai_dc, datacite, mets)
set_spec: Optional set/collection to restrict harvesting
from_date: Harvest only records added after this date (YYYY-MM-DD)
"""
params = f"?verb=ListRecords&metadataPrefix={metadata_prefix}"
if set_spec:
params += f"&set={set_spec}"
if from_date:
params += f"&from={from_date}"
url = base_url + params
records = []
while url:
response = urllib.request.urlopen(url)
tree = ET.parse(response)
root = tree.getroot()
ns = {"oai": "http://www.openarchives.org/OAI/2.0/"}
for record in root.findall(".//oai:record", ns):
header = record.find("oai:header", ns)
identifier = header.find("oai:identifier", ns).text
datestamp = header.find("oai:datestamp", ns).text
records.append({"identifier": identifier, "datestamp": datestamp})
token_elem = root.find(".//oai:resumptionToken", ns)
if token_elem is not None and token_elem.text:
url = f"{base_url}?verb=ListRecords&resumptionToken={token_elem.text}"
else:
url = None
return records
| Verb | Purpose |
|------|---------|
| Identify | Get repository name, admin email, policies |
| ListSets | List available collections/sets |
| ListMetadataFormats | List supported metadata schemas |
| ListIdentifiers | Lightweight listing of record headers |
| ListRecords | Full metadata records with pagination |
| GetRecord | Retrieve a single record by identifier |
The most widely deployed open-source repository platform (used by ~40% of repositories worldwide):
{base-url}/oai/request{base-url}/server/apiPopular in the UK and Europe:
{base-url}/cgi/oai2{base-url}/cgi/export/{id}/{format}Used by larger institutions with complex digital collections:
1. Identify target repositories
- Use OpenDOAR to find IRs by subject or country
- List subject repositories relevant to your discipline
2. Test endpoints
- Send Identify request to verify the endpoint is active
- Check ListMetadataFormats for available schemas
3. Harvest incrementally
- Use "from" parameter to harvest only new records
- Store last harvest date for each repository
- Respect rate limits (typically 1 request per second)
4. Deduplicate
- Match records by DOI when available
- Use title + author fuzzy matching for records without DOIs
- Flag duplicates rather than deleting (keep provenance)
5. Store and index
- Save metadata in structured format (JSON, SQLite, CSV)
- Build a local search index for efficient retrieval
robots.txt and repository rate limitsdocumentation
Write Tsinghua University theses using the ThuThesis LaTeX template
development
Templates, formatting rules, and strategies for thesis and dissertation writing
documentation
Set up LaTeX templates for PhD and Master's thesis documents
documentation
Write SJTU theses using the SJTUThesis LaTeX template with full compliance