prompts/skills/ia/SKILL.md
Interact with Internet Archive (archive.org) - upload files, download items, and search the archive using the ia CLI tool. Use when working with archive.org, archiving content, or retrieving historical data.
npx skillsauth add ramblurr/nix-devenv iaInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables interaction with the Internet Archive (archive.org) using the ia command-line tool from the internetarchive Python package.
An item is the fundamental unit on archive.org - a logical grouping of related files sharing common metadata. An item can be a book, a song, an album, a dataset, a movie, an image or set of images, etc. Each item has a unique identifier across the entire archive.
Every item contains:
<identifier>_meta.xml - item-level metadata<identifier>_files.xml - file-level metadataItems must belong to a collection.
| Constraint | Recommended | Hard Limit | |------------|-------------|------------| | Item total size | Under 100GB | ~1TB | | Files per item | Under 10,000 | 250,000 (performance degrades >10,000) | | Single file size | Under 50GB | 500-700GB | | Daily upload | Under 1,000 files | 5,000 files (zips count as 1) |
Permanent URL patterns:
https://archive.org/details/<identifier>https://archive.org/download/<identifier>https://archive.org/download/<identifier>/<filename>https://archive.org/history/<identifier>Warning: Never link to server-specific URLs like ia802304.us.archive.org - these break when items migrate between servers. Always use the canonical archive.org URLs above.
For more details, see: https://archive.org/developers/items.html
When you upload files to the Internet Archive, the system automatically generates derivative files - converted versions in different formats and resolutions. For example:
Derivatives make content accessible across different devices and bandwidths. You can identify derivatives in ia list output - they have an original field pointing to their source file.
To skip derivative generation during upload, use --no-derive:
ia upload my-item file.mp4 --metadata="mediatype:movies" --no-derive
For the complete list of source formats and their generated derivatives, see: https://archive.org/help/derivatives.php
Internet Archive items use XML-based metadata. Key points:
identifier, mediatypetitle, description, creator, date, subject, collection, languageIdentifier requirements:
For the complete metadata schema reference, see: https://archive.org/developers/metadata-schema
Collections group related items together. Key points:
To request a collection, contact Internet Archive with:
Public upload collections (anyone can upload to):
opensource_movies, opensource_audio, opensource_media - general mediacommunity_texts, community_video, community_audio - community contributionsOther collections restrict uploads to designated uploaders only.
Before using any ia commands, check if the tool is installed:
ia --version
If the ia command is not found, install it using uv:
uv tool install internetarchive
Alternative installation methods:
pipx install internetarchivepip install internetarchiveAfter installation, verify it works with ia --version.
These options work with all ia commands:
| Option | Description |
|--------|-------------|
| -h, --help | Show help message |
| -v, --version | Display version |
| -c FILE, --config-file | Path to config file |
| -l, --log | Enable logging |
| -d, --debug | Enable debug output |
Check if ia is configured:
ia configure --whoami
If not configured (shows error or empty), the user needs to set up credentials:
ia configure and follow prompts~/.config/ia.ini| Option | Description |
|--------|-------------|
| --whoami | Print current authenticated user |
| --show | Print current config as JSON |
| --check | Validate IA-S3 keys (exit 0 if valid, 1 otherwise) |
# Show current config
ia configure --show
# Validate keys (useful in scripts)
ia configure --check && echo "Keys valid"
Alternative to config file:
export IA_ACCESS_KEY_ID="your-access-key"
export IA_SECRET_ACCESS_KEY="your-secret-key"
Note: Configuration is required for uploads and metadata modifications. Searching and downloading public items works without authentication.
All requests to the Internet Archive must include a proper User-Agent string that clearly identifies the source of the request. This applies to every request made via any tool - the ia CLI, Python library, direct API calls, curl, or any other HTTP client. This is critical for AI agents, bots, and automated tools.
The ia CLI automatically includes a default User-Agent with your access key:
internetarchive/5.7.2 (Linux x86_64; N; en; ACCESS_KEY) Python/3.11.0
When using Claude Code or other AI/LLM agents, you must append a custom suffix that includes:
The --user-agent-suffix CLI option and user_agent_suffix config setting require internetarchive version 5.7.2 or newer. The default User-Agent (including access key) is always sent - your suffix is appended to it.
CLI:
ia --user-agent-suffix "Claude Code/1.0.0 (claude-sonnet-4-20250514)" download my-item
INI file (~/.config/internetarchive/ia.ini):
[general]
user_agent_suffix = Claude Code/1.0.0 (claude-sonnet-4-20250514)
Python API:
from internetarchive import get_session
session = get_session(config={
'general': {'user_agent_suffix': 'Claude Code/1.0.0 (claude-sonnet-4-20250514)'}
})
The resulting User-Agent will look like:
internetarchive/5.7.2 (Linux x86_64; N; en; ACCESS_KEY) Python/3.11.0 Claude Code/1.0.0 (claude-sonnet-4-20250514)
This helps the Internet Archive track usage patterns, troubleshoot issues, and maintain service quality. Always be specific - include version numbers, model identifiers, and enough detail to distinguish your tool from others.
Search the Internet Archive catalog:
ia search '<query>'
| Parameter | Description |
|-----------|-------------|
| --itemlist | Output identifiers only, one per line |
| -n, --num-found | Print only the count of results |
| -s, --sort | Sort results: --sort='field desc' or --sort='field asc' |
| -f, --field | Return specific metadata fields (repeatable) |
| -F, --fts | Full-text search (search within text content, not just metadata) |
| --parameters | Raw query parameters: --parameters="page=N&rows=N" |
# Get result count only
ia search 'collection:nasa' -n
# Sort by date descending
ia search 'mediatype:texts' --sort='date desc'
# Return specific fields
ia search 'collection:nasa' --field=identifier --field=title
Common sort fields for use with --sort:
| Field | Description |
|-------|-------------|
| date | Content date |
| publicdate | When item was published to archive.org |
| addeddate | When added to archive |
| updatedate | Last updated |
| title / titleSorter | Alphabetical by title |
| creator / creatorSorter | Alphabetical by creator |
| downloads | Total downloads |
| week | Downloads this week |
| month | Downloads this month |
| num_reviews | Number of reviews |
| num_favorites | Number of favorites |
| item_size | Total item size |
| files_count | Number of files |
Use asc or desc suffix:
ia search 'mediatype:audio' --sort='downloads desc'
ia search 'collection:books' --sort='publicdate asc'
ia search 'creator:NASA' --sort='title asc'
The Internet Archive uses Apache Lucene query syntax. By default, the operator is AND (all terms must be present).
| Operator | Description |
|----------|-------------|
| AND | All terms must be present (default) |
| OR | Any of the terms can be present |
| NOT | Exclude documents with term (requires at least one positive term) |
| ( ) | Group clauses to form subqueries |
Use field:value syntax to search specific metadata fields:
| Query | Description |
|-------|-------------|
| 'title:"search text"' | By title |
| 'creator:"Author Name"' | By creator/author |
| 'subject:"topic"' | Search by subject |
| 'description:"text"' | By description |
| 'collection:name' | Items in a collection |
| 'mediatype:texts' | By media type (texts, movies, audio, software, image, data) |
| 'contributor:smithsonian' | By contributor |
| 'language:eng' | By language code |
| 'format:pdf' | Items containing specific file format |
| 'isbn:9780123456789' | By ISBN |
| 'licenseurl:http*by-nc*' | By Creative Commons license |
Search values between bounds using brackets or parentheses:
| Syntax | Description |
|--------|-------------|
| [1000 TO 2000] | Inclusive range (includes bounds) |
| {1000 TO 2000} | Exclusive range (excludes bounds) |
| [1000 TO null] | Open-ended range (1000 or greater) |
| [null TO 2000] | Open-ended range (2000 or less) |
Searchable date fields: addeddate, createdate, date, indexdate, publicdate, reviewdate, updatedate, oai_updatedate
| Query | Description |
|-------|-------------|
| 'date:[2020-01-01 TO 2024-12-31]' | Date range |
| 'publicdate:[2024-01-01 TO 2024-06-30]' | By publication date |
| 'indexdate:[2024-01-01T00:00:00Z TO 2024-12-31T23:59:59Z]' | With timestamp |
| 'date:2024*' | Wildcard for year (non-range) |
Append ~ for approximate spelling matches:
ia search 'title:buttonwood~'
# Boost fuzzy matches with weights
ia search '(title:buttonwood~)^150 OR (subject:buttonwood~)^100'
Find items where a field doesn't exist:
ia search 'collection:microfiche AND NOT _exists_:creator'
Search by uploader's user item, screen name, or email:
ia search '_uploader_useritem:@username'
ia search '_uploader_screenname:"Display Name"'
ia search 'uploader:[email protected]'
Beyond standard metadata, you can search by:
downloads - download countitem_size - total item size in bytesfiles_count - number of filescollection_size - size of collectionitem_count - items in collectionia search 'collection:opensource AND downloads:[1000 TO null]'
ia search 'mediatype:movies AND item_size:[1000000000 TO null]'
# AND is implicit between terms
ia search 'collection:nasa mediatype:image'
# Explicit operators
ia search 'collection:nasa AND mediatype:image'
ia search 'mediatype:texts OR mediatype:audio'
ia search 'collection:opensource NOT mediatype:software'
# Grouped subqueries
ia search '(mediatype:texts OR mediatype:audio) AND creator:"Mark Twain"'
Use the -F (or --fts) flag to search within the actual text content of items rather than just metadata. This is particularly powerful for searching text collections like books, documents, and OCR'd materials.
Basic full-text search:
ia search -F 'collection:collection_name "search phrase"'
How it works:
Full-text search syntax:
"complete phrase"collection:name AND "text to find"# Search NASA images
ia search 'collection:nasa mediatype:image' --parameters="rows=10"
# Search public domain books
ia search 'subject:"public domain" mediatype:texts'
# Get just identifiers
ia search 'creator:"Mark Twain"' --itemlist
# Full-text search within a text collection
ia search -F 'collection:books "climate change"'
# Full-text search for a specific quote in public domain texts
ia search -F '"to be or not to be" mediatype:texts'
# Full-text search with collection filter and pagination
ia search -F 'collection:usgovernmentdocuments "artificial intelligence"' --parameters="rows=20"
Download files from an Internet Archive item:
ia download <identifier>
| Parameter | Description |
|-----------|-------------|
| --glob="*.ext" | Download only matching files (use \| for multiple: '*.mp4\|*.webm') |
| --exclude="*pattern*" | Exclude files matching pattern |
| --format="FORMAT" | Download specific derivative format |
| --source=SOURCE | Filter by source: original, derivative, metadata |
| --exclude-source=SOURCE | Exclude by source type |
| --destdir=path | Download to specific directory |
| --no-directories | Flatten directory structure |
| -s, --stdout | Write file to stdout (for piping) |
| --dry-run | Show what would be downloaded |
| --checksum | Skip files that already exist with correct checksum |
| --on-the-fly | Download on-the-fly files (generated derivatives) |
| --search="QUERY" | Download from search results |
| --itemlist=FILE | Download items listed in file |
Use --source and --exclude-source to filter by file origin:
# Download only original files (skip all derivatives)
ia download my-item --source=original
# Download originals and metadata, skip derivatives
ia download my-item --exclude-source=derivative
# Download only metadata files
ia download my-item --source=metadata
# Download all files from an item
ia download TripDown1905
# Download specific files by name
ia download TripDown1905 file1.mp4 file2.ogv
# Download only MP4 files
ia download TripDown1905 --glob="*.mp4"
# Download MP4s but exclude low-quality versions
ia download TripDown1905 --glob="*.mp4" --exclude="*512kb*"
# Download specific format
ia download TripDown1905 --format='512Kb MPEG4'
# Download to specific directory
ia download TripDown1905 --destdir=./downloads
# Download from search results
ia download --search 'collection:opensource_movies' --glob="*.mp4"
# Download items from a list file
ia search 'collection:glasgowschoolofart' --itemlist > itemlist.txt
ia download --itemlist itemlist.txt
# Preview what will be downloaded
ia download my_item --dry-run
Upload files to the Internet Archive (requires authentication):
ia upload <identifier> file1 file2 --metadata="mediatype:value"
The mediatype field is required. Common values:
texts - Books, documents, PDFsmovies - Video filesaudio - Music, podcasts, soundsoftware - Programs, gamesimage - Photos, graphicsdata - Datasets, archives| Parameter | Description |
|-----------|-------------|
| --metadata="key:value" | Set metadata (repeatable) |
| --header="key:value" | Set HTTP header |
| --checksum | Skip files already uploaded |
| -v, --verify | Verify data wasn't corrupted after upload |
| --no-derive | Skip derivative processing |
| --retries=N | Number of retry attempts |
| --remote-name=NAME | Set remote filename (for stdin uploads) |
| --keep-directories | Preserve directory structure in remote filename |
| -o, --open-after-upload | Open item in browser after upload |
| --file-metadata=FILE | File-level metadata from JSONL file |
| --spreadsheet=FILE | Bulk upload from CSV spreadsheet |
--metadata="title:My Document Title"
--metadata="creator:Author Name"
--metadata="description:A description of the content"
--metadata="subject:topic1;topic2"
--metadata="collection:community_texts"
--metadata="date:2024-01-15"
--metadata="language:eng"
# Upload a PDF document
ia upload my-document-2024 document.pdf \
--metadata="mediatype:texts" \
--metadata="title:My Document" \
--metadata="creator:John Doe"
# Upload multiple files
ia upload my-archive file1.pdf file2.pdf file3.pdf \
--metadata="mediatype:texts" \
--metadata="title:Document Collection"
# Upload with checksum verification and retries
ia upload my-item large-file.zip \
--metadata="mediatype:data" \
--checksum \
--retries=10
# Upload from stdin
cat data.gz | ia upload my-item - \
--remote-name=data.gz \
--metadata="mediatype:data"
# Bulk upload using spreadsheet
ia upload --spreadsheet=metadata.csv
Notes:
data mediatype by default if not specifiedUpload to test_collection for validation - items are automatically removed after ~30 days:
ia upload my-test-item file.pdf \
--metadata="mediatype:texts" \
--metadata="collection:test_collection"
ia metadata <identifier> --existsTo set a custom thumbnail for an item, upload an image named <identifier>_itemimage.jpg:
ia upload my-item my-item_itemimage.jpg
To make files streamable but not downloadable, add the item to the stream_only collection:
ia metadata <identifier> --append-list="collection:stream_only"
View and modify item metadata:
# View metadata (JSON output)
ia metadata <identifier>
# Extract specific field with jq
ia metadata <identifier> | jq '.metadata.date'
# List file formats contained in an item
ia metadata <identifier> --formats
# Modify metadata (set or replace)
ia metadata <identifier> --modify="title:New Title"
ia metadata <identifier> --modify="foo:bar" --modify="baz:value"
# Remove a metadata field
ia metadata <identifier> --modify="fieldname:REMOVE_TAG"
# Append value to existing field
ia metadata <identifier> --append="title:Subtitle Here"
# Append to list field (e.g., subjects)
ia metadata <identifier> --append-list="subject:new topic"
# Remove specific value from list field
ia metadata <identifier> --remove="subject:old topic"
# Modify file-level metadata
ia metadata <identifier> --target="files/foo.txt" --modify="title:My File"
# Bulk updates from spreadsheet
ia metadata --spreadsheet=metadata.csv
List files in an Internet Archive item:
ia list <identifier>
Shows all files with details (name, size, format).
| Parameter | Description |
|-----------|-------------|
| --columns=name,size | Specify columns to show |
| --glob="*.pdf" | Filter by pattern |
| -l, --location | Print full URLs for each file |
| -a, --all | List all available file information |
| -v, --verbose | Print column headers |
# List with full URLs
ia list my-item --location
# List all file info with headers
ia list my-item --all --verbose
# List specific columns
ia list my-item --columns=name,size,format
Check status of catalog tasks (uploads, derives, etc.):
# Check tasks for a specific item
ia tasks <identifier>
# Check all your tasks
ia tasks
To make an item dark (hidden from public access) or undark it:
# Dark an item (requires comment)
ia tasks <identifier> --cmd=make_dark.php --comment="Reason for darking"
# Undark an item
ia tasks <identifier> --cmd=make_undark.php --comment="Reason for undarking"
For batch processing many items, use GNU Parallel to run ia commands concurrently.
# macOS
brew install parallel
# Debian/Ubuntu
apt install parallel
Pipe item identifiers to parallel, using {} as placeholder:
# Fetch metadata for many items
cat itemlist.txt | parallel 'ia metadata {}'
# Download multiple items
cat itemlist.txt | parallel 'ia download {}'
For reliable bulk operations, use job logging to track progress and handle failures:
# Step 1: Create item list
ia search 'collection:myproject' --itemlist > itemlist.txt
# Step 2: Run with job logging
cat itemlist.txt | parallel --joblog job.log 'ia download {}'
# Step 3: Check for failures
echo $? # 0 = all succeeded
# Step 4: Retry only failed jobs
parallel --retry-failed --joblog job.log
The --joblog file tracks each command's exit status, allowing you to:
Always preview before bulk execution:
cat itemlist.txt | parallel --dry-run 'ia download {}'
Control concurrency to avoid overwhelming the server:
# Limit to 4 concurrent jobs
cat itemlist.txt | parallel -j4 'ia download {}'
# Add delay between jobs
cat itemlist.txt | parallel --delay 1 'ia download {}'
See: https://archive.org/developers/internetarchive/parallel.html
ia configure firstia metadata <id>--checksum for large uploads to enable resume--dry-run to preview operationslanguage metadata for proper OCR processing on texts| Error | Solution |
|-------|----------|
| "not configured" | Run ia configure or set environment variables |
| "identifier exists" | Choose a different identifier |
| "permission denied" | Check credentials at https://archive.org/account/s3.php |
| "network error" | Retry the operation; check internet connection |
| "item not found" | Verify the identifier spelling |
| "429 Too Many Requests" | Rate limited; wait and retry with Retry-After header value |
| Item not appearing in search | Usually appears within minutes; check ia tasks <identifier> for pending jobs |
| Derive task failed | Check filename characters, file format, language metadata |
# Search
ia search 'query'
ia search 'query' --itemlist
# Download
ia download <identifier>
ia download <identifier> --glob="*.pdf"
# Upload (requires auth)
ia upload <identifier> files --metadata="mediatype:texts"
# Metadata
ia metadata <identifier>
ia metadata <identifier> --modify="title:New Title"
# List files
ia list <identifier>
# Tasks
ia tasks <identifier>
# Config
ia configure
ia configure --whoami
# Install
uv tool install internetarchive
For programmatic access beyond the CLI, see the full developer documentation: https://archive.org/developers
| API | Description | |-----|-------------| | Items | Understanding item structure and access | | Metadata Schema | Complete metadata field reference | | Metadata Read | Retrieve item metadata via API | | Metadata Write | Modify item metadata via API | | IAS3 | S3-compatible API for uploads | | Tasks | Task queue management |
| API | Description | |-----|-------------| | Changes | Track item modifications across the archive | | Views | Access viewing and download statistics | | Reviews | Manage item reviews | | Simple Lists | Create item relationships and lists | | OCR Service | Text recognition service | | PDF Service | PDF generation and processing |
For Python integration: internetarchive library
A community-maintained TypeScript port is available: internetarchive-ts (docs)
Note: This is a work in progress and not officially maintained by the Internet Archive.
testing
Use this OCP when executing or preparing to execute commands that change a live or important system, service reloads/restarts, package changes, deployments, migrations, firewall/network/access changes, credential rotation, NixOS switch/test/boot/deploy, or incident mitigation. It guides safe operations with a persisted ledger for scope, preflight, baseline, rollback, validation, and evidence.
development
Create new agent skills with proper structure, progressive disclosure, and bundled resources. Use when user wants to create, write, or build a new skill.
documentation
Naming conventions for workflow documents in prompts/. Use when creating plans, PRDs, research reports, idea capture or other workflow documents. Triggers on (1) creating new planning documents, (2) naming PRDs or research reports, (3) questions about document organization in prompts/.
testing
Grilling session that challenges your plan against the existing domain model, sharpens terminology, and updates documentation (CONTEXT.md, ADRs) inline as decisions crystallise. Use when user wants to stress-test a plan against their project's language and documented decisions.