plugins/media-curator/skills/archive-acquisition/SKILL.md
Patterns for acquiring content from Internet Archive and archival sources
npx skillsauth add jmagly/aiwg Archive AcquisitionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Specialized patterns and techniques for acquiring media content from Internet Archive (archive.org) and other archival sources. Focuses on bulk downloads, quality selection, format filtering, and collection management.
Internet Archive hosts massive collections of audio, video, text, and software. Key characteristics:
Find items matching specific criteria:
# Search for items by keyword and mediatype
search_archive() {
local query="$1"
local mediatype="$2" # audio, movies, texts, etc.
local rows="${3:-100}" # Results per page
curl -s "https://archive.org/advancedsearch.php" \
-d "q=$query AND mediatype:$mediatype" \
-d "fl[]=identifier,title,creator,date,format" \
-d "rows=$rows" \
-d "output=json" \
| jq '.response.docs[]'
}
# Example: Search for Grateful Dead concerts
search_archive "grateful dead" "audio" 50
{
"identifier": "gd1977-05-08.sbd.miller.97065.flac16",
"title": "Grateful Dead Live at Barton Hall on 1977-05-08",
"creator": "Grateful Dead",
"date": "1977-05-08",
"format": ["Flac", "VBR MP3", "Ogg Vorbis", "Metadata"]
}
List all items in a collection:
# Get items from a specific collection
get_collection_items() {
local collection="$1"
local rows="${2:-1000}"
curl -s "https://archive.org/advancedsearch.php" \
-d "q=collection:$collection" \
-d "fl[]=identifier,title,format" \
-d "rows=$rows" \
-d "output=json" \
| jq -r '.response.docs[] | .identifier'
}
# Example: Get all items from GratefulDead collection
get_collection_items "GratefulDead" 500
# Find items by date range
search_by_date() {
local creator="$1"
local start_date="$2" # YYYY-MM-DD
local end_date="$3"
curl -s "https://archive.org/advancedsearch.php" \
-d "q=creator:\"$creator\" AND date:[$start_date TO $end_date]" \
-d "fl[]=identifier,title,date" \
-d "rows=100" \
-d "output=json" \
| jq '.response.docs[]'
}
# Find items with specific format available
search_by_format() {
local query="$1"
local format="$2" # FLAC, MP3, etc.
curl -s "https://archive.org/advancedsearch.php" \
-d "q=$query AND format:$format" \
-d "fl[]=identifier,title,format" \
-d "rows=100" \
-d "output=json" \
| jq '.response.docs[]'
}
# Example: Find all FLAC recordings of specific artist
search_by_format "pink floyd" "Flac"
# Retrieve full metadata for an item
get_item_metadata() {
local identifier="$1"
curl -s "https://archive.org/metadata/$identifier" \
| jq '.'
}
# Extract specific metadata fields
get_item_files() {
local identifier="$1"
curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | select(.format != "Metadata") | "\(.name)\t\(.format)\t\(.size)"'
}
# Example output:
# gd77-05-08d1t01.flac Flac 45123456
# gd77-05-08d1t01.mp3 VBR MP3 12345678
# Get only FLAC files from an item
get_flac_files() {
local identifier="$1"
curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | select(.format == "Flac") | .name'
}
# Get best quality audio files (prefer FLAC, fallback to 320K MP3)
get_best_audio_files() {
local identifier="$1"
# Try FLAC first
local flac_files=$(curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | select(.format == "Flac") | .name')
if [[ -n "$flac_files" ]]; then
echo "$flac_files"
return 0
fi
# Fallback to highest bitrate MP3
curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | select(.format == "VBR MP3" or .format == "320Kbps MP3") | .name'
}
# Download all files from a single item
download_item() {
local identifier="$1"
local output_dir="$2"
local format_filter="$3" # Optional: FLAC, MP3, etc.
mkdir -p "$output_dir"
# Get file list
local files
if [[ -n "$format_filter" ]]; then
files=$(curl -s "https://archive.org/metadata/$identifier" \
| jq -r ".files[] | select(.format == \"$format_filter\") | .name")
else
files=$(curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | select(.format != "Metadata") | .name')
fi
# Download each file
while IFS= read -r file; do
echo "Downloading: $file"
wget -q --show-progress \
"https://archive.org/download/$identifier/$file" \
-P "$output_dir"
done <<< "$files"
}
# Example: Download all FLAC files from an item
download_item "gd1977-05-08.sbd.miller.97065.flac16" "/mnt/archive/grateful-dead/1977-05-08" "Flac"
# Download entire collection with format filtering
download_collection() {
local collection="$1"
local output_base="$2"
local format="$3" # FLAC, MP3, etc.
local max_items="${4:-100}"
# Get collection items
local items=$(get_collection_items "$collection" "$max_items")
local count=0
while IFS= read -r identifier; do
count=$((count + 1))
echo "[$count/$max_items] Downloading: $identifier"
# Create item directory
local item_dir="$output_base/$identifier"
mkdir -p "$item_dir"
# Download with format filter
download_item "$identifier" "$item_dir" "$format"
# Download metadata
curl -s "https://archive.org/metadata/$identifier" \
> "$item_dir/.curator/metadata.json"
# Rate limiting (be nice to archive.org)
sleep 2
done <<< "$items"
}
# Example: Download first 50 FLAC items from collection
download_collection "GratefulDead" "/mnt/archive/grateful-dead" "Flac" 50
# Download collection items in parallel (with concurrency limit)
download_collection_parallel() {
local collection="$1"
local output_base="$2"
local format="$3"
local max_concurrent="${4:-3}"
# Get collection items
local items=$(get_collection_items "$collection" 1000)
# Create job queue
local queue_file="/tmp/curator-archive-queue-$$.txt"
echo "$items" > "$queue_file"
# Process queue with concurrency limit
cat "$queue_file" | xargs -P "$max_concurrent" -I {} bash -c "
identifier={}
item_dir=\"$output_base/\$identifier\"
mkdir -p \"\$item_dir\"
echo \"Downloading: \$identifier\"
# Download best quality audio
files=\$(curl -s \"https://archive.org/metadata/\$identifier\" \
| jq -r '.files[] | select(.format == \"$format\") | .name')
while IFS= read -r file; do
wget -q --show-progress \
\"https://archive.org/download/\$identifier/\$file\" \
-P \"\$item_dir\"
done <<< \"\$files\"
# Save metadata
curl -s \"https://archive.org/metadata/\$identifier\" \
> \"\$item_dir/.curator/metadata.json\"
sleep 2 # Rate limiting
"
rm "$queue_file"
}
# Use wget's recursive mode for efficient bulk download
download_with_wget_recursive() {
local identifier="$1"
local output_dir="$2"
local format_pattern="$3" # e.g., "*.flac" or "*.mp3"
mkdir -p "$output_dir"
wget --recursive \
--no-parent \
--no-directories \
--accept "$format_pattern" \
--directory-prefix="$output_dir" \
--wait=2 \
--random-wait \
"https://archive.org/download/$identifier/"
}
# Example: Download all FLAC files from item
download_with_wget_recursive "gd1977-05-08.sbd.miller.97065.flac16" \
"/mnt/archive/gd-1977-05-08" \
"*.flac"
For audio archival, prefer formats in this order:
# Select best available format for an item
select_best_format() {
local identifier="$1"
# Get available formats
local formats=$(curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | .format' | sort -u)
# Check in preference order
if echo "$formats" | grep -q "Flac"; then
echo "Flac"
elif echo "$formats" | grep -q "320Kbps MP3"; then
echo "320Kbps MP3"
elif echo "$formats" | grep -q "VBR MP3"; then
echo "VBR MP3"
elif echo "$formats" | grep -q "Ogg Vorbis"; then
echo "Ogg Vorbis"
else
echo "$(echo "$formats" | grep -i mp3 | head -1)"
fi
}
# Download using best available format
download_best_quality() {
local identifier="$1"
local output_dir="$2"
local best_format=$(select_best_format "$identifier")
echo "Selected format: $best_format"
download_item "$identifier" "$output_dir" "$best_format"
}
# Estimate download size before acquiring
estimate_download_size() {
local identifier="$1"
local format="$2"
local total_size=$(curl -s "https://archive.org/metadata/$identifier" \
| jq "[.files[] | select(.format == \"$format\") | .size | tonumber] | add")
# Convert to human-readable
local size_mb=$((total_size / 1048576))
local size_gb=$((size_mb / 1024))
if [[ $size_gb -gt 0 ]]; then
echo "${size_gb}GB"
else
echo "${size_mb}MB"
fi
}
# Check if size is acceptable before downloading
download_if_acceptable() {
local identifier="$1"
local output_dir="$2"
local format="$3"
local max_size_gb="${4:-5}" # Default 5GB limit
local estimated_size=$(estimate_download_size "$identifier" "$format")
local size_value=$(echo "$estimated_size" | sed 's/[^0-9]//g')
local size_unit=$(echo "$estimated_size" | sed 's/[0-9]//g')
if [[ "$size_unit" == "GB" && $size_value -gt $max_size_gb ]]; then
echo "SKIPPED: $identifier ($estimated_size exceeds ${max_size_gb}GB limit)"
return 1
fi
echo "Downloading: $identifier ($estimated_size)"
download_item "$identifier" "$output_dir" "$format"
}
# Convert Archive.org metadata to curator format
extract_archive_metadata() {
local identifier="$1"
local output_file="$2"
curl -s "https://archive.org/metadata/$identifier" | jq '{
source: "archive.org",
identifier: .metadata.identifier,
title: .metadata.title,
creator: .metadata.creator,
date: .metadata.date,
venue: .metadata.venue,
coverage: .metadata.coverage,
description: .metadata.description,
collection: .metadata.collection,
downloads: .item_size.downloads,
avg_rating: .reviews.info.avg_rating,
num_reviews: .reviews.info.num_reviews,
files: [
.files[] | {
name: .name,
format: .format,
size: .size,
md5: .md5,
sha1: .sha1
}
]
}' > "$output_file"
}
# Example usage
extract_archive_metadata "gd1977-05-08.sbd.miller.97065.flac16" \
".curator/metadata/gd-1977-05-08.json"
# Merge Archive.org metadata with existing curator metadata
merge_metadata() {
local curator_metadata="$1"
local archive_metadata="$2"
local output_file="$3"
jq -s '.[0] * .[1]' "$curator_metadata" "$archive_metadata" > "$output_file"
}
Archive.org provides MD5 and SHA1 checksums for all files:
# Verify downloaded files against Archive.org checksums
verify_archive_downloads() {
local identifier="$1"
local download_dir="$2"
# Get checksums from metadata
local checksums=$(curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | "\(.md5) \(.name)"')
# Verify each file
local failed=0
while IFS= read -r line; do
local expected_md5="${line%% *}"
local filename="${line#* }"
local filepath="$download_dir/$filename"
if [[ ! -f "$filepath" ]]; then
echo "MISSING: $filename"
failed=$((failed + 1))
continue
fi
local actual_md5=$(md5sum "$filepath" | awk '{print $1}')
if [[ "$actual_md5" != "$expected_md5" ]]; then
echo "CHECKSUM FAILED: $filename"
echo " Expected: $expected_md5"
echo " Actual: $actual_md5"
failed=$((failed + 1))
else
echo "VERIFIED: $filename"
fi
done <<< "$checksums"
if [[ $failed -gt 0 ]]; then
echo "Verification failed for $failed files"
return 1
fi
echo "All files verified successfully"
return 0
}
# Verify audio file integrity after download
verify_audio_integrity() {
local file="$1"
# Check file is not zero-byte
if [[ ! -s "$file" ]]; then
echo "FAILED: Zero-byte file"
return 1
fi
# Verify with ffmpeg
if command -v ffmpeg >/dev/null; then
if ffmpeg -v error -i "$file" -f null - 2>&1 | grep -q "error"; then
echo "FAILED: Corrupted audio file"
return 1
fi
fi
# FLAC specific: verify with flac tool
if [[ "$file" == *.flac ]] && command -v flac >/dev/null; then
if ! flac -t "$file" 2>&1 | grep -q "ok"; then
echo "FAILED: FLAC verification failed"
return 1
fi
fi
echo "VERIFIED: $file"
return 0
}
# Verify all files in directory
verify_directory_integrity() {
local dir="$1"
local total=0
local verified=0
local failed=0
find "$dir" -type f \( -name "*.flac" -o -name "*.mp3" -o -name "*.ogg" \) | while read -r file; do
total=$((total + 1))
if verify_audio_integrity "$file"; then
verified=$((verified + 1))
else
failed=$((failed + 1))
fi
done
echo "Integrity check complete: $verified/$total verified, $failed failed"
}
Archive.org requests responsible use of their bandwidth:
# Implement polite download behavior
polite_download() {
local identifier="$1"
local output_dir="$2"
local format="$3"
# Add delays between requests
local wait_time=2 # seconds
# Get file list
local files=$(curl -s "https://archive.org/metadata/$identifier" \
| jq -r ".files[] | select(.format == \"$format\") | .name")
while IFS= read -r file; do
echo "Downloading: $file"
wget --quiet \
--show-progress \
--wait="$wait_time" \
--random-wait \
--user-agent="AIWG-MediaCurator/1.0 (research/personal use)" \
"https://archive.org/download/$identifier/$file" \
-P "$output_dir"
done <<< "$files"
}
wget -c to resume interrupted downloads# Create searchable index of downloaded items
build_collection_index() {
local collection_dir="$1"
local index_file="$2"
echo '{"items": []}' > "$index_file"
find "$collection_dir" -name "metadata.json" -type f | while read -r metadata_file; do
local item_dir=$(dirname "$metadata_file")
# Extract key metadata fields
local item_data=$(jq '{
identifier: .metadata.identifier,
title: .metadata.title,
creator: .metadata.creator,
date: .metadata.date,
local_path: "'"$item_dir"'",
file_count: (.files | length),
total_size: ([.files[] | .size | tonumber] | add)
}' "$metadata_file")
# Append to index
jq ".items += [$item_data]" "$index_file" > "$index_file.tmp"
mv "$index_file.tmp" "$index_file"
done
echo "Index built: $index_file"
}
# Search local collection
search_local_collection() {
local index_file="$1"
local query="$2"
jq -r ".items[] | select(.title | test(\"$query\"; \"i\")) | \"\(.date) - \(.creator) - \(.title) (\(.local_path))\"" \
"$index_file"
}
# Find and remove duplicate items in collection
deduplicate_collection() {
local collection_dir="$1"
# Build checksum database
local checksum_file="/tmp/curator-checksums-$$.txt"
find "$collection_dir" -type f \( -name "*.flac" -o -name "*.mp3" \) | while read -r file; do
local checksum=$(md5sum "$file" | awk '{print $1}')
echo "$checksum $file" >> "$checksum_file"
done
# Find duplicates
sort "$checksum_file" | uniq -d -w 32 | while read -r line; do
local checksum="${line%% *}"
echo "Duplicate found (checksum: $checksum):"
grep "^$checksum" "$checksum_file" | while read -r dup_line; do
local dup_file="${dup_line#* }"
echo " - $dup_file"
done
# Offer to delete all but first
local files=($(grep "^$checksum" "$checksum_file" | cut -d' ' -f3-))
local keep="${files[0]}"
echo "Keep: $keep"
for ((i=1; i<${#files[@]}; i++)); do
echo "Delete: ${files[$i]}"
rm "${files[$i]}"
done
done
rm "$checksum_file"
}
# Download only new items added to collection since last sync
sync_collection_incremental() {
local collection="$1"
local output_base="$2"
local format="$3"
local sync_state_file="$output_base/.curator/sync-state.json"
# Get last sync timestamp
local last_sync
if [[ -f "$sync_state_file" ]]; then
last_sync=$(jq -r '.last_sync' "$sync_state_file")
else
last_sync="1970-01-01"
fi
# Search for items added/updated since last sync
local new_items=$(curl -s "https://archive.org/advancedsearch.php" \
-d "q=collection:$collection AND publicdate:[$last_sync TO null]" \
-d "fl[]=identifier" \
-d "rows=1000" \
-d "output=json" \
| jq -r '.response.docs[] | .identifier')
# Download new items
local count=0
while IFS= read -r identifier; do
count=$((count + 1))
echo "Downloading new item [$count]: $identifier"
download_item "$identifier" "$output_base/$identifier" "$format"
sleep 2
done <<< "$new_items"
# Update sync state
echo "{\"last_sync\": \"$(date -Iseconds)\", \"items_added\": $count}" > "$sync_state_file"
echo "Sync complete: $count new items added"
}
# Download items matching complex filter criteria
download_filtered_batch() {
local query="$1"
local output_base="$2"
local format="$3"
local min_rating="${4:-4.0}"
local min_year="${5:-1960}"
# Search with filters
local items=$(curl -s "https://archive.org/advancedsearch.php" \
-d "q=$query AND avg_rating:[$min_rating TO 5.0] AND year:[$min_year TO 2026]" \
-d "fl[]=identifier,avg_rating,year" \
-d "rows=500" \
-d "output=json")
# Download each item
echo "$items" | jq -r '.response.docs[] | .identifier' | while read -r identifier; do
local rating=$(echo "$items" | jq -r ".response.docs[] | select(.identifier == \"$identifier\") | .avg_rating")
local year=$(echo "$items" | jq -r ".response.docs[] | select(.identifier == \"$identifier\") | .year")
echo "Downloading: $identifier (rating: $rating, year: $year)"
download_item "$identifier" "$output_base/$identifier" "$format"
sleep 2
done
}
# Generate source plan for /acquire command
generate_archive_source_plan() {
local collection="$1"
local output_file="$2"
local format="${3:-Flac}"
# Get collection items
local items=$(get_collection_items "$collection" 100)
# Build YAML plan
cat > "$output_file" <<EOF
source_plan:
id: archive-$(date +%Y%m%d-%H%M%S)
source: archive.org
collection: $collection
format: $format
created_at: $(date -Iseconds)
sources:
EOF
while IFS= read -r identifier; do
# Get item metadata
local metadata=$(curl -s "https://archive.org/metadata/$identifier")
local title=$(echo "$metadata" | jq -r '.metadata.title')
local creator=$(echo "$metadata" | jq -r '.metadata.creator')
local date=$(echo "$metadata" | jq -r '.metadata.date')
cat >> "$output_file" <<EOF
- identifier: $identifier
title: "$title"
creator: "$creator"
date: "$date"
url: "https://archive.org/details/$identifier"
format: $format
EOF
done <<< "$items"
echo "Source plan generated: $output_file"
}
# Usage with /acquire
# generate_archive_source_plan "GratefulDead" ".curator/sources/gd-plan.yaml" "Flac"
# /acquire --plan .curator/sources/gd-plan.yaml
| Issue | Cause | Solution |
|-------|-------|----------|
| Download fails immediately | Invalid identifier | Verify identifier on archive.org website |
| Slow download speeds | Network throttling | Use --limit-rate with wget |
| Incomplete files | Connection timeout | Use wget -c to resume |
| Missing files | Format not available | Check available formats with metadata API |
| Checksum mismatch | Corrupted download | Delete and re-download |
| Rate limiting | Too many requests | Increase wait time between downloads |
# Enable verbose output for troubleshooting
export CURATOR_DEBUG=1
download_item_debug() {
local identifier="$1"
local output_dir="$2"
if [[ "$CURATOR_DEBUG" == "1" ]]; then
set -x # Enable bash tracing
fi
# Get metadata with verbose curl
curl -v "https://archive.org/metadata/$identifier" 2>&1 | tee "$output_dir/.curator/metadata-debug.log"
# Download with verbose wget
wget --verbose \
--debug \
"https://archive.org/download/$identifier/" \
2>&1 | tee "$output_dir/.curator/download-debug.log"
set +x
}
# Optimal concurrent downloads based on connection speed
determine_optimal_parallelism() {
local speed_mbps="$1"
if [[ $speed_mbps -lt 10 ]]; then
echo 2 # Slow connection
elif [[ $speed_mbps -lt 50 ]]; then
echo 3 # Medium connection
elif [[ $speed_mbps -lt 100 ]]; then
echo 5 # Fast connection
else
echo 8 # Very fast connection
fi
}
# Test connection speed
test_connection_speed() {
# Download test file and measure speed
local test_file="https://archive.org/download/test_item/test.mp3"
local start_time=$(date +%s)
wget -q -O /tmp/speed-test.tmp "$test_file"
local end_time=$(date +%s)
local duration=$((end_time - start_time))
local size=$(stat -c%s /tmp/speed-test.tmp 2>/dev/null || stat -f%z /tmp/speed-test.tmp)
local speed_mbps=$(( (size * 8) / (duration * 1000000) ))
rm /tmp/speed-test.tmp
echo "$speed_mbps"
}
# Cache metadata to reduce API calls
cache_metadata() {
local identifier="$1"
local cache_dir=".curator/cache/metadata"
local cache_file="$cache_dir/$identifier.json"
mkdir -p "$cache_dir"
# Check cache age
if [[ -f "$cache_file" ]]; then
local cache_age=$(( $(date +%s) - $(stat -c%Y "$cache_file" 2>/dev/null || stat -f%m "$cache_file") ))
# Cache valid for 7 days
if [[ $cache_age -lt 604800 ]]; then
cat "$cache_file"
return 0
fi
fi
# Fetch and cache
curl -s "https://archive.org/metadata/$identifier" | tee "$cache_file"
}
Archive Acquisition skill provides:
/acquire commandUse this skill whenever working with Internet Archive content in the Media Curator framework.
data-ai
Report which research-corpus radar sidecars are overdue for refresh. Computes staleness (days since last refresh vs the cadence window) for every radar, sorted most-overdue-first. Runs via `aiwg corpus radar-status`.
data-ai
Aggregate research-corpus radar sidecars into a corpus or per-cluster freshness report — totals, overdue count, per-cluster / per-GRADE / per-trajectory breakdowns, an overdue table, and per-radar rationale snippets. Runs via `aiwg corpus radar-report`.
testing
Scaffold radar/freshness sidecars for research-corpus REFs. Pulls title/authors from the citation sidecar and GRADE from the analysis doc, defaults the refresh cadence from GRADE and the cluster from a corpus-local map, and stamps documentation/radar/REF-XXX-radar.md. Runs via `aiwg corpus radar-init`.
data-ai
Compute an entity's publication trajectory — per-year paper counts, topic drift, hot-streak detection (≥3 consecutive A-grade years), and career phase. Runs via `aiwg corpus profile-temporal`.