Archive Acquisition Skill

Specialized patterns and techniques for acquiring media content from Internet Archive (archive.org) and other archival sources. Focuses on bulk downloads, quality selection, format filtering, and collection management.

Internet Archive Overview

Internet Archive hosts massive collections of audio, video, text, and software. Key characteristics:

Open access: Most content freely downloadable
Multiple formats: Same item often available in FLAC, MP3, OGG, etc.
Collections: Items grouped by uploader, topic, or curator
Metadata: Rich JSON metadata for every item
API access: Full programmatic access via REST API

Discovery Patterns

Search API

Find items matching specific criteria:

# Search for items by keyword and mediatype
search_archive() {
  local query="$1"
  local mediatype="$2"  # audio, movies, texts, etc.
  local rows="${3:-100}"  # Results per page

  curl -s "https://archive.org/advancedsearch.php" \
    -d "q=$query AND mediatype:$mediatype" \
    -d "fl[]=identifier,title,creator,date,format" \
    -d "rows=$rows" \
    -d "output=json" \
    | jq '.response.docs[]'
}

# Example: Search for Grateful Dead concerts
search_archive "grateful dead" "audio" 50

Response Format

{
  "identifier": "gd1977-05-08.sbd.miller.97065.flac16",
  "title": "Grateful Dead Live at Barton Hall on 1977-05-08",
  "creator": "Grateful Dead",
  "date": "1977-05-08",
  "format": ["Flac", "VBR MP3", "Ogg Vorbis", "Metadata"]
}

Collection Browsing

List all items in a collection:

# Get items from a specific collection
get_collection_items() {
  local collection="$1"
  local rows="${2:-1000}"

  curl -s "https://archive.org/advancedsearch.php" \
    -d "q=collection:$collection" \
    -d "fl[]=identifier,title,format" \
    -d "rows=$rows" \
    -d "output=json" \
    | jq -r '.response.docs[] | .identifier'
}

# Example: Get all items from GratefulDead collection
get_collection_items "GratefulDead" 500

Advanced Query Patterns

# Find items by date range
search_by_date() {
  local creator="$1"
  local start_date="$2"  # YYYY-MM-DD
  local end_date="$3"

  curl -s "https://archive.org/advancedsearch.php" \
    -d "q=creator:\"$creator\" AND date:[$start_date TO $end_date]" \
    -d "fl[]=identifier,title,date" \
    -d "rows=100" \
    -d "output=json" \
    | jq '.response.docs[]'
}

# Find items with specific format available
search_by_format() {
  local query="$1"
  local format="$2"  # FLAC, MP3, etc.

  curl -s "https://archive.org/advancedsearch.php" \
    -d "q=$query AND format:$format" \
    -d "fl[]=identifier,title,format" \
    -d "rows=100" \
    -d "output=json" \
    | jq '.response.docs[]'
}

# Example: Find all FLAC recordings of specific artist
search_by_format "pink floyd" "Flac"

Item Metadata Retrieval

Get Item Details

# Retrieve full metadata for an item
get_item_metadata() {
  local identifier="$1"

  curl -s "https://archive.org/metadata/$identifier" \
    | jq '.'
}

# Extract specific metadata fields
get_item_files() {
  local identifier="$1"

  curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | select(.format != "Metadata") | "\(.name)\t\(.format)\t\(.size)"'
}

# Example output:
# gd77-05-08d1t01.flac    Flac    45123456
# gd77-05-08d1t01.mp3     VBR MP3 12345678

Filter by Format

# Get only FLAC files from an item
get_flac_files() {
  local identifier="$1"

  curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | select(.format == "Flac") | .name'
}

# Get best quality audio files (prefer FLAC, fallback to 320K MP3)
get_best_audio_files() {
  local identifier="$1"

  # Try FLAC first
  local flac_files=$(curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | select(.format == "Flac") | .name')

  if [[ -n "$flac_files" ]]; then
    echo "$flac_files"
    return 0
  fi

  # Fallback to highest bitrate MP3
  curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | select(.format == "VBR MP3" or .format == "320Kbps MP3") | .name'
}

Download Patterns

Single Item Download

# Download all files from a single item
download_item() {
  local identifier="$1"
  local output_dir="$2"
  local format_filter="$3"  # Optional: FLAC, MP3, etc.

  mkdir -p "$output_dir"

  # Get file list
  local files
  if [[ -n "$format_filter" ]]; then
    files=$(curl -s "https://archive.org/metadata/$identifier" \
      | jq -r ".files[] | select(.format == \"$format_filter\") | .name")
  else
    files=$(curl -s "https://archive.org/metadata/$identifier" \
      | jq -r '.files[] | select(.format != "Metadata") | .name')
  fi

  # Download each file
  while IFS= read -r file; do
    echo "Downloading: $file"
    wget -q --show-progress \
      "https://archive.org/download/$identifier/$file" \
      -P "$output_dir"
  done <<< "$files"
}

# Example: Download all FLAC files from an item
download_item "gd1977-05-08.sbd.miller.97065.flac16" "/mnt/archive/grateful-dead/1977-05-08" "Flac"

Recursive Collection Download

# Download entire collection with format filtering
download_collection() {
  local collection="$1"
  local output_base="$2"
  local format="$3"  # FLAC, MP3, etc.
  local max_items="${4:-100}"

  # Get collection items
  local items=$(get_collection_items "$collection" "$max_items")

  local count=0
  while IFS= read -r identifier; do
    count=$((count + 1))
    echo "[$count/$max_items] Downloading: $identifier"

    # Create item directory
    local item_dir="$output_base/$identifier"
    mkdir -p "$item_dir"

    # Download with format filter
    download_item "$identifier" "$item_dir" "$format"

    # Download metadata
    curl -s "https://archive.org/metadata/$identifier" \
      > "$item_dir/.curator/metadata.json"

    # Rate limiting (be nice to archive.org)
    sleep 2
  done <<< "$items"
}

# Example: Download first 50 FLAC items from collection
download_collection "GratefulDead" "/mnt/archive/grateful-dead" "Flac" 50

Parallel Collection Download

# Download collection items in parallel (with concurrency limit)
download_collection_parallel() {
  local collection="$1"
  local output_base="$2"
  local format="$3"
  local max_concurrent="${4:-3}"

  # Get collection items
  local items=$(get_collection_items "$collection" 1000)

  # Create job queue
  local queue_file="/tmp/curator-archive-queue-$$.txt"
  echo "$items" > "$queue_file"

  # Process queue with concurrency limit
  cat "$queue_file" | xargs -P "$max_concurrent" -I {} bash -c "
    identifier={}
    item_dir=\"$output_base/\$identifier\"
    mkdir -p \"\$item_dir\"

    echo \"Downloading: \$identifier\"

    # Download best quality audio
    files=\$(curl -s \"https://archive.org/metadata/\$identifier\" \
      | jq -r '.files[] | select(.format == \"$format\") | .name')

    while IFS= read -r file; do
      wget -q --show-progress \
        \"https://archive.org/download/\$identifier/\$file\" \
        -P \"\$item_dir\"
    done <<< \"\$files\"

    # Save metadata
    curl -s \"https://archive.org/metadata/\$identifier\" \
      > \"\$item_dir/.curator/metadata.json\"

    sleep 2  # Rate limiting
  "

  rm "$queue_file"
}

wget Recursive Download

# Use wget's recursive mode for efficient bulk download
download_with_wget_recursive() {
  local identifier="$1"
  local output_dir="$2"
  local format_pattern="$3"  # e.g., "*.flac" or "*.mp3"

  mkdir -p "$output_dir"

  wget --recursive \
       --no-parent \
       --no-directories \
       --accept "$format_pattern" \
       --directory-prefix="$output_dir" \
       --wait=2 \
       --random-wait \
       "https://archive.org/download/$identifier/"
}

# Example: Download all FLAC files from item
download_with_wget_recursive "gd1977-05-08.sbd.miller.97065.flac16" \
  "/mnt/archive/gd-1977-05-08" \
  "*.flac"

Quality Selection Strategy

Format Preference Hierarchy

For audio archival, prefer formats in this order:

FLAC - Lossless, widely supported
320Kbps MP3 - High quality lossy, universal compatibility
VBR MP3 - Variable bitrate, good quality/size ratio
Ogg Vorbis - Good quality, open format
128Kbps MP3 - Acceptable for spoken word, not music

Automatic Quality Selection

# Select best available format for an item
select_best_format() {
  local identifier="$1"

  # Get available formats
  local formats=$(curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | .format' | sort -u)

  # Check in preference order
  if echo "$formats" | grep -q "Flac"; then
    echo "Flac"
  elif echo "$formats" | grep -q "320Kbps MP3"; then
    echo "320Kbps MP3"
  elif echo "$formats" | grep -q "VBR MP3"; then
    echo "VBR MP3"
  elif echo "$formats" | grep -q "Ogg Vorbis"; then
    echo "Ogg Vorbis"
  else
    echo "$(echo "$formats" | grep -i mp3 | head -1)"
  fi
}

# Download using best available format
download_best_quality() {
  local identifier="$1"
  local output_dir="$2"

  local best_format=$(select_best_format "$identifier")

  echo "Selected format: $best_format"
  download_item "$identifier" "$output_dir" "$best_format"
}

Size vs Quality Trade-offs

# Estimate download size before acquiring
estimate_download_size() {
  local identifier="$1"
  local format="$2"

  local total_size=$(curl -s "https://archive.org/metadata/$identifier" \
    | jq "[.files[] | select(.format == \"$format\") | .size | tonumber] | add")

  # Convert to human-readable
  local size_mb=$((total_size / 1048576))
  local size_gb=$((size_mb / 1024))

  if [[ $size_gb -gt 0 ]]; then
    echo "${size_gb}GB"
  else
    echo "${size_mb}MB"
  fi
}

# Check if size is acceptable before downloading
download_if_acceptable() {
  local identifier="$1"
  local output_dir="$2"
  local format="$3"
  local max_size_gb="${4:-5}"  # Default 5GB limit

  local estimated_size=$(estimate_download_size "$identifier" "$format")
  local size_value=$(echo "$estimated_size" | sed 's/[^0-9]//g')
  local size_unit=$(echo "$estimated_size" | sed 's/[0-9]//g')

  if [[ "$size_unit" == "GB" && $size_value -gt $max_size_gb ]]; then
    echo "SKIPPED: $identifier ($estimated_size exceeds ${max_size_gb}GB limit)"
    return 1
  fi

  echo "Downloading: $identifier ($estimated_size)"
  download_item "$identifier" "$output_dir" "$format"
}

Metadata Extraction and Storage

Extract Archive Metadata

# Convert Archive.org metadata to curator format
extract_archive_metadata() {
  local identifier="$1"
  local output_file="$2"

  curl -s "https://archive.org/metadata/$identifier" | jq '{
    source: "archive.org",
    identifier: .metadata.identifier,
    title: .metadata.title,
    creator: .metadata.creator,
    date: .metadata.date,
    venue: .metadata.venue,
    coverage: .metadata.coverage,
    description: .metadata.description,
    collection: .metadata.collection,
    downloads: .item_size.downloads,
    avg_rating: .reviews.info.avg_rating,
    num_reviews: .reviews.info.num_reviews,
    files: [
      .files[] | {
        name: .name,
        format: .format,
        size: .size,
        md5: .md5,
        sha1: .sha1
      }
    ]
  }' > "$output_file"
}

# Example usage
extract_archive_metadata "gd1977-05-08.sbd.miller.97065.flac16" \
  ".curator/metadata/gd-1977-05-08.json"

Merge with Curator Metadata

# Merge Archive.org metadata with existing curator metadata
merge_metadata() {
  local curator_metadata="$1"
  local archive_metadata="$2"
  local output_file="$3"

  jq -s '.[0] * .[1]' "$curator_metadata" "$archive_metadata" > "$output_file"
}

Verification and Quality Assurance

Checksum Verification

Archive.org provides MD5 and SHA1 checksums for all files:

# Verify downloaded files against Archive.org checksums
verify_archive_downloads() {
  local identifier="$1"
  local download_dir="$2"

  # Get checksums from metadata
  local checksums=$(curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | "\(.md5)  \(.name)"')

  # Verify each file
  local failed=0
  while IFS= read -r line; do
    local expected_md5="${line%%  *}"
    local filename="${line#*  }"
    local filepath="$download_dir/$filename"

    if [[ ! -f "$filepath" ]]; then
      echo "MISSING: $filename"
      failed=$((failed + 1))
      continue
    fi

    local actual_md5=$(md5sum "$filepath" | awk '{print $1}')

    if [[ "$actual_md5" != "$expected_md5" ]]; then
      echo "CHECKSUM FAILED: $filename"
      echo "  Expected: $expected_md5"
      echo "  Actual:   $actual_md5"
      failed=$((failed + 1))
    else
      echo "VERIFIED: $filename"
    fi
  done <<< "$checksums"

  if [[ $failed -gt 0 ]]; then
    echo "Verification failed for $failed files"
    return 1
  fi

  echo "All files verified successfully"
  return 0
}

File Integrity Check

# Verify audio file integrity after download
verify_audio_integrity() {
  local file="$1"

  # Check file is not zero-byte
  if [[ ! -s "$file" ]]; then
    echo "FAILED: Zero-byte file"
    return 1
  fi

  # Verify with ffmpeg
  if command -v ffmpeg >/dev/null; then
    if ffmpeg -v error -i "$file" -f null - 2>&1 | grep -q "error"; then
      echo "FAILED: Corrupted audio file"
      return 1
    fi
  fi

  # FLAC specific: verify with flac tool
  if [[ "$file" == *.flac ]] && command -v flac >/dev/null; then
    if ! flac -t "$file" 2>&1 | grep -q "ok"; then
      echo "FAILED: FLAC verification failed"
      return 1
    fi
  fi

  echo "VERIFIED: $file"
  return 0
}

# Verify all files in directory
verify_directory_integrity() {
  local dir="$1"

  local total=0
  local verified=0
  local failed=0

  find "$dir" -type f \( -name "*.flac" -o -name "*.mp3" -o -name "*.ogg" \) | while read -r file; do
    total=$((total + 1))

    if verify_audio_integrity "$file"; then
      verified=$((verified + 1))
    else
      failed=$((failed + 1))
    fi
  done

  echo "Integrity check complete: $verified/$total verified, $failed failed"
}

Rate Limiting and Politeness

Archive.org requests responsible use of their bandwidth:

# Implement polite download behavior
polite_download() {
  local identifier="$1"
  local output_dir="$2"
  local format="$3"

  # Add delays between requests
  local wait_time=2  # seconds

  # Get file list
  local files=$(curl -s "https://archive.org/metadata/$identifier" \
    | jq -r ".files[] | select(.format == \"$format\") | .name")

  while IFS= read -r file; do
    echo "Downloading: $file"

    wget --quiet \
         --show-progress \
         --wait="$wait_time" \
         --random-wait \
         --user-agent="AIWG-MediaCurator/1.0 (research/personal use)" \
         "https://archive.org/download/$identifier/$file" \
         -P "$output_dir"

  done <<< "$files"
}

Bulk Download Best Practices

Rate limit: Wait 2-5 seconds between requests
Random wait: Add jitter to avoid patterns
User agent: Identify your tool
Off-peak hours: Download during US night hours
Resume support: Use wget -c to resume interrupted downloads
Error handling: Retry failed downloads after delay

Collection Management

Build Local Collection Index

# Create searchable index of downloaded items
build_collection_index() {
  local collection_dir="$1"
  local index_file="$2"

  echo '{"items": []}' > "$index_file"

  find "$collection_dir" -name "metadata.json" -type f | while read -r metadata_file; do
    local item_dir=$(dirname "$metadata_file")

    # Extract key metadata fields
    local item_data=$(jq '{
      identifier: .metadata.identifier,
      title: .metadata.title,
      creator: .metadata.creator,
      date: .metadata.date,
      local_path: "'"$item_dir"'",
      file_count: (.files | length),
      total_size: ([.files[] | .size | tonumber] | add)
    }' "$metadata_file")

    # Append to index
    jq ".items += [$item_data]" "$index_file" > "$index_file.tmp"
    mv "$index_file.tmp" "$index_file"
  done

  echo "Index built: $index_file"
}

# Search local collection
search_local_collection() {
  local index_file="$1"
  local query="$2"

  jq -r ".items[] | select(.title | test(\"$query\"; \"i\")) | \"\(.date) - \(.creator) - \(.title) (\(.local_path))\"" \
    "$index_file"
}

Deduplicate Collection

# Find and remove duplicate items in collection
deduplicate_collection() {
  local collection_dir="$1"

  # Build checksum database
  local checksum_file="/tmp/curator-checksums-$$.txt"

  find "$collection_dir" -type f \( -name "*.flac" -o -name "*.mp3" \) | while read -r file; do
    local checksum=$(md5sum "$file" | awk '{print $1}')
    echo "$checksum  $file" >> "$checksum_file"
  done

  # Find duplicates
  sort "$checksum_file" | uniq -d -w 32 | while read -r line; do
    local checksum="${line%%  *}"

    echo "Duplicate found (checksum: $checksum):"

    grep "^$checksum" "$checksum_file" | while read -r dup_line; do
      local dup_file="${dup_line#*  }"
      echo "  - $dup_file"
    done

    # Offer to delete all but first
    local files=($(grep "^$checksum" "$checksum_file" | cut -d' ' -f3-))
    local keep="${files[0]}"

    echo "Keep: $keep"

    for ((i=1; i<${#files[@]}; i++)); do
      echo "Delete: ${files[$i]}"
      rm "${files[$i]}"
    done
  done

  rm "$checksum_file"
}

Advanced Patterns

Incremental Collection Sync

# Download only new items added to collection since last sync
sync_collection_incremental() {
  local collection="$1"
  local output_base="$2"
  local format="$3"
  local sync_state_file="$output_base/.curator/sync-state.json"

  # Get last sync timestamp
  local last_sync
  if [[ -f "$sync_state_file" ]]; then
    last_sync=$(jq -r '.last_sync' "$sync_state_file")
  else
    last_sync="1970-01-01"
  fi

  # Search for items added/updated since last sync
  local new_items=$(curl -s "https://archive.org/advancedsearch.php" \
    -d "q=collection:$collection AND publicdate:[$last_sync TO null]" \
    -d "fl[]=identifier" \
    -d "rows=1000" \
    -d "output=json" \
    | jq -r '.response.docs[] | .identifier')

  # Download new items
  local count=0
  while IFS= read -r identifier; do
    count=$((count + 1))
    echo "Downloading new item [$count]: $identifier"

    download_item "$identifier" "$output_base/$identifier" "$format"
    sleep 2
  done <<< "$new_items"

  # Update sync state
  echo "{\"last_sync\": \"$(date -Iseconds)\", \"items_added\": $count}" > "$sync_state_file"

  echo "Sync complete: $count new items added"
}

Filtered Batch Download

# Download items matching complex filter criteria
download_filtered_batch() {
  local query="$1"
  local output_base="$2"
  local format="$3"
  local min_rating="${4:-4.0}"
  local min_year="${5:-1960}"

  # Search with filters
  local items=$(curl -s "https://archive.org/advancedsearch.php" \
    -d "q=$query AND avg_rating:[$min_rating TO 5.0] AND year:[$min_year TO 2026]" \
    -d "fl[]=identifier,avg_rating,year" \
    -d "rows=500" \
    -d "output=json")

  # Download each item
  echo "$items" | jq -r '.response.docs[] | .identifier' | while read -r identifier; do
    local rating=$(echo "$items" | jq -r ".response.docs[] | select(.identifier == \"$identifier\") | .avg_rating")
    local year=$(echo "$items" | jq -r ".response.docs[] | select(.identifier == \"$identifier\") | .year")

    echo "Downloading: $identifier (rating: $rating, year: $year)"

    download_item "$identifier" "$output_base/$identifier" "$format"
    sleep 2
  done
}

Integration with Media Curator Framework

Source Plan Generation

# Generate source plan for /acquire command
generate_archive_source_plan() {
  local collection="$1"
  local output_file="$2"
  local format="${3:-Flac}"

  # Get collection items
  local items=$(get_collection_items "$collection" 100)

  # Build YAML plan
  cat > "$output_file" <<EOF
source_plan:
  id: archive-$(date +%Y%m%d-%H%M%S)
  source: archive.org
  collection: $collection
  format: $format
  created_at: $(date -Iseconds)

sources:
EOF

  while IFS= read -r identifier; do
    # Get item metadata
    local metadata=$(curl -s "https://archive.org/metadata/$identifier")
    local title=$(echo "$metadata" | jq -r '.metadata.title')
    local creator=$(echo "$metadata" | jq -r '.metadata.creator')
    local date=$(echo "$metadata" | jq -r '.metadata.date')

    cat >> "$output_file" <<EOF
  - identifier: $identifier
    title: "$title"
    creator: "$creator"
    date: "$date"
    url: "https://archive.org/details/$identifier"
    format: $format
EOF

  done <<< "$items"

  echo "Source plan generated: $output_file"
}

# Usage with /acquire
# generate_archive_source_plan "GratefulDead" ".curator/sources/gd-plan.yaml" "Flac"
# /acquire --plan .curator/sources/gd-plan.yaml

Troubleshooting

Common Issues and Solutions

| Issue | Cause | Solution | |-------|-------|----------| | Download fails immediately | Invalid identifier | Verify identifier on archive.org website | | Slow download speeds | Network throttling | Use --limit-rate with wget | | Incomplete files | Connection timeout | Use wget -c to resume | | Missing files | Format not available | Check available formats with metadata API | | Checksum mismatch | Corrupted download | Delete and re-download | | Rate limiting | Too many requests | Increase wait time between downloads |

Debug Mode

# Enable verbose output for troubleshooting
export CURATOR_DEBUG=1

download_item_debug() {
  local identifier="$1"
  local output_dir="$2"

  if [[ "$CURATOR_DEBUG" == "1" ]]; then
    set -x  # Enable bash tracing
  fi

  # Get metadata with verbose curl
  curl -v "https://archive.org/metadata/$identifier" 2>&1 | tee "$output_dir/.curator/metadata-debug.log"

  # Download with verbose wget
  wget --verbose \
       --debug \
       "https://archive.org/download/$identifier/" \
       2>&1 | tee "$output_dir/.curator/download-debug.log"

  set +x
}

Performance Optimization

Parallel Download Tuning

# Optimal concurrent downloads based on connection speed
determine_optimal_parallelism() {
  local speed_mbps="$1"

  if [[ $speed_mbps -lt 10 ]]; then
    echo 2  # Slow connection
  elif [[ $speed_mbps -lt 50 ]]; then
    echo 3  # Medium connection
  elif [[ $speed_mbps -lt 100 ]]; then
    echo 5  # Fast connection
  else
    echo 8  # Very fast connection
  fi
}

# Test connection speed
test_connection_speed() {
  # Download test file and measure speed
  local test_file="https://archive.org/download/test_item/test.mp3"
  local start_time=$(date +%s)

  wget -q -O /tmp/speed-test.tmp "$test_file"

  local end_time=$(date +%s)
  local duration=$((end_time - start_time))
  local size=$(stat -c%s /tmp/speed-test.tmp 2>/dev/null || stat -f%z /tmp/speed-test.tmp)
  local speed_mbps=$(( (size * 8) / (duration * 1000000) ))

  rm /tmp/speed-test.tmp

  echo "$speed_mbps"
}

Caching Strategy

# Cache metadata to reduce API calls
cache_metadata() {
  local identifier="$1"
  local cache_dir=".curator/cache/metadata"
  local cache_file="$cache_dir/$identifier.json"

  mkdir -p "$cache_dir"

  # Check cache age
  if [[ -f "$cache_file" ]]; then
    local cache_age=$(( $(date +%s) - $(stat -c%Y "$cache_file" 2>/dev/null || stat -f%m "$cache_file") ))

    # Cache valid for 7 days
    if [[ $cache_age -lt 604800 ]]; then
      cat "$cache_file"
      return 0
    fi
  fi

  # Fetch and cache
  curl -s "https://archive.org/metadata/$identifier" | tee "$cache_file"
}

Summary

Archive Acquisition skill provides:

Discovery: Search API patterns for finding content
Quality Selection: Automatic best-format selection (prefer FLAC, accept MP3 320K)
Bulk Download: Recursive wget and parallel download patterns
Verification: Checksum validation and integrity checks
Collection Management: Indexing, deduplication, incremental sync
Integration: Source plan generation for /acquire command
Performance: Parallel tuning, caching, rate limiting

Use this skill whenever working with Internet Archive content in the Media Curator framework.

References

@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/human-authorization.md — Seek explicit authorization before bulk downloads and irreversible archive operations
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/research-before-decision.md — Evaluate archive item quality and metadata before committing to download
@$AIWG_ROOT/agentic/code/frameworks/media-curator/skills/acquire/SKILL.md — General acquisition skill that delegates to archive-acquisition for archive.org sources
@$AIWG_ROOT/agentic/code/frameworks/media-curator/skills/integrity-verification/SKILL.md — Verify checksums after archive downloads
@$AIWG_ROOT/agentic/code/frameworks/media-curator/skills/quality-filtering/SKILL.md — Quality scoring used to select best format from archive items

Archive Acquisition Skill

Internet Archive Overview

Internet Archive hosts massive collections of audio, video, text, and software. Key characteristics:

Open access: Most content freely downloadable
Multiple formats: Same item often available in FLAC, MP3, OGG, etc.
Collections: Items grouped by uploader, topic, or curator
Metadata: Rich JSON metadata for every item
API access: Full programmatic access via REST API

Discovery Patterns

Search API

Find items matching specific criteria:

# Search for items by keyword and mediatype
search_archive() {
  local query="$1"
  local mediatype="$2"  # audio, movies, texts, etc.
  local rows="${3:-100}"  # Results per page

  curl -s "https://archive.org/advancedsearch.php" \
    -d "q=$query AND mediatype:$mediatype" \
    -d "fl[]=identifier,title,creator,date,format" \
    -d "rows=$rows" \
    -d "output=json" \
    | jq '.response.docs[]'
}

# Example: Search for Grateful Dead concerts
search_archive "grateful dead" "audio" 50

Response Format

{
  "identifier": "gd1977-05-08.sbd.miller.97065.flac16",
  "title": "Grateful Dead Live at Barton Hall on 1977-05-08",
  "creator": "Grateful Dead",
  "date": "1977-05-08",
  "format": ["Flac", "VBR MP3", "Ogg Vorbis", "Metadata"]
}

Collection Browsing

List all items in a collection:

# Get items from a specific collection
get_collection_items() {
  local collection="$1"
  local rows="${2:-1000}"

  curl -s "https://archive.org/advancedsearch.php" \
    -d "q=collection:$collection" \
    -d "fl[]=identifier,title,format" \
    -d "rows=$rows" \
    -d "output=json" \
    | jq -r '.response.docs[] | .identifier'
}

# Example: Get all items from GratefulDead collection
get_collection_items "GratefulDead" 500

Advanced Query Patterns

# Find items by date range
search_by_date() {
  local creator="$1"
  local start_date="$2"  # YYYY-MM-DD
  local end_date="$3"

  curl -s "https://archive.org/advancedsearch.php" \
    -d "q=creator:\"$creator\" AND date:[$start_date TO $end_date]" \
    -d "fl[]=identifier,title,date" \
    -d "rows=100" \
    -d "output=json" \
    | jq '.response.docs[]'
}

# Find items with specific format available
search_by_format() {
  local query="$1"
  local format="$2"  # FLAC, MP3, etc.

  curl -s "https://archive.org/advancedsearch.php" \
    -d "q=$query AND format:$format" \
    -d "fl[]=identifier,title,format" \
    -d "rows=100" \
    -d "output=json" \
    | jq '.response.docs[]'
}

# Example: Find all FLAC recordings of specific artist
search_by_format "pink floyd" "Flac"

Item Metadata Retrieval

Get Item Details

# Retrieve full metadata for an item
get_item_metadata() {
  local identifier="$1"

  curl -s "https://archive.org/metadata/$identifier" \
    | jq '.'
}

# Extract specific metadata fields
get_item_files() {
  local identifier="$1"

  curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | select(.format != "Metadata") | "\(.name)\t\(.format)\t\(.size)"'
}

# Example output:
# gd77-05-08d1t01.flac    Flac    45123456
# gd77-05-08d1t01.mp3     VBR MP3 12345678

Filter by Format

# Get only FLAC files from an item
get_flac_files() {
  local identifier="$1"

  curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | select(.format == "Flac") | .name'
}

# Get best quality audio files (prefer FLAC, fallback to 320K MP3)
get_best_audio_files() {
  local identifier="$1"

  # Try FLAC first
  local flac_files=$(curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | select(.format == "Flac") | .name')

  if [[ -n "$flac_files" ]]; then
    echo "$flac_files"
    return 0
  fi

  # Fallback to highest bitrate MP3
  curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | select(.format == "VBR MP3" or .format == "320Kbps MP3") | .name'
}

Download Patterns

Single Item Download

# Download all files from a single item
download_item() {
  local identifier="$1"
  local output_dir="$2"
  local format_filter="$3"  # Optional: FLAC, MP3, etc.

  mkdir -p "$output_dir"

  # Get file list
  local files
  if [[ -n "$format_filter" ]]; then
    files=$(curl -s "https://archive.org/metadata/$identifier" \
      | jq -r ".files[] | select(.format == \"$format_filter\") | .name")
  else
    files=$(curl -s "https://archive.org/metadata/$identifier" \
      | jq -r '.files[] | select(.format != "Metadata") | .name')
  fi

  # Download each file
  while IFS= read -r file; do
    echo "Downloading: $file"
    wget -q --show-progress \
      "https://archive.org/download/$identifier/$file" \
      -P "$output_dir"
  done <<< "$files"
}

# Example: Download all FLAC files from an item
download_item "gd1977-05-08.sbd.miller.97065.flac16" "/mnt/archive/grateful-dead/1977-05-08" "Flac"

Recursive Collection Download

# Download entire collection with format filtering
download_collection() {
  local collection="$1"
  local output_base="$2"
  local format="$3"  # FLAC, MP3, etc.
  local max_items="${4:-100}"

  # Get collection items
  local items=$(get_collection_items "$collection" "$max_items")

  local count=0
  while IFS= read -r identifier; do
    count=$((count + 1))
    echo "[$count/$max_items] Downloading: $identifier"

    # Create item directory
    local item_dir="$output_base/$identifier"
    mkdir -p "$item_dir"

    # Download with format filter
    download_item "$identifier" "$item_dir" "$format"

    # Download metadata
    curl -s "https://archive.org/metadata/$identifier" \
      > "$item_dir/.curator/metadata.json"

    # Rate limiting (be nice to archive.org)
    sleep 2
  done <<< "$items"
}

# Example: Download first 50 FLAC items from collection
download_collection "GratefulDead" "/mnt/archive/grateful-dead" "Flac" 50

Parallel Collection Download

# Download collection items in parallel (with concurrency limit)
download_collection_parallel() {
  local collection="$1"
  local output_base="$2"
  local format="$3"
  local max_concurrent="${4:-3}"

  # Get collection items
  local items=$(get_collection_items "$collection" 1000)

  # Create job queue
  local queue_file="/tmp/curator-archive-queue-$$.txt"
  echo "$items" > "$queue_file"

  # Process queue with concurrency limit
  cat "$queue_file" | xargs -P "$max_concurrent" -I {} bash -c "
    identifier={}
    item_dir=\"$output_base/\$identifier\"
    mkdir -p \"\$item_dir\"

    echo \"Downloading: \$identifier\"

    # Download best quality audio
    files=\$(curl -s \"https://archive.org/metadata/\$identifier\" \
      | jq -r '.files[] | select(.format == \"$format\") | .name')

    while IFS= read -r file; do
      wget -q --show-progress \
        \"https://archive.org/download/\$identifier/\$file\" \
        -P \"\$item_dir\"
    done <<< \"\$files\"

    # Save metadata
    curl -s \"https://archive.org/metadata/\$identifier\" \
      > \"\$item_dir/.curator/metadata.json\"

    sleep 2  # Rate limiting
  "

  rm "$queue_file"
}

wget Recursive Download

# Use wget's recursive mode for efficient bulk download
download_with_wget_recursive() {
  local identifier="$1"
  local output_dir="$2"
  local format_pattern="$3"  # e.g., "*.flac" or "*.mp3"

  mkdir -p "$output_dir"

  wget --recursive \
       --no-parent \
       --no-directories \
       --accept "$format_pattern" \
       --directory-prefix="$output_dir" \
       --wait=2 \
       --random-wait \
       "https://archive.org/download/$identifier/"
}

# Example: Download all FLAC files from item
download_with_wget_recursive "gd1977-05-08.sbd.miller.97065.flac16" \
  "/mnt/archive/gd-1977-05-08" \
  "*.flac"

Quality Selection Strategy

Format Preference Hierarchy

For audio archival, prefer formats in this order:

FLAC - Lossless, widely supported
320Kbps MP3 - High quality lossy, universal compatibility
VBR MP3 - Variable bitrate, good quality/size ratio
Ogg Vorbis - Good quality, open format
128Kbps MP3 - Acceptable for spoken word, not music

Automatic Quality Selection

# Select best available format for an item
select_best_format() {
  local identifier="$1"

  # Get available formats
  local formats=$(curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | .format' | sort -u)

  # Check in preference order
  if echo "$formats" | grep -q "Flac"; then
    echo "Flac"
  elif echo "$formats" | grep -q "320Kbps MP3"; then
    echo "320Kbps MP3"
  elif echo "$formats" | grep -q "VBR MP3"; then
    echo "VBR MP3"
  elif echo "$formats" | grep -q "Ogg Vorbis"; then
    echo "Ogg Vorbis"
  else
    echo "$(echo "$formats" | grep -i mp3 | head -1)"
  fi
}

# Download using best available format
download_best_quality() {
  local identifier="$1"
  local output_dir="$2"

  local best_format=$(select_best_format "$identifier")

  echo "Selected format: $best_format"
  download_item "$identifier" "$output_dir" "$best_format"
}

Size vs Quality Trade-offs

# Estimate download size before acquiring
estimate_download_size() {
  local identifier="$1"
  local format="$2"

  local total_size=$(curl -s "https://archive.org/metadata/$identifier" \
    | jq "[.files[] | select(.format == \"$format\") | .size | tonumber] | add")

  # Convert to human-readable
  local size_mb=$((total_size / 1048576))
  local size_gb=$((size_mb / 1024))

  if [[ $size_gb -gt 0 ]]; then
    echo "${size_gb}GB"
  else
    echo "${size_mb}MB"
  fi
}

# Check if size is acceptable before downloading
download_if_acceptable() {
  local identifier="$1"
  local output_dir="$2"
  local format="$3"
  local max_size_gb="${4:-5}"  # Default 5GB limit

  local estimated_size=$(estimate_download_size "$identifier" "$format")
  local size_value=$(echo "$estimated_size" | sed 's/[^0-9]//g')
  local size_unit=$(echo "$estimated_size" | sed 's/[0-9]//g')

  if [[ "$size_unit" == "GB" && $size_value -gt $max_size_gb ]]; then
    echo "SKIPPED: $identifier ($estimated_size exceeds ${max_size_gb}GB limit)"
    return 1
  fi

  echo "Downloading: $identifier ($estimated_size)"
  download_item "$identifier" "$output_dir" "$format"
}

Metadata Extraction and Storage

Extract Archive Metadata

# Convert Archive.org metadata to curator format
extract_archive_metadata() {
  local identifier="$1"
  local output_file="$2"

  curl -s "https://archive.org/metadata/$identifier" | jq '{
    source: "archive.org",
    identifier: .metadata.identifier,
    title: .metadata.title,
    creator: .metadata.creator,
    date: .metadata.date,
    venue: .metadata.venue,
    coverage: .metadata.coverage,
    description: .metadata.description,
    collection: .metadata.collection,
    downloads: .item_size.downloads,
    avg_rating: .reviews.info.avg_rating,
    num_reviews: .reviews.info.num_reviews,
    files: [
      .files[] | {
        name: .name,
        format: .format,
        size: .size,
        md5: .md5,
        sha1: .sha1
      }
    ]
  }' > "$output_file"
}

# Example usage
extract_archive_metadata "gd1977-05-08.sbd.miller.97065.flac16" \
  ".curator/metadata/gd-1977-05-08.json"

Merge with Curator Metadata

# Merge Archive.org metadata with existing curator metadata
merge_metadata() {
  local curator_metadata="$1"
  local archive_metadata="$2"
  local output_file="$3"

  jq -s '.[0] * .[1]' "$curator_metadata" "$archive_metadata" > "$output_file"
}

Verification and Quality Assurance

Checksum Verification

Archive.org provides MD5 and SHA1 checksums for all files:

# Verify downloaded files against Archive.org checksums
verify_archive_downloads() {
  local identifier="$1"
  local download_dir="$2"

  # Get checksums from metadata
  local checksums=$(curl -s "https://archive.org/metadata/$identifier" \
    | jq -r '.files[] | "\(.md5)  \(.name)"')

  # Verify each file
  local failed=0
  while IFS= read -r line; do
    local expected_md5="${line%%  *}"
    local filename="${line#*  }"
    local filepath="$download_dir/$filename"

    if [[ ! -f "$filepath" ]]; then
      echo "MISSING: $filename"
      failed=$((failed + 1))
      continue
    fi

    local actual_md5=$(md5sum "$filepath" | awk '{print $1}')

    if [[ "$actual_md5" != "$expected_md5" ]]; then
      echo "CHECKSUM FAILED: $filename"
      echo "  Expected: $expected_md5"
      echo "  Actual:   $actual_md5"
      failed=$((failed + 1))
    else
      echo "VERIFIED: $filename"
    fi
  done <<< "$checksums"

  if [[ $failed -gt 0 ]]; then
    echo "Verification failed for $failed files"
    return 1
  fi

  echo "All files verified successfully"
  return 0
}

File Integrity Check

# Verify audio file integrity after download
verify_audio_integrity() {
  local file="$1"

  # Check file is not zero-byte
  if [[ ! -s "$file" ]]; then
    echo "FAILED: Zero-byte file"
    return 1
  fi

  # Verify with ffmpeg
  if command -v ffmpeg >/dev/null; then
    if ffmpeg -v error -i "$file" -f null - 2>&1 | grep -q "error"; then
      echo "FAILED: Corrupted audio file"
      return 1
    fi
  fi

  # FLAC specific: verify with flac tool
  if [[ "$file" == *.flac ]] && command -v flac >/dev/null; then
    if ! flac -t "$file" 2>&1 | grep -q "ok"; then
      echo "FAILED: FLAC verification failed"
      return 1
    fi
  fi

  echo "VERIFIED: $file"
  return 0
}

# Verify all files in directory
verify_directory_integrity() {
  local dir="$1"

  local total=0
  local verified=0
  local failed=0

  find "$dir" -type f \( -name "*.flac" -o -name "*.mp3" -o -name "*.ogg" \) | while read -r file; do
    total=$((total + 1))

    if verify_audio_integrity "$file"; then
      verified=$((verified + 1))
    else
      failed=$((failed + 1))
    fi
  done

  echo "Integrity check complete: $verified/$total verified, $failed failed"
}

Rate Limiting and Politeness

Archive.org requests responsible use of their bandwidth:

# Implement polite download behavior
polite_download() {
  local identifier="$1"
  local output_dir="$2"
  local format="$3"

  # Add delays between requests
  local wait_time=2  # seconds

  # Get file list
  local files=$(curl -s "https://archive.org/metadata/$identifier" \
    | jq -r ".files[] | select(.format == \"$format\") | .name")

  while IFS= read -r file; do
    echo "Downloading: $file"

    wget --quiet \
         --show-progress \
         --wait="$wait_time" \
         --random-wait \
         --user-agent="AIWG-MediaCurator/1.0 (research/personal use)" \
         "https://archive.org/download/$identifier/$file" \
         -P "$output_dir"

  done <<< "$files"
}

Bulk Download Best Practices

Rate limit: Wait 2-5 seconds between requests
Random wait: Add jitter to avoid patterns
User agent: Identify your tool
Off-peak hours: Download during US night hours
Resume support: Use wget -c to resume interrupted downloads
Error handling: Retry failed downloads after delay

Collection Management

Build Local Collection Index

# Create searchable index of downloaded items
build_collection_index() {
  local collection_dir="$1"
  local index_file="$2"

  echo '{"items": []}' > "$index_file"

  find "$collection_dir" -name "metadata.json" -type f | while read -r metadata_file; do
    local item_dir=$(dirname "$metadata_file")

    # Extract key metadata fields
    local item_data=$(jq '{
      identifier: .metadata.identifier,
      title: .metadata.title,
      creator: .metadata.creator,
      date: .metadata.date,
      local_path: "'"$item_dir"'",
      file_count: (.files | length),
      total_size: ([.files[] | .size | tonumber] | add)
    }' "$metadata_file")

    # Append to index
    jq ".items += [$item_data]" "$index_file" > "$index_file.tmp"
    mv "$index_file.tmp" "$index_file"
  done

  echo "Index built: $index_file"
}

# Search local collection
search_local_collection() {
  local index_file="$1"
  local query="$2"

  jq -r ".items[] | select(.title | test(\"$query\"; \"i\")) | \"\(.date) - \(.creator) - \(.title) (\(.local_path))\"" \
    "$index_file"
}

Deduplicate Collection

# Find and remove duplicate items in collection
deduplicate_collection() {
  local collection_dir="$1"

  # Build checksum database
  local checksum_file="/tmp/curator-checksums-$$.txt"

  find "$collection_dir" -type f \( -name "*.flac" -o -name "*.mp3" \) | while read -r file; do
    local checksum=$(md5sum "$file" | awk '{print $1}')
    echo "$checksum  $file" >> "$checksum_file"
  done

  # Find duplicates
  sort "$checksum_file" | uniq -d -w 32 | while read -r line; do
    local checksum="${line%%  *}"

    echo "Duplicate found (checksum: $checksum):"

    grep "^$checksum" "$checksum_file" | while read -r dup_line; do
      local dup_file="${dup_line#*  }"
      echo "  - $dup_file"
    done

    # Offer to delete all but first
    local files=($(grep "^$checksum" "$checksum_file" | cut -d' ' -f3-))
    local keep="${files[0]}"

    echo "Keep: $keep"

    for ((i=1; i<${#files[@]}; i++)); do
      echo "Delete: ${files[$i]}"
      rm "${files[$i]}"
    done
  done

  rm "$checksum_file"
}

Advanced Patterns

Incremental Collection Sync

# Download only new items added to collection since last sync
sync_collection_incremental() {
  local collection="$1"
  local output_base="$2"
  local format="$3"
  local sync_state_file="$output_base/.curator/sync-state.json"

  # Get last sync timestamp
  local last_sync
  if [[ -f "$sync_state_file" ]]; then
    last_sync=$(jq -r '.last_sync' "$sync_state_file")
  else
    last_sync="1970-01-01"
  fi

  # Search for items added/updated since last sync
  local new_items=$(curl -s "https://archive.org/advancedsearch.php" \
    -d "q=collection:$collection AND publicdate:[$last_sync TO null]" \
    -d "fl[]=identifier" \
    -d "rows=1000" \
    -d "output=json" \
    | jq -r '.response.docs[] | .identifier')

  # Download new items
  local count=0
  while IFS= read -r identifier; do
    count=$((count + 1))
    echo "Downloading new item [$count]: $identifier"

    download_item "$identifier" "$output_base/$identifier" "$format"
    sleep 2
  done <<< "$new_items"

  # Update sync state
  echo "{\"last_sync\": \"$(date -Iseconds)\", \"items_added\": $count}" > "$sync_state_file"

  echo "Sync complete: $count new items added"
}

Filtered Batch Download

# Download items matching complex filter criteria
download_filtered_batch() {
  local query="$1"
  local output_base="$2"
  local format="$3"
  local min_rating="${4:-4.0}"
  local min_year="${5:-1960}"

  # Search with filters
  local items=$(curl -s "https://archive.org/advancedsearch.php" \
    -d "q=$query AND avg_rating:[$min_rating TO 5.0] AND year:[$min_year TO 2026]" \
    -d "fl[]=identifier,avg_rating,year" \
    -d "rows=500" \
    -d "output=json")

  # Download each item
  echo "$items" | jq -r '.response.docs[] | .identifier' | while read -r identifier; do
    local rating=$(echo "$items" | jq -r ".response.docs[] | select(.identifier == \"$identifier\") | .avg_rating")
    local year=$(echo "$items" | jq -r ".response.docs[] | select(.identifier == \"$identifier\") | .year")

    echo "Downloading: $identifier (rating: $rating, year: $year)"

    download_item "$identifier" "$output_base/$identifier" "$format"
    sleep 2
  done
}

Integration with Media Curator Framework

Source Plan Generation

# Generate source plan for /acquire command
generate_archive_source_plan() {
  local collection="$1"
  local output_file="$2"
  local format="${3:-Flac}"

  # Get collection items
  local items=$(get_collection_items "$collection" 100)

  # Build YAML plan
  cat > "$output_file" <<EOF
source_plan:
  id: archive-$(date +%Y%m%d-%H%M%S)
  source: archive.org
  collection: $collection
  format: $format
  created_at: $(date -Iseconds)

sources:
EOF

  while IFS= read -r identifier; do
    # Get item metadata
    local metadata=$(curl -s "https://archive.org/metadata/$identifier")
    local title=$(echo "$metadata" | jq -r '.metadata.title')
    local creator=$(echo "$metadata" | jq -r '.metadata.creator')
    local date=$(echo "$metadata" | jq -r '.metadata.date')

    cat >> "$output_file" <<EOF
  - identifier: $identifier
    title: "$title"
    creator: "$creator"
    date: "$date"
    url: "https://archive.org/details/$identifier"
    format: $format
EOF

  done <<< "$items"

  echo "Source plan generated: $output_file"
}

# Usage with /acquire
# generate_archive_source_plan "GratefulDead" ".curator/sources/gd-plan.yaml" "Flac"
# /acquire --plan .curator/sources/gd-plan.yaml

Troubleshooting

Common Issues and Solutions

Debug Mode

# Enable verbose output for troubleshooting
export CURATOR_DEBUG=1

download_item_debug() {
  local identifier="$1"
  local output_dir="$2"

  if [[ "$CURATOR_DEBUG" == "1" ]]; then
    set -x  # Enable bash tracing
  fi

  # Get metadata with verbose curl
  curl -v "https://archive.org/metadata/$identifier" 2>&1 | tee "$output_dir/.curator/metadata-debug.log"

  # Download with verbose wget
  wget --verbose \
       --debug \
       "https://archive.org/download/$identifier/" \
       2>&1 | tee "$output_dir/.curator/download-debug.log"

  set +x
}

Performance Optimization

Parallel Download Tuning

# Optimal concurrent downloads based on connection speed
determine_optimal_parallelism() {
  local speed_mbps="$1"

  if [[ $speed_mbps -lt 10 ]]; then
    echo 2  # Slow connection
  elif [[ $speed_mbps -lt 50 ]]; then
    echo 3  # Medium connection
  elif [[ $speed_mbps -lt 100 ]]; then
    echo 5  # Fast connection
  else
    echo 8  # Very fast connection
  fi
}

# Test connection speed
test_connection_speed() {
  # Download test file and measure speed
  local test_file="https://archive.org/download/test_item/test.mp3"
  local start_time=$(date +%s)

  wget -q -O /tmp/speed-test.tmp "$test_file"

  local end_time=$(date +%s)
  local duration=$((end_time - start_time))
  local size=$(stat -c%s /tmp/speed-test.tmp 2>/dev/null || stat -f%z /tmp/speed-test.tmp)
  local speed_mbps=$(( (size * 8) / (duration * 1000000) ))

  rm /tmp/speed-test.tmp

  echo "$speed_mbps"
}

Caching Strategy

# Cache metadata to reduce API calls
cache_metadata() {
  local identifier="$1"
  local cache_dir=".curator/cache/metadata"
  local cache_file="$cache_dir/$identifier.json"

  mkdir -p "$cache_dir"

  # Check cache age
  if [[ -f "$cache_file" ]]; then
    local cache_age=$(( $(date +%s) - $(stat -c%Y "$cache_file" 2>/dev/null || stat -f%m "$cache_file") ))

    # Cache valid for 7 days
    if [[ $cache_age -lt 604800 ]]; then
      cat "$cache_file"
      return 0
    fi
  fi

  # Fetch and cache
  curl -s "https://archive.org/metadata/$identifier" | tee "$cache_file"
}

Summary

Archive Acquisition skill provides:

Discovery: Search API patterns for finding content
Quality Selection: Automatic best-format selection (prefer FLAC, accept MP3 320K)
Bulk Download: Recursive wget and parallel download patterns
Verification: Checksum validation and integrity checks
Collection Management: Indexing, deduplication, incremental sync
Integration: Source plan generation for /acquire command
Performance: Parallel tuning, caching, rate limiting

Use this skill whenever working with Internet Archive content in the Media Curator framework.

References

@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/human-authorization.md — Seek explicit authorization before bulk downloads and irreversible archive operations
@$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/research-before-decision.md — Evaluate archive item quality and metadata before committing to download
@$AIWG_ROOT/agentic/code/frameworks/media-curator/skills/acquire/SKILL.md — General acquisition skill that delegates to archive-acquisition for archive.org sources
@$AIWG_ROOT/agentic/code/frameworks/media-curator/skills/integrity-verification/SKILL.md — Verify checksums after archive downloads
@$AIWG_ROOT/agentic/code/frameworks/media-curator/skills/quality-filtering/SKILL.md — Quality scoring used to select best format from archive items

Adoption

jmagly/Archive Acquisition

$ install --global

Security Scan Results

SKILL.md

Archive Acquisition Skill

Internet Archive Overview

Discovery Patterns

Search API

Response Format

Collection Browsing

Advanced Query Patterns

Item Metadata Retrieval

Get Item Details

Filter by Format

Download Patterns

Single Item Download

Recursive Collection Download

Parallel Collection Download

wget Recursive Download

Quality Selection Strategy

Format Preference Hierarchy

Automatic Quality Selection

Size vs Quality Trade-offs

Metadata Extraction and Storage

Extract Archive Metadata

Merge with Curator Metadata

Verification and Quality Assurance

Checksum Verification

File Integrity Check

Rate Limiting and Politeness

Bulk Download Best Practices

Collection Management

Build Local Collection Index

Deduplicate Collection

Advanced Patterns

Incremental Collection Sync

Filtered Batch Download

Integration with Media Curator Framework

Source Plan Generation

Troubleshooting

Common Issues and Solutions

Debug Mode

Performance Optimization

Parallel Download Tuning

Caching Strategy

Summary

References

Related Skills

jmagly/radar-status

jmagly/radar-report

jmagly/radar-init

jmagly/profile-temporal

jmagly/Archive Acquisition

$ install --global

Security Scan Results

SKILL.md

Archive Acquisition Skill

Internet Archive Overview

Discovery Patterns

Search API

Response Format

Collection Browsing

Advanced Query Patterns

Item Metadata Retrieval

Get Item Details

Filter by Format

Download Patterns

Single Item Download

Recursive Collection Download

Parallel Collection Download

wget Recursive Download

Quality Selection Strategy

Format Preference Hierarchy

Automatic Quality Selection

Size vs Quality Trade-offs

Metadata Extraction and Storage

Extract Archive Metadata

Merge with Curator Metadata

Verification and Quality Assurance