research-toolkit/skills/web-archiving/SKILL.md
Web page archiving and retrieval from cached/deleted sources. Use when accessing unavailable pages, preserving web content, creating legal evidence archives, or building redundant archival workflows. Covers Wayback Machine, Archive.today, ArchiveBox, and evidence preservation tools.
npx skillsauth add jamditis/claude-skills-journalism web-archivingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Patterns for accessing inaccessible web pages and preserving web content for journalism, research, and legal purposes.
Try services in this order for maximum coverage:
┌─────────────────────────────────────────────────────────────────┐
│ ARCHIVE RETRIEVAL CASCADE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Wayback Machine (archive.org) │
│ └─ 900B+ pages, historical depth, API access │
│ ↓ not found │
│ 2. Archive.today (archive.is/archive.ph) │
│ └─ On-demand snapshots, paywall bypass │
│ └─ Caveat (2026): FBI subpoenaed registrar in Oct 2025; │
│ Wikipedia deprecated as citation source in Feb 2026 — │
│ prefer Wayback / Perma.cc for legal or citation use │
│ ↓ not found │
│ 3. Memento Time Travel (aggregator) │
│ └─ Searches multiple archives simultaneously │
│ │
│ Retired (do not use): Google Cache (`cache:` operator) was │
│ shut down in Sept 2024; Bing Cache dropdown was removed in │
│ the same year. Both formerly fed this cascade. │
│ │
└─────────────────────────────────────────────────────────────────┘
import requests
from typing import Optional
from datetime import datetime
from urllib.parse import quote, unquote
def check_wayback_availability(url: str) -> Optional[dict]:
"""Check if URL exists in Wayback Machine."""
api_url = "https://archive.org/wayback/available"
try:
response = requests.get(api_url, params={'url': url}, timeout=10)
data = response.json()
if data.get('archived_snapshots', {}).get('closest'):
snapshot = data['archived_snapshots']['closest']
return {
'available': snapshot.get('available', False),
'url': snapshot.get('url'),
'timestamp': snapshot.get('timestamp'),
'status': snapshot.get('status')
}
return None
except Exception as e:
return None
def get_wayback_url(url: str, timestamp: str = None) -> str:
"""Generate Wayback Machine URL for a page.
Returns the canonical raw form (`.../web/<timestamp>/<url>`) per
Wayback's replay-URL convention. If you intend to navigate to the
returned link in a browser AND the target URL has `#` fragments,
encode at the call site with urllib.parse.quote so the browser
doesn't strip the fragment before request dispatch.
Args:
url: Original URL to retrieve
timestamp: Optional YYYYMMDDHHMMSS format, or None for latest
"""
if timestamp:
return f"https://web.archive.org/web/{timestamp}/{url}"
return f"https://web.archive.org/web/{url}"
def save_to_wayback(url: str, s3_keys: Optional[tuple[str, str]] = None) -> Optional[str]:
"""Request Wayback Machine to archive a URL via Save Page Now.
Returns the archived URL if successful.
Anonymous requests are rate-limited at roughly 15/minute. Pass
`s3_keys=(access_key, secret)` from an Internet Archive account
to raise the cap (anonymous → ~50/min with auth) and avoid silent
drops on paywalled / heavily JS-rendered pages.
"""
# quote(unquote(url), ...) normalizes any existing %xx escapes
# first so they don't get double-encoded into %25xx.
save_url = f"https://web.archive.org/save/{quote(unquote(url), safe='')}"
headers = {'User-Agent': 'Mozilla/5.0 (research-archiver)'}
if s3_keys:
headers['Authorization'] = f'LOW {s3_keys[0]}:{s3_keys[1]}'
try:
response = requests.get(save_url, headers=headers, timeout=60)
if response.status_code == 200:
# SPN delivers the canonical archive URL via the final URL
# after redirect-following (or the `Link` header on async
# captures). `response.url` is the reliable common case.
return response.url
return None
except Exception:
return None
def get_all_snapshots(url: str, limit: int = 100) -> list[dict]:
"""Get all archived snapshots of a URL using CDX API.
Returns list of snapshots with timestamps and status codes.
"""
cdx_url = "https://web.archive.org/cdx/search/cdx"
params = {
'url': url,
'output': 'json',
'limit': limit,
'fl': 'timestamp,original,statuscode,digest,length'
}
try:
response = requests.get(cdx_url, params=params, timeout=30)
data = response.json()
if len(data) < 2: # First row is headers
return []
headers = data[0]
snapshots = []
for row in data[1:]:
snapshot = dict(zip(headers, row))
snapshot['wayback_url'] = (
f"https://web.archive.org/web/{snapshot['timestamp']}/{snapshot['original']}"
)
snapshots.append(snapshot)
return snapshots
except Exception:
return []
import re
import requests
from urllib.parse import quote, unquote, urljoin
def save_to_archive_today(url: str) -> Optional[str]:
"""Submit URL to Archive.today for archiving.
Note: Archive.today has rate limiting and CAPTCHA requirements.
This function works for basic archiving but may require
manual intervention for high-volume use.
Operational notes (2026): the FBI subpoenaed archive.today's
registrar in October 2025; Wikipedia stopped accepting it as a
citation source in February 2026 after the site shipped
DDoS-attack code in January 2026. Still useful for capturing
content the Wayback Machine can't render — but treat as
secondary to Wayback / Perma.cc for legal or citation use.
"""
submit_url = "https://archive.today/submit/"
data = {
'url': url,
'anyway': '1' # Archive even if recent snapshot exists
}
try:
response = requests.post(
submit_url,
data=data,
timeout=60,
allow_redirects=False,
headers={'User-Agent': 'Mozilla/5.0 (research-archiver)'},
)
# archive.today returns the snapshot URL in one of two shapes:
# - 30x with Location: https://archive.today/<snapshot_id>
# (Location MAY be relative per RFC 7231)
# - 200 with Refresh: 0;url=https://archive.today/<snapshot_id>
# (Refresh keyword is case-insensitive per HTML spec)
# Following redirects silently can land on /wip/ pages or hide
# the canonical snapshot URL, so handle both headers explicitly.
if response.status_code in (301, 302, 303, 307, 308):
location = response.headers.get('Location')
if location:
return urljoin(response.url, location)
if response.status_code == 200:
refresh = response.headers.get('Refresh', '')
m = re.search(r'\burl\s*=\s*(.+)', refresh, re.IGNORECASE)
if m:
target = m.group(1).strip().strip('\'"')
return urljoin(response.url, target)
return None
except Exception:
return None
def search_archive_today(url: str) -> Optional[str]:
"""Search for existing Archive.today snapshot.
Uses the /newest/<url> lookup which 302s to the most recent
snapshot (or to a CAPTCHA page if rate-limited).
"""
search_url = f"https://archive.ph/newest/{quote(unquote(url), safe='')}"
try:
response = requests.get(
search_url,
timeout=30,
allow_redirects=False,
headers={'User-Agent': 'Mozilla/5.0 (research-archiver)'},
)
if response.status_code in (301, 302, 303, 307, 308):
location = response.headers.get('Location')
if location:
resolved = urljoin(response.url, location)
# Only return if we landed on an archive page, not CAPTCHA
if 'archive.' in resolved and '/newest/' not in resolved:
return resolved
return None
except Exception:
return None
from dataclasses import dataclass
from typing import Optional, List
from concurrent.futures import ThreadPoolExecutor, as_completed
@dataclass
class ArchiveResult:
service: str
url: str
archived_url: Optional[str]
success: bool
error: Optional[str] = None
class MultiArchiver:
"""Archive URLs to multiple services for redundancy."""
def __init__(self):
self.services = [
('wayback', self._save_wayback),
('archive_today', self._save_archive_today),
('perma_cc', self._save_perma), # Requires API key
]
def archive_url(self, url: str, parallel: bool = True) -> List[ArchiveResult]:
"""Archive URL to all services.
Args:
url: URL to archive
parallel: If True, archive to all services simultaneously
"""
results = []
if parallel:
with ThreadPoolExecutor(max_workers=3) as executor:
futures = {
executor.submit(save_func, url): name
for name, save_func in self.services
}
for future in as_completed(futures):
service = futures[future]
try:
archived_url = future.result()
results.append(ArchiveResult(
service=service,
url=url,
archived_url=archived_url,
success=archived_url is not None
))
except Exception as e:
results.append(ArchiveResult(
service=service,
url=url,
archived_url=None,
success=False,
error=str(e)
))
else:
for name, save_func in self.services:
try:
archived_url = save_func(url)
results.append(ArchiveResult(
service=name,
url=url,
archived_url=archived_url,
success=archived_url is not None
))
except Exception as e:
results.append(ArchiveResult(
service=name,
url=url,
archived_url=None,
success=False,
error=str(e)
))
return results
def _save_wayback(self, url: str) -> Optional[str]:
return save_to_wayback(url)
def _save_archive_today(self, url: str) -> Optional[str]:
return save_to_archive_today(url)
def _save_perma(self, url: str) -> Optional[str]:
# Requires Perma.cc API key
# Implementation depends on having API credentials
return None
# Recommended: Docker Compose (v0.8.x ships with Chromium, yt-dlp,
# wget, single-file, and other capture tools preinstalled).
mkdir ~/web-archives && cd ~/web-archives
curl -O 'https://docker-compose.archivebox.io' && mv docker-compose.archivebox.io docker-compose.yml
docker compose run archivebox init --setup
docker compose up -d # start the web UI on :8000
# Pip-based install still works but you'll need to install Chromium /
# yt-dlp / wget / single-file separately for full capture coverage:
# pip install archivebox && archivebox init
# Add URLs to archive (from inside the archive directory)
docker compose run archivebox add "https://example.com/article"
# Add multiple URLs from file
docker compose run archivebox add --depth=0 < urls.txt
# Schedule regular archiving
docker compose run archivebox schedule --every=day --depth=1 "https://example.com/feed.rss"
import subprocess
from pathlib import Path
from typing import List, Optional
class ArchiveBoxManager:
"""Manage local ArchiveBox instance."""
def __init__(self, archive_dir: Path):
self.archive_dir = archive_dir
self._ensure_initialized()
def _ensure_initialized(self):
"""Initialize ArchiveBox if needed."""
if not (self.archive_dir / 'index.sqlite3').exists():
subprocess.run(
['archivebox', 'init'],
cwd=self.archive_dir,
check=True
)
def add_url(self, url: str, depth: int = 0) -> bool:
"""Archive a single URL.
Args:
url: URL to archive
depth: 0 for single page, 1 to follow links one level deep
"""
result = subprocess.run(
['archivebox', 'add', f'--depth={depth}', url],
cwd=self.archive_dir,
capture_output=True,
text=True
)
return result.returncode == 0
def add_urls_from_file(self, filepath: Path) -> bool:
"""Archive URLs from a text file (one per line)."""
with open(filepath) as f:
result = subprocess.run(
['archivebox', 'add', '--depth=0'],
cwd=self.archive_dir,
stdin=f,
capture_output=True
)
return result.returncode == 0
def search(self, query: str) -> List[dict]:
"""Search archived content."""
result = subprocess.run(
['archivebox', 'list', '--filter-type=search', query],
cwd=self.archive_dir,
capture_output=True,
text=True
)
# Parse output...
return []
import hashlib
import sys
import json
import requests
from datetime import datetime, timezone
from dataclasses import dataclass, asdict
from typing import List
@dataclass
class EvidenceRecord:
"""Legally defensible evidence record."""
# Content identification
original_url: str
archived_urls: List[str] # Multiple archive copies
content_hash_sha256: str
# Timestamps
capture_time_utc: str
first_observed: str
# Metadata
page_title: str
captured_by: str
capture_method: str
tool_versions: dict
# Chain of custody
custody_log: List[dict] # Who accessed when
def add_custody_entry(self, accessor: str, action: str, notes: str = ""):
"""Log access to evidence."""
self.custody_log.append({
'timestamp': datetime.now(timezone.utc).isoformat(),
'accessor': accessor,
'action': action,
'notes': notes
})
def to_json(self) -> str:
return json.dumps(asdict(self), indent=2)
@classmethod
def from_capture(cls, url: str, content: bytes, captured_by: str):
"""Create evidence record from captured content."""
now = datetime.now(timezone.utc).isoformat()
py = sys.version_info
return cls(
original_url=url,
archived_urls=[],
content_hash_sha256=hashlib.sha256(content).hexdigest(),
capture_time_utc=now,
first_observed=now,
page_title="",
captured_by=captured_by,
capture_method="automated_capture",
tool_versions={
# Replace 'archiver' with your tool's actual __version__
'archiver': '1.0.0',
'python': f'{py.major}.{py.minor}.{py.micro}',
'requests': requests.__version__,
},
custody_log=[]
)
def capture_as_evidence(url: str, captured_by: str) -> EvidenceRecord:
"""Capture URL with full evidence chain documentation."""
# Capture content
response = requests.get(url)
content = response.content
# Create evidence record
record = EvidenceRecord.from_capture(url, content, captured_by)
record.page_title = extract_title(content)
# Archive to multiple services
archiver = MultiArchiver()
results = archiver.archive_url(url)
for result in results:
if result.success:
record.archived_urls.append(result.archived_url)
# Log initial capture
record.add_custody_entry(
captured_by,
'initial_capture',
f'Captured from {url}, archived to {len(record.archived_urls)} services'
)
return record
import requests
from typing import Optional
class PermaCC:
"""Perma.cc API client for legal-grade archiving.
Requires API key from perma.cc (free for limited use).
Used by US courts and legal professionals.
"""
def __init__(self, api_key: str):
self.api_key = api_key
self.base_url = "https://api.perma.cc/v1"
self.headers = {
'Authorization': f'ApiKey {api_key}',
'Content-Type': 'application/json'
}
def create_archive(self, url: str, folder_id: int = None) -> Optional[dict]:
"""Create a new Perma.cc archive.
Returns dict with guid, creation_timestamp, and captures.
"""
data = {'url': url}
if folder_id:
data['folder'] = folder_id
try:
response = requests.post(
f"{self.base_url}/archives/",
json=data,
headers=self.headers,
timeout=60
)
if response.status_code == 201:
result = response.json()
return {
'guid': result['guid'],
'url': f"https://perma.cc/{result['guid']}",
'creation_timestamp': result['creation_timestamp'],
'title': result.get('title', '')
}
return None
except Exception:
return None
def get_archive(self, guid: str) -> Optional[dict]:
"""Retrieve archive metadata by GUID."""
try:
response = requests.get(
f"{self.base_url}/archives/{guid}/",
headers=self.headers,
timeout=30
)
return response.json() if response.status_code == 200 else None
except Exception:
return None
// Save to Wayback Machine - add as bookmark
javascript:(function(){
window.open('https://web.archive.org/save/' + encodeURIComponent(location.href), '_blank');
})();
// Save to Archive.today
javascript:(function(){
window.open('https://archive.today/?run=1&url=' + encodeURIComponent(location.href), '_blank');
})();
// Check all archives (Memento)
javascript:(function(){
window.open('https://timetravel.mementoweb.org/list/0/' + encodeURIComponent(location.href), '_blank');
})();
// Try multiple archives for dead pages
// Note: Google Cache (webcache.googleusercontent.com) was retired in
// Sept 2024 and is omitted here.
javascript:(function(){
// Encode location.href so any '#' / '?' inside it travels as part
// of the path argument; raw concatenation lets the browser strip
// fragments and re-attach query strings to the outer URL.
var encoded = encodeURIComponent(location.href);
var archives = [
'https://web.archive.org/web/*/' + encoded,
'https://archive.ph/newest/' + encoded,
'https://timetravel.mementoweb.org/list/0/' + encoded
];
archives.forEach(function(a){ window.open(a, '_blank'); });
})();
| Service | Best For | API | Deletions | Max Size | Notes |
|---------|----------|-----|-----------|----------|-------|
| Wayback Machine | Historical research | Yes (free) | On request | Unlimited | Anonymous SPN ~15/min; auth via S3 keys raises cap |
| Archive.today | Paywall bypass, quick saves | Informal | Never | 50MB | FBI subpoena Oct 2025; Wikipedia deprecated as citation source Feb 2026 — avoid for legal/citation use |
| Perma.cc | Legal citations | Yes (free tier) | By creator | Standard pages | Used by US courts; Authorization: ApiKey <key> |
| ArchiveBox | Self-hosted, privacy | Local | Never | Disk space | v0.8 ships Docker Compose with Chromium / yt-dlp / wget |
| Browsertrix Cloud | Interactive / JS-heavy capture | Yes | By creator | Plan-based | Webrecorder.net successor to Conifer; outputs WARC |
| Conifer | Interactive content | Yes | By creator | 5GB free | Older Webrecorder service; Browsertrix Cloud is the active path |
import requests
from enum import Enum
from typing import Optional
from urllib.parse import quote, unquote
class ArchiveError(Enum):
NOT_FOUND = "No archive found"
RATE_LIMITED = "Rate limited by service"
BLOCKED = "URL blocked from archiving"
TIMEOUT = "Request timed out"
SERVICE_DOWN = "Archive service unavailable"
def get_archived_page(url: str) -> tuple[Optional[str], Optional[ArchiveError]]:
"""Try all archive services with proper error handling."""
# 1. Try Wayback Machine first
try:
result = check_wayback_availability(url)
if result and result.get('available'):
return result['url'], None
except requests.Timeout:
pass # Try next service
except Exception:
pass
# 2. Try Archive.today
try:
result = search_archive_today(url)
if result:
return result, None
except Exception:
pass
# 3. Try Memento aggregator
try:
memento_url = f"https://timetravel.mementoweb.org/api/json/0/{quote(unquote(url), safe='')}"
response = requests.get(memento_url, timeout=30)
data = response.json()
if data.get('mementos', {}).get('closest'):
return data['mementos']['closest']['uri'][0], None
except Exception:
pass
return None, ArchiveError.NOT_FOUND
Always archive to at least two services:
def ensure_archived(url: str) -> bool:
"""Ensure URL is archived in at least 2 services."""
archiver = MultiArchiver()
results = archiver.archive_url(url)
successful = [r for r in results if r.success]
return len(successful) >= 2
robots.txt for bulk archivingtesting
Configure install-time cooldowns for npm/bun (minimum release age) and run a sandboxed pre-install scan when the cooldown has to be bypassed. Use when the user asks about supply-chain attacks, npm/bun security, "minimum release age", a "cooldown" for installs, hardening against Shai-Hulud-class worms, or how to safely install a package that was just published. Also use after any recent supply-chain incident in the npm ecosystem.
tools
Generate CLAUDE.md project memory files that transfer institutional knowledge, not obvious information. Use when setting up new journalism projects, onboarding collaborators, or documenting project-specific quirks. Includes templates for editorial tools, event websites, publications, research projects, content pipelines, and digital archives.
development
Use when suggesting APIs for a project, looking for free data sources, building weekend projects that need external data, or when the user needs weather, news, finance, sports, ML, or entertainment data without paid subscriptions
development
Choose the correct CLAUDE.md or LESSONS.md template for journalism projects. Use when starting a new project, setting up documentation, or unsure which template category fits best. Provides decision trees and selection guidance for 6 journalism-focused template types.