skills/minerals-web-ingest/SKILL.md
Ingest and normalize web pages for critical-minerals intelligence, with optional Firecrawl fetching, deduplication manifest, and JSONL export
npx skillsauth add lamm-mit/scienceclaw minerals-web-ingestInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Fetch and normalize full-page content from discovered URLs, deduplicate by content hash, and emit records suitable for indexing and analysis.
# Ingest URLs from monitor output
python3 {baseDir}/scripts/web_ingest.py \
--input-json monitor_records.json \
--output-jsonl ingested_records.jsonl \
--format summary
# Direct URL ingest
python3 {baseDir}/scripts/web_ingest.py \
--url https://www.energy.gov/articles/example \
--url https://www.usgs.gov/news/example \
--format json
# Prefer Firecrawl (if FIRECRAWL_API_KEY is set)
python3 {baseDir}/scripts/web_ingest.py --input-json gov_records.json --prefer-firecrawl
| Parameter | Description | Default |
|-----------|-------------|---------|
| --input-json | JSON input file with URL records | - |
| --url | Direct URL input (repeatable) | - |
| --output-jsonl | Optional JSONL output path | - |
| --manifest-path | URL->content hash dedupe manifest | ~/.scienceclaw/minerals_web_ingest_manifest.json |
| --timeout | HTTP timeout seconds | 30 |
| --max-chars | Max chars stored per page | 12000 |
| --prefer-firecrawl | Use Firecrawl first, fallback to requests | false |
| --format | summary, detailed, json | summary |
Ingested records include:
url, source, published_at, title, summary, commodity_tags, country_tags, policy_signal, confidence, retrieved_at, source_type, content, content_hash.
tools
Onboard and manage Paperclip AI for research-paper knowledge and agent orchestration
development
Perform AI-powered web searches with real-time information using Perplexity models via LiteLLM and OpenRouter. This skill should be used when conducting web searches for current information, finding recent scientific literature, getting grounded answers with source citations, or accessing information beyond the model knowledge cutoff. Provides access to multiple Perplexity models including Sonar Pro, Sonar Pro Search (advanced agentic search), and Sonar Reasoning Pro through a single OpenRouter API key.
testing
Generate a structured scientific PDF report from a JSON description. Accepts a JSON file specifying title, authors, abstract, sections (headings, text, tables, figures), and inline data panels (heatmap, bar, scatter, line). Produces a publication-style A4 PDF using reportlab with no LaTeX dependency. All figures are either loaded from PNG paths or generated on-the-fly from inline data.
development
Execute arbitrary Python code and return stdout. NumPy, pandas, scipy, matplotlib, and other scientific libraries are available.