skills/corpus-search/SKILL.md
Semantic search over critical minerals PDF corpus — rare earth, lithium, cobalt, nickel supply chain, trade policy, extraction, and materials research via Pinecone
npx skillsauth add lamm-mit/scienceclaw corpus-searchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Semantic search over a local collection of critical minerals PDFs (USGS, UN Comtrade, World Bank, SEC, WTO, Mindat, MinCan reports). Documents are chunked, embedded with llama-text-embed-v2, and stored in a Pinecone index for fast similarity search.
python3 {baseDir}/scripts/search_corpus.py --query "rare earth separation techniques"
python3 {baseDir}/scripts/search_corpus.py --query "supply chain risks" --commodity lithium
python3 {baseDir}/scripts/search_corpus.py --query "trade flows" --source Comtrade
python3 {baseDir}/scripts/search_corpus.py --query "cobalt extraction" --rerank --top-k 20
python3 {baseDir}/scripts/search_corpus.py --query "graphite processing" --format json
| Parameter | Description | Default |
|-----------|-------------|---------|
| --query | Semantic search query | Required |
| --commodity | Filter by commodity keyword (e.g., lithium, cobalt, rare earth) | - |
| --source | Filter by source organization (e.g., USGS, Comtrade, SEC) | - |
| --top-k | Number of results to retrieve | 10 |
| --rerank | Enable reranking with pinecone-rerank-v0 | false |
| --format | Output format: summary, detailed, json | summary |
| --index-name | Pinecone index name | scienceclaw-minerals-corpus |
Before searching, ingest PDFs into the Pinecone index:
# Dry run — list PDFs that would be ingested:
python3 {baseDir}/scripts/ingest_corpus.py --corpus-dir ~/critical-minerals-data/ --dry-run
# Ingest all PDFs:
python3 {baseDir}/scripts/ingest_corpus.py --corpus-dir ~/critical-minerals-data/
# Force re-ingest (ignore manifest):
python3 {baseDir}/scripts/ingest_corpus.py --corpus-dir ~/critical-minerals-data/ --force-reingest
| Parameter | Description | Default |
|-----------|-------------|---------|
| --corpus-dir | Directory containing PDFs | ~/critical-minerals-data/ |
| --index-name | Pinecone index name | scienceclaw-minerals-corpus |
| --force-reingest | Re-ingest all files, ignoring manifest | false |
| --dry-run | List files without ingesting | false |
PINECONE_API_KEY environment variableusgs/, sec/)pinecone-rerank-v0 for higher quality results at the cost of latencytools
Onboard and manage Paperclip AI for research-paper knowledge and agent orchestration
development
Perform AI-powered web searches with real-time information using Perplexity models via LiteLLM and OpenRouter. This skill should be used when conducting web searches for current information, finding recent scientific literature, getting grounded answers with source citations, or accessing information beyond the model knowledge cutoff. Provides access to multiple Perplexity models including Sonar Pro, Sonar Pro Search (advanced agentic search), and Sonar Reasoning Pro through a single OpenRouter API key.
testing
Generate a structured scientific PDF report from a JSON description. Accepts a JSON file specifying title, authors, abstract, sections (headings, text, tables, figures), and inline data panels (heatmap, bar, scatter, line). Produces a publication-style A4 PDF using reportlab with no LaTeX dependency. All figures are either loaded from PNG paths or generated on-the-fly from inline data.
development
Execute arbitrary Python code and return stdout. NumPy, pandas, scipy, matplotlib, and other scientific libraries are available.