skills/25-HosungYou-Diverga/skills/i3/SKILL.md
RAG Builder with Parallel Document Processing Vector database construction with local embeddings (zero cost) Handles PDF download, text extraction, chunking, and vector database creation Absorbed B5 (Parallel Document Processor) capabilities Use when: building RAG, creating vector database, downloading PDFs, embedding documents, batch processing Triggers: build RAG, create vector database, download PDFs, embed documents, batch PDF processing
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research i3Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
diverga_check_prerequisites("i3") → must return approved: true
If not approved → AskUserQuestion for each missing checkpoint (see .claude/references/checkpoint-templates.md)
diverga_mark_checkpoint("SCH_RAG_READINESS", decision, rationale)Read .research/decision-log.yaml directly to verify prerequisites. Conversation history is last resort.
Agent ID: I3 Category: I - Systematic Review Automation Tier: LOW (Haiku) Icon: 🗄️⚡
Builds a RAG (Retrieval-Augmented Generation) system from PRISMA-selected papers. Uses completely free local embeddings and ChromaDB, making the RAG building stage $0 cost. Handles PDF download, text extraction, chunking, and vector database creation.
| Component | Tool | Cost | |-----------|------|------| | PDF Download | requests | $0 | | Text Extraction | PyMuPDF | $0 | | Embeddings | all-MiniLM-L6-v2 | $0 (local) | | Vector DB | ChromaDB | $0 (local) | | Chunking | LangChain | $0 |
Total RAG Building Cost: $0
Required:
- project_path: "string"
Optional:
- chunk_size_tokens: "int (default: 500)"
- chunk_overlap_tokens: "int (default: 100)"
- embedding_model: "string (default: all-MiniLM-L6-v2)"
- delay_between_downloads: "float (default: 2.0)"
- download_timeout: "int (default: 30)"
main_output:
stage: "rag_build"
pdf_download:
total_papers: "int"
downloaded: "int"
failed: "int"
success_rate: "string"
total_size_mb: "int"
rag_build:
total_chunks: "int"
avg_chunks_per_paper: "float"
chunk_size_tokens: "int"
chunk_overlap_tokens: "int"
embedding_model: "string"
embedding_dimensions: "int"
vector_db: "string"
output_paths:
pdfs: "string"
chroma_db: "string"
rag_config: "string"
Before completing RAG build, I3 SHOULD:
REPORT build status:
RAG Build Complete
PDF Download:
- Total papers: 287
- PDFs downloaded: 245 (85.4%)
- PDFs unavailable: 42
Vector Database:
- Total chunks: 4,850
- Avg chunks/paper: 19.8
- Embedding model: all-MiniLM-L6-v2
- Database: ChromaDB
Storage:
- PDF size: 1.2 GB
- Vector DB size: 450 MB
Ready for research queries?
ASK if user wants to proceed
CONFIRM RAG is ready for queries
# Project path (set to your working directory)
cd "$(pwd)"
# Stage 4: PDF Download
python scripts/04_download_pdfs.py \
--project {project_path} \
--delay 2.0 \
--timeout 30
# Stage 5: RAG Build
python scripts/05_build_rag.py \
--project {project_path} \
--chunk-size 1000 \
--chunk-overlap 200 \
--embedding-model sentence-transformers/all-MiniLM-L6-v2
Problem: Documentation says "1000 tokens" but code used "1000 characters"
Fix: Token-based chunking with tiktoken
import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base")
# Settings
chunk_size_tokens = 500 # Actual tokens
chunk_overlap_tokens = 100 # Actual tokens
# Character fallback (if tiktoken unavailable)
chunk_size_chars = 1000
chunk_overlap_chars = 200
| Model | Dimensions | Speed | Quality | |-------|------------|-------|---------| | all-MiniLM-L6-v2 (Default) | 384 | Fast | Good | | all-mpnet-base-v2 | 768 | Medium | Better | | bge-small-en-v1.5 | 384 | Fast | Good | | e5-small-v2 | 384 | Fast | Good |
All models run locally at zero cost.
| Source | URL Pattern | Success Rate |
|--------|-------------|--------------|
| Semantic Scholar | openAccessPdf.url | ~40% |
| OpenAlex | open_access.oa_url | ~50% |
| arXiv | arxiv.org/pdf/{id}.pdf | 100% |
max_retries = 3
base_delay = 2.0
for attempt in range(max_retries):
try:
download_pdf(url)
break
except Timeout:
delay = base_delay * (2 ** attempt)
time.sleep(delay)
data/04_rag/
├── chroma_db/
│ ├── chroma.sqlite3 # Metadata store
│ ├── {collection_id}/ # Vector embeddings
│ └── index/ # HNSW index
└── rag_config.json # Configuration
After build, I3 tests retrieval with research question:
# Test query
results = vectorstore.similarity_search(
research_question,
k=5
)
# Report results
for doc in results:
print(f"- {doc.metadata['title']} ({doc.metadata['year']})")
print(f" Preview: {doc.page_content[:150]}...")
| Keywords (EN) | Keywords (KR) | Action | |---------------|---------------|--------| | build RAG, create vector database | RAG 구축, 벡터 DB | Activate I3 | | download PDFs | PDF 다운로드 | Activate I3 | | embed documents | 문서 임베딩 | Activate I3 |
| Error | Action | |-------|--------| | PDF corrupt | Skip, log to failed list | | OCR needed | Fall back to pytesseract | | Memory limit | Process in batches | | Embedding timeout | Retry with smaller batch |
requires: ["I2-screening-assistant"]
sequential_next: []
parallel_compatible: []
development
Track dataset lineage, transformation steps, merge logic, and reproducibility risks in Stata workflows. Use when the user needs to explain where data came from, how it changed, or why a pipeline can be trusted.
development
Audit datasets for structure, missingness, labeling, suspicious values, duplicate identifiers, and documentation readiness. Use when a researcher asks for data QA, codebook review, sanity checks, or pre-analysis cleanup guidance.
data-ai
Design, run, and critique causal inference workflows in Stata. Use when the user is working on identification, treatment effects, DiD, IV, event studies, RD, or assumption-sensitive empirical claims.
tools
Complete survival analysis library in Python. Handles right-censored data, Kaplan-Meier curves, and Cox regression. Standard for clinical trial analysis and epidemiology.