skills/43-wentorai-research-plugins/skills/tools/knowledge-graph/rag-methodology-guide/SKILL.md
RAG architecture for academic knowledge retrieval and synthesis
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research rag-methodology-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Design and implement Retrieval-Augmented Generation (RAG) systems for academic research, including document chunking, embedding strategies, retrieval pipelines, and evaluation.
Retrieval-Augmented Generation (RAG) augments a language model's generation with relevant information retrieved from an external knowledge base. For academic research, this enables:
Query: "What are the main challenges of protein folding?"
|
v
[1. Query Processing]
|-- Embed query using embedding model
|-- Optional: Query expansion / HyDE
|
v
[2. Retrieval]
|-- Search vector database for top-k relevant chunks
|-- Optional: Reranking with cross-encoder
|
v
[3. Context Assembly]
|-- Combine retrieved chunks into a prompt
|-- Add metadata (source, page, citation)
|
v
[4. Generation]
|-- LLM generates answer grounded in retrieved context
|-- Include inline citations
|
v
Answer with citations
| Strategy | Description | Best For | |----------|-------------|----------| | Fixed-size | Split every N characters/tokens | Simple, fast, baseline | | Sentence-based | Split on sentence boundaries | Natural reading units | | Paragraph-based | Split on paragraph breaks | Coherent semantic units | | Section-based | Split on document headings | Academic papers | | Recursive | Hierarchically split (heading > paragraph > sentence) | General purpose | | Semantic | Split on topic shifts using embeddings | Best quality, slower |
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_academic_paper(text, chunk_size=1000, chunk_overlap=200):
"""Chunk an academic paper using recursive splitting."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=[
"\n## ", # H2 headings (section breaks)
"\n### ", # H3 headings (subsection breaks)
"\n\n", # Paragraph breaks
"\n", # Line breaks
". ", # Sentence breaks
" ", # Word breaks
],
length_function=len
)
chunks = splitter.split_text(text)
return chunks
# Add metadata to each chunk
def create_documents(paper_text, metadata):
"""Create chunks with source metadata for citation tracking."""
chunks = chunk_academic_paper(paper_text)
documents = []
for i, chunk in enumerate(chunks):
documents.append({
"text": chunk,
"metadata": {
**metadata,
"chunk_index": i,
"chunk_total": len(chunks)
}
})
return documents
# Example usage
docs = create_documents(
paper_text=extracted_text,
metadata={
"title": "Attention Is All You Need",
"authors": "Vaswani et al.",
"year": 2017,
"doi": "10.48550/arXiv.1706.03762",
"source_file": "vaswani2017attention.pdf"
}
)
| Model | Dimensions | Quality | Speed | Cost | |-------|-----------|---------|-------|------| | OpenAI text-embedding-3-small | 1536 | Good | Fast | $0.02/1M tokens | | OpenAI text-embedding-3-large | 3072 | Excellent | Fast | $0.13/1M tokens | | Cohere embed-v3 | 1024 | Excellent | Fast | $0.10/1M tokens | | sentence-transformers/all-MiniLM-L6-v2 | 384 | Good | Very fast | Free (local) | | BAAI/bge-large-en-v1.5 | 1024 | Excellent | Medium | Free (local) | | nomic-embed-text | 768 | Good | Fast | Free (local) |
| Database | Type | Scalability | Features | |----------|------|------------|----------| | ChromaDB | Embedded | Small-medium | Simple, good for prototyping | | FAISS | Library | Large | Facebook research, GPU support | | Pinecone | Cloud | Large | Managed, serverless | | Weaviate | Self-hosted/Cloud | Large | Hybrid search, filters | | Qdrant | Self-hosted/Cloud | Large | Rich filtering, payload storage | | pgvector | PostgreSQL extension | Medium | SQL integration |
import chromadb
from sentence_transformers import SentenceTransformer
# Initialize embedding model (local, free)
embed_model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Initialize ChromaDB
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
name="research_papers",
metadata={"hnsw:space": "cosine"}
)
# Index documents
def index_documents(documents):
"""Add documents to the vector database."""
texts = [doc["text"] for doc in documents]
embeddings = embed_model.encode(texts, show_progress_bar=True).tolist()
ids = [f"doc_{i}" for i in range(len(documents))]
metadatas = [doc["metadata"] for doc in documents]
collection.add(
documents=texts,
embeddings=embeddings,
metadatas=metadatas,
ids=ids
)
print(f"Indexed {len(documents)} chunks")
index_documents(docs)
def retrieve(query, top_k=5):
"""Retrieve the most relevant chunks for a query."""
query_embedding = embed_model.encode([query]).tolist()
results = collection.query(
query_embeddings=query_embedding,
n_results=top_k,
include=["documents", "metadatas", "distances"]
)
retrieved = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
retrieved.append({
"text": doc,
"metadata": meta,
"similarity": 1 - dist # Convert distance to similarity
})
return retrieved
# Example
results = retrieve("What are the main components of the Transformer architecture?")
for r in results:
print(f"[{r['similarity']:.3f}] {r['metadata'].get('title', 'N/A')}")
print(f" {r['text'][:150]}...")
def hybrid_retrieve(query, top_k=5, alpha=0.7):
"""Combine dense (semantic) and sparse (keyword) retrieval."""
# Dense retrieval (vector similarity)
dense_results = retrieve(query, top_k=top_k * 2)
# Sparse retrieval (BM25 keyword matching)
from rank_bm25 import BM25Okapi
# Assume all_documents is a list of all chunk texts
tokenized_corpus = [doc.split() for doc in all_documents]
bm25 = BM25Okapi(tokenized_corpus)
bm25_scores = bm25.get_scores(query.split())
sparse_top_k = bm25_scores.argsort()[-top_k * 2:][::-1]
# Reciprocal Rank Fusion (RRF)
rrf_scores = {}
k = 60 # RRF constant
for rank, result in enumerate(dense_results):
doc_id = result["metadata"].get("chunk_index", rank)
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + alpha / (k + rank + 1)
for rank, idx in enumerate(sparse_top_k):
rrf_scores[idx] = rrf_scores.get(idx, 0) + (1 - alpha) / (k + rank + 1)
# Sort by RRF score and return top-k
sorted_results = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
return sorted_results[:top_k]
def generate_answer(query, retrieved_contexts):
"""Generate an answer with inline citations using an LLM."""
# Build context string with citation markers
context_parts = []
for i, ctx in enumerate(retrieved_contexts, 1):
source = f"{ctx['metadata'].get('authors', 'Unknown')}, {ctx['metadata'].get('year', 'N/A')}"
context_parts.append(f"[{i}] ({source}): {ctx['text']}")
context_string = "\n\n".join(context_parts)
prompt = f"""Based on the following research paper excerpts, answer the question.
Use inline citations like [1], [2] to reference specific sources.
Only use information from the provided excerpts.
If the excerpts do not contain enough information, say so.
EXCERPTS:
{context_string}
QUESTION: {query}
ANSWER (with inline citations):"""
# Send to LLM (example with OpenAI)
# response = openai.chat.completions.create(
# model="gpt-4",
# messages=[{"role": "user", "content": prompt}],
# temperature=0.1
# )
# return response.choices[0].message.content
return prompt # Return prompt for inspection
| Metric | Measures | Tool | |--------|----------|------| | Retrieval precision | Are retrieved chunks relevant? | Manual annotation | | Retrieval recall | Are all relevant chunks retrieved? | Known-relevant set | | NDCG | Ranking quality of retrieved results | BEIR benchmark | | Answer correctness | Is the generated answer factually correct? | Human evaluation | | Faithfulness | Does the answer only use information from retrieved context? | RAGAS framework | | Answer relevance | Does the answer address the question? | RAGAS framework | | Context relevance | Are the retrieved contexts relevant to the question? | RAGAS framework |
# Using RAGAS for automated RAG evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
# Prepare evaluation dataset
eval_data = {
"question": ["What is the Transformer architecture?"],
"answer": ["The Transformer uses self-attention mechanisms..."],
"contexts": [["The Transformer model architecture eschews recurrence..."]],
"ground_truth": ["The Transformer is a neural network architecture..."]
}
result = evaluate(
dataset=eval_data,
metrics=[faithfulness, answer_relevancy, context_precision]
)
print(result)
development
Track dataset lineage, transformation steps, merge logic, and reproducibility risks in Stata workflows. Use when the user needs to explain where data came from, how it changed, or why a pipeline can be trusted.
development
Audit datasets for structure, missingness, labeling, suspicious values, duplicate identifiers, and documentation readiness. Use when a researcher asks for data QA, codebook review, sanity checks, or pre-analysis cleanup guidance.
data-ai
Design, run, and critique causal inference workflows in Stata. Use when the user is working on identification, treatment effects, DiD, IV, event studies, RD, or assumption-sensitive empirical claims.
tools
Complete survival analysis library in Python. Handles right-censored data, Kaplan-Meier curves, and Cox regression. Standard for clinical trial analysis and epidemiology.