rag-eval/SKILL.md
Iterate on RAG systems with structured evals instead of eyeballing. This skill should be used when the user is tuning a RAG pipeline — changing retrieval prompts, swapping models, adjusting chunking, or debugging poor answers — and wants a cheap, ranked set of experiments with cost tracking and structured feedback on the stack. Also use when the user asks "how do I know if my RAG is working?", "this RAG eval is burning money", or "what should I try next on retrieval?".
npx skillsauth add glebis/claude-skills rag-evalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Replace the "tweak → squint → swap model → burn credits" loop with a single command that runs a grid of eval variants on the user's gold-set, ranks them by a cost-aware score, and returns structured feedback on architecture, stack, and likely-issues. Draws on evidence-based RAG practices and learns from the user's past runs.
Trigger on: "help me test a RAG", "tune my RAG", "my RAG is bad", "compare retrieval prompts", "how do I eval this", "what's the best embedding model for X", "my RAG eval is expensive". Also trigger when the user reports burning OpenRouter / OpenAI credits with no clear signal of improvement.
Collect these from the user before the first sweep. Many are optional with sensible defaults; always confirm the ones that gate cost.
references/best-practices.md.OPENROUTER_API_KEY or OPENAI_API_KEY (read from env)..rag-eval/history.jsonl in the repo root.Follow this order. Refer to references/best-practices.md for the canonical checklist and references/evidence-base.md for the research-backed defaults.
When the user provides a session ID (Claude Code transcript, skill-studio session, or a Fathom meeting), run the deterministic ingest first — no LLM calls. This extracts only the useful signals (models tried, prompt variants, cost events, eval results) as compact JSON, so the rest of the skill works off a tiny structured bundle instead of a long raw transcript.
python scripts/session_ingest.py <session_id> > /tmp/rag-eval-bundle.json
# or with a direct path:
python scripts/session_ingest.py --path /path/to/transcript.jsonl > /tmp/rag-eval-bundle.json
The bundle includes: models_tried, prompts_tried (hashes only), iterations, total_cost_usd, summary_stats. Feed this into Step 1 — do not paste the raw transcript.
Why this matters: transcripts can be 100k+ tokens of noise. The ingest script does regex extraction only, keeping the LLM budget for the actual audit + sweep planning. This is a hard requirement, not an optimization.
Read references/best-practices.md and inspect the user's repo + vector-store config. Produce a structured report covering:
Present the report to the user and ask which issues to address first.
Based on the audit, propose 3–8 variants to test. Keep the grid small on the first run (default: 2 prompts × 2 models × 1 retrieval variant = 4 cells). Estimate cost using gold-set size × variants × avg tokens × provider pricing. Present the cost estimate and wait for user confirmation before running.
Use scripts/eval_sweep.py (see the script header for invocation). It reads a config YAML, runs each variant against the gold-set, records per-variant cost and answer quality, and appends to history.jsonl.
Guardrails:
.rag-eval/ (gitignore it).After the sweep, rank variants by a cost-aware score: quality × (1 / log(1 + cost)). Present:
Write the full report to .rag-eval/reports/<timestamp>.md.
Before each subsequent run, read history.jsonl and factor in what the user has already tried. Avoid re-testing rejected variants. Surface patterns ("models A, B, C all underperformed on multi-hop queries — next try a reranker").
scripts/eval_sweep.py — grid-search runner. Reads eval_config.yaml, writes results to history.jsonl.references/best-practices.md — evidence-based RAG checklist the agent uses as an anchor.references/evidence-base.md — pointers to recent RAG research and when each technique helps.assets/eval_config.template.yaml — starter config to copy into the user's repo.assets/gold_set.template.jsonl — 3 example Q&A pairs to show the gold-set format..rag-eval/ in the target repo.tavily-search or firecrawl-research to pull current evidence, then synthesize into the audit report.development
Create Tufte-inspired data reports and infographic dashboards as standalone HTML files. Uses EB Garamond for text, Monaspace Argon for numbers, Chart.js for interactive charts, and inline SVG sparklines. Produces publication-quality reports with 2-column narrative+data layouts, status dashboards, scroll animations, and responsive mobile support. Use this skill whenever the user wants to create a data report, activity dashboard, infographic, personal analytics page, health tracker visualization, or any document that combines narrative text with interactive charts and tables. Also triggers for "make a report like Tufte", "create an infographic", "build a dashboard", "visualize my data", or requests for beautiful data-driven documents.
documentation
Cut a software release and maintain a tiered compatibility policy. Use when the user wants to release, ship a version, bump the version, tag a release, write a changelog, or update COMPATIBILITY. Config-driven via release.config.json; bumps version files, runs a readiness gate, updates COMPATIBILITY.md tiers and deprecations, tags (→ release workflow), and reports closed issues. Teaches the underlying standards as it runs.
development
Sync and manage bilingual (EN/RU) library content for agency-docs. Use when adding, updating, or reviewing library articles. Handles translation, sync checks, and Russian stylistic review.
development
This skill should be used to watch a long-running background job (ffmpeg/media encode, qmd or other embedding/vector-DB run, batch agent/LLM pipeline, or a real-browser/agent-browser daemon) until it finishes or wedges, then deliver a verdict (done, needs-attention, or blocked) plus the exact next command, without burning dozens of manual poll commands. Triggers on "babysit this job", "watch this until it's done", "ping me when the encode/embed/batch finishes", "is this background process stuck", "monitor this ffmpeg/qmd run", or any request to wait on a long-running process and be told when it's complete or hung.