Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

glebis/rag-eval

Name: rag-eval
Author: glebis

rag-eval/SKILL.md

npx skillsauth add glebis/claude-skills rag-eval

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

rag-eval

Purpose

Replace the "tweak → squint → swap model → burn credits" loop with a single command that runs a grid of eval variants on the user's gold-set, ranks them by a cost-aware score, and returns structured feedback on architecture, stack, and likely-issues. Draws on evidence-based RAG practices and learns from the user's past runs.

When to use

Trigger on: "help me test a RAG", "tune my RAG", "my RAG is bad", "compare retrieval prompts", "how do I eval this", "what's the best embedding model for X", "my RAG eval is expensive". Also trigger when the user reports burning OpenRouter / OpenAI credits with no clear signal of improvement.

Prerequisites — gather before running

Collect these from the user before the first sweep. Many are optional with sensible defaults; always confirm the ones that gate cost.

RAG codebase root — path to the repo/module under test.
Gold-set — at least 10 Q&A pairs. If missing, offer to generate a starter gold-set from the user's dataset (LLM-synthesized, human-reviewed). See references/best-practices.md.
Dataset — the corpus the RAG retrieves over.
Budget cap — hard dollar limit per run (default: $2 if user doesn't specify). Always confirm before any sweep.
Provider keys — OPENROUTER_API_KEY or OPENAI_API_KEY (read from env).
Vector-store config — collection name, embedding model, chunk size (read from repo; confirm if ambiguous).
Eval history path (optional) — defaults to .rag-eval/history.jsonl in the repo root.

Workflow

Follow this order. Refer to references/best-practices.md for the canonical checklist and references/evidence-base.md for the research-backed defaults.

Step 0 — (Optional) Ingest a prior iteration session

When the user provides a session ID (Claude Code transcript, skill-studio session, or a Fathom meeting), run the deterministic ingest first — no LLM calls. This extracts only the useful signals (models tried, prompt variants, cost events, eval results) as compact JSON, so the rest of the skill works off a tiny structured bundle instead of a long raw transcript.

python scripts/session_ingest.py <session_id> > /tmp/rag-eval-bundle.json
# or with a direct path:
python scripts/session_ingest.py --path /path/to/transcript.jsonl > /tmp/rag-eval-bundle.json

The bundle includes: models_tried, prompts_tried (hashes only), iterations, total_cost_usd, summary_stats. Feed this into Step 1 — do not paste the raw transcript.

Why this matters: transcripts can be 100k+ tokens of noise. The ingest script does regex extraction only, keeping the LLM budget for the actual audit + sweep planning. This is a hard requirement, not an optimization.

Step 1 — Audit the stack

Read references/best-practices.md and inspect the user's repo + vector-store config. Produce a structured report covering:

Architecture (retrieval type: dense / hybrid / rerank; chunking strategy; prompt structure)
Tech stack (embedding model, LLM, vector store)
Resources (dataset size, gold-set size, prior eval runs)
Risks (known anti-patterns, missing pieces)

Present the report to the user and ask which issues to address first.

Step 2 — Propose a sweep plan

Based on the audit, propose 3–8 variants to test. Keep the grid small on the first run (default: 2 prompts × 2 models × 1 retrieval variant = 4 cells). Estimate cost using gold-set size × variants × avg tokens × provider pricing. Present the cost estimate and wait for user confirmation before running.

Step 3 — Run the sweep

Use scripts/eval_sweep.py (see the script header for invocation). It reads a config YAML, runs each variant against the gold-set, records per-variant cost and answer quality, and appends to history.jsonl.

Guardrails:

Never exceed the budget cap — halt mid-sweep if reached.
Never mutate the user's repo. Write all artifacts under .rag-eval/ (gitignore it).
Confirm before any sweep estimated to exceed the user's cap.

Step 4 — Rank and report

After the sweep, rank variants by a cost-aware score: quality × (1 / log(1 + cost)). Present:

Top 3 variants with quality metrics and cost
What changed vs the previous best
Concrete next experiment to try

Write the full report to .rag-eval/reports/<timestamp>.md.

Step 5 — Self-improve

Before each subsequent run, read history.jsonl and factor in what the user has already tried. Avoid re-testing rejected variants. Surface patterns ("models A, B, C all underperformed on multi-hop queries — next try a reranker").

Reusable resources

scripts/eval_sweep.py — grid-search runner. Reads eval_config.yaml, writes results to history.jsonl.
references/best-practices.md — evidence-based RAG checklist the agent uses as an anchor.
references/evidence-base.md — pointers to recent RAG research and when each technique helps.
assets/eval_config.template.yaml — starter config to copy into the user's repo.
assets/gold_set.template.jsonl — 3 example Q&A pairs to show the gold-set format.

Notes

Cost is the main failure mode. Never run without a confirmed budget. Err on the side of smaller sweeps; users can always run again.
No repo mutation. All outputs go under .rag-eval/ in the target repo.
When uncertain about best practices, do web research. Use tavily-search or firecrawl-research to pull current evidence, then synthesize into the audit report.
Defer to the user. Before changing any file in the target repo, always confirm.

glebis/rag-eval

rag-eval/SKILL.md

Iterate on RAG systems with structured evals instead of eyeballing. This skill should be used when the user is tuning a RAG pipeline — changing retrieval prompts, swapping models, adjusting chunking, or debugging poor answers — and wants a cheap, ranked set of experiments with cost tracking and structured feedback on the stack. Also use when the user asks "how do I know if my RAG is working?", "this RAG eval is burning money", or "what should I try next on retrieval?".

173 stars

development

Updated May 10, 2026

$ install --global

skillsauth

npx skillsauth add glebis/claude-skills rag-eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 10, 2026, 7:20 AM153.7s2 files scanned

SKILL.md

name:: rag-eval
description:: Iterate on RAG systems with structured evals instead of eyeballing. This skill should be used when the user is tuning a RAG pipeline — changing retrieval prompts, swapping models, adjusting chunking, or debugging poor answers — and wants a cheap, ranked set of experiments with cost tracking and structured feedback on the stack. Also use when the user asks "how do I know if my RAG is working?", "this RAG eval is burning money", or "what should I try next on retrieval?".

rag-eval

Purpose

When to use

Prerequisites — gather before running

Collect these from the user before the first sweep. Many are optional with sensible defaults; always confirm the ones that gate cost.

RAG codebase root — path to the repo/module under test.
Gold-set — at least 10 Q&A pairs. If missing, offer to generate a starter gold-set from the user's dataset (LLM-synthesized, human-reviewed). See references/best-practices.md.
Dataset — the corpus the RAG retrieves over.
Budget cap — hard dollar limit per run (default: $2 if user doesn't specify). Always confirm before any sweep.
Provider keys — OPENROUTER_API_KEY or OPENAI_API_KEY (read from env).
Vector-store config — collection name, embedding model, chunk size (read from repo; confirm if ambiguous).
Eval history path (optional) — defaults to .rag-eval/history.jsonl in the repo root.

Workflow

Follow this order. Refer to references/best-practices.md for the canonical checklist and references/evidence-base.md for the research-backed defaults.

Step 0 — (Optional) Ingest a prior iteration session

python scripts/session_ingest.py <session_id> > /tmp/rag-eval-bundle.json
# or with a direct path:
python scripts/session_ingest.py --path /path/to/transcript.jsonl > /tmp/rag-eval-bundle.json

The bundle includes: models_tried, prompts_tried (hashes only), iterations, total_cost_usd, summary_stats. Feed this into Step 1 — do not paste the raw transcript.

Step 1 — Audit the stack

Read references/best-practices.md and inspect the user's repo + vector-store config. Produce a structured report covering:

Architecture (retrieval type: dense / hybrid / rerank; chunking strategy; prompt structure)
Tech stack (embedding model, LLM, vector store)
Resources (dataset size, gold-set size, prior eval runs)
Risks (known anti-patterns, missing pieces)

Present the report to the user and ask which issues to address first.

Step 2 — Propose a sweep plan

Step 3 — Run the sweep

Guardrails:

Never exceed the budget cap — halt mid-sweep if reached.
Never mutate the user's repo. Write all artifacts under .rag-eval/ (gitignore it).
Confirm before any sweep estimated to exceed the user's cap.

Step 4 — Rank and report

After the sweep, rank variants by a cost-aware score: quality × (1 / log(1 + cost)). Present:

Top 3 variants with quality metrics and cost
What changed vs the previous best
Concrete next experiment to try

Write the full report to .rag-eval/reports/<timestamp>.md.

Step 5 — Self-improve

Reusable resources

scripts/eval_sweep.py — grid-search runner. Reads eval_config.yaml, writes results to history.jsonl.
references/best-practices.md — evidence-based RAG checklist the agent uses as an anchor.
references/evidence-base.md — pointers to recent RAG research and when each technique helps.
assets/eval_config.template.yaml — starter config to copy into the user's repo.
assets/gold_set.template.jsonl — 3 example Q&A pairs to show the gold-set format.

Notes

Cost is the main failure mode. Never run without a confirmed budget. Err on the side of smaller sweeps; users can always run again.
No repo mutation. All outputs go under .rag-eval/ in the target repo.
When uncertain about best practices, do web research. Use tavily-search or firecrawl-research to pull current evidence, then synthesize into the audit report.
Defer to the user. Before changing any file in the target repo, always confirm.

Related Skills

glebis/skills/disk-cleanup

data-ai

VerifiedTrustedCommunity

--- name: disk-cleanup description: Scan and clean macOS caches, package-manager data, crash dumps, and app caches to reclaim disk space. Deterministic — a config registry (targets.json) plus two scripts (survey.py read-only, clean.py executor) do all the measuring and deleting; the agent only relays a compressed summary and makes the few human-judgment calls. IMPORTANT — use this skill whenever the user's request on macOS involves: freeing disk space, cleaning/clearing caches, "disk is full", "

337SKILL.mdUpdated May 30, 2026

glebis/skills/disk-cleanup

glebis/nano-banana

development

VerifiedTrustedCommunity

Generate and edit images using Google's Gemini image generation models (Nano Banana family). Supports style presets, platform-specific sizing (YouTube/slides/blog), variants, image editing via inlineData, reference images for style transfer, and organized output with metadata. Default model is Nano Banana 2 (gemini-3.1-flash-image-preview). Key is auto-decrypted via SOPS.

337SKILL.mdUpdated Apr 22, 2026

glebis/agency-docs-updater

development

VerifiedTrustedCommunity

--- name: agency-docs-updater description: End-to-end pipeline for publishing Claude Code lab meetings. Accepts optional args: date (YYYYMMDD, "yesterday", "today") and lab number (e.g. "04"). Examples: "yesterday 04", "20260420 05", "04" (today, lab 04), "" (today, auto-detect lab). --- # Agency Docs Updater Execute ALL steps automatically in sequence. Only pause if a step fails and cannot be recovered. Read `references/learnings.md` before starting for known pitfalls. **Configuration**: pat

337SKILL.mdUpdated Apr 22, 2026

glebis/agency-docs-updater

glebis/typography

tools

VerifiedTrustedCommunity

This skill should be used when applying proper typography to prose text or files in Russian, English, German, or French — smart quotes per locale («ёлочки», “curly”, „Gänsefüßchen“, « guillemets »), correct dashes (тире, em/en dash, Gedankenstrich, tiret), non-breaking spaces, ranges, ellipsis, and French espaces insécables before ! ? ; :. Fully deterministic via a pinned typograf-based CLI; never apply these rules by hand. Triggers on "типографика", "typograf", "оттипографь", "smart quotes", "fix typography", "неразрывные пробелы".

329SKILL.mdUpdated Jul 24, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/glebis/claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r claude-skills/rag-eval ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

glebis/claude-skills

173 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT