.claude/skills/ai-eval-design-and-iteration/SKILL.md
Develop "quizzes" (evals) to measure model performance on specific tasks. Use these benchmarks to guide fine-tuning, determine product UX patterns, and track performance improvements over time. Use this when launching a new AI feature, switching between model versions, or optimizing for high-stakes accuracy.
npx skillsauth add samarv/Shanon ai-eval-design-and-iterationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
In traditional software, inputs and outputs are defined. In AI, inputs and outputs are fuzzy. Evals (evaluations) are the "unit tests" for AI products. They allow you to move from "vibes-based" development to metric-driven iteration. By building a rigorous "quiz" for your model, you can determine exactly how capable your product is and where it requires human-in-the-loop scaffolding.
Don't start with generic benchmarks (like MMLU). Instead, define the specific "hero" scenarios your product must master.
Create a set of tests to gauge how well the model knows the subject material.
Use the eval scores to guide your development cycle.
The "score" of your eval dictates the product's user interface. Kevin Weil's 60/95/99 Rule:
Example 1: Deep Research Tool
Example 2: Customer Support Agent
documentation
Presentation creation, editing, and analysis. When Claude needs to work with presentations (.pptx files) for: (1) Creating new presentations, (2) Modifying or editing content, (3) Working with layouts, (4) Adding comments or speaker notes, or any other presentation tasks
development
A framework to identify and develop sustainable competitive advantages (Power) based on a company's lifecycle stage. Use this when drafting a product strategy, evaluating business model durability, or distinguishing between "operational excellence" and true competitive moats.
development
```yaml --- name: podcast-launch-and-growth-engine description: A framework for launching and scaling a podcast based on topic validation, ranking momentum, and lean production. Use this skill when starting a new content channel, choosing a niche, or designing a listener acquisition strategy. --- This framework leverages Chris Hutchins' "All the Hacks" methodology to move from an idea to the top 5% of active podcasts through strategic validation, momentum-based launching, and high-efficiency di
development
A high-bar framework for measuring and achieving product-market fit (PMF) before scaling. Use this when validating a new product line, deciding if a beta is ready for a general release, or diagnosing why a product isn't generating organic word-of-mouth growth.