Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

samarv/ai-error-analysis-and-eval-design

Name: ai-error-analysis-and-eval-design
Author: samarv

.claude/skills/ai-error-analysis-and-eval-design/SKILL.md

npx skillsauth add samarv/Shanon ai-error-analysis-and-eval-design

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

To build great AI products, you must transition from subjective "vibe checks" to systematic measurement. This process identifies exactly where an LLM is failing and creates a feedback loop for continuous improvement.

Phase 1: Open Coding (The "Benevolent Dictator" Phase)

Before automating, you must manually ground yourself in the data. Appoint one "Benevolent Dictator"—typically the Product Manager or domain expert—to define "good" taste.

Sample the Data: Extract 50–100 "traces" (logs of full LLM interactions) from your observability tool (e.g., Braintrust, LangSmith, Phoenix).
Note the Upstream Error: Read each trace. If something is wrong, write a brief, informal note (an "Open Code") describing the first thing that went wrong.
- Rule: Don't overthink it. Use specific language (e.g., "hallucinated virtual tour," "didn't confirm call transfer") rather than just "bad."
Stop at Saturation: Continue until you stop learning new ways the system fails (Theoretical Saturation).

Phase 2: Axial Coding (Categorization)

Synthesize your mess of notes into actionable categories using an LLM.

Export Notes: Put your open codes into a CSV or spreadsheet.
Synthesize Failure Modes: Use an LLM (Claude or ChatGPT) to group your notes into 5–7 "Axial Codes" (failure categories).
- Prompt Pattern: "Analyze these manual notes from AI traces and group them into actionable failure categories (Axial Codes). Each category should represent a specific product problem."
Map Back: Use a spreadsheet formula or LLM to categorize every trace into one of these buckets.
Prioritize: Create a pivot table to count the frequency of each category. Focus your engineering efforts on the highest-frequency or highest-risk buckets.

Phase 3: Build the "LLM as Judge"

For complex, subjective failures (like "human handoff quality"), create an automated evaluator.

Write the Judge Prompt: Create a separate prompt for an LLM whose only job is to evaluate one specific failure mode.
Enforce Binary Scoring: Require the judge to output only True or False.
- Note: Avoid 1–5 or 1–10 scales. They result in "weasel" metrics (e.g., a score of 3.7) that provide no clear direction for improvement.
Define Rules: Include specific criteria from your "Benevolent Dictator" notes.
- Example: "Output True if the user explicitly asked for a human and the assistant responded with a tool call without acknowledging the request."

Phase 4: Alignment & Validation

Never ship an eval until you know the judge matches human judgment.

Create an Agreement Matrix: Compare the Judge's True/False labels against your manual labels from Phase 1.
Review Mismatches: Specifically look at:
- False Positives: Judge said error, Human said no error.
- False Negatives: Human said error, Judge said no error.
Iterate: Refine the Judge's prompt until it aligns with the "Benevolent Dictator" at least 80–90% of the time.

Examples

Example 1: Real Estate AI Assistant

Context: AI is supposed to book apartment tours.
Open Code: "AI told the user a virtual tour was available when the property only offers in-person tours."
Axial Code: "Capability Misrepresentation."
Judge Logic: "Check the 'Property Context' tool output. If 'virtual_tour' is False, but the LLM response contains 'virtual tour,' output True (Error)."

Example 2: Customer Support Handoff

Context: AI should hand off to a human for sensitive issues.
Open Code: "User said they were frustrated with a leak, AI just gave a generic maintenance link."
Axial Code: "Handoff Protocol Violation."
Judge Logic: "Search for sentiment indicating frustration or emergency. If found, did the AI offer a human transfer? If no, output True (Error)."

Common Pitfalls

Likert Scales: Using 1–5 scales makes it impossible to know if a change in score is meaningful. Use binary True/False.
Automating Too Early: Do not let an LLM do the initial "Open Coding." It lacks the product context to know what "janky" looks like for your specific business.
Committee Judging: Don't use a committee to define "good." Appoint one person with the best domain taste to be the final arbiter (The Benevolent Dictator).
Chasing Generic Metrics: Don't rely on generic evals like "hallucination score" or "cosine similarity." They rarely correlate with product-specific success.

samarv/ai-error-analysis-and-eval-design

.claude/skills/ai-error-analysis-and-eval-design/SKILL.md

A systematic workflow to move AI products beyond "vibe checks" by identifying specific failure modes and building automated LLM judges. Use this when your AI outputs feel "janky," when you need a feedback signal for prompt engineering, or when monitoring production performance at scale.

22 stars

development

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add samarv/Shanon ai-error-analysis-and-eval-design

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 3:15 PM5.1s1 file scanned

SKILL.md

name:: ai-error-analysis-and-eval-design
description:: A systematic workflow to move AI products beyond "vibe checks" by identifying specific failure modes and building automated LLM judges. Use this when your AI outputs feel "janky," when you need a feedback signal for prompt engineering, or when monitoring production performance at scale.

Phase 1: Open Coding (The "Benevolent Dictator" Phase)

Before automating, you must manually ground yourself in the data. Appoint one "Benevolent Dictator"—typically the Product Manager or domain expert—to define "good" taste.

Sample the Data: Extract 50–100 "traces" (logs of full LLM interactions) from your observability tool (e.g., Braintrust, LangSmith, Phoenix).
Note the Upstream Error: Read each trace. If something is wrong, write a brief, informal note (an "Open Code") describing the first thing that went wrong.
- Rule: Don't overthink it. Use specific language (e.g., "hallucinated virtual tour," "didn't confirm call transfer") rather than just "bad."
Stop at Saturation: Continue until you stop learning new ways the system fails (Theoretical Saturation).

Phase 2: Axial Coding (Categorization)

Synthesize your mess of notes into actionable categories using an LLM.

Export Notes: Put your open codes into a CSV or spreadsheet.
Synthesize Failure Modes: Use an LLM (Claude or ChatGPT) to group your notes into 5–7 "Axial Codes" (failure categories).
- Prompt Pattern: "Analyze these manual notes from AI traces and group them into actionable failure categories (Axial Codes). Each category should represent a specific product problem."
Map Back: Use a spreadsheet formula or LLM to categorize every trace into one of these buckets.
Prioritize: Create a pivot table to count the frequency of each category. Focus your engineering efforts on the highest-frequency or highest-risk buckets.

Phase 3: Build the "LLM as Judge"

For complex, subjective failures (like "human handoff quality"), create an automated evaluator.

Write the Judge Prompt: Create a separate prompt for an LLM whose only job is to evaluate one specific failure mode.
Enforce Binary Scoring: Require the judge to output only True or False.
- Note: Avoid 1–5 or 1–10 scales. They result in "weasel" metrics (e.g., a score of 3.7) that provide no clear direction for improvement.
Define Rules: Include specific criteria from your "Benevolent Dictator" notes.
- Example: "Output True if the user explicitly asked for a human and the assistant responded with a tool call without acknowledging the request."

Phase 4: Alignment & Validation

Never ship an eval until you know the judge matches human judgment.

Create an Agreement Matrix: Compare the Judge's True/False labels against your manual labels from Phase 1.
Review Mismatches: Specifically look at:
- False Positives: Judge said error, Human said no error.
- False Negatives: Human said error, Judge said no error.
Iterate: Refine the Judge's prompt until it aligns with the "Benevolent Dictator" at least 80–90% of the time.

Examples

Example 1: Real Estate AI Assistant

Context: AI is supposed to book apartment tours.
Open Code: "AI told the user a virtual tour was available when the property only offers in-person tours."
Axial Code: "Capability Misrepresentation."
Judge Logic: "Check the 'Property Context' tool output. If 'virtual_tour' is False, but the LLM response contains 'virtual tour,' output True (Error)."

Example 2: Customer Support Handoff

Context: AI should hand off to a human for sensitive issues.
Open Code: "User said they were frustrated with a leak, AI just gave a generic maintenance link."
Axial Code: "Handoff Protocol Violation."
Judge Logic: "Search for sentiment indicating frustration or emergency. If found, did the AI offer a human transfer? If no, output True (Error)."

Common Pitfalls

Likert Scales: Using 1–5 scales makes it impossible to know if a change in score is meaningful. Use binary True/False.
Automating Too Early: Do not let an LLM do the initial "Open Coding." It lacks the product context to know what "janky" looks like for your specific business.
Committee Judging: Don't use a committee to define "good." Appoint one person with the best domain taste to be the final arbiter (The Benevolent Dictator).
Chasing Generic Metrics: Don't rely on generic evals like "hallucination score" or "cosine similarity." They rarely correlate with product-specific success.

Related Skills

samarv/pptx

documentation

VerifiedTrustedCommunity

Presentation creation, editing, and analysis. When Claude needs to work with presentations (.pptx files) for: (1) Creating new presentations, (2) Modifying or editing content, (3) Working with layouts, (4) Adding comments or speaker notes, or any other presentation tasks

22SKILL.mdUpdated Apr 16, 2026

samarv/power-progression-strategy

development

VerifiedTrustedCommunity

A framework to identify and develop sustainable competitive advantages (Power) based on a company's lifecycle stage. Use this when drafting a product strategy, evaluating business model durability, or distinguishing between "operational excellence" and true competitive moats.

22SKILL.mdUpdated Apr 16, 2026

samarv/power-progression-strategy

samarv/.claude/skills/podcast-launch-and-growth-engine

development

VerifiedTrustedCommunity

```yaml --- name: podcast-launch-and-growth-engine description: A framework for launching and scaling a podcast based on topic validation, ranking momentum, and lean production. Use this skill when starting a new content channel, choosing a niche, or designing a listener acquisition strategy. --- This framework leverages Chris Hutchins' "All the Hacks" methodology to move from an idea to the top 5% of active podcasts through strategic validation, momentum-based launching, and high-efficiency di

22SKILL.mdUpdated Apr 16, 2026

samarv/.claude/skills/podcast-launch-and-growth-engine

samarv/pmf-threshold-gating

development

VerifiedTrustedCommunity

A high-bar framework for measuring and achieving product-market fit (PMF) before scaling. Use this when validating a new product line, deciding if a beta is ready for a general release, or diagnosing why a product isn't generating organic word-of-mouth growth.

22SKILL.mdUpdated Apr 16, 2026

samarv/pmf-threshold-gating

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/samarv/Shanon.git

# Copy into Claude Code skills folder (global)
cp -r Shanon/.claude/skills/ai-error-analysis-and-eval-design ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

samarv/Shanon

22 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT