Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ranbot-ai/agent-evaluation

Name: agent-evaluation
Author: ranbot-ai

skills/agent-evaluation/SKILL.md

npx skillsauth add ranbot-ai/awesome-skills agent-evaluation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Agent Evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Prerequisites

Knowledge: Testing methodologies, Statistical analysis basics, LLM behavior patterns
Skills_recommended: autonomous-agents, multi-agent-orchestration
Required skills: testing-fundamentals, llm-fundamentals

Scope

Does_not_cover: Model training evaluation (loss, perplexity), Fairness and bias testing, User experience testing
Boundaries: Focus is agent capability and reliability, Covers functional and behavioral testing

Ecosystem

Primary_tools

AgentBench - Multi-environment benchmark for LLM agents (ICLR 2024)
τ-bench (Tau-bench) - Sierra's real-world agent benchmark
ToolEmu - Risky behavior detection for agent tool use
Langsmith - LLM tracing and evaluation platform

Alternatives

Braintrust - When: Need production monitoring integration LLM evaluation and monitoring
PromptFoo - When: Focus on prompt-level evaluation Prompt testing framework

Deprecated

Manual testing only

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

When to use: Evaluating stochastic agent behavior

interface TestResult { testId: string; runId: string; passed: boolean; score: number; // 0-1 for partial credit latencyMs: number; tokensUsed: number; output: string; expectedBehaviors: string[]; actualBehaviors: string[]; }

interface StatisticalAnalysis { passRate: number; confidence95: [number, number]; meanScore: number; stdDevScore: number; meanLatency: number; p95Latency: number; behaviorConsistency: number; }

class StatisticalEvaluator { private readonly minRuns = 10; private readonly confidenceLevel = 0.95;

async evaluateAgent(
    agent: Agent,
    testSuite: TestCase[]
): Promise<EvaluationReport> {
    const results: TestResult[] = [];

    // Run each test multiple times
    for (const test of testSuite) {
        for (let run = 0; run < this.minRuns; run++) {
            const result = await this.runTest(agent, test, run);
            results.push(result);
        }
    }

    // Analyze by test
    const byTest = this.groupByTest(results);
    const testAnalyses = new Map<string, StatisticalAnalysis>();

    for (const [testId, testResults] of byTest) {
        testAnalyses.set(testId, this.analyzeResults(testResults));
    }

    // Overall analysis
    const overall = this.analyzeResults(results);

    return {
        overall,
        byTest: testAnalyses,
        concerns: this.identifyConcerns(testAnalyses),
        recommendations: this.generateRecommendations(testAnalyses)
    };
}

private analyzeResults(results: TestResult[]): StatisticalAnalysis {
    const passes = results.filter(r => r.passed);
    const passRate = passes.length / results.length;

    // Calculate confidence interval for pass rate
    const z = 1.96;  // 95% confidence
    const se = Math.sqrt((passRate * (1 - passRate)) / results.length);
    const confidence95: [number, number] = [
        Math.max(0, passRate - z * se),
        Math.min(1, passRate + z * se)
    ];

    const scores = results.map(r => r.score);
    const latencies = results.map(r => r.latencyMs);

    return {
        passRate,
        confidence95,
        meanScore: this.mean(scores),
        stdDevScore: this.stdDev(scores),
        meanLatency: this.mean(latencies),
        p95Latency: this.percentile(latencies, 95),
        behaviorConsistency: this.calculateConsistency(results)
    };
}

private calculateConsistency(results: TestResult[]): number {
    // How consistent are the behaviors across runs?
    if (results.length < 2) return 1;

    const behaviorSets = results.map(r => new Set(r.actualBehaviors));
    let consistencySum = 0;
    let comparisons = 0;

    for (let i = 0; i < behaviorSets.length; i++) {
        for (let j = i + 1; j < behaviorSets.length; j++) {
            const intersection = new Set(
                [...behaviorSets[i]].filter(x => behaviorSets[j].has(x))
            );
            const union = new Set([...behaviorSets[i], ...behaviorSets[j]]);
            consistencySum += intersection.size / union.size;
            comparisons++;
        }
    }

    return consistencySum / comparisons;
}

private identifyConcerns(analyses: Map<string, StatisticalAnalysis>): Concern[] {
    const concerns: Concern[] = [];

    for (const [testId,

ranbot-ai/agent-evaluation

skills/agent-evaluation/SKILL.md

4 stars

testing

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add ranbot-ai/awesome-skills agent-evaluation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 15, 2026, 7:20 AM21.5s1 file scanned

SKILL.md

name:: agent-evaluation
description:: Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benc
category:: Security & Systems
source:: antigravity
tags:: [ai, agent, llm, workflow, design, document, security, vulnerability, rag, cro]
url:: https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/agent-evaluation

Agent Evaluation

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Prerequisites

Knowledge: Testing methodologies, Statistical analysis basics, LLM behavior patterns
Skills_recommended: autonomous-agents, multi-agent-orchestration
Required skills: testing-fundamentals, llm-fundamentals

Scope

Does_not_cover: Model training evaluation (loss, perplexity), Fairness and bias testing, User experience testing
Boundaries: Focus is agent capability and reliability, Covers functional and behavioral testing

Ecosystem

Primary_tools

AgentBench - Multi-environment benchmark for LLM agents (ICLR 2024)
τ-bench (Tau-bench) - Sierra's real-world agent benchmark
ToolEmu - Risky behavior detection for agent tool use
Langsmith - LLM tracing and evaluation platform

Alternatives

Braintrust - When: Need production monitoring integration LLM evaluation and monitoring
PromptFoo - When: Focus on prompt-level evaluation Prompt testing framework

Deprecated

Manual testing only

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

When to use: Evaluating stochastic agent behavior

interface StatisticalAnalysis { passRate: number; confidence95: [number, number]; meanScore: number; stdDevScore: number; meanLatency: number; p95Latency: number; behaviorConsistency: number; }

class StatisticalEvaluator { private readonly minRuns = 10; private readonly confidenceLevel = 0.95;

async evaluateAgent(
    agent: Agent,
    testSuite: TestCase[]
): Promise<EvaluationReport> {
    const results: TestResult[] = [];

    // Run each test multiple times
    for (const test of testSuite) {
        for (let run = 0; run < this.minRuns; run++) {
            const result = await this.runTest(agent, test, run);
            results.push(result);
        }
    }

    // Analyze by test
    const byTest = this.groupByTest(results);
    const testAnalyses = new Map<string, StatisticalAnalysis>();

    for (const [testId, testResults] of byTest) {
        testAnalyses.set(testId, this.analyzeResults(testResults));
    }

    // Overall analysis
    const overall = this.analyzeResults(results);

    return {
        overall,
        byTest: testAnalyses,
        concerns: this.identifyConcerns(testAnalyses),
        recommendations: this.generateRecommendations(testAnalyses)
    };
}

private analyzeResults(results: TestResult[]): StatisticalAnalysis {
    const passes = results.filter(r => r.passed);
    const passRate = passes.length / results.length;

    // Calculate confidence interval for pass rate
    const z = 1.96;  // 95% confidence
    const se = Math.sqrt((passRate * (1 - passRate)) / results.length);
    const confidence95: [number, number] = [
        Math.max(0, passRate - z * se),
        Math.min(1, passRate + z * se)
    ];

    const scores = results.map(r => r.score);
    const latencies = results.map(r => r.latencyMs);

    return {
        passRate,
        confidence95,
        meanScore: this.mean(scores),
        stdDevScore: this.stdDev(scores),
        meanLatency: this.mean(latencies),
        p95Latency: this.percentile(latencies, 95),
        behaviorConsistency: this.calculateConsistency(results)
    };
}

private calculateConsistency(results: TestResult[]): number {
    // How consistent are the behaviors across runs?
    if (results.length < 2) return 1;

    const behaviorSets = results.map(r => new Set(r.actualBehaviors));
    let consistencySum = 0;
    let comparisons = 0;

    for (let i = 0; i < behaviorSets.length; i++) {
        for (let j = i + 1; j < behaviorSets.length; j++) {
            const intersection = new Set(
                [...behaviorSets[i]].filter(x => behaviorSets[j].has(x))
            );
            const union = new Set([...behaviorSets[i], ...behaviorSets[j]]);
            consistencySum += intersection.size / union.size;
            comparisons++;
        }
    }

    return consistencySum / comparisons;
}

private identifyConcerns(analyses: Map<string, StatisticalAnalysis>): Concern[] {
    const concerns: Concern[] = [];

    for (const [testId,

Related Skills

ranbot-ai/ditto

tools

VerifiedTrustedCommunity

Use when a user asks to mine or update a private, evidence-backed work profile from local Claude Code, Codex, Copilot CLI, or OpenCode sessions.

5SKILL.mdUpdated Jul 18, 2026

ranbot-ai/diagnose-android-overheating

data-ai

VerifiedTrustedCommunity

Use when diagnosing Android overheating, idle heat, thermal throttling, charging or radio heat, or abnormal battery drain with read-only ADB evidence and approval gates.

5SKILL.mdUpdated Jul 18, 2026

ranbot-ai/diagnose-android-overheating

ranbot-ai/competitor-ad-intelligence

research

VerifiedTrustedCommunity

Research public competitor ads, analyze creative patterns and landing pages, and produce an evidence-labeled strategic teardown.

5SKILL.mdUpdated Jul 18, 2026

ranbot-ai/competitor-ad-intelligence

ranbot-ai/anywrite

tools

VerifiedTrustedCommunity

Compiled CLI covering all 52 endpoints of the Anytype local API — objects, properties, tags, search, chat, files — one binary, no MCP server needed.

5SKILL.mdUpdated Jul 18, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ranbot-ai/awesome-skills.git

# Copy into Claude Code skills folder (global)
cp -r awesome-skills/skills/agent-evaluation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ranbot-ai/awesome-skills

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT