Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

google-gemini/behavioral-evals

Name: behavioral-evals
Author: google-gemini

.gemini/skills/behavioral-evals/SKILL.md

npx skillsauth add google-gemini/gemini-cli behavioral-evals

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Behavioral Evals

Overview

Behavioral evaluations (evals) are tests that validate the agent's decision-making (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.

[!NOTE] Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.

🔄 Workflow Decision Tree

Does a prompt/tool change need validation?
- No -> Normal integration tests.
- Yes -> Continue below.
Is it UI/Interaction heavy?
- Yes -> Use appEvalTest (AppRig). See creating.md.
- No -> Use evalTest (TestRig). See creating.md.
Is it a new test?
- Yes -> Set policy to USUALLY_PASSES.
- No -> ALWAYS_PASSES (locks in regression).
Are you fixing a failure or promoting a test?
- Fixing -> See fixing.md.
- Promoting -> See promoting.md.

📋 Quick Checklist

1. Setup Workspace

Seed the workspace with necessary files using the files object to simulate a realistic scenario (e.g., NodeJS project with package.json).

Details in creating.md

2. Write Assertions

Audit agent decisions using rig.setBreakpoint() (AppRig only) or index verification on rig.readToolLogs().

Details in creating.md

3. Verify

Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.

See evals/README.md for running commands.

📦 Bundled Resources

Detailed procedural guides:

creating.md: Assertion strategies, Rig selection, Mock MCPs.
fixing.md: Step-by-step automated investigation, architecture diagnosis guidelines.
promoting.md: Candidate identification criteria and threshold guidelines.

google-gemini/behavioral-evals

.gemini/skills/behavioral-evals/SKILL.md

Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt steering, or adding workspace regression tests.

100,284 stars

development

Updated Apr 5, 2026

$ install --global

skillsauth

npx skillsauth add google-gemini/gemini-cli behavioral-evals

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 5, 2026, 1:57 PM16.4s7 files scanned

SKILL.md

name:: behavioral-evals
description:: Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt steering, or adding workspace regression tests.

Behavioral Evals

Overview

[!NOTE] Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.

🔄 Workflow Decision Tree

Does a prompt/tool change need validation?
- No -> Normal integration tests.
- Yes -> Continue below.
Is it UI/Interaction heavy?
- Yes -> Use appEvalTest (AppRig). See creating.md.
- No -> Use evalTest (TestRig). See creating.md.
Is it a new test?
- Yes -> Set policy to USUALLY_PASSES.
- No -> ALWAYS_PASSES (locks in regression).
Are you fixing a failure or promoting a test?
- Fixing -> See fixing.md.
- Promoting -> See promoting.md.

📋 Quick Checklist

1. Setup Workspace

Seed the workspace with necessary files using the files object to simulate a realistic scenario (e.g., NodeJS project with package.json).

Details in creating.md

2. Write Assertions

Audit agent decisions using rig.setBreakpoint() (AppRig only) or index verification on rig.readToolLogs().

Details in creating.md

3. Verify

Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.

See evals/README.md for running commands.

📦 Bundled Resources

Detailed procedural guides:

creating.md: Assertion strategies, Rig selection, Mock MCPs.
fixing.md: Step-by-step automated investigation, architecture diagnosis guidelines.
promoting.md: Candidate identification criteria and threshold guidelines.

Related Skills

google-gemini/pirate-skill

tools

VerifiedTrustedCommunity

Speak like a pirate.

100,284SKILL.mdUpdated Apr 5, 2026

google-gemini/pirate-skill

google-gemini/skill-creator

tools

VerifiedTrustedCommunity

Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Gemini CLI's capabilities with specialized knowledge, workflows, or tool integrations.

100,284SKILL.mdUpdated Apr 5, 2026

google-gemini/skill-creator

google-gemini/greeter

tools

VerifiedTrustedCommunity

A friendly greeter skill

100,284SKILL.mdUpdated Apr 5, 2026

google-gemini/greeter

google-gemini/string-reviewer

development

VerifiedTrustedCommunity

Use this skill when asked to review text and user-facing strings within the codebase. It ensures that these strings follow rules on clarity, usefulness, brevity and style.

100,284SKILL.mdUpdated Apr 5, 2026

google-gemini/string-reviewer

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/google-gemini/gemini-cli.git

# Copy into Claude Code skills folder (global)
cp -r gemini-cli/.gemini/skills/behavioral-evals ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

google-gemini/gemini-cli

100,284 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT