.gemini/skills/behavioral-evals/SKILL.md
Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt steering, or adding workspace regression tests.
npx skillsauth add google-gemini/gemini-cli behavioral-evalsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Behavioral evaluations (evals) are tests that validate the agent's decision-making (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.
[!NOTE] Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.
appEvalTest (AppRig). See creating.md.evalTest (TestRig). See creating.md.USUALLY_PASSES.ALWAYS_PASSES (locks in regression).Seed the workspace with necessary files using the files object to simulate a realistic scenario (e.g., NodeJS project with package.json).
Audit agent decisions using rig.setBreakpoint() (AppRig only) or index verification on rig.readToolLogs().
Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.
Detailed procedural guides:
tools
Speak like a pirate.
tools
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Gemini CLI's capabilities with specialized knowledge, workflows, or tool integrations.
tools
A friendly greeter skill
development
Use this skill when asked to review text and user-facing strings within the codebase. It ensures that these strings follow rules on clarity, usefulness, brevity and style.