skills/agent-benchmark/SKILL.md
Expert in writing test configurations for agent-benchmark, a testing framework for AI agents using MCP servers. Use when creating YAML test files, configuring providers, servers, agents, sessions, assertions, or using templates. Helps write benchmarks for AI coding agents.
npx skillsauth add mykhaliev/agent-benchmark agent-benchmarkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are an expert in writing test configurations for agent-benchmark, a YAML-based testing framework for AI agents that interact with MCP (Model Context Protocol) servers.
agent-benchmark tests AI agents by:
Every test file has these sections:
providers: # LLM configurations
servers: # MCP server definitions
agents: # Agent configurations (provider + servers)
sessions: # Test sessions containing tests
settings: # Global settings
variables: # Reusable template variables
criteria: # Success rate requirements
providers:
- name: gpt4
type: AZURE
auth_type: entra_id
model: gpt-4o
baseUrl: "{{AZURE_OPENAI_ENDPOINT}}"
version: 2025-01-01-preview
servers:
- name: filesystem
type: stdio
command: npx @modelcontextprotocol/server-filesystem /tmp
agents:
- name: test-agent
provider: gpt4
servers:
- name: filesystem
system_prompt: |
Execute tasks directly without asking for confirmation.
settings:
verbose: true
max_iterations: 10
sessions:
- name: File Operations
tests:
- name: Create file
prompt: "Create a file called test.txt with 'Hello World'"
assertions:
- type: tool_called
tool: write_file
- type: no_error_messages
For detailed configuration options, see:
testing
A demonstration skill for agent-benchmark testing. Shows how Agent Skills are loaded and injected into agent system prompts. Use this as a template for creating your own skills.
tools
Use when work should span one or more detached tasks but still behave like one job with a single owner context. TaskFlow is the durable flow substrate under authoring layers like Lobster, ACPX, plugins, or plain code. Keep conditional logic in the caller; use TaskFlow for flow identity, child-task linkage, waiting state, revision-checked mutations, and user-facing emergence.
tools
# Lobster Lobster executes multi-step workflows with approval checkpoints. Use it when: - User wants a repeatable automation (triage, monitor, sync) - Actions need human approval before executing (send, post, delete) - Multiple tool calls should run as one deterministic operation ## When to use Lobster | User intent | Use Lobster? | | ------------------------------------------------------ | --------------------------
tools
# Lobster Lobster executes multi-step workflows with approval checkpoints. Use it when: - User wants a repeatable automation (triage, monitor, sync) - Actions need human approval before executing (send, post, delete) - Multiple tool calls should run as one deterministic operation ## When to use Lobster | User intent | Use Lobster? | | ------------------------------------------------------ | --------------------------