agent-evaluation/SKILL.md
Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on re...
npx skillsauth add automacoescomerciaisintegradas/skills agent-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.
You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it
Run tests multiple times and analyze result distributions
Define and test agent behavioral invariants
Actively try to break agent behavior
| Issue | Severity | Solution | |-------|----------|----------| | Agent scores well on benchmarks but fails in production | high | // Bridge benchmark and production evaluation | | Same test passes sometimes, fails other times | high | // Handle flaky tests in LLM agent evaluation | | Agent optimized for metric, not actual task | medium | // Multi-dimensional evaluation to prevent gaming | | Test data accidentally used in training or prompts | critical | // Prevent data leakage in agent evaluation |
Works well with: multi-agent-orchestration, agent-communication, autonomous-agents
This skill is applicable to execute the workflow or actions described in the overview.
development
name: Claude Code System Prompts Mirror slug: claude-code-system-prompts version: 1.0.0 owner: Automações Comerciais Integradas description: Espelho versionado dos prompts de sistema do Claude Code (upstream Piebald-AI), com foco em consulta, estudo e adaptação para engenharia de agentes. language: pt-BR commands: - command: /prompts-index description: Lista categorias e principais arquivos do espelho local de prompts. parameters: - name: categoria type: string r
development
Skill de direção de arte inspirada no visual de /paz-bem.html: editorial premium, tipografia serif/sans, paleta quente (gesso/terracota/carvão), texturas, grid assimétrica e microinterações com GSAP.
development
Padrao de deploy estatico para projetos HTML/CSS/JS puro, com build local por ambiente (dev/prod), publicacao Git sem Actions e promocao de branch entre ambientes.
development
nome: Botão WhatsApp Floating descricao: Skill para gerar e integrar botões flutuantes do WhatsApp com design premium, animação de pulso e link direto para chat. autor: Antigravity comandos: comando: /gerar-botao-whatsapp descricao: Gera o código HTML/CSS completo para um botão flutuante personalizável. parametros: - nome: numero tipo: string descricao: Número do WhatsApp com DDI e DDD (ex: 5541992062238). - nome: mensagem tipo: string descricao: Mensagem inicial pré-preench