skills/liu-2023-agentbench/SKILL.md
Comprehensive benchmark suite for evaluating LLM agents across diverse interactive environments
npx skillsauth add curiositech/windags-skills liu-2023-agentbenchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
IF task requires procedural execution (web shopping, DB queries, API calls):
├─ IF <15 steps AND template-based outputs
│ └─ Use code-trained models (CodeLlama) - 3x better at format compliance
└─ IF >15 steps OR requires plan revision
└─ Use general models (GPT-4) - maintains coherence across turns
IF task requires strategic reasoning (games, puzzles, negotiations):
├─ IF model must generate novel hypotheses
│ └─ Avoid code-trained models - they over-optimize for deterministic paths
└─ IF model must revise plans based on feedback
└─ Use frontier models only - others lose state after round 5
IF failure budget <10%:
├─ Invalid format/action rate matters more than success rate
└─ API models (GPT-4: 6% invalid) vs Open source (13.6% invalid)
IF task involves >20 interaction rounds:
└─ Only GPT-4 tier maintains plan-state binding - others enter loops by round 10
IF agent produces malformed outputs despite clear instructions:
├─ Check Rouge-L similarity in last 3 outputs
│ ├─ High (>0.8): Loop detection failure → Add state tracking
│ └─ Low (<0.5): Executive function gap → Add format validation
IF agent violates environment rules (impossible actions):
├─ Count rule violations per environment type
│ ├─ Code environments: Missing API constraints → Add action space docs
│ ├─ Game environments: Invalid moves → Add rule reminders each turn
│ └─ Web environments: Element targeting → Add DOM structure context
IF agent exceeds task limits without completion:
├─ Analyze final 5 rounds for repetition patterns
│ ├─ Repeating same action: Add "what have I tried?" prompt
│ ├─ Repeating same reasoning: Add progress checkpoints
│ └─ Random actions: Escalate to human or abort task
IF environment has complex rule set (>10 constraints):
├─ High verbosity: Include full rules every turn
│ └─ Trade-off: Context bloat but lower invalid action rate
└─ Low verbosity: Rules in system message only
└─ Trade-off: Cleaner prompt but higher rule violation risk
IF task requires >10 sequential steps:
├─ Include explicit progress tracking: "Step X of Y completed"
└─ Add loop detection: "Have I done this exact action before?"
Detection Rule: If model produces structurally valid JSON but semantically invalid actions (e.g., {"action": "click", "element": "nonexistent_button"})
Symptoms:
Diagnosis: Dissociation between linguistic understanding and environmental grounding
Fix: Add action pre-validation layer that checks element existence before execution
Detection Rule: If model generates good initial plan but actions don't follow plan by round 5+ OR model repeats plan generation mid-task
Symptoms:
<thought> tags, contradictory actionsDiagnosis: Plan-state binding failure in working memory
Fix: Include plan summary in every prompt; add "current plan step" tracking
Detection Rule: If Rouge-L ≥0.8 in final 3 rounds AND task incomplete
Symptoms:
Diagnosis: No internal representation of "attempted strategies" or progress monitoring
Fix: External loop detection with mandatory strategy pivot after 3 identical rounds
Detection Rule: If code-trained model fails strategic tasks with success rate <50% of general model performance
Symptoms:
Diagnosis: Code training bias toward single optimal path
Fix: Use general models for strategic tasks; add explicit exploration prompts for code-trained models
Detection Rule: If model correctly explains requirements when asked but immediately violates them in output
Symptoms:
Diagnosis: Linguistic competence vs. procedural compliance dissociation
Fix: Constrained decoding, output validation layer, or format templates with variable substitution
Scenario: Agent must purchase specific laptop from e-commerce site. CodeLlama-34b vs GPT-4 comparison.
Turn 1-3: Both models navigate homepage correctly, use search function
Turn 4-6: Product comparison required
Turn 7-12: CodeLlama hits constraint (wrong specs)
Turn 13+: Task completion
Key Insight: Procedural task (web navigation) initially favors CodeLlama, but strategic pivot requirement (spec mismatch → search refinement) causes failure. The task grounding shifted from procedural to strategic mid-execution.
Scenario: Agent stuck in navigation loop on unfamiliar website
Detection Phase:
Round 15: {"action": "click", "element": "nav-menu"}
Round 16: {"action": "click", "element": "nav-menu"}
Round 17: {"action": "click", "element": "nav-menu"}
Rouge-L: 0.92 - LOOP DETECTED
Recovery Steps:
Key Insight: Loop detection must trigger strategy enumeration, not just "try harder." The fix is metacognitive scaffolding, not better reasoning.
Do NOT use this skill for:
question-answering-strategies.md insteadcontent-generation-patterns.md insteadlogical-reasoning-frameworks.md insteadstreaming-response-handling.md insteadapi-integration-patterns.md insteadDelegate to other skills when:
domain-expertise-routing.mdbenchmark-design-principles.mdagent-orchestration-patterns.mdllm-system-monitoring.mdThis skill specifically handles: Multi-round interactive decision-making where environmental constraints, plan revision, and failure recovery are primary concerns.
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.