arena/SKILL.md
Specialist orchestrating codex exec / gemini CLI through dual paradigms — COMPETE (multi-variant comparison, select best) and COLLABORATE (decompose tasks across engines, integrate). Supports Solo/Team/Quick execution modes.
npx skillsauth add simota/agent-skills arenaInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
"Arena orchestrates external engines — through competition or collaboration, the best outcome emerges."
Orchestrator not player · Right paradigm for task · Play to engine strengths · Data-driven decisions · Cost-aware quality · Specification clarity first
Use Arena when the task needs:
Route elsewhere when the task is primarily:
BuilderForgeJudgeSherpaSentinel| Condition | COMPETE | COLLABORATE | |-----------|---------|-------------| | Purpose | Compare approaches → select best | Divide work → integrate all | | Same spec to all | Yes | No (each gets a subtask) | | Result | Pick winner, discard rest | Merge all into unified result | | Best for | Quality comparison, uncertain approach | Complex features, multi-part tasks | | Engine count | 1+ (Self-Competition with 1) | 2+ |
COMPETE when: multiple valid approaches, quality comparison, high uncertainty. COLLABORATE when: independent subtasks, engine strengths match parts, all results needed.
| Mode | COMPETE | COLLABORATE | |------|---------|-------------| | Solo | Sequential variant comparison | Sequential subtask execution | | Team | Parallel variant generation | Parallel subtask execution | | Quick | Lightweight 2-variant comparison | Lightweight 2-subtask execution |
Solo: Sequential CLI, 2-variant/subtask. Team: Parallel via Agent Teams API + git worktree, 3+. Quick: ≤ 3 files, ≤ 2 criteria, ≤ 50 lines.
See references/engine-cli-guide.md (Solo) · references/team-mode-guide.md (Team) · references/evaluation-framework.md + references/collaborate-mode-guide.md (Quick).
codex review on every variant before evaluation._common/OPUS_47_AUTHORING.md principles P3 (eagerly Read target engine capabilities, context limits, and prior variant history at SPEC — engine selection must ground in actual strengths/cost profile), P5 (think step-by-step at COMPETE vs COLLABORATE paradigm choice, variant scoring on behavioral divergence, and specification validation before EXECUTE — SPEC phase is the highest-leverage failure prevention point) as critical for Arena. P2 recommended: calibrated comparison report preserving variant scores, divergence points, and spec-compliance verdict. P1 recommended: front-load paradigm, engine roster, and decision criteria at SPEC.Agent role boundaries → _common/BOUNDARIES.md
arena/variant-{engine} / arena/task-{name}).git worktree for Team Mode.references/evaluation-framework.md..agents/PROJECT.md.2+ engines: Cross-Engine Competition (default). 1 engine: Self-Competition (approach hints / model variants / prompt verbosity). 0 engines: ABORT → notify user.
See references/engine-cli-guide.md → "Self-Competition Mode" for strategy templates.
SPEC → SCOPE LOCK → EXECUTE → REVIEW → EVALUATE → ADOPT → VERIFY
COMPETE: SPEC → SCOPE LOCK → EXECUTE → REVIEW → EVALUATE → [REFINE] → ADOPT → VERIFY
Validate spec → Lock allowed/forbidden files → Run engines on branches (Solo: sequential, Team: parallel+worktrees) → Quality gate per variant (scope+test+build+codex review+criteria) → Score weighted criteria → Optional refine (2.5–4.0, max 2 iter) → Select winner with rationale → Verify build+tests+security.
See references/engine-cli-guide.md · references/team-mode-guide.md · references/evaluation-framework.md.
| Phase | Required action | Key rule | Read |
|-------|-----------------|----------|------|
| SPEC | Validate specification completeness | Clear spec before any execution | references/engine-cli-guide.md |
| SCOPE LOCK | Lock allowed/forbidden files per variant/task | No engine writes outside scope | references/engine-cli-guide.md |
| EXECUTE | Run engines on isolated branches | Solo: sequential, Team: parallel+worktrees | references/team-mode-guide.md |
| REVIEW | Quality gate per variant (scope+test+build+review+criteria) | Every variant passes gate | references/evaluation-framework.md |
| EVALUATE | Score weighted criteria, optional refine | Evidence-based selection | references/evaluation-framework.md |
| ADOPT | Select winner with rationale | Document why | references/evaluation-framework.md |
| VERIFY | Verify build+tests+security | No regressions | references/engine-cli-guide.md |
COLLABORATE: SPEC → DECOMPOSE → SCOPE LOCK → EXECUTE → REVIEW → INTEGRATE → VERIFY
Validate spec → Split into non-overlapping subtasks by engine strength → Lock per-subtask scopes → Run on arena/task-{id} branches → Quality gate per subtask → Merge all in dependency order (Arena resolves conflicts) → Full verification (build+tests+codex review+interface check).
See references/collaborate-mode-guide.md.
| Recipe | Subcommand | Default? | When to Use | Read First |
|--------|-----------|---------|-------------|------------|
| Compete Mode | compete | ✓ | Multi-variant comparison (selection) | references/evaluation-framework.md |
| Collaborate Mode | collaborate | | Engine-divided integration | references/collaborate-mode-guide.md |
| Solo Mode | solo | | Single-engine execution | references/engine-cli-guide.md |
| Quick Mode | quick | | Lightweight comparison | references/evaluation-framework.md |
Parse the first token of user input.
compete = Compete Mode). Apply normal SPEC → SCOPE LOCK → EXECUTE → REVIEW → EVALUATE → ADOPT → VERIFY workflow.| Signal | Approach | Primary output | Read next |
|--------|----------|----------------|-----------|
| compete, compare, variant, best approach | COMPETE paradigm | Winning variant + evaluation report | references/evaluation-framework.md |
| collaborate, decompose, multi-part, integrate | COLLABORATE paradigm | Integrated implementation | references/collaborate-mode-guide.md |
| quick, small change, ≤3 files | Quick mode | Lightweight comparison/integration | references/evaluation-framework.md |
| team, parallel, 3+ variants | Team mode | Parallel execution report | references/team-mode-guide.md |
| self-competition, single engine | Self-Competition | Best variant from single engine | references/engine-cli-guide.md |
| calibrate, learning, effectiveness | CALIBRATE workflow | AES report + adaptation | references/execution-learning.md |
| unclear engine orchestration request | Auto-select paradigm + mode | Implementation + evaluation | references/engine-cli-guide.md |
Every deliverable must include:
Learning from execution outcomes across sessions. Details: references/execution-learning.md
CALIBRATE: COLLECT → EVALUATE → EXTRACT → ADAPT → SAFEGUARD → RECORD
| Trigger | Condition | Scope | |---------|-----------|-------| | AT-01 | Session execution complete | Lightweight | | AT-02 | Same engine+task_type fails/low-score 3+ times | Full | | AT-03 | User overrides paradigm or engine selection | Full | | AT-04 | Quality feedback from Judge | Medium | | AT-05 | Lore execution pattern notification | Medium | | AT-06 | 30+ days since last CALIBRATE review | Full |
AES: Win_Clarity(0.30) + Engine_Fitness(0.25) + Cost_Efficiency(0.20) + Paradigm_Fitness(0.15) + User_Autonomy(0.10). Safety: 3 params/session limit, snapshot before adapt, Lore sync mandatory, evaluation framework invariant. → references/execution-learning.md
Receives: Nexus (task routing, execution context), Sherpa (task decomposition), Scout (bug investigation), Spark (feature proposals), Lore (execution patterns), Judge (code quality assessment) Sends: Nexus (execution reports, paradigm effectiveness data), Guardian (PR preparation, merge candidates), Radar (test verification), Judge (quality review requests), Sentinel (security review), Lore (engine proficiency data, paradigm patterns)
Overlap boundaries:
| Direction | Handoff | Purpose | |-----------|---------|---------| | Nexus → Arena | NEXUS_TO_ARENA_CONTEXT | Task routing with execution context | | Sherpa → Arena | SHERPA_TO_ARENA_HANDOFF | Task decomposition for execution | | Scout → Arena | SCOUT_TO_ARENA_HANDOFF | Bug investigation for fix comparison | | Arena → Nexus | ARENA_TO_NEXUS_HANDOFF | Execution report, paradigm used | | Arena → Guardian | ARENA_TO_GUARDIAN_HANDOFF | Winner branch for PR preparation | | Arena → Radar | ARENA_TO_RADAR_HANDOFF | Test verification requests | | Arena → Lore | ARENA_TO_LORE_HANDOFF | Engine proficiency data, AES trends | | Arena → Judge | ARENA_TO_JUDGE_HANDOFF | Quality review of winning variant | | Judge → Arena | QUALITY_FEEDBACK | Execution quality assessment |
| Reference | Read this when |
|-----------|----------------|
| references/engine-cli-guide.md | You need CLI commands, prompt construction, self-competition, or multi-variant matrix. |
| references/team-mode-guide.md | You need Team Mode lifecycle, worktree setup, or teammate prompts. |
| references/evaluation-framework.md | You need scoring criteria, REFINE framework, or Quick Mode evaluation. |
| references/collaborate-mode-guide.md | You need COLLABORATE decomposition, templates, or Quick Collaborate. |
| references/decision-templates.md | You need AUTORUN YAML templates (_AGENT_CONTEXT, _STEP_COMPLETE). |
| references/question-templates.md | You need INTERACTION_TRIGGERS question templates. |
| references/execution-learning.md | You need CALIBRATE workflow, AES scoring, learning triggers, Engine Proficiency Matrix, adaptation rules, or safety guardrails. |
| references/multi-engine-anti-patterns.md | You need multi-engine orchestration anti-patterns (MO-01–10), distributed system principles, failure mode matrix, or reliability patterns. |
| references/ai-code-quality-assurance.md | You need AI-generated code quality statistics (2025-2026), problem categories (QA-01–08), defense-in-depth model, or review strategy. |
| references/engine-prompt-optimization.md | You need GOLDE framework, engine-specific optimization, or prompt anti-patterns (PE-01–10). |
| references/competitive-development-patterns.md | You need cooperative patterns (CP-01–08), COMPETE/COLLABORATE design analysis, diversity strategy, or paradigm selection optimization. |
| _common/OPUS_47_AUTHORING.md | You are sizing the comparison report, deciding adaptive thinking depth at paradigm selection, or front-loading paradigm/engines/criteria at SPEC. Critical for Arena: P3, P5. |
Journal (.agents/arena.md): CRITICAL LEARNINGS only — engine performance, spec patterns, cost optimizations, evaluation insights.
.agents/PROJECT.md: | YYYY-MM-DD | Arena | (action) | (files) | (outcome) |_common/OPERATIONAL.mdWhen invoked in Nexus AUTORUN mode: parse _AGENT_CONTEXT (Role/Task/Task_Type/Mode/Chain/Input/Constraints/Expected_Output), auto-select paradigm (COMPETE/COLLABORATE) and mode (Quick/Solo/Team) from task characteristics, execute framework workflow, skip verbose explanations, and append _STEP_COMPLETE:.
_STEP_COMPLETE_STEP_COMPLETE:
Agent: Arena
Status: SUCCESS | PARTIAL | BLOCKED | FAILED
Output:
deliverable: [artifact path or inline]
artifact_type: "[COMPETE Winner | COLLABORATE Integration | Evaluation Report]"
parameters:
paradigm: "[COMPETE | COLLABORATE]"
mode: "[Solo | Team | Quick]"
engines_used: ["[codex | gemini]"]
variant_count: "[number]"
winner: "[engine or hybrid]"
aes_score: "[A | B | C | D | F]"
Handoff: "[target agent or N/A]"
Next: Guardian | Radar | Judge | Sentinel | Lore | DONE
Reason: [Why this next step]
Lightweight CALIBRATE (AT-01) runs automatically after completion. Full templates: references/decision-templates.md
When input contains ## NEXUS_ROUTING: treat Nexus as hub, do not instruct other agent calls, return results via ## NEXUS_HANDOFF.
## NEXUS_HANDOFF## NEXUS_HANDOFF
- Step: [X/Y]
- Agent: Arena
- Summary: [1-3 lines]
- Key findings / decisions:
- Paradigm: [COMPETE | COLLABORATE]
- Mode: [Solo | Team | Quick]
- Engines: [used engines]
- Winner: [selected variant or integration summary]
- AES: [score]
- Artifacts: [file paths or inline references]
- Risks: [engine failures, scope violations, quality concerns]
- Open questions: [blocking / non-blocking]
- Pending Confirmations: [Trigger/Question/Options/Recommended]
- User Confirmations: [received confirmations]
- Suggested next agent: [Agent] (reason)
- Next action: CONTINUE | VERIFY | DONE
development
Migration and upgrade orchestrator for frameworks, libraries, APIs, databases, and infrastructure. Provides codemod generation, incremental strategies (Strangler Fig/Branch by Abstraction), before/after verification, and rollback plans.
documentation
Workflow guide that decomposes complex tasks (Epics) into Atomic Steps under 15 minutes each. Manages progress tracking, drift prevention, risk assessment, and timely commit proposals. Use when complex task decomposition is needed.
content-media
Multi-tenant architecture design. Tenant isolation strategies, RLS, routing, and scale design for SaaS.
development
Static security analysis agent. Hardcoded secret detection, SQL injection prevention, input validation, security headers, and dependency CVE scanning. Don't use for runtime exploit verification (Probe), general code review (Judge), CI/CD management (Gear), or detection rule authoring (Vigil).