skills/multi-model-review/skills/multi-model-review/SKILL.md
Query 2–3 AI models in parallel via OpenRouter and synthesize their responses into a unified review. Use when the user says "get a second opinion", "ask GPT", "ask Gemini", "multi-model review", "council review", "validate this", "what does [model] think", or wants cross-model validation of code, architecture, security, writing, math, or documents. Requires OPENROUTER_API_KEY set in the environment.
npx skillsauth add back1ply/LLM-Skills multi-model-reviewInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Query multiple AI models in parallel via OpenRouter and synthesize their responses.
Uses the bundled scripts/or_review.py helper — requires python and httpx on PATH.
OPENROUTER_API_KEY must be set in Claude Code's environment. Add to user settings:
{ "env": { "OPENROUTER_API_KEY": "sk-or-..." } }
Or export before launching: export OPENROUTER_API_KEY=sk-or-...
Match the user's task to the best preset. The --preset flag selects the right models and system prompt automatically — no manual configuration needed.
| Task | --preset | Models | Benchmark backing |
|------|-----------|--------|-------------------|
| Code review | code | anthropic/claude-opus-4.7 + openai/gpt-5.5 | Both top-3 SWE-Bench Verified; different training lineages catch different bugs |
| Security audit | security | google/gemini-3.1-pro-preview + anthropic/claude-opus-4.7 | Gemini 94.3% GPQA Diamond; Anthropic safety-trained #1 coder |
| Architecture | arch | google/gemini-3.1-pro-preview + anthropic/claude-sonnet-4.6 | Both 1M+ context; structured reasoning at lower cost than Opus |
| Writing critique | writing | anthropic/claude-sonnet-4.6 + openai/gpt-5.5 | Claude #1 EQ-Bench Creative (1936 ELO); GPT-5.5 trained for reduced sycophancy |
| Math / science | math | openai/o4-mini-high + google/gemini-3.1-pro-preview | o4-mini 99.5% AIME 2025; Gemini 94.3% GPQA Diamond |
| Long document | docs | google/gemini-3.1-pro-preview + openai/gpt-5.5 | Both 1M+ context; Gemini leads long-context recall |
| Translation | translate | openai/gpt-5.5 + google/gemini-3.1-pro-preview | GPT-5.5 leads FLORES European; Gemini leads CJK + French |
| Creative writing | creative | anthropic/claude-sonnet-4.6 + google/gemini-3.1-pro-preview | Claude #1 EQ-Bench narrative; Gemini #1 Chatbot Arena creative |
| Quick check | quick | openai/gpt-4.1-mini + google/gemini-3.1-flash-lite-preview | Cheap, low latency — good for non-critical checks |
| Free | free | openrouter/free × 2 | Zero cost; each call independently routed to a random free model |
| Task | --preset | Models | Benchmark |
|------|-----------|--------|-----------|
| UI / Web | code-web | anthropic/claude-opus-4.7 + google/gemini-3.1-pro-preview | Opus #1 WebDev Arena (1570 ELO); best for components, canvas, CSS |
| SQL / data | code-sql | anthropic/claude-opus-4.7 + google/gemini-3.1-pro-preview | Opus #1 BIRD accuracy; Gemini leads BigQuery/Snowflake dialects |
| CLI / systems | code-cli | openai/gpt-5.5 + anthropic/claude-opus-4.7 | GPT-5.5 82.7% Terminal-Bench 2.0 (#1); Opus #2 — best agentic CLI pair |
| Algorithms | code-live | openai/gpt-5.5 + google/gemini-3.1-pro-preview | LiveCodeBench (contamination-proof); different provider lineages |
| Budget coding | budget | moonshotai/kimi-k2.6 + deepseek/deepseek-v4-pro | ~6× cheaper than code preset; comparable review quality |
Default when task is unclear: code.
Cost warning: If the input exceeds ~5,000 tokens, warn the user before proceeding — costs multiply per model.
claude-opus-4.7($5/M in · $25/M out) +gpt-5.5($5/M in · $30/M out) — tier-1 pair; ~$0.20–$1.00/review on large filesgemini-3.1-pro-preview($2/M in · $12/M out) — middle tier; appears in many presets as a cost-efficient partnerbudgetpreset ($0.75/M + $0.44/M in): near-frontier code review at ~6× lower costquick(gpt-4.1-mini+gemini-3.1-flash-lite-preview) andfree: always safe to run without warning
Glob("**/multi-model-review/scripts/or_review.py")
If multiple paths are returned, prefer the one not containing /cache/ in its path. Use the first result as <script_path>.
File input (preferred — no escaping issues):
python "<script_path>" --preset code --file "<absolute_path_to_file>"
Heredoc / inline content (when the user pastes code directly):
python "<script_path>" --preset code << 'MMREVIEW'
<paste content here>
MMREVIEW
Piped input:
cat "<file_path>" | python "<script_path>" --preset security
Short inline string (only for simple content without special chars):
python "<script_path>" --preset quick --prompt "Review this function: ..."
Use --models to swap models while keeping the preset's system prompt:
python "<script_path>" --preset code --models "openai/gpt-5.5,google/gemini-3.1-pro-preview"
Use --system to replace the system prompt entirely:
python "<script_path>" --models "openai/gpt-5.5,anthropic/claude-opus-4.7" \
--file myfile.py --system "You are a Python 2/3 compatibility expert..."
Use --max-tokens N to limit response length (default 2000). Use --list-presets to see all preset details.
The script prints progress to stderr as each model completes, and outputs JSON to stdout.
The script prints JSON with this structure:
{
"results": {
"<model-id>": {
"content": "...",
"elapsed_s": 2.14,
"prompt_tokens": 523,
"completion_tokens": 412,
"cost_usd": 0.013
}
},
"meta": {
"preset": "code",
"models": ["...", "..."],
"total_elapsed_s": 2.14,
"total_prompt_tokens": 1046,
"total_completion_tokens": 824,
"total_cost_usd": 0.027
}
}
Parse each model's response from results[model-id].content. Apply this protocol:
| Situation | Action | |-----------|--------| | Both models agree | High-confidence finding — present as settled | | One model unique | "Worth considering (flagged by [model] only)" | | Models conflict | Surface both positions, explain tradeoff, let user decide | | Security finding from any model | Always escalate regardless of disagreement |
Bias awareness — apply to every synthesis:
- Verbosity bias: A longer response isn't more reliable. Weight specificity and concrete examples over length.
- Solo findings: A finding from only one model is unconfirmed, not dismissed — flag it clearly.
- If a model returns an
ERROR:...string, note it was unavailable and weight findings from the remaining model(s) accordingly.
## Multi-Model Review — [Task Type]
Models: [model-a] · [model-b] | Cost: $X.XX | Time: X.Xs
### Agreed findings ✓ high confidence
- [finding]
### Solo findings ? unconfirmed
- [model-a]: [finding]
- [model-b]: [finding]
### Conflicts ↔
- **[topic]:** [model-a] says X / [model-b] says Y — [tradeoff note]
### Verdict
[1–2 sentence synthesis of the most actionable takeaways]
Omit any section that has no entries. If meta.total_cost_usd is null, omit the cost from the header.
# List all presets with their models:
python scripts/or_review.py --list-presets
# File review:
python scripts/or_review.py --preset code --file myfile.py
# Piped input:
cat myfile.py | python scripts/or_review.py --preset security
# Heredoc for inline content:
python scripts/or_review.py --preset arch << 'EOF'
paste architecture description here
EOF
# Custom models + system prompt:
python scripts/or_review.py \
--models "openai/gpt-5.5,google/gemini-3.1-pro-preview" \
--file myfile.py \
--system "You are a Python 2/3 compatibility expert..."
# Budget review with max-tokens cap:
python scripts/or_review.py --preset budget --file myfile.py --max-tokens 1500
documentation
This skill should be used when the user asks to "write DAX measures", "create Power BI calculations", "help with DAX formulas", "write time intelligence", or mentions aggregations, filters, or DAX performance. Ensures correct syntax, optimal performance, and best practices on the first attempt.
tools
This skill should be used when the user asks to "find a skill", "discover plugins", "search for an MCP", "what plugins exist for X", "fill my skill gaps", "improve my setup", or when Claude recognizes it lacks tools for a task. Searches GitHub and marketplaces to suggest installations.
development
This skill should be used when the user asks to "review installed skills", "find duplicates", "detect skill overlaps", "identify skill gaps", "optimize skills", "audit my skills", or "troubleshoot skill conflicts". Supports Gemini, Claude Code, Cursor, Copilot, Windsurf, and custom setups.
development
This skill should be used when the user asks to "design KPIs", "create a KPI system", "build a measurement framework", "develop a balanced scorecard", "define metrics", "prototype a dashboard", "shortlist measures", "build a KPI Tree", "set up management reporting", or mentions ROKS methodology, KPI definition sheets, stakeholder engagement for measurement, or sustaining KPI systems.