skills/model-research/SKILL.md
Research, compare, and select LLM models for AI-powered apps and workflows. Finds latest models per family, verifies availability on target platform, compares pricing/benchmarks/tool-calling, produces ranked recommendations. Use when: "which model", "compare models", "find a model", "model research", "best model for", "cheapest model", "tool calling models", "model selection", "upgrade model", "swap model", "fallback models", "model chain".
npx skillsauth add phrazzld/agent-skills model-researchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Find the right model for a specific use case. Live data, not training data.
1. NEVER RECOMMEND MODELS FROM TRAINING DATA ALONE.
Every model name, price, and benchmark MUST come from a live source.
2. EVERY MODEL HAS A RELEASE DATE. USE IT.
The current date matters. A model released 3+ months ago may already be
superseded. A model released last week may be unstable. Factor age into
every recommendation.
LLM model landscapes change weekly. Training data is stale within months. A model recommended from memory may be discontinued, superseded, or repriced.
Today's date is always available in system context. Use it.
For every model under consideration, determine its release date and compute age:
| Age | Classification | Implication | |-----|---------------|-------------| | < 2 weeks | Bleeding edge | May have bugs, API instability, sparse benchmarks. Flag risk. | | 2 weeks – 2 months | Fresh | Sweet spot: stabilized but still current-gen. Prefer these. | | 2 – 4 months | Aging | Check if a successor exists in the same family. Often superseded. | | 4+ months | Stale | Almost certainly superseded. Do NOT recommend without verifying it's still the latest in its family. |
In a field moving this fast, "a few months old" IS old. When two models are otherwise comparable, prefer the fresher one — it reflects more recent training data, RLHF tuning, and architecture improvements.
When presenting models, ALWAYS include the release date (or "unknown" if unfindable) and age classification. This lets the user calibrate trust.
Before searching, answer these questions (ask the user if unclear):
Launch ALL of these in parallel:
Platform catalog — fetch the actual model list from the target platform
https://openrouter.ai/api/v1/models (JSON endpoint)Exa search — find recent benchmarks, comparisons, known issues
"[model family] tool calling benchmark 2026""[model family] vs [competitor] [use case]""openrouter [model name] issues OR problems OR reliability"xAI / social pulse — what practitioners are saying
"[model name] production" OR "[model name] tool calling"Codebase — what the project currently uses and why
For EACH model family the user mentions or that appears promising:
Build a comparison table with ONLY verified data:
| Field | Source | |-------|--------| | Model ID | Platform API | | Release date | Platform page, announcement post, or changelog | | Age vs today | Computed from release date and current date | | Pricing | Platform API | | Context window | Platform API | | Tool calling support | Platform docs + benchmark | | Benchmark scores | Recent (<3 month) benchmarks with citations | | Known issues | Exa + xAI search results | | Status | GA / Preview / Deprecated |
Rules:
Produce ranked recommendations for the specific use case:
For each recommendation, state:
development
Lightweight evidence-backed retro and catch-up reports for a current repo, branch, PR, backlog slice, or recent agent session. Use when the user asks for a debrief, catch me up, what changed, why it matters, product implications, end-user implications, developer experience implications, current app state, backlog state, workspace state, alternatives considered, or context rebuild after losing the thread. Trigger: /debrief.
testing
Capture agent-session work records as local JSONL audit evidence. Links a backlog/spec, branch, commits, review verdicts, QA/demo evidence, transcript refs, and shipped ref without storing raw private transcripts. Use when: "trace this work", "write work record", "agent session trace", "journal this delivery", "link transcript evidence". Trigger: /trace, /journal.
data-ai
Turn proven agent-session patterns into first-party Harness Kit skills. Use when: "skillify this conversation", "make this into a skill", "generate a skill from current transcript", "extract reusable workflow". Trigger: /skillify.
testing
Run one targeted, read-only architecture or quality critique through a named lens from the shared rubric. Use when: "critique this module", "run an Ousterhout pass", "lens critique", "architecture critique". Trigger: /critique.