Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

phrazzld/model-research

Name: model-research
Author: phrazzld

skills/model-research/SKILL.md

npx skillsauth add phrazzld/agent-skills model-research

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

/model-research

Find the right model for a specific use case. Live data, not training data.

The Iron Laws

1. NEVER RECOMMEND MODELS FROM TRAINING DATA ALONE.
   Every model name, price, and benchmark MUST come from a live source.

2. EVERY MODEL HAS A RELEASE DATE. USE IT.
   The current date matters. A model released 3+ months ago may already be
   superseded. A model released last week may be unstable. Factor age into
   every recommendation.

LLM model landscapes change weekly. Training data is stale within months. A model recommended from memory may be discontinued, superseded, or repriced.

Delegation Judgment

delegate on judgment per the shared Roster contract: native subagents by default; add cross-model critics, roster providers, or sprite lanes (/sprites) only when they answer a distinct question. See harnesses/shared/AGENTS.md (Roster).

Local lane guidance: Use specialized lanes for platform catalog verification, benchmark/source review, social/reliability pulse, integration/codebase fit, and cost/latency risk. Prefer different model/provider families for independent judgment; the lead owns synthesis and final recommendation.

The Freshness Rule

Today's date is always available in system context. Use it.

For every model under consideration, determine its release date and compute age:

| Age | Classification | Implication | |-----|---------------|-------------| | < 2 weeks | Bleeding edge | May have bugs, API instability, sparse benchmarks. Flag risk. | | 2 weeks – 2 months | Fresh | Sweet spot: stabilized but still current-gen. Prefer these. | | 2 – 4 months | Aging | Check if a successor exists in the same family. Often superseded. | | 4+ months | Stale | Almost certainly superseded. Do NOT recommend without verifying it's still the latest in its family. |

In a field moving this fast, "a few months old" IS old. When two models are otherwise comparable, prefer the fresher one — it reflects more recent training data, RLHF tuning, and architecture improvements.

When presenting models, ALWAYS include the release date (or "unknown" if unfindable) and age classification. This lets the user calibrate trust.

Protocol

Phase 1: Scope the Use Case

Before searching, answer these questions (ask the user if unclear):

Task type: tool calling, classification, generation, code, chat, embedding?
Platform: OpenRouter, Anthropic, OpenAI, Vertex, Bedrock, direct?
Budget: cost ceiling per million tokens? Per request?
Quality floor: what's the minimum acceptable reliability?
Constraints: context window, latency, streaming, multimodal, structured output?
Volume: requests per day/month? (affects rate limit viability)

Phase 2: Live Research (MANDATORY PARALLEL FANOUT)

Launch ALL of these in parallel:

Platform catalog — fetch the actual model list from the target platform
- OpenRouter: https://openrouter.ai/api/v1/models (JSON endpoint)
- Filter by capability (tool calling, structured output, etc.)
- This is the AUTHORITATIVE source for model IDs, pricing, and availability
Exa search — find recent benchmarks, comparisons, known issues
- Search: "[model family] tool calling benchmark 2026"
- Search: "[model family] vs [competitor] [use case]"
- Search: "openrouter [model name] issues OR problems OR reliability"
xAI / social pulse — what practitioners are saying
- Search: "[model name] production" OR "[model name] tool calling"
- Catches reliability issues, rate limit complaints, silent deprecations
Codebase — what the project currently uses and why
- Check existing model config, ADRs, policy files
- Understand current integration shape (AI SDK version, provider setup)

Phase 3: Model Family Audit

For EACH model family the user mentions or that appears promising:

List ALL current models in the family (not just the one you remember)
Identify the latest — check release dates, version numbers
Check supersession — is there a newer model that replaces this one?
Verify availability — is it actually on the target platform right now?
Get exact pricing — from the platform API, not from memory

Phase 4: Head-to-Head Comparison

Build a comparison table with ONLY verified data:

| Field | Source | |-------|--------| | Model ID | Platform API | | Release date | Platform page, announcement post, or changelog | | Age vs today | Computed from release date and current date | | Pricing | Platform API | | Context window | Platform API | | Tool calling support | Platform docs + benchmark | | Benchmark scores | Recent (<3 month) benchmarks with citations | | Known issues | Exa + xAI search results | | Status | GA / Preview / Deprecated |

Rules:

Every cell must have a source. No "probably" or "likely."
If a benchmark score can't be found, write "No data" — don't interpolate.
If pricing is from a blog post, verify against the platform API.
Flag preview/beta models explicitly — they have different reliability profiles.
Release date is mandatory. If you can't find it, write "Unknown (treat as stale)." A model with unknown release date gets the same skepticism as a 4+ month old model.

Phase 5: Recommendation

Produce ranked recommendations for the specific use case:

Primary model: best quality-per-dollar for the task
Fallback chain: 2-3 models ordered by reliability, increasing cost acceptable
Not recommended: models that looked promising but have specific disqualifiers

For each recommendation, state:

Why this model (specific evidence, not vibes)
Risk (preview status, known issues, rate limits)
Cost impact vs current model

Gotchas

Models age fast: A model from 3 months ago is likely superseded. ALWAYS compute age from release date. MiniMax M2.5 → M2.7 happened in ONE month. Gemini generations overlap. If you're recommending something >2 months old, you're probably behind.
Superseded models: ALWAYS check if a newer version exists before recommending. Search "[model family] latest model [current year]" to catch recent releases.
Preview ≠ GA: Preview models can change behavior without notice. Flag them. Preview models are especially likely to have intermittent reliability issues.
"Free tier" rate limits: Many OpenRouter models have free tiers with 1 concurrent request. Not viable for production. Check the paid tier pricing.
Thinking tokens: Models with reasoning (DeepSeek, Qwen) may generate hidden thinking tokens that add latency and cost. Check if thinking can be disabled.
Tool calling ≠ good tool calling: A model may "support" tool calling but fail on complex schemas, parallel tools, or multi-step tool chains. Look for benchmarks specific to tool calling (tau2-bench, PinchBench, Toolathlon), not just general benchmarks.
OpenRouter routing: Same model ID may route to different providers with different latency/reliability. Check if provider can be pinned.
Output token limits: Some cheap models cap output at 4K-8K tokens. Check max output.
Stale benchmark data: A benchmark from 6 months ago may not reflect current model performance, especially for preview models that get updated.
"Latest" in family ≠ best for task: Newer isn't always better for a specific use case. A flash-lite model may be newer but weaker at tool calling than the older flash. Match the model's DESIGN PURPOSE to your task.

Anti-Patterns

Recommending a model you "know" without live verification
Using general benchmarks (MMLU, HumanEval) for tool-calling selection
Ignoring model family versions (recommending M2.5 when M2.7 exists)
Treating all "flash/lite/mini" models as equivalent
Recommending based on pricing alone without tool-calling quality check

Verification

Semantic waiver: "latest model" facts are time-sensitive and require live provider/docs checks. Each run must cite current availability, release date or version evidence, pricing/benchmark source, and the exact platform where the model will be used.

phrazzld/model-research

skills/model-research/SKILL.md

Research, compare, and select LLM models for AI-powered apps and workflows. Finds latest models per family, verifies availability on target platform, compares pricing/benchmarks/tool-calling, produces ranked recommendations. Use when: "which model", "compare models", "find a model", "model research", "best model for", "cheapest model", "tool calling models", "model selection", "upgrade model", "swap model", "fallback models", "model chain". Trigger: /model-research.

13 stars

tools

Updated Jun 10, 2026

$ install --global

skillsauth

npx skillsauth add phrazzld/agent-skills model-research

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 10, 2026, 6:23 AM21.8s1 file scanned

SKILL.md

name:: model-research
description:: |
Use when:: which model", "compare models", "find a model", "model research",
Trigger:: /model-research.
argument-hint:: [use-case description] [--platform openrouter|anthropic|openai|vertex]

/model-research

Find the right model for a specific use case. Live data, not training data.

The Iron Laws

1. NEVER RECOMMEND MODELS FROM TRAINING DATA ALONE.
   Every model name, price, and benchmark MUST come from a live source.

2. EVERY MODEL HAS A RELEASE DATE. USE IT.
   The current date matters. A model released 3+ months ago may already be
   superseded. A model released last week may be unstable. Factor age into
   every recommendation.

LLM model landscapes change weekly. Training data is stale within months. A model recommended from memory may be discontinued, superseded, or repriced.

Delegation Judgment

The Freshness Rule

Today's date is always available in system context. Use it.

For every model under consideration, determine its release date and compute age:

When presenting models, ALWAYS include the release date (or "unknown" if unfindable) and age classification. This lets the user calibrate trust.

Protocol

Phase 1: Scope the Use Case

Before searching, answer these questions (ask the user if unclear):

Task type: tool calling, classification, generation, code, chat, embedding?
Platform: OpenRouter, Anthropic, OpenAI, Vertex, Bedrock, direct?
Budget: cost ceiling per million tokens? Per request?
Quality floor: what's the minimum acceptable reliability?
Constraints: context window, latency, streaming, multimodal, structured output?
Volume: requests per day/month? (affects rate limit viability)

Phase 2: Live Research (MANDATORY PARALLEL FANOUT)

Launch ALL of these in parallel:

Platform catalog — fetch the actual model list from the target platform
- OpenRouter: https://openrouter.ai/api/v1/models (JSON endpoint)
- Filter by capability (tool calling, structured output, etc.)
- This is the AUTHORITATIVE source for model IDs, pricing, and availability
Exa search — find recent benchmarks, comparisons, known issues
- Search: "[model family] tool calling benchmark 2026"
- Search: "[model family] vs [competitor] [use case]"
- Search: "openrouter [model name] issues OR problems OR reliability"
xAI / social pulse — what practitioners are saying
- Search: "[model name] production" OR "[model name] tool calling"
- Catches reliability issues, rate limit complaints, silent deprecations
Codebase — what the project currently uses and why
- Check existing model config, ADRs, policy files
- Understand current integration shape (AI SDK version, provider setup)

Phase 3: Model Family Audit

For EACH model family the user mentions or that appears promising:

List ALL current models in the family (not just the one you remember)
Identify the latest — check release dates, version numbers
Check supersession — is there a newer model that replaces this one?
Verify availability — is it actually on the target platform right now?
Get exact pricing — from the platform API, not from memory

Phase 4: Head-to-Head Comparison

Build a comparison table with ONLY verified data:

Rules:

Every cell must have a source. No "probably" or "likely."
If a benchmark score can't be found, write "No data" — don't interpolate.
If pricing is from a blog post, verify against the platform API.
Flag preview/beta models explicitly — they have different reliability profiles.
Release date is mandatory. If you can't find it, write "Unknown (treat as stale)." A model with unknown release date gets the same skepticism as a 4+ month old model.

Phase 5: Recommendation

Produce ranked recommendations for the specific use case:

Primary model: best quality-per-dollar for the task
Fallback chain: 2-3 models ordered by reliability, increasing cost acceptable
Not recommended: models that looked promising but have specific disqualifiers

For each recommendation, state:

Why this model (specific evidence, not vibes)
Risk (preview status, known issues, rate limits)
Cost impact vs current model

Gotchas

Models age fast: A model from 3 months ago is likely superseded. ALWAYS compute age from release date. MiniMax M2.5 → M2.7 happened in ONE month. Gemini generations overlap. If you're recommending something >2 months old, you're probably behind.
Superseded models: ALWAYS check if a newer version exists before recommending. Search "[model family] latest model [current year]" to catch recent releases.
Preview ≠ GA: Preview models can change behavior without notice. Flag them. Preview models are especially likely to have intermittent reliability issues.
"Free tier" rate limits: Many OpenRouter models have free tiers with 1 concurrent request. Not viable for production. Check the paid tier pricing.
Thinking tokens: Models with reasoning (DeepSeek, Qwen) may generate hidden thinking tokens that add latency and cost. Check if thinking can be disabled.
Tool calling ≠ good tool calling: A model may "support" tool calling but fail on complex schemas, parallel tools, or multi-step tool chains. Look for benchmarks specific to tool calling (tau2-bench, PinchBench, Toolathlon), not just general benchmarks.
OpenRouter routing: Same model ID may route to different providers with different latency/reliability. Check if provider can be pinned.
Output token limits: Some cheap models cap output at 4K-8K tokens. Check max output.
Stale benchmark data: A benchmark from 6 months ago may not reflect current model performance, especially for preview models that get updated.
"Latest" in family ≠ best for task: Newer isn't always better for a specific use case. A flash-lite model may be newer but weaker at tool calling than the older flash. Match the model's DESIGN PURPOSE to your task.

Anti-Patterns

Recommending a model you "know" without live verification
Using general benchmarks (MMLU, HumanEval) for tool-calling selection
Ignoring model family versions (recommending M2.5 when M2.7 exists)
Treating all "flash/lite/mini" models as equivalent
Recommending based on pricing alone without tool-calling quality check

Verification

Related Skills

phrazzld/compound

testing

VerifiedTrustedCommunity

Capture one compounding repo-technical learning while a solved problem is still fresh. Use when: after a bug fix, diagnosis, delivery, review, or incident reveals a reusable pattern worth adding to `docs/solutions/`. Trigger: /compound, /capture-learning, /learning.

13SKILL.mdUpdated Jul 5, 2026

phrazzld/factory-apps

testing

VerifiedTrustedCommunity

Route Misty Step factory application capabilities. Use when choosing, auditing, integrating, or operating Canary, Powder, Landmark, Aesthetic, or Bitterblossom: production observability, incidents, health checks, error logging, backlog/work-card state, release intelligence, UI/UX system adoption, or supervised/unsupervised agent dispatch. Trigger: /factory-apps, /factory-stack.

13SKILL.mdUpdated Jul 4, 2026

phrazzld/factory-apps

phrazzld/skill-eval

testing

VerifiedTrustedCommunity

Prove a skill beats no-skill with a falsifiable A/B eval, or retire it. Design, generate, run, and maintain a skill-specific eval: name the one claim the skill must earn, run it skill-on vs raw same-model, grade blind with objective checks first, return a keep/adapt/cut verdict. Use when: "eval this skill", "does this skill help", "prove the skill beats no skill", "write an eval for", "benchmark a skill", "is this skill worth it", "skill A/B", "skill regression test", "generate skill evals". Trigger: /skill-eval, /eval-skill, /prove-skill.

13SKILL.mdUpdated Jul 2, 2026

phrazzld/skills/harness-engineering/templates/repo-local-skill

tools

VerifiedTrustedCommunity

> Template. Copy to `<target-repo>/.agents/skills/<repo>-<domain>/SKILL.md` > and fill every bracketed placeholder from the live target repo. Delete this > line and every other `> ` guidance line before committing. See > `../../references/repo-local-skill-generation.md` for the full process. --- name: <repo>-<domain> description: | [One paragraph: what this skill verifies/runs/operates for <repo>, stated in terms of the repo's real shape (service/CLI/library/etc.), not generic process. En

13SKILL.mdUpdated Jul 2, 2026

phrazzld/skills/harness-engineering/templates/repo-local-skill

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/phrazzld/agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-skills/skills/model-research ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

phrazzld/agent-skills

13 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT