Multi-Model / Model-Agnostic Platform Skill

You are operating in a multi-model environment. Help users build, maintain, and optimize systems that span multiple AI models and providers.

1. Model Selection Criteria

Choose the right model for each task by evaluating:

Task complexity: Use large frontier models (Claude Sonnet/Opus, GPT-4o, Gemini Ultra) for complex reasoning, multi-step planning, and code generation. Use smaller models (Claude Haiku, GPT-4o mini, Gemini Flash) for classification, summarization, or high-volume triage.
Modality requirements: Match the model to required input/output modalities — text, code, vision, audio, structured data, or tool use.
Context window needs: Use models with larger context windows (200K+ tokens) for full codebase ingestion, long document analysis, or multi-turn conversations with heavy history.
Cost-per-token budget: Estimate token consumption upfront. For high-volume pipelines, model selection is a budget decision as much as a capability decision.
Latency sensitivity: Real-time user-facing tasks require low-latency models. Offline batch pipelines can use larger, slower, cheaper models.

Decision rule: Default to the smallest model that satisfies the task requirements. Escalate to a larger model only when the smaller model demonstrably fails.

Model routing by task complexity (Python example):

MODEL_SONNET = "claude-sonnet-4-6"
MODEL_HAIKU = "claude-haiku-4-5-20251001"

def select_model(text_length: int, item_count: int, force_model: str | None = None) -> str:
    if force_model is not None:
        return force_model
    if text_length >= 10_000 or item_count >= 30:
        return MODEL_SONNET  # Complex task
    return MODEL_HAIKU        # Simple task (3-4x cheaper)

2. Cost-Aware Pipeline Patterns

Immutable cost tracking — track cumulative spend with frozen dataclasses:

from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CostTracker:
    budget_limit: float = 1.00
    records: tuple = ()

    def add(self, record) -> "CostTracker":
        return CostTracker(budget_limit=self.budget_limit, records=(*self.records, record))

    @property
    def total_cost(self) -> float:
        return sum(r.cost_usd for r in self.records)

    @property
    def over_budget(self) -> bool:
        return self.total_cost > self.budget_limit

Narrow retry logic — retry only on transient errors, fail fast on authentication or bad request errors:

from anthropic import APIConnectionError, InternalServerError, RateLimitError

_RETRYABLE = (APIConnectionError, RateLimitError, InternalServerError)

def call_with_retry(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except _RETRYABLE:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff
    # AuthenticationError, BadRequestError → raise immediately

Prompt caching — cache long system prompts to avoid resending on every request:

messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}},
        {"type": "text", "text": user_input},
    ],
}]

Pricing reference (2025-2026):

| Model | Input ($/1M tokens) | Output ($/1M tokens) | |-------|---------------------|----------------------| | Haiku 4.5 | $0.80 | $4.00 | | Sonnet 4.6 | $3.00 | $15.00 | | Opus 4.5 | $15.00 | $75.00 |

3. Prompt Portability Across Models

Write prompts that work reliably across multiple models without per-model rewrites:

Avoid model-specific syntax: Do not rely on proprietary prompt features. Use neutral Markdown for structure.
Explicit role and context framing: Always open with a clear role statement and task description.
Structured output specification: Define expected output format in the prompt body itself — do not rely on model default behaviors.
Separate system and user content: Keep system-level instructions in the system prompt; task-specific content in the user turn. This is portable across all major APIs (Anthropic, OpenAI, Google, Mistral).
Use delimiters consistently: Wrap distinct content sections in consistent delimiters (XML tags, triple-backtick blocks, or --- separators).
Test prompt portability: After writing a prompt, run it against at least two models. If outputs diverge significantly, the prompt is under-specified.

4. Handling Model-Specific Capabilities and Limitations

Tool/function calling schemas: OpenAI, Anthropic, and Google each have different schemas. Abstract tool definitions into a canonical format and write per-provider adapters.
Context window utilization: Monitor output quality as context grows; implement context pruning or summarization before hitting the limit.
Streaming support: Implement a streaming abstraction with a non-streaming fallback.
Maximum output tokens: Chunk long-output tasks into multiple sequential calls if output may exceed the model's generation limit.
Refusal behavior: Each model has different content policies. Define the scope of your agent clearly in the system prompt so refusals are predictable.

5. Fallback Strategies When a Model Fails

Build resilient multi-model pipelines that degrade gracefully:

Define a model priority chain: For every task type, define ranked models: primary → secondary → emergency fallback. Example: claude-sonnet-4 → claude-haiku-3 → gpt-4o-mini.
Trigger conditions for fallback: HTTP 429 (rate limit), HTTP 503 (service unavailable), HTTP 500 (provider error), timeout exceeded, or structured output parse failure.
Retry with exponential backoff before fallback: Distinguish transient errors (retry same model) from persistent failures (escalate). Minimum 3 attempts before switching models.
Output normalization after fallback: Ensure outputs from all fallback models are normalized to the same schema before passing to downstream systems.
Log every fallback event: Record the original model, reason for fallback, fallback model used, and outcome.
Circuit breaker pattern: After N consecutive failures on a primary model within a rolling window, open the circuit breaker and route directly to the fallback for a configurable cooldown period.

6. Context Window Management Across Models

Always measure before sending: Count tokens before constructing the final prompt. Use the target model's tokenizer. Never estimate by character count.
Define a context budget: Allocate the window into segments: system prompt, retrieved context, conversation history, task instructions, output buffer.
History truncation strategy: Truncate oldest turns first. Preserve the system prompt and most recent exchanges. Optionally replace truncated history with a rolling summary.
RAG for large codebases: Do not stuff entire files into context. Retrieve only the most relevant code chunks via embedding search. Limit retrieved context to 30-40% of total window budget.
Cross-model context hand-off: When moving from one model to another, serialize relevant context into a structured summary rather than passing raw history.
Monitor in production: Log prompt_tokens and completion_tokens from every response. Alert when prompt tokens exceed 80% of the model's limit.

7. On-Device Models (Apple FoundationModels / iOS 26+)

For privacy-sensitive apps that must work offline, use Apple's on-device FoundationModels framework.

Availability check (always required):

struct GenerativeView: View {
    private var model = SystemLanguageModel.default

    var body: some View {
        switch model.availability {
        case .available:
            ContentView()
        case .unavailable(.deviceNotEligible):
            Text("Device not eligible for Apple Intelligence")
        case .unavailable(.appleIntelligenceNotEnabled):
            Text("Please enable Apple Intelligence in Settings")
        case .unavailable(.modelNotReady):
            Text("Model is downloading or not ready")
        case .unavailable(let other):
            Text("Model unavailable: \(other)")
        }
    }
}

Basic session:

let session = LanguageModelSession(instructions: """
    You are a cooking assistant. Provide brief, practical suggestions.
    """)
let response = try await session.respond(to: "I have chicken and rice")

Structured output with @Generable:

@Generable(description: "A trip idea")
struct TripIdea {
    var destination: String
    @Guide(description: "Why it's a good choice", .range(10...200))
    var rationale: String
}

let response = try await session.respond(to: "Suggest a trip", generating: TripIdea.self)

Key constraints:

4,096 token limit (instructions + prompt + output combined)
One request per session at a time (isResponding check required)
Access results via .content, not .output
Use @Generable for structured output — stronger guarantees than parsing raw strings
Snapshot streaming (not deltas) for real-time UI via session.streamResponse(...)

8. Consistent Output Format Regardless of Model

Schema-first design: Define the output schema before selecting a model. Every model must produce output conforming to this schema.
Structured output enforcement: Use JSON mode, structured output APIs, or tool-calling response format where available.
Output parsing and validation layer: Implement a validation layer between the model response and your application that parses, validates against schema, and raises errors on malformed output.
Normalization of model-specific quirks: Some models prepend preamble text or wrap JSON in markdown fences. Implement model-specific output normalization handlers.
Fallback parsing: If strict parsing fails, use a lenient fallback parser. Log all fallback parse events.
Version your output schemas: When changing the expected output format, version the schema and maintain backward-compatible parsing.

9. Model Versioning and Upgrades

Pin model versions in production: Never use floating aliases (e.g., gpt-4o-latest, claude-3-sonnet). Always pin to a specific dated version (e.g., claude-sonnet-4-20250514).
Upgrade tracking: Maintain a model version registry mapping task types to pinned versions. Run your test suite before deploying upgrades.
Canary deployments for model upgrades: Roll out model version changes to 1-5% of traffic first. Monitor output quality metrics before increasing rollout.
Regression testing on upgrade: Run your full prompt test suite against the new version before promoting it.
Deprecation monitoring: Subscribe to provider deprecation notices. Alert when a pinned version approaches its deprecation date minus 30 days.
Rollback capability: Maintain the ability to roll back to the previous model version within 5 minutes of detecting a production regression.

10. Latency Considerations

First-token latency vs. total latency: For streaming user-facing responses, TTFT is the metric that drives perceived responsiveness. For batch pipelines, total latency matters more.
Parallel model calls: When a task can be decomposed into independent subtasks, run them in parallel across models simultaneously.
Streaming for user-facing paths: Always stream responses in user-facing applications. Implement streaming-compatible output parsers.
Pre-computation and prompt caching: Use provider-side prompt caching (Anthropic prompt caching, OpenAI cached prompts) for repeated system prompt prefixes. Reduces both cost and latency.
Timeout budgets: Typical production timeouts: 5s for TTFT, 30s for total completion on frontier models. Fail fast and trigger fallback rather than waiting indefinitely.
Latency SLA monitoring: Track p50, p95, and p99 latency per model per task type. Alert on p95 degradation.

11. API Key and Credential Management

Never hardcode API keys: Keys must never appear in source code, configuration files, or log output.
Centralized secrets management: Store all API keys in a dedicated secrets manager (AWS Secrets Manager, HashiCorp Vault, Azure Key Vault, GCP Secret Manager).
Per-environment key isolation: Maintain separate keys for development, staging, and production.
Provider key abstraction: Abstract credentials behind a unified resolver: get_credential("anthropic") — not direct environment variable references. Enables central rotation without code changes.
Key inventory tracking: Maintain an inventory of all API keys: provider, environment, creation date, rotation date, owning team. Review and prune unused keys quarterly.

12. Monitoring and Observability

Structured request/response logging: Log every model call with: timestamp, model ID, task type, prompt token count, completion token count, latency, status code. Do not log raw prompt content without PII scrubbing.
Per-model metrics dashboards: Track request volume, error rate, latency distribution, token usage, and cost per model.
Unified tracing: Use OpenTelemetry distributed tracing spanning multiple model calls within a single user request.
Output quality metrics: Track structured output parse success rate, schema validation pass rate, and fallback parse trigger rate.
Alerting thresholds: Error rate > 1% (critical), p95 latency > 2× baseline (warning), cost > 110% of daily budget (warning), fallback trigger rate > 5% (warning).

Quick Reference: Decision Checklist

When building or reviewing a multi-model system:

[ ] Model selection is documented per task type with rationale
[ ] Prompts are tested against all models in the portfolio
[ ] Model versions are pinned (no floating aliases in production)
[ ] Fallback chains are defined and tested for all critical paths
[ ] Context window budgets are calculated and enforced
[ ] Output schemas are versioned and validated at the parsing layer
[ ] All API keys are stored in a secrets manager, not in code
[ ] Latency SLAs are defined with p95/p99 alerting
[ ] Cost caps are enforced at the routing layer
[ ] Distributed tracing spans all model calls in a request chain
[ ] Regression test suite runs in CI on every prompt or model change

Multi-Model / Model-Agnostic Platform Skill

You are operating in a multi-model environment. Help users build, maintain, and optimize systems that span multiple AI models and providers.

1. Model Selection Criteria

Choose the right model for each task by evaluating:

Task complexity: Use large frontier models (Claude Sonnet/Opus, GPT-4o, Gemini Ultra) for complex reasoning, multi-step planning, and code generation. Use smaller models (Claude Haiku, GPT-4o mini, Gemini Flash) for classification, summarization, or high-volume triage.
Modality requirements: Match the model to required input/output modalities — text, code, vision, audio, structured data, or tool use.
Context window needs: Use models with larger context windows (200K+ tokens) for full codebase ingestion, long document analysis, or multi-turn conversations with heavy history.
Cost-per-token budget: Estimate token consumption upfront. For high-volume pipelines, model selection is a budget decision as much as a capability decision.
Latency sensitivity: Real-time user-facing tasks require low-latency models. Offline batch pipelines can use larger, slower, cheaper models.

Decision rule: Default to the smallest model that satisfies the task requirements. Escalate to a larger model only when the smaller model demonstrably fails.

Model routing by task complexity (Python example):

MODEL_SONNET = "claude-sonnet-4-6"
MODEL_HAIKU = "claude-haiku-4-5-20251001"

def select_model(text_length: int, item_count: int, force_model: str | None = None) -> str:
    if force_model is not None:
        return force_model
    if text_length >= 10_000 or item_count >= 30:
        return MODEL_SONNET  # Complex task
    return MODEL_HAIKU        # Simple task (3-4x cheaper)

2. Cost-Aware Pipeline Patterns

Immutable cost tracking — track cumulative spend with frozen dataclasses:

from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CostTracker:
    budget_limit: float = 1.00
    records: tuple = ()

    def add(self, record) -> "CostTracker":
        return CostTracker(budget_limit=self.budget_limit, records=(*self.records, record))

    @property
    def total_cost(self) -> float:
        return sum(r.cost_usd for r in self.records)

    @property
    def over_budget(self) -> bool:
        return self.total_cost > self.budget_limit

Narrow retry logic — retry only on transient errors, fail fast on authentication or bad request errors:

from anthropic import APIConnectionError, InternalServerError, RateLimitError

_RETRYABLE = (APIConnectionError, RateLimitError, InternalServerError)

def call_with_retry(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except _RETRYABLE:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff
    # AuthenticationError, BadRequestError → raise immediately

Prompt caching — cache long system prompts to avoid resending on every request:

messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}},
        {"type": "text", "text": user_input},
    ],
}]

Pricing reference (2025-2026):

3. Prompt Portability Across Models

Write prompts that work reliably across multiple models without per-model rewrites:

Avoid model-specific syntax: Do not rely on proprietary prompt features. Use neutral Markdown for structure.
Explicit role and context framing: Always open with a clear role statement and task description.
Structured output specification: Define expected output format in the prompt body itself — do not rely on model default behaviors.
Separate system and user content: Keep system-level instructions in the system prompt; task-specific content in the user turn. This is portable across all major APIs (Anthropic, OpenAI, Google, Mistral).
Use delimiters consistently: Wrap distinct content sections in consistent delimiters (XML tags, triple-backtick blocks, or --- separators).
Test prompt portability: After writing a prompt, run it against at least two models. If outputs diverge significantly, the prompt is under-specified.

4. Handling Model-Specific Capabilities and Limitations

Tool/function calling schemas: OpenAI, Anthropic, and Google each have different schemas. Abstract tool definitions into a canonical format and write per-provider adapters.
Context window utilization: Monitor output quality as context grows; implement context pruning or summarization before hitting the limit.
Streaming support: Implement a streaming abstraction with a non-streaming fallback.
Maximum output tokens: Chunk long-output tasks into multiple sequential calls if output may exceed the model's generation limit.
Refusal behavior: Each model has different content policies. Define the scope of your agent clearly in the system prompt so refusals are predictable.

5. Fallback Strategies When a Model Fails

Build resilient multi-model pipelines that degrade gracefully:

Define a model priority chain: For every task type, define ranked models: primary → secondary → emergency fallback. Example: claude-sonnet-4 → claude-haiku-3 → gpt-4o-mini.
Trigger conditions for fallback: HTTP 429 (rate limit), HTTP 503 (service unavailable), HTTP 500 (provider error), timeout exceeded, or structured output parse failure.
Retry with exponential backoff before fallback: Distinguish transient errors (retry same model) from persistent failures (escalate). Minimum 3 attempts before switching models.
Output normalization after fallback: Ensure outputs from all fallback models are normalized to the same schema before passing to downstream systems.
Log every fallback event: Record the original model, reason for fallback, fallback model used, and outcome.
Circuit breaker pattern: After N consecutive failures on a primary model within a rolling window, open the circuit breaker and route directly to the fallback for a configurable cooldown period.

6. Context Window Management Across Models

Always measure before sending: Count tokens before constructing the final prompt. Use the target model's tokenizer. Never estimate by character count.
Define a context budget: Allocate the window into segments: system prompt, retrieved context, conversation history, task instructions, output buffer.
History truncation strategy: Truncate oldest turns first. Preserve the system prompt and most recent exchanges. Optionally replace truncated history with a rolling summary.
RAG for large codebases: Do not stuff entire files into context. Retrieve only the most relevant code chunks via embedding search. Limit retrieved context to 30-40% of total window budget.
Cross-model context hand-off: When moving from one model to another, serialize relevant context into a structured summary rather than passing raw history.
Monitor in production: Log prompt_tokens and completion_tokens from every response. Alert when prompt tokens exceed 80% of the model's limit.

7. On-Device Models (Apple FoundationModels / iOS 26+)

For privacy-sensitive apps that must work offline, use Apple's on-device FoundationModels framework.

Availability check (always required):

struct GenerativeView: View {
    private var model = SystemLanguageModel.default

    var body: some View {
        switch model.availability {
        case .available:
            ContentView()
        case .unavailable(.deviceNotEligible):
            Text("Device not eligible for Apple Intelligence")
        case .unavailable(.appleIntelligenceNotEnabled):
            Text("Please enable Apple Intelligence in Settings")
        case .unavailable(.modelNotReady):
            Text("Model is downloading or not ready")
        case .unavailable(let other):
            Text("Model unavailable: \(other)")
        }
    }
}

Basic session:

let session = LanguageModelSession(instructions: """
    You are a cooking assistant. Provide brief, practical suggestions.
    """)
let response = try await session.respond(to: "I have chicken and rice")

Structured output with @Generable:

@Generable(description: "A trip idea")
struct TripIdea {
    var destination: String
    @Guide(description: "Why it's a good choice", .range(10...200))
    var rationale: String
}

let response = try await session.respond(to: "Suggest a trip", generating: TripIdea.self)

Key constraints:

4,096 token limit (instructions + prompt + output combined)
One request per session at a time (isResponding check required)
Access results via .content, not .output
Use @Generable for structured output — stronger guarantees than parsing raw strings
Snapshot streaming (not deltas) for real-time UI via session.streamResponse(...)

8. Consistent Output Format Regardless of Model

Schema-first design: Define the output schema before selecting a model. Every model must produce output conforming to this schema.
Structured output enforcement: Use JSON mode, structured output APIs, or tool-calling response format where available.
Output parsing and validation layer: Implement a validation layer between the model response and your application that parses, validates against schema, and raises errors on malformed output.
Normalization of model-specific quirks: Some models prepend preamble text or wrap JSON in markdown fences. Implement model-specific output normalization handlers.
Fallback parsing: If strict parsing fails, use a lenient fallback parser. Log all fallback parse events.
Version your output schemas: When changing the expected output format, version the schema and maintain backward-compatible parsing.

9. Model Versioning and Upgrades

Pin model versions in production: Never use floating aliases (e.g., gpt-4o-latest, claude-3-sonnet). Always pin to a specific dated version (e.g., claude-sonnet-4-20250514).
Upgrade tracking: Maintain a model version registry mapping task types to pinned versions. Run your test suite before deploying upgrades.
Canary deployments for model upgrades: Roll out model version changes to 1-5% of traffic first. Monitor output quality metrics before increasing rollout.
Regression testing on upgrade: Run your full prompt test suite against the new version before promoting it.
Deprecation monitoring: Subscribe to provider deprecation notices. Alert when a pinned version approaches its deprecation date minus 30 days.
Rollback capability: Maintain the ability to roll back to the previous model version within 5 minutes of detecting a production regression.

10. Latency Considerations

First-token latency vs. total latency: For streaming user-facing responses, TTFT is the metric that drives perceived responsiveness. For batch pipelines, total latency matters more.
Parallel model calls: When a task can be decomposed into independent subtasks, run them in parallel across models simultaneously.
Streaming for user-facing paths: Always stream responses in user-facing applications. Implement streaming-compatible output parsers.
Pre-computation and prompt caching: Use provider-side prompt caching (Anthropic prompt caching, OpenAI cached prompts) for repeated system prompt prefixes. Reduces both cost and latency.
Timeout budgets: Typical production timeouts: 5s for TTFT, 30s for total completion on frontier models. Fail fast and trigger fallback rather than waiting indefinitely.
Latency SLA monitoring: Track p50, p95, and p99 latency per model per task type. Alert on p95 degradation.

11. API Key and Credential Management

Never hardcode API keys: Keys must never appear in source code, configuration files, or log output.
Centralized secrets management: Store all API keys in a dedicated secrets manager (AWS Secrets Manager, HashiCorp Vault, Azure Key Vault, GCP Secret Manager).
Per-environment key isolation: Maintain separate keys for development, staging, and production.
Provider key abstraction: Abstract credentials behind a unified resolver: get_credential("anthropic") — not direct environment variable references. Enables central rotation without code changes.
Key inventory tracking: Maintain an inventory of all API keys: provider, environment, creation date, rotation date, owning team. Review and prune unused keys quarterly.

12. Monitoring and Observability

Structured request/response logging: Log every model call with: timestamp, model ID, task type, prompt token count, completion token count, latency, status code. Do not log raw prompt content without PII scrubbing.
Per-model metrics dashboards: Track request volume, error rate, latency distribution, token usage, and cost per model.
Unified tracing: Use OpenTelemetry distributed tracing spanning multiple model calls within a single user request.
Output quality metrics: Track structured output parse success rate, schema validation pass rate, and fallback parse trigger rate.
Alerting thresholds: Error rate > 1% (critical), p95 latency > 2× baseline (warning), cost > 110% of daily budget (warning), fallback trigger rate > 5% (warning).

Quick Reference: Decision Checklist

When building or reviewing a multi-model system:

[ ] Model selection is documented per task type with rationale
[ ] Prompts are tested against all models in the portfolio
[ ] Model versions are pinned (no floating aliases in production)
[ ] Fallback chains are defined and tested for all critical paths
[ ] Context window budgets are calculated and enforced
[ ] Output schemas are versioned and validated at the parsing layer
[ ] All API keys are stored in a secrets manager, not in code
[ ] Latency SLAs are defined with p95/p99 alerting
[ ] Cost caps are enforced at the routing layer
[ ] Distributed tracing spans all model calls in a request chain
[ ] Regression test suite runs in CI on every prompt or model change

Adoption

bereniketech/multi-model-model-agnostic-platforms

$ install --global

SKILL.md

Multi-Model / Model-Agnostic Platform Skill

1. Model Selection Criteria

2. Cost-Aware Pipeline Patterns

3. Prompt Portability Across Models

4. Handling Model-Specific Capabilities and Limitations

5. Fallback Strategies When a Model Fails

6. Context Window Management Across Models

7. On-Device Models (Apple FoundationModels / iOS 26+)

8. Consistent Output Format Regardless of Model

9. Model Versioning and Upgrades

10. Latency Considerations

11. API Key and Credential Management

12. Monitoring and Observability

Quick Reference: Decision Checklist

Related Skills

bereniketech/anti-reversing-techniques

bereniketech/active-directory-attacks

bereniketech/zeroize-audit

bereniketech/wcag-audit-patterns

bereniketech/multi-model-model-agnostic-platforms

$ install --global

SKILL.md

Multi-Model / Model-Agnostic Platform Skill

1. Model Selection Criteria

2. Cost-Aware Pipeline Patterns

3. Prompt Portability Across Models

4. Handling Model-Specific Capabilities and Limitations

5. Fallback Strategies When a Model Fails

6. Context Window Management Across Models

7. On-Device Models (Apple FoundationModels / iOS 26+)

8. Consistent Output Format Regardless of Model

9. Model Versioning and Upgrades

10. Latency Considerations

11. API Key and Credential Management

12. Monitoring and Observability

Quick Reference: Decision Checklist

Related Skills

bereniketech/anti-reversing-techniques

bereniketech/active-directory-attacks

bereniketech/zeroize-audit

bereniketech/wcag-audit-patterns