Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

lidge-jun/cost-aware-llm-pipeline

Name: cost-aware-llm-pipeline
Author: lidge-jun

cost-aware-llm-pipeline/SKILL.md

npx skillsauth add lidge-jun/cli-jaw-skills cost-aware-llm-pipeline

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Cost-Aware LLM Pipeline

Patterns for controlling LLM API costs while maintaining quality. Combines model routing, budget tracking, retry logic, and prompt caching into a composable pipeline.

When to Activate

Building applications that call LLM APIs (Claude, GPT, etc.)
Processing batches of items with varying complexity
Need to stay within a budget for API spend
Optimizing cost without sacrificing quality on complex tasks

Core Concepts

1. Model Routing by Task Complexity

Automatically select cheaper models for simple tasks, reserving expensive models for complex ones.

MODEL_SONNET = "claude-sonnet-4-6"
MODEL_HAIKU = "claude-haiku-4-5-20251001"

_SONNET_TEXT_THRESHOLD = 10_000  # chars
_SONNET_ITEM_THRESHOLD = 30     # items

def select_model(
    text_length: int,
    item_count: int,
    force_model: str | None = None,
) -> str:
    """Select model based on task complexity."""
    if force_model is not None:
        return force_model
    if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD:
        return MODEL_SONNET  # Complex task
    return MODEL_HAIKU  # Simple task (3-4x cheaper)

2. Immutable Cost Tracking

Track cumulative spend with frozen dataclasses. Each API call returns a new tracker — never mutates state.

from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CostRecord:
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float

@dataclass(frozen=True, slots=True)
class CostTracker:
    budget_limit: float = 1.00
    records: tuple[CostRecord, ...] = ()

    def add(self, record: CostRecord) -> "CostTracker":
        """Return new tracker with added record (never mutates self)."""
        return CostTracker(
            budget_limit=self.budget_limit,
            records=(*self.records, record),
        )

    @property
    def total_cost(self) -> float:
        return sum(r.cost_usd for r in self.records)

    @property
    def over_budget(self) -> bool:
        return self.total_cost > self.budget_limit

3. Narrow Retry Logic

Retry only on transient errors. Fail fast on authentication or bad request errors.

from anthropic import (
    APIConnectionError,
    InternalServerError,
    RateLimitError,
)

_RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError)
_MAX_RETRIES = 3

def call_with_retry(func, *, max_retries: int = _MAX_RETRIES):
    """Retry only on transient errors, fail fast on others."""
    for attempt in range(max_retries):
        try:
            return func()
        except _RETRYABLE_ERRORS:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff
    # AuthenticationError, BadRequestError etc. → raise immediately

4. Prompt Caching

Cache long system prompts to avoid resending them on every request.

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"},  # Cache this
            },
            {
                "type": "text",
                "text": user_input,  # Variable part
            },
        ],
    }
]

Composition

Combine all four techniques in a single pipeline function:

def process(text: str, config: Config, tracker: CostTracker) -> tuple[Result, CostTracker]:
    # 1. Route model
    model = select_model(len(text), estimated_items, config.force_model)

    # 2. Check budget
    if tracker.over_budget:
        raise BudgetExceededError(tracker.total_cost, tracker.budget_limit)

    # 3. Call with retry + caching
    response = call_with_retry(lambda: client.messages.create(
        model=model,
        messages=build_cached_messages(system_prompt, text),
    ))

    # 4. Track cost (immutable)
    record = CostRecord(model=model, input_tokens=..., output_tokens=..., cost_usd=...)
    tracker = tracker.add(record)

    return parse_result(response), tracker

Pricing Reference (June 2026)

| Model | Input ($/1M tokens) | Output ($/1M tokens) | Relative Cost | |-------|---------------------|----------------------|---------------| | Haiku 4.5 | $1.00 | $5.00 | 1x | | Sonnet 4.6 | $3.00 | $15.00 | ~3x | | Opus 4.8 | $5.00 | $25.00 | ~5x |

Batch API: 50% discount on all models for async batch processing. Prompt caching: Up to 90% savings on cached input tokens (24-hour retention).

Preferred Practices

Start with the cheapest model and route to expensive models only when complexity thresholds are met
Set explicit budget limits before processing batches — fail early rather than overspend
Log model selection decisions so you can tune thresholds based on real data
Use prompt caching for system prompts over 1024 tokens — saves both cost and latency
Retry only on transient failures (network, rate limit, server error) — let auth/validation errors surface immediately
Keep cost tracking immutable for easier debugging and auditing
Use model name constants or config instead of hardcoded strings throughout the codebase

When to Use

Any application calling Claude, OpenAI, or similar LLM APIs
Batch processing pipelines where cost adds up quickly
Multi-model architectures that need intelligent routing
Production systems that need budget guardrails

Advanced Patterns (2026)

Model Cascading (Cheap-First)

Start with the cheapest model; escalate only when quality is insufficient:

def cascade_call(prompt: str, quality_threshold: float = 0.8) -> tuple[str, str]:
    """Try cheap model first, escalate if quality is low."""
    result = call_model(MODEL_HAIKU, prompt)
    if evaluate_quality(result) >= quality_threshold:
        return result, MODEL_HAIKU
    return call_model(MODEL_SONNET, prompt), MODEL_SONNET

Circuit Breaker

Hard-stop when budget is exhausted — never silently overspend:

def check_circuit(tracker: CostTracker) -> None:
    """Raise immediately if budget exceeded. Check BEFORE every call."""
    if tracker.total_cost >= tracker.budget_limit:
        raise BudgetExceededError(
            f"Budget exhausted: ${tracker.total_cost:.2f} / ${tracker.budget_limit:.2f}"
        )

Semantic Caching

Cache semantically similar requests to avoid duplicate API calls:

def get_or_call(prompt: str, cache: dict, similarity_threshold: float = 0.95) -> str:
    """Return cached result for semantically similar prompts."""
    for cached_prompt, cached_result in cache.items():
        if semantic_similarity(prompt, cached_prompt) >= similarity_threshold:
            return cached_result
    result = call_model(MODEL_SONNET, prompt)
    cache[prompt] = result
    return result

lidge-jun/cost-aware-llm-pipeline

cost-aware-llm-pipeline/SKILL.md

Cost optimization patterns for LLM API usage — model routing by task complexity, budget tracking, retry logic, and prompt caching.

4 stars

development

Updated Jun 9, 2026

$ install --global

skillsauth

npx skillsauth add lidge-jun/cli-jaw-skills cost-aware-llm-pipeline

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 9, 2026, 7:53 AM180.6s1 file scanned

SKILL.md

name:: cost-aware-llm-pipeline
description:: Cost optimization patterns for LLM API usage — model routing by task complexity, budget tracking, retry logic, and prompt caching.

Cost-Aware LLM Pipeline

Patterns for controlling LLM API costs while maintaining quality. Combines model routing, budget tracking, retry logic, and prompt caching into a composable pipeline.

When to Activate

Building applications that call LLM APIs (Claude, GPT, etc.)
Processing batches of items with varying complexity
Need to stay within a budget for API spend
Optimizing cost without sacrificing quality on complex tasks

Core Concepts

1. Model Routing by Task Complexity

Automatically select cheaper models for simple tasks, reserving expensive models for complex ones.

MODEL_SONNET = "claude-sonnet-4-6"
MODEL_HAIKU = "claude-haiku-4-5-20251001"

_SONNET_TEXT_THRESHOLD = 10_000  # chars
_SONNET_ITEM_THRESHOLD = 30     # items

def select_model(
    text_length: int,
    item_count: int,
    force_model: str | None = None,
) -> str:
    """Select model based on task complexity."""
    if force_model is not None:
        return force_model
    if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD:
        return MODEL_SONNET  # Complex task
    return MODEL_HAIKU  # Simple task (3-4x cheaper)

2. Immutable Cost Tracking

Track cumulative spend with frozen dataclasses. Each API call returns a new tracker — never mutates state.

from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CostRecord:
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float

@dataclass(frozen=True, slots=True)
class CostTracker:
    budget_limit: float = 1.00
    records: tuple[CostRecord, ...] = ()

    def add(self, record: CostRecord) -> "CostTracker":
        """Return new tracker with added record (never mutates self)."""
        return CostTracker(
            budget_limit=self.budget_limit,
            records=(*self.records, record),
        )

    @property
    def total_cost(self) -> float:
        return sum(r.cost_usd for r in self.records)

    @property
    def over_budget(self) -> bool:
        return self.total_cost > self.budget_limit

3. Narrow Retry Logic

Retry only on transient errors. Fail fast on authentication or bad request errors.

from anthropic import (
    APIConnectionError,
    InternalServerError,
    RateLimitError,
)

_RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError)
_MAX_RETRIES = 3

def call_with_retry(func, *, max_retries: int = _MAX_RETRIES):
    """Retry only on transient errors, fail fast on others."""
    for attempt in range(max_retries):
        try:
            return func()
        except _RETRYABLE_ERRORS:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff
    # AuthenticationError, BadRequestError etc. → raise immediately

4. Prompt Caching

Cache long system prompts to avoid resending them on every request.

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"},  # Cache this
            },
            {
                "type": "text",
                "text": user_input,  # Variable part
            },
        ],
    }
]

Composition

Combine all four techniques in a single pipeline function:

def process(text: str, config: Config, tracker: CostTracker) -> tuple[Result, CostTracker]:
    # 1. Route model
    model = select_model(len(text), estimated_items, config.force_model)

    # 2. Check budget
    if tracker.over_budget:
        raise BudgetExceededError(tracker.total_cost, tracker.budget_limit)

    # 3. Call with retry + caching
    response = call_with_retry(lambda: client.messages.create(
        model=model,
        messages=build_cached_messages(system_prompt, text),
    ))

    # 4. Track cost (immutable)
    record = CostRecord(model=model, input_tokens=..., output_tokens=..., cost_usd=...)
    tracker = tracker.add(record)

    return parse_result(response), tracker

Pricing Reference (June 2026)

Batch API: 50% discount on all models for async batch processing. Prompt caching: Up to 90% savings on cached input tokens (24-hour retention).

Preferred Practices

Start with the cheapest model and route to expensive models only when complexity thresholds are met
Set explicit budget limits before processing batches — fail early rather than overspend
Log model selection decisions so you can tune thresholds based on real data
Use prompt caching for system prompts over 1024 tokens — saves both cost and latency
Retry only on transient failures (network, rate limit, server error) — let auth/validation errors surface immediately
Keep cost tracking immutable for easier debugging and auditing
Use model name constants or config instead of hardcoded strings throughout the codebase

When to Use

Any application calling Claude, OpenAI, or similar LLM APIs
Batch processing pipelines where cost adds up quickly
Multi-model architectures that need intelligent routing
Production systems that need budget guardrails

Advanced Patterns (2026)

Model Cascading (Cheap-First)

Start with the cheapest model; escalate only when quality is insufficient:

def cascade_call(prompt: str, quality_threshold: float = 0.8) -> tuple[str, str]:
    """Try cheap model first, escalate if quality is low."""
    result = call_model(MODEL_HAIKU, prompt)
    if evaluate_quality(result) >= quality_threshold:
        return result, MODEL_HAIKU
    return call_model(MODEL_SONNET, prompt), MODEL_SONNET

Circuit Breaker

Hard-stop when budget is exhausted — never silently overspend:

def check_circuit(tracker: CostTracker) -> None:
    """Raise immediately if budget exceeded. Check BEFORE every call."""
    if tracker.total_cost >= tracker.budget_limit:
        raise BudgetExceededError(
            f"Budget exhausted: ${tracker.total_cost:.2f} / ${tracker.budget_limit:.2f}"
        )

Semantic Caching

Cache semantically similar requests to avoid duplicate API calls:

def get_or_call(prompt: str, cache: dict, similarity_threshold: float = 0.95) -> str:
    """Return cached result for semantically similar prompts."""
    for cached_prompt, cached_result in cache.items():
        if semantic_similarity(prompt, cached_prompt) >= similarity_threshold:
            return cached_result
    result = call_model(MODEL_SONNET, prompt)
    cache[prompt] = result
    return result

Related Skills

lidge-jun/codex-imagegen

tools

VerifiedTrustedCommunity

Use only on the Codex CLI for native image generation or image editing without an API key. Save final PNG files under ~/.cli-jaw/uploads, report web-ready absolute-path markdown, and send to Telegram or Discord only when explicitly requested.

5SKILL.mdUpdated Jul 10, 2026

lidge-jun/codex-imagegen

lidge-jun/repo-map

tools

VerifiedTrustedCommunity

Ranked repository structure map via `cli-jaw map`. Use for codebase overview, structure map, symbol overview, unfamiliar codebase exploration, architecture orientation. Triggers: repo map, structure map, codebase overview, 와꾸, project structure, unfamiliar code.

5SKILL.mdUpdated Jul 7, 2026

lidge-jun/design

tools

VerifiedTrustedCommunity

cli-jaw Design workspace: create, preview, run, and export design pages from the right sidebar. Covers panel UX, direct-write workflow, artifact lifecycle, wireframe generation, design system, and Open Design adapter.

5SKILL.mdUpdated Jul 5, 2026

lidge-jun/dev-devops

development

VerifiedTrustedCommunity

MUST USE for infrastructure and delivery work — container builds, deploy pipelines, Kubernetes, Infrastructure as Code, SRE foundations, edge/serverless, ML infrastructure. Triggers: Dockerfile, K8s manifests, CI/CD pipeline, Terraform/IaC, release/deploy, devops/infra/deploy or release_cd task_tags.

5SKILL.mdUpdated Jun 19, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/lidge-jun/cli-jaw-skills.git

# Copy into Claude Code skills folder (global)
cp -r cli-jaw-skills/cost-aware-llm-pipeline ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

lidge-jun/cli-jaw-skills

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT