skills/cost-aware-selection-text-classification/SKILL.md
Guides cost-aware model selection for text classification pipelines, applying multi-objective trade-off analysis (F1 vs cost vs latency) to choose between fine-tuned encoders (BERT/RoBERTa/DistilBERT) and LLM prompting (GPT-4o/Claude). Uses Pareto frontier analysis and a parameterized utility function to recommend the right model for a given deployment regime. Trigger phrases: - "Which model should I use for text classification?" - "Is GPT-4o overkill for my classification task?" - "Help me pick a cost-effective NLP model" - "Compare BERT vs LLM for classification cost" - "Optimize my text classification pipeline for production" - "Build a cost-aware NLP system"
npx skillsauth add ndpvt-web/arxiv-claude-skills cost-aware-selection-text-classificationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to act as a production NLP architect who recommends the right text classification model by jointly optimizing predictive quality (macro F1), inference cost (USD per million requests), and latency (p50/p95 milliseconds). Rather than defaulting to the largest available LLM, Claude applies the multi-objective framework from Valdes Gonzalez (2026) — using Pareto frontier projections and a parameterized utility function — to match a model to the user's actual deployment constraints: interactive latency budgets, batch throughput targets, or monthly cost ceilings.
The core insight is that model selection for text classification should be treated as a multi-objective optimization problem across three axes: F1 score, inference cost, and latency. The paper demonstrates that fine-tuned BERT-family encoders (BERT, RoBERTa, DistilBERT) achieve F1 scores within 0-3 points of GPT-4o and Claude Sonnet 4.5 on standard benchmarks (IMDB, SST-2, AG News, DBPedia) while costing 30-250x less and running 2-10x faster. For example, DistilBERT classifies SST-2 at $5.19/1M requests with 98ms p50 latency; GPT-4o zero-shot costs $192.48/1M with 377ms p50 — and Claude zero-shot costs $326.67/1M with 1394ms p50.
The decision framework uses a parameterized utility function: U(F1, Cost, Latency) = F1 / Cost * exp(-Latency_p50 / τ), where τ is a latency tolerance parameter. Three deployment regimes are defined: latency-sensitive (τ=250ms, interactive UIs), balanced (τ=500ms, standard production APIs), and latency-tolerant (τ=1000ms, batch/async jobs). Under all three regimes, fine-tuned encoders dominate the Pareto frontier for classification tasks with fixed label spaces. LLMs only become competitive when the label space is evolving, training data is unavailable, or the task requires open-ended reasoning beyond pattern matching.
Few-shot prompting deserves special scrutiny: it typically improves F1 by only 0.5-3.5 points over zero-shot while doubling or quadrupling cost (token counts increase 2-4x). The paper concludes few-shot is "a budgeted choice: justified under explicit cost-latency constraints rather than adopted as default."
Characterize the classification task. Determine: Is the label space fixed or evolving? How many classes? Is labeled training data available (or obtainable)? What is the expected request volume per month? This determines whether fine-tuning is feasible.
Define deployment constraints. Ask the user for their deployment regime:
Estimate cost at scale. Calculate monthly cost using these reference points (per 1M requests):
Benchmark F1 on the user's domain. If the user has a dataset, recommend fine-tuning a BERT-family encoder as the baseline. Use AdamW with weight decay, up to 4 epochs, selecting the checkpoint that maximizes F1_val - |F1_val - F1_train| (the paper's generalization-aware metric). Run 3 seeds. Simultaneously, test GPT-4o zero-shot with temperature=0.0 and top_p=1.0 for a prompt-based ceiling.
Construct the Pareto frontier. Plot each candidate model on three 2D projections: (F1 vs Cost), (F1 vs Latency), (Cost vs Latency). A model is Pareto-optimal if no other model simultaneously beats it on all three metrics. Discard dominated models.
Compute the utility score. For each Pareto-optimal model, compute U = F1 / Cost * exp(-Latency_p50 / τ) using the user's τ. Rank by U. The highest-U model is the recommended default.
Design the hybrid architecture (if applicable). If LLMs score higher F1 on a meaningful subset of ambiguous inputs, recommend a two-tier system: route the bulk through the encoder, and escalate low-confidence predictions (below a calibrated threshold) to the LLM for a second opinion. This captures LLM quality on the hard tail while keeping average cost near encoder levels.
Specify the deployment stack. For encoders, recommend containerized inference on Google Cloud Run (or equivalent) with the model in an OCI image stored in Artifact Registry. For LLMs, use the provider's API with structured output parsing. Include retry/fallback logic.
Set up monitoring. Track per-model F1 on a held-out production sample, p50/p95 latency, and cumulative cost. Set alerts if F1 drops (data drift) or cost exceeds the monthly budget.
Document the decision. Produce a model selection report showing the Pareto chart, utility scores, and the rationale for the chosen model — so the team can revisit when costs or requirements change.
Example 1: Sentiment analysis API for a SaaS product
User: "I need to classify customer feedback as positive/negative/neutral. We get about 2M messages per month. Currently using GPT-4o with few-shot prompting and it's costing us a fortune."
Approach:
Output:
RECOMMENDATION: Fine-tune DistilBERT
- F1: ~93.5% (vs 90.5% GPT-4o few-shot)
- Cost: $10/month (vs $775/month) — 75x savings
- Latency: 98ms p50 (vs 377ms) — 3.8x faster
- Utility score (τ=500): 14.8 (vs 0.19 for GPT-4o FS)
- Action: Collect 5K labeled samples, fine-tune with AdamW, deploy on Cloud Run
Example 2: Topic classification at startup scale
User: "We're building a news aggregator and need to classify articles into 4 categories (World, Sports, Business, Sci/Tech). We have no labeled data yet. Volume is low — maybe 10K articles/month to start."
Approach:
Output:
PHASE 1 (NOW, <50K/month):
Model: GPT-4o zero-shot
Cost: ~$3-14/month — acceptable at this scale
F1: ~88% — good enough for MVP
PHASE 2 (WHEN volume >100K/month):
Model: Fine-tuned RoBERTa
Cost: $1/month (vs $28+ for GPT-4o) — 28x savings
F1: ~94.6% — 7 point improvement over LLM
Bootstrap: Use Phase 1 GPT-4o outputs as training labels
Example 3: Hybrid architecture for ambiguous content moderation
User: "We do content moderation — classify posts into 14 categories. Some categories are clear-cut, others are nuanced. We need high accuracy on the ambiguous ones."
Approach:
Output:
HYBRID ARCHITECTURE:
Tier 1 (80% of traffic): Fine-tuned BERT
- Cost: $10.77/1M × 0.8 = $8.62/1M
- Latency: 203ms p50
Tier 2 (20% escalated): Claude zero-shot
- Cost: $751.89/1M × 0.2 = $150.38/1M
Blended cost: $159/1M (vs $751.89/1M pure LLM — 4.7x savings)
Blended F1: Higher than either model alone (encoder handles easy, LLM handles hard)
Confidence threshold: Tune on validation set to balance cost vs accuracy
F1_val - |F1_val - F1_train| when fine-tuning encoders — it selects checkpoints that generalize rather than overfit.U = F1 / Cost * exp(-Latency/τ) is a useful heuristic but may not capture all production concerns (e.g., cold-start latency, rate limits, model update cadence, compliance requirements).Paper: Valdes Gonzalez, A. (2026). Cost-Aware Model Selection for Text Classification: Multi-Objective Trade-offs Between Fine-Tuned Encoders and LLM Prompting in Production. arXiv:2602.06370v1. https://arxiv.org/abs/2602.06370v1
What to look for: Tables 2-5 for per-model F1/cost/latency numbers across all four benchmarks; Figures 4-6 for Pareto frontier visualizations; Section 4.3 for the utility function formulation and deployment regime analysis; Section 5 for the hybrid architecture recommendations.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".