skills/creditaudit-2textnd-dimension-evaluation/SKILL.md
Evaluate and select LLMs using CreditAudit's 2D framework: mean ability plus stability risk (fluctuation) across system prompt variations. Assigns credit grades (AAA–BBB) to models based on performance volatility. Use when: 'compare models for deployment', 'which LLM is most stable', 'evaluate model robustness to prompt changes', 'credit grade these models', 'model selection for agentic pipeline', 'rank models by reliability'.
npx skillsauth add ndpvt-web/arxiv-claude-skills creditaudit-2textnd-dimension-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
CreditAudit enables Claude to evaluate and compare language models not just by average benchmark scores, but by a second critical dimension: stability under routine system prompt variation. Based on the paper "CreditAudit: 2nd Dimension for LLM Evaluation and Selection" (arXiv:2602.02515v2), this skill implements a deployment-oriented credit audit framework that tests models across semantically aligned prompt templates, computes mean ability (mu) and scenario-induced fluctuation (sigma), and maps volatility into interpretable credit grades from AAA (most stable) to BBB (most volatile). This directly addresses the real-world problem where leaderboard-similar models behave very differently when system prompts, output protocols, or interaction modes shift during production use.
CreditAudit treats model evaluation as a 2D problem: the X-axis is mean ability (mu) — average performance across prompt scenarios — and the Y-axis is fluctuation (sigma) — standard deviation of performance across those same scenarios. Two models with identical mean scores can land in entirely different risk categories if one is stable (sigma=0.5) while the other swings wildly (sigma=3.0). This matters because in agentic pipelines, a model that occasionally fails badly under minor prompt rewording will cause compounding downstream failures.
The framework constructs a family of semantically aligned, non-adversarial system prompt templates (typically 8–10 variants) that represent routine protocol variations practitioners actually encounter: "output only the option letter," "be concise," "think step by step," "be cautious," format-constrained variants, etc. These are not adversarial jailbreaks — they are the mundane prompt rewrites that happen during normal iteration. Each model is evaluated on the same question set under every template, producing a model x template x benchmark score cube.
Fluctuation sigma is then mapped to credit grades using cross-model quantile thresholds: AAA (sigma <= q0.25, most stable), AA (q0.25 < sigma <= q0.50), A (q0.50 < sigma <= q0.75), and BBB (sigma > q0.75, most volatile). A scenario neutrality diagnostic confirms that templates don't introduce systematic difficulty bias — the observed fluctuation reflects genuine model-specific sensitivity. Selection then follows regime-specific rules: for agentic/high-failure-cost settings, prioritize low sigma first; for single-shot controlled deployments, score can weigh more heavily.
Define the evaluation task set. Select or sample a fixed set of questions from one or more benchmarks relevant to the deployment (e.g., domain-specific QA, coding tasks, reasoning problems). Use a fixed random seed for reproducibility. Aim for 100–500 questions per benchmark.
Construct 8–10 semantically aligned system prompt templates. Each template should express a different but realistic protocol intent: bare-minimum instruction, concise output, verbose reasoning, structured JSON output, cautious hedging, step-by-step chain-of-thought, role-play framing, format-constrained (e.g., "answer with only A/B/C/D"), etc. Crucially, the same template index must express the same intent across all benchmarks.
Run each model against every (template, question) pair. For M models, T templates, and N questions, this produces M x T x N raw responses. Extract the answer from each response and compute accuracy per (model, template, benchmark) cell.
Compute per-model aggregate scores. For each model m and template t, compute the equal-weight average score across benchmarks: S(m,t). Then compute mean ability: mu_m = (1/T) * sum(S(m,t) for t in templates) and fluctuation: sigma_m = sqrt((1/(T-1)) * sum((S(m,t) - mu_m)^2 for t in templates)).
Run the scenario neutrality diagnostic. For each template t, compute the cross-model average: S_bar_t = (1/M) * sum(S(m,t) for m in models). Verify the trend across templates is near-flat. If one template is dramatically harder/easier for ALL models, it's introducing difficulty drift rather than measuring model sensitivity — consider removing or rebalancing it.
Assign credit grades using cross-model quantiles. Compute q0.25, q0.50, q0.75 of sigma across all evaluated models. Map each model: AAA if sigma <= q0.25, AA if sigma <= q0.50, A if sigma <= q0.75, BBB otherwise.
Plot the 2D evaluation map. Place models on a (mu, sigma) plane. Identify four quadrants: Q1 (high score, low sigma) = safe default; Q2 (lower score, low sigma) = predictable baseline; Q3 (lower score, high sigma) = avoid; Q4 (high score, high sigma) = scenario-fragile, use with caution.
Apply regime-specific selection rules. For agentic/multi-step pipelines: filter to AAA/AA grades first, then rank by mu within that tier. For single-shot controlled deployments: rank by mu but flag Q4 models with a stability warning. For cost-sensitive tiers: Q2 models offer predictable behavior at lower capability.
Generate the CreditAudit report. Produce a structured summary: model rankings table with mu, sigma, grade, and quadrant; per-benchmark breakdowns of mu and sigma; the scenario neutrality check results; and regime-specific deployment recommendations.
Iterate templates if needed. If sigma values cluster too tightly (all models grade AAA), templates may be too similar — add more diverse protocol variations. If all models grade BBB, templates may be too aggressive — verify they're non-adversarial.
Example 1: Comparing models for an agentic coding pipeline
User: "I'm choosing between GPT-4o, Claude Sonnet, and Gemini Pro for a multi-step code generation pipeline. Which is most reliable?"
Approach:
Output:
CreditAudit Report — Agentic Coding Pipeline Selection
═══════════════════════════════════════════════════════
Model | mu (%) | sigma | Grade | Quadrant
----------------|---------|-------|-------|---------
Claude Sonnet | 82.3 | 1.1 | AAA | Q1
GPT-4o | 83.1 | 2.8 | A | Q4
Gemini Pro | 79.5 | 1.4 | AA | Q2
Quantile thresholds: q25=1.2 q50=1.6 q75=2.4
Scenario Neutrality: PASS (cross-model template means range: 79.1–82.4%)
Regime Recommendation (Agentic/Multi-Step):
→ Deploy Claude Sonnet (AAA grade, Q1). Despite GPT-4o's +0.8% mean
advantage, its sigma of 2.8 means prompt rewording during iteration
risks disproportionate failures in downstream pipeline steps.
→ Gemini Pro is a viable fallback (AA, predictable baseline).
→ GPT-4o requires prompt-locking if deployed — avoid routine prompt
iteration without re-testing.
Example 2: Single-shot QA deployment with stability audit
User: "Rank these 5 models for a customer-facing Q&A bot. Accuracy matters most but I need to know which ones are flaky."
Approach:
Output:
CreditAudit Report — Customer QA Bot Selection
═══════════════════════════════════════════════
Model | mu (%) | sigma | Grade | Quadrant
-----------|--------|-------|-------|---------
Model A | 88.2 | 0.9 | AAA | Q1
Model B | 87.9 | 1.1 | AAA | Q1
Model C | 89.1 | 3.2 | BBB | Q4 ⚠
Model D | 84.3 | 1.5 | AA | Q2
Model E | 83.0 | 2.7 | A | Q3
Regime Recommendation (Single-Shot, Customer-Facing):
→ Model A or B (both AAA, Q1) are safe defaults.
→ Model C scores highest BUT grade BBB — its 89.1% mean hides
template-specific drops to 83.4%. If your system prompt changes
during A/B testing or iteration, expect inconsistent behavior.
→ Stability warning for Model C: per-benchmark sigma on TruthfulQA
is 4.1 — truthfulness degrades significantly under casual-tone
templates.
Example 3: Designing a CreditAudit template family from scratch
User: "Help me create a set of system prompt templates for evaluating models on my legal document analysis task."
Approach:
Output:
CreditAudit Template Family — Legal Document Analysis
══════════════════════════════════════════════════════
Temp00 (Baseline):
"You are a legal document analysis assistant. Answer the question
based on the provided document."
Temp01 (Format-constrained):
"Answer with only the relevant legal citation and a one-sentence
explanation. No preamble."
Temp02 (Chain-of-thought):
"Analyze the legal document step by step. Identify relevant clauses,
then provide your answer with reasoning."
Temp03 (Cautious):
"You are a careful legal analyst. If the answer is ambiguous, say so.
Err on the side of caution."
Temp04 (Structured output):
"Return your answer as JSON with keys: 'answer', 'confidence',
'relevant_clauses', 'reasoning'."
Temp05 (Role-play):
"You are a senior associate at a law firm reviewing this document
for a partner. Be thorough and precise."
Temp06 (Concise):
"Be brief. Provide only the direct answer to the question."
Temp07 (Adversarial-adjacent but non-adversarial):
"The user may phrase questions ambiguously. Interpret charitably
and answer the most likely intended question."
Temp08 (Multi-part):
"First state whether the question can be answered from the document.
If yes, provide the answer. If partially, state what's missing."
Alignment check: Each template varies protocol intent (format, tone,
reasoning depth, output structure) while keeping the core task identical.
All templates are non-adversarial — they represent variations a legal
tech team would plausibly iterate through during development.
Paper: "CreditAudit: 2nd Dimension for LLM Evaluation and Selection" — Song et al., 2026. arXiv:2602.02515v2. https://arxiv.org/abs/2602.02515v2
Look for: The model x template x benchmark score cube methodology, the quantile-based grade mapping table, the scenario neutrality diagnostic, and the four-quadrant (mu, sigma) selection framework with regime-specific deployment guidance.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".