skills/decomposing-reasoning-efficiency/SKILL.md
Analyze and optimize LLM reasoning token efficiency using a multiplicative decomposition framework. Breaks down reasoning performance into completion rate, conditional correctness, verbosity, verbalization overhead, and coupling coefficient. Identifies bottleneck profiles and suggests targeted interventions. Includes trace-quality diagnostics (grounding, repetition, prompt copying). Trigger phrases: "analyze reasoning efficiency", "decompose token usage", "why is this model so verbose", "token budget analysis", "reasoning trace quality", "optimize reasoning tokens"
npx skillsauth add ndpvt-web/arxiv-claude-skills decomposing-reasoning-efficiencyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to analyze and optimize how reasoning tokens are spent (or wasted) in LLM outputs. Rather than treating accuracy as the sole metric, it applies the multiplicative decomposition framework from Kaiser et al. (2026) to separate token efficiency into interpretable factors: whether the model completes within budget, whether completed responses are correct, and how verbosely the model reasons. When task metadata or reasoning traces are available, it goes further -- measuring verbalization overhead per work unit, how overhead scales with difficulty, and whether verbose output reflects genuine reasoning or degenerate looping.
Standard LLM evaluations report a single accuracy number, hiding critical information about where tokens go. A model scoring 80% accuracy might be truncating 15% of its responses (never getting a chance to answer) and getting 94% of completed responses correct -- or it might complete everything but only get 80% right. These two failure modes demand completely different interventions (raise token budget vs. improve reasoning quality), yet accuracy alone cannot distinguish them.
The decomposition framework factors token-level efficiency multiplicatively:
Efficiency = Completion Rate x Conditional Correctness x Verbosity Score
When per-instance workload metadata is available (e.g., number of digits in an arithmetic problem, rows in a table), verbosity further decomposes into verbalization overhead (baseline tokens per work unit) and a coupling coefficient (how overhead scales with workload). A model with high overhead but low coupling wastes tokens on boilerplate regardless of difficulty. A model with low overhead but high coupling becomes disproportionately verbose on hard problems.
When reasoning traces are available, three deterministic trace-quality measures distinguish productive verbosity from degenerate output: grounding ratio (does the trace reference problem elements?), repetition ratio (does it loop over the same content?), and prompt-copying ratio (does it parrot the input?). These require no human judges or LLM-as-judge calls -- they are computed directly from text overlap and n-gram statistics.
Collect structured response data. For each model response, record: (a) the problem/prompt, (b) the full reasoning trace if available, (c) the final answer, (d) the token count, (e) the token budget/limit, and (f) the ground-truth answer. Store as a list of structured records (JSON or dataframe rows).
Compute Completion Rate. For each response, check whether the output was truncated (hit the token limit without producing a final answer). Completion Rate = (number of non-truncated responses) / (total responses). Flag truncated instances for separate analysis.
Compute Conditional Correctness. Among only the completed (non-truncated) responses, compute accuracy: Conditional Correctness = (correct completed responses) / (total completed responses). This isolates reasoning quality from budget constraints.
Compute raw Verbosity. For each correct, completed response, record the token count. Compute the mean and distribution. Verbosity Score = (reference token count) / (actual mean token count), where the reference can be the minimum observed, a theoretical lower bound, or a cross-model baseline.
Decompose Verbosity (if workload metadata available). If each problem has a workload proxy (e.g., number of reasoning steps, input size, digit count), fit a linear model: tokens_i = alpha + beta * workload_i. The intercept alpha is the verbalization overhead (fixed cost per response). The slope beta is the coupling coefficient (marginal token cost per work unit). Compare these across models or prompt strategies.
Compute Trace-Quality Measures (if reasoning traces available). For each trace:
Classify the bottleneck profile. Based on the decomposition, assign each model/prompt to a profile:
Generate a structured efficiency report. Present findings as a table with columns: Model/Prompt, Completion Rate, Conditional Correctness, Mean Tokens, Verbosity Score, Bottleneck Profile. If trace data is available, add Grounding, Repetition, and Prompt-Copying columns.
Recommend targeted interventions. Based on the bottleneck profile, provide specific, actionable recommendations (not generic "make it better"). Each recommendation should address the identified limiting factor.
Iterate and re-measure. After applying interventions, re-run the decomposition to verify the bottleneck has shifted or resolved. Efficiency improvements should show up as changes in the specific decomposed factor that was targeted.
Example 1: Comparing two prompting strategies on math problems
User: "I'm running GPT-4o on 200 competition math problems with two prompts -- a standard CoT prompt and a concise CoT prompt. Both get ~72% accuracy. Which is better?"
Approach:
Output:
| Strategy | Completion | Cond. Correctness | Mean Tokens | Verbosity Score | Bottleneck |
|--------------|------------|-------------------|-------------|-----------------|--------------|
| Standard CoT | 0.88 | 0.82 | 2,847 | 0.31 | Truncation |
| Concise CoT | 0.97 | 0.74 | 1,203 | 0.73 | Correctness |
Analysis: Standard CoT and Concise CoT both achieve ~72% accuracy, but for
different reasons. Standard CoT truncates 12% of responses (losing easy points)
but gets 82% of completed answers right. Concise CoT rarely truncates but its
compressed reasoning hurts quality (74% conditional correctness).
Recommendation:
- Standard CoT: Increase token budget from 4096 to 6144. Predicted accuracy
gain: ~4-6% from recovered truncated responses.
- Concise CoT: Add 2-3 worked examples to the prompt to improve reasoning
quality. Token savings are real but correctness needs work.
Example 2: Diagnosing a verbose reasoning model
User: "Our fine-tuned model gets 91% on our internal QA benchmark but costs 3x more tokens than the base model at 89%. Is the extra reasoning worth it?"
Approach:
Output:
| Model | Completion | Cond. Corr. | Overhead (α) | Coupling (β) | Grounding | Repetition |
|------------|------------|-------------|--------------|---------------|-----------|------------|
| Base | 0.99 | 0.90 | 180 tokens | 12 tok/unit | 0.72 | 0.04 |
| Fine-tuned | 0.98 | 0.93 | 620 tokens | 28 tok/unit | 0.68 | 0.19 |
Diagnosis: The fine-tuned model's 3x token cost comes from two sources:
1. Verbalization overhead (α) is 3.4x higher -- it produces ~440 extra tokens
of boilerplate per response regardless of difficulty.
2. Coupling coefficient (β) is 2.3x higher -- it scales worse on hard questions.
3. Repetition ratio of 0.19 suggests ~19% of reasoning content is redundant
looping, while grounding actually decreased vs. base.
The 2% accuracy gain (91% vs 89%) does not justify the 3x cost. The fine-tuning
introduced verbose habits without proportional reasoning improvement.
Recommendation: Apply repetition penalty during decoding (rep_penalty=1.15).
Add a system instruction: "State each reasoning step once. Do not re-derive
previous conclusions." Expected token reduction: 30-40% with minimal accuracy loss.
Example 3: Building an efficiency evaluation harness
User: "I'm building an eval pipeline for our reasoning models. How do I add efficiency metrics alongside accuracy?"
Approach:
Output:
import numpy as np
from collections import Counter
def compute_efficiency_decomposition(results: list[dict]) -> dict:
"""
Each result dict must have:
- 'completed': bool (not truncated)
- 'correct': bool (final answer matches ground truth)
- 'token_count': int (total reasoning + answer tokens)
- 'workload': float (optional, task difficulty proxy)
- 'trace': str (optional, full reasoning text)
- 'prompt': str (optional, original prompt text)
"""
n = len(results)
completed = [r for r in results if r['completed']]
correct_completed = [r for r in completed if r['correct']]
completion_rate = len(completed) / n
cond_correctness = len(correct_completed) / len(completed) if completed else 0.0
mean_tokens = np.mean([r['token_count'] for r in correct_completed]) if correct_completed else float('inf')
report = {
'completion_rate': round(completion_rate, 3),
'conditional_correctness': round(cond_correctness, 3),
'mean_tokens_correct': round(mean_tokens, 1),
'n_total': n,
'n_completed': len(completed),
'n_correct_completed': len(correct_completed),
}
# Verbosity decomposition (if workload metadata present)
if all('workload' in r for r in correct_completed) and len(correct_completed) > 5:
workloads = np.array([r['workload'] for r in correct_completed])
tokens = np.array([r['token_count'] for r in correct_completed])
# Linear fit: tokens = alpha + beta * workload
A = np.vstack([np.ones_like(workloads), workloads]).T
alpha, beta = np.linalg.lstsq(A, tokens, rcond=None)[0]
report['verbalization_overhead_alpha'] = round(float(alpha), 1)
report['coupling_coefficient_beta'] = round(float(beta), 2)
# Trace quality (if traces present)
if all('trace' in r and 'prompt' in r for r in completed[:50]):
sample = completed[:50]
gnd, rep, cpy = [], [], []
for r in sample:
trace_4grams = _ngrams(r['trace'], 4)
prompt_4grams = set(_ngrams(r['prompt'], 4))
trace_set = set(trace_4grams)
counts = Counter(trace_4grams)
rep.append(sum(1 for c in counts.values() if c > 1) / max(len(trace_set), 1))
cpy.append(len(trace_set & prompt_4grams) / max(len(trace_set), 1))
report['repetition_ratio'] = round(float(np.mean(rep)), 3)
report['prompt_copying_ratio'] = round(float(np.mean(cpy)), 3)
# Bottleneck classification
report['bottleneck'] = _classify_bottleneck(report)
return report
def _ngrams(text: str, n: int) -> list[tuple]:
tokens = text.split()
return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]
def _classify_bottleneck(report: dict) -> str:
if report['completion_rate'] < 0.90:
return 'truncation-bound'
if report['conditional_correctness'] < 0.80:
return 'correctness-bound'
if report.get('repetition_ratio', 0) > 0.15:
return 'degenerate-looping'
if report.get('verbalization_overhead_alpha', 0) > 500:
return 'verbosity-bound (high overhead)'
if report.get('coupling_coefficient_beta', 0) > 25:
return 'verbosity-bound (high coupling)'
return 'balanced'
finish_reason field), use a heuristic: responses ending mid-sentence or at exactly the token limit are likely truncated. Flag these as uncertain and report completion rate as a range.Kaiser, D., Frigessi, A., Ramezani-Kebrya, A., & Ricaud, B. (2026). Decomposing Reasoning Efficiency in Large Language Models. arXiv:2602.09805v1. https://arxiv.org/abs/2602.09805v1
Key insight to look for: Table 2 and Figure 2 showing how accuracy rankings vs. efficiency rankings diverge (Spearman rho=0.63), and the bottleneck profile taxonomy in Section 5 that maps different models to different efficiency interventions.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".