skills/skillxiv-v0.0.2-claude-opus-4.6/coding-comprehension-evaluation/SKILL.md
Evaluate LLM coding capabilities across three dimensions—problem analysis, code implementation, and test validation—to identify specific reasoning gaps and improve model robustness through ensemble approaches.
npx skillsauth add ADu2021/skillXiv coding-comprehension-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Large language models excel at individual coding tasks, but understanding how they break down problems remains opaque. Standard benchmarks measure only final correctness (pass/fail), hiding whether failures stem from misunderstanding the problem, poor implementation, or inadequate validation. A model might correctly understand a problem but produce buggy code, or implement code correctly yet fail to generate meaningful test cases.
The Coding Triangle framework decomposes coding ability into three interconnected dimensions—Editorial (problem analysis), Code (implementation), and Cases (test generation)—revealing alignment and misalignment between capabilities. This granular view exposes that LLMs make predictable, correlated errors across dimensions and that model ensembles can overcome individual model limitations by increasing solution diversity.
Rather than treating code generation as monolithic, the Coding Triangle treats it as three coupled competencies: (1) Editorial—understanding the problem statement, constraints, and approach, (2) Code—translating approach into correct implementation, and (3) Cases—generating test cases that validate solutions. Models can succeed or fail at each dimension independently. A model might solve a problem correctly but fail to write tests that catch edge cases. Or it might generate test cases that are actually correct solutions but fail to implement efficiently.
The key insight is that self-consistency emerges at the dimension level: if a model confidently solves a problem one way, it tends to make similar mistakes across all three dimensions. This predictability enables targeted improvement—rather than attempting generic "better reasoning," practitioners can identify which dimension causes failure and apply targeted interventions.
The following implements the Coding Triangle evaluation framework for assessing LLM coding abilities across dimensions.
Step 1: Problem Analysis (Editorial Dimension)
This code extracts problem understanding from model outputs and validates completeness against gold standards.
import json
from typing import Dict, List, Tuple
from dataclasses import dataclass
@dataclass
class EditorialAnalysis:
problem_statement: str
constraints: List[str]
input_format: str
output_format: str
approach: str
key_insights: List[str]
class EditorialEvaluator:
def __init__(self):
self.required_components = [
"constraints",
"input_format",
"output_format",
"approach",
"key_insights"
]
def extract_editorial(self, model_output: str) -> EditorialAnalysis:
"""Parse model's problem analysis from text output."""
lines = model_output.split("\n")
analysis = EditorialAnalysis(
problem_statement=model_output[:200],
constraints=[],
input_format="",
output_format="",
approach="",
key_insights=[]
)
current_section = None
for line in lines:
if "constraint" in line.lower():
current_section = "constraints"
elif "input" in line.lower():
current_section = "input_format"
elif "output" in line.lower():
current_section = "output_format"
elif "approach" in line.lower() or "algorithm" in line.lower():
current_section = "approach"
elif "insight" in line.lower():
current_section = "key_insights"
elif current_section and line.strip():
if current_section == "constraints":
analysis.constraints.append(line.strip())
elif current_section == "key_insights":
analysis.key_insights.append(line.strip())
return analysis
def score_editorial(self, analysis: EditorialAnalysis, gold_standard: Dict) -> float:
"""Score completeness of problem analysis."""
score = 0.0
total = len(self.required_components)
if analysis.constraints and any(
gold_constraint in str(analysis.constraints)
for gold_constraint in gold_standard.get("constraints", [])
):
score += 1
if analysis.input_format and gold_standard.get("input_format"):
score += 0.5 if len(analysis.input_format) > 20 else 0
if analysis.output_format and gold_standard.get("output_format"):
score += 0.5 if len(analysis.output_format) > 20 else 0
if analysis.approach and gold_standard.get("expected_approach"):
score += 1
if analysis.key_insights:
score += len(analysis.key_insights) * 0.2
return min(score / total, 1.0)
Step 2: Code Implementation Dimension
This evaluates whether the LLM-generated code correctly solves the problem.
from typing import Callable
import subprocess
import tempfile
class CodeEvaluator:
def __init__(self, language: str = "python"):
self.language = language
def extract_code(self, model_output: str) -> str:
"""Extract executable code from model output."""
lines = model_output.split("\n")
code_lines = []
in_code_block = False
for line in lines:
if "```" in line:
in_code_block = not in_code_block
elif in_code_block:
code_lines.append(line)
return "\n".join(code_lines)
def test_code(self, code: str, test_cases: List[Tuple]) -> Dict:
"""Execute code against test cases and report results."""
results = {
"passed": 0,
"failed": 0,
"errors": [],
"outputs": []
}
with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
f.write(code)
f.flush()
temp_path = f.name
for test_input, expected_output in test_cases:
try:
result = subprocess.run(
["python", temp_path],
input=str(test_input),
capture_output=True,
text=True,
timeout=5
)
actual_output = result.stdout.strip()
expected_output_str = str(expected_output).strip()
if actual_output == expected_output_str:
results["passed"] += 1
else:
results["failed"] += 1
results["outputs"].append({
"input": test_input,
"expected": expected_output_str,
"actual": actual_output
})
except subprocess.TimeoutExpired:
results["failed"] += 1
results["errors"].append(f"Timeout on input: {test_input}")
except Exception as e:
results["failed"] += 1
results["errors"].append(str(e))
import os
os.unlink(temp_path)
return results
def score_code(self, test_results: Dict) -> float:
"""Score code based on test passage rate."""
total = test_results["passed"] + test_results["failed"]
return test_results["passed"] / total if total > 0 else 0.0
Step 3: Test Case Generation (Cases Dimension)
This evaluates the diversity and effectiveness of test cases generated by the model.
class CasesEvaluator:
def __init__(self):
self.case_categories = [
"edge_case",
"boundary_case",
"normal_case",
"stress_case"
]
def extract_test_cases(self, model_output: str) -> List[Tuple]:
"""Extract test cases from model output."""
import re
test_cases = []
lines = model_output.split("\n")
for i, line in enumerate(lines):
if "test" in line.lower() or "case" in line.lower():
# Look for input/output patterns
if i + 1 < len(lines):
match = re.search(r"(\[.*?\]|{.*?})", lines[i + 1])
if match:
test_cases.append(lines[i + 1])
return test_cases
def categorize_test_case(self, test_case: str, problem_constraints: Dict) -> str:
"""Classify test case as edge, boundary, normal, or stress."""
if any(val == 0 or val == 1 for val in test_case if isinstance(val, int)):
return "boundary_case"
if any(str(val) in test_case for val in problem_constraints.get("limits", [])):
return "edge_case"
# Rough heuristic: large inputs are stress cases
if len(str(test_case)) > 100:
return "stress_case"
return "normal_case"
def score_test_coverage(self, test_cases: List[str], problem_constraints: Dict) -> float:
"""Score diversity of test case coverage."""
if not test_cases:
return 0.0
categories = [self.categorize_test_case(tc, problem_constraints) for tc in test_cases]
unique_categories = len(set(categories))
# Higher diversity score
coverage_score = unique_categories / len(self.case_categories)
return min(coverage_score + len(test_cases) * 0.05, 1.0)
Step 4: Integrated Triangle Evaluation and Ensemble
This combines all three dimensions and demonstrates ensemble improvement through diversity.
from dataclasses import dataclass
from typing import List
import numpy as np
@dataclass
class TriangleDimensions:
editorial_score: float
code_score: float
cases_score: float
@property
def overall_score(self) -> float:
return np.mean([self.editorial_score, self.code_score, self.cases_score])
class CodingTriangleEvaluator:
def __init__(self):
self.editorial_eval = EditorialEvaluator()
self.code_eval = CodeEvaluator()
self.cases_eval = CasesEvaluator()
def evaluate_response(self, model_output: str, problem: Dict, test_cases: List[Tuple]) -> TriangleDimensions:
"""Evaluate LLM response across all three dimensions."""
# Editorial dimension
editorial = self.editorial_eval.extract_editorial(model_output)
editorial_score = self.editorial_eval.score_editorial(editorial, problem)
# Code dimension
code = self.code_eval.extract_code(model_output)
test_results = self.code_eval.test_code(code, test_cases)
code_score = self.code_eval.score_code(test_results)
# Cases dimension
generated_cases = self.cases_eval.extract_test_cases(model_output)
cases_score = self.cases_eval.score_test_coverage(generated_cases, problem)
return TriangleDimensions(
editorial_score=editorial_score,
code_score=code_score,
cases_score=cases_score
)
def ensemble_evaluation(self, model_outputs: List[str], problem: Dict, test_cases: List[Tuple]) -> Dict:
"""Evaluate multiple models and combine results via diversity voting."""
all_scores = []
all_codes = []
for output in model_outputs:
scores = self.evaluate_response(output, problem, test_cases)
all_scores.append(scores)
all_codes.append(self.code_eval.extract_code(output))
# Select most diverse solution set
unique_codes = list(set(all_codes))
best_code = None
best_test_rate = 0
for code in unique_codes:
results = self.code_eval.test_code(code, test_cases)
test_rate = self.code_eval.score_code(results)
if test_rate > best_test_rate:
best_test_rate = test_rate
best_code = code
avg_editorial = np.mean([s.editorial_score for s in all_scores])
avg_cases = np.mean([s.cases_score for s in all_scores])
return {
"ensemble_code_score": best_test_rate,
"avg_editorial_score": avg_editorial,
"avg_cases_score": avg_cases,
"solution_diversity": len(unique_codes) / len(all_codes),
"best_solution": best_code
}
Hyperparameters and Configuration
| Parameter | Recommended Value | Range | Notes | |-----------|------------------|-------|-------| | Ensemble Size | 3-5 models | 1-10 | Diminishing diversity returns beyond 5 | | Test Case Diversity Threshold | 0.6-0.8 | 0.4-1.0 | Higher threshold ensures coverage of edge/stress cases | | Editorial Completeness Threshold | 0.7 | 0.5-1.0 | Minimum required components for problem understanding | | Timeout per Test Case | 5 seconds | 1-30 | Catches infinite loops without excessive waiting | | Code Extraction Confidence | 0.8 | 0.5-1.0 | Minimum confidence required to test extracted code |
When to Use
When NOT to Use
Common Pitfalls
Coding Triangle: How Does Large Language Model Understand Code? https://arxiv.org/abs/2507.06138
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.