Coding Triangle: Decomposing LLM Code Understanding into Three Dimensions

Large language models excel at individual coding tasks, but understanding how they break down problems remains opaque. Standard benchmarks measure only final correctness (pass/fail), hiding whether failures stem from misunderstanding the problem, poor implementation, or inadequate validation. A model might correctly understand a problem but produce buggy code, or implement code correctly yet fail to generate meaningful test cases.

The Coding Triangle framework decomposes coding ability into three interconnected dimensions—Editorial (problem analysis), Code (implementation), and Cases (test generation)—revealing alignment and misalignment between capabilities. This granular view exposes that LLMs make predictable, correlated errors across dimensions and that model ensembles can overcome individual model limitations by increasing solution diversity.

Core Concept

Rather than treating code generation as monolithic, the Coding Triangle treats it as three coupled competencies: (1) Editorial—understanding the problem statement, constraints, and approach, (2) Code—translating approach into correct implementation, and (3) Cases—generating test cases that validate solutions. Models can succeed or fail at each dimension independently. A model might solve a problem correctly but fail to write tests that catch edge cases. Or it might generate test cases that are actually correct solutions but fail to implement efficiently.

The key insight is that self-consistency emerges at the dimension level: if a model confidently solves a problem one way, it tends to make similar mistakes across all three dimensions. This predictability enables targeted improvement—rather than attempting generic "better reasoning," practitioners can identify which dimension causes failure and apply targeted interventions.

Architecture Overview

Editorial Dimension: Parse problem statement into constraints, identify input/output specifications, outline solution approach, explain key insights
Code Dimension: Implement solution in target language, ensure boundary condition handling, verify algorithmic efficiency
Cases Dimension: Generate diverse test cases covering edge cases, stress scenarios, and normal inputs; validate test effectiveness
Evaluation Framework: Assess each dimension independently, then analyze relationships—e.g., does Editorial correctness predict Code correctness?
Ensemble Strategy: Combine multiple models' outputs by voting or diversity selection to increase overall solution coverage
Analysis Tools: Measure dimension-level correctness, correlation matrices between dimensions, error clustering analysis, distribution comparison to human solutions

Implementation

The following implements the Coding Triangle evaluation framework for assessing LLM coding abilities across dimensions.

Step 1: Problem Analysis (Editorial Dimension)

This code extracts problem understanding from model outputs and validates completeness against gold standards.

import json
from typing import Dict, List, Tuple
from dataclasses import dataclass

@dataclass
class EditorialAnalysis:
    problem_statement: str
    constraints: List[str]
    input_format: str
    output_format: str
    approach: str
    key_insights: List[str]

class EditorialEvaluator:
    def __init__(self):
        self.required_components = [
            "constraints",
            "input_format",
            "output_format",
            "approach",
            "key_insights"
        ]

    def extract_editorial(self, model_output: str) -> EditorialAnalysis:
        """Parse model's problem analysis from text output."""
        lines = model_output.split("\n")
        analysis = EditorialAnalysis(
            problem_statement=model_output[:200],
            constraints=[],
            input_format="",
            output_format="",
            approach="",
            key_insights=[]
        )

        current_section = None
        for line in lines:
            if "constraint" in line.lower():
                current_section = "constraints"
            elif "input" in line.lower():
                current_section = "input_format"
            elif "output" in line.lower():
                current_section = "output_format"
            elif "approach" in line.lower() or "algorithm" in line.lower():
                current_section = "approach"
            elif "insight" in line.lower():
                current_section = "key_insights"
            elif current_section and line.strip():
                if current_section == "constraints":
                    analysis.constraints.append(line.strip())
                elif current_section == "key_insights":
                    analysis.key_insights.append(line.strip())

        return analysis

    def score_editorial(self, analysis: EditorialAnalysis, gold_standard: Dict) -> float:
        """Score completeness of problem analysis."""
        score = 0.0
        total = len(self.required_components)

        if analysis.constraints and any(
            gold_constraint in str(analysis.constraints)
            for gold_constraint in gold_standard.get("constraints", [])
        ):
            score += 1

        if analysis.input_format and gold_standard.get("input_format"):
            score += 0.5 if len(analysis.input_format) > 20 else 0

        if analysis.output_format and gold_standard.get("output_format"):
            score += 0.5 if len(analysis.output_format) > 20 else 0

        if analysis.approach and gold_standard.get("expected_approach"):
            score += 1

        if analysis.key_insights:
            score += len(analysis.key_insights) * 0.2

        return min(score / total, 1.0)

Step 2: Code Implementation Dimension

This evaluates whether the LLM-generated code correctly solves the problem.

from typing import Callable
import subprocess
import tempfile

class CodeEvaluator:
    def __init__(self, language: str = "python"):
        self.language = language

    def extract_code(self, model_output: str) -> str:
        """Extract executable code from model output."""
        lines = model_output.split("\n")
        code_lines = []
        in_code_block = False

        for line in lines:
            if "```" in line:
                in_code_block = not in_code_block
            elif in_code_block:
                code_lines.append(line)

        return "\n".join(code_lines)

    def test_code(self, code: str, test_cases: List[Tuple]) -> Dict:
        """Execute code against test cases and report results."""
        results = {
            "passed": 0,
            "failed": 0,
            "errors": [],
            "outputs": []
        }

        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
            f.write(code)
            f.flush()
            temp_path = f.name

        for test_input, expected_output in test_cases:
            try:
                result = subprocess.run(
                    ["python", temp_path],
                    input=str(test_input),
                    capture_output=True,
                    text=True,
                    timeout=5
                )

                actual_output = result.stdout.strip()
                expected_output_str = str(expected_output).strip()

                if actual_output == expected_output_str:
                    results["passed"] += 1
                else:
                    results["failed"] += 1
                    results["outputs"].append({
                        "input": test_input,
                        "expected": expected_output_str,
                        "actual": actual_output
                    })

            except subprocess.TimeoutExpired:
                results["failed"] += 1
                results["errors"].append(f"Timeout on input: {test_input}")
            except Exception as e:
                results["failed"] += 1
                results["errors"].append(str(e))

        import os
        os.unlink(temp_path)
        return results

    def score_code(self, test_results: Dict) -> float:
        """Score code based on test passage rate."""
        total = test_results["passed"] + test_results["failed"]
        return test_results["passed"] / total if total > 0 else 0.0

Step 3: Test Case Generation (Cases Dimension)

This evaluates the diversity and effectiveness of test cases generated by the model.

class CasesEvaluator:
    def __init__(self):
        self.case_categories = [
            "edge_case",
            "boundary_case",
            "normal_case",
            "stress_case"
        ]

    def extract_test_cases(self, model_output: str) -> List[Tuple]:
        """Extract test cases from model output."""
        import re
        test_cases = []
        lines = model_output.split("\n")

        for i, line in enumerate(lines):
            if "test" in line.lower() or "case" in line.lower():
                # Look for input/output patterns
                if i + 1 < len(lines):
                    match = re.search(r"(\[.*?\]|{.*?})", lines[i + 1])
                    if match:
                        test_cases.append(lines[i + 1])

        return test_cases

    def categorize_test_case(self, test_case: str, problem_constraints: Dict) -> str:
        """Classify test case as edge, boundary, normal, or stress."""
        if any(val == 0 or val == 1 for val in test_case if isinstance(val, int)):
            return "boundary_case"
        if any(str(val) in test_case for val in problem_constraints.get("limits", [])):
            return "edge_case"
        # Rough heuristic: large inputs are stress cases
        if len(str(test_case)) > 100:
            return "stress_case"
        return "normal_case"

    def score_test_coverage(self, test_cases: List[str], problem_constraints: Dict) -> float:
        """Score diversity of test case coverage."""
        if not test_cases:
            return 0.0

        categories = [self.categorize_test_case(tc, problem_constraints) for tc in test_cases]
        unique_categories = len(set(categories))

        # Higher diversity score
        coverage_score = unique_categories / len(self.case_categories)
        return min(coverage_score + len(test_cases) * 0.05, 1.0)

Step 4: Integrated Triangle Evaluation and Ensemble

This combines all three dimensions and demonstrates ensemble improvement through diversity.

from dataclasses import dataclass
from typing import List
import numpy as np

@dataclass
class TriangleDimensions:
    editorial_score: float
    code_score: float
    cases_score: float

    @property
    def overall_score(self) -> float:
        return np.mean([self.editorial_score, self.code_score, self.cases_score])

class CodingTriangleEvaluator:
    def __init__(self):
        self.editorial_eval = EditorialEvaluator()
        self.code_eval = CodeEvaluator()
        self.cases_eval = CasesEvaluator()

    def evaluate_response(self, model_output: str, problem: Dict, test_cases: List[Tuple]) -> TriangleDimensions:
        """Evaluate LLM response across all three dimensions."""
        # Editorial dimension
        editorial = self.editorial_eval.extract_editorial(model_output)
        editorial_score = self.editorial_eval.score_editorial(editorial, problem)

        # Code dimension
        code = self.code_eval.extract_code(model_output)
        test_results = self.code_eval.test_code(code, test_cases)
        code_score = self.code_eval.score_code(test_results)

        # Cases dimension
        generated_cases = self.cases_eval.extract_test_cases(model_output)
        cases_score = self.cases_eval.score_test_coverage(generated_cases, problem)

        return TriangleDimensions(
            editorial_score=editorial_score,
            code_score=code_score,
            cases_score=cases_score
        )

    def ensemble_evaluation(self, model_outputs: List[str], problem: Dict, test_cases: List[Tuple]) -> Dict:
        """Evaluate multiple models and combine results via diversity voting."""
        all_scores = []
        all_codes = []

        for output in model_outputs:
            scores = self.evaluate_response(output, problem, test_cases)
            all_scores.append(scores)
            all_codes.append(self.code_eval.extract_code(output))

        # Select most diverse solution set
        unique_codes = list(set(all_codes))
        best_code = None
        best_test_rate = 0

        for code in unique_codes:
            results = self.code_eval.test_code(code, test_cases)
            test_rate = self.code_eval.score_code(results)
            if test_rate > best_test_rate:
                best_test_rate = test_rate
                best_code = code

        avg_editorial = np.mean([s.editorial_score for s in all_scores])
        avg_cases = np.mean([s.cases_score for s in all_scores])

        return {
            "ensemble_code_score": best_test_rate,
            "avg_editorial_score": avg_editorial,
            "avg_cases_score": avg_cases,
            "solution_diversity": len(unique_codes) / len(all_codes),
            "best_solution": best_code
        }

Practical Guidance

Hyperparameters and Configuration

| Parameter | Recommended Value | Range | Notes | |-----------|------------------|-------|-------| | Ensemble Size | 3-5 models | 1-10 | Diminishing diversity returns beyond 5 | | Test Case Diversity Threshold | 0.6-0.8 | 0.4-1.0 | Higher threshold ensures coverage of edge/stress cases | | Editorial Completeness Threshold | 0.7 | 0.5-1.0 | Minimum required components for problem understanding | | Timeout per Test Case | 5 seconds | 1-30 | Catches infinite loops without excessive waiting | | Code Extraction Confidence | 0.8 | 0.5-1.0 | Minimum confidence required to test extracted code |

When to Use

Competitive programming evaluation pipelines where deep analysis of model capabilities is needed
Identifying which reasoning dimensions cause LLM failures on coding tasks
Building self-improving code systems that target specific dimensional weaknesses
Creating diverse solution ensembles for more robust code generation
Benchmarking LLM progress on fine-grained coding abilities rather than just pass rates

When NOT to Use

Not suitable when only final code correctness matters (use simpler pass/fail metrics)
Avoid for production code generation systems without extensive validation safeguards
Not recommended for real-time code systems with strict latency requirements (evaluation is slow)
Do not use as a replacement for human code review in production systems

Common Pitfalls

Ignoring distribution shift: Model-generated solutions differ systematically from human solutions. Measure this gap rather than assuming equivalence.
Overweighting ensemble diversity: More models don't automatically improve results. Measure actual improvement and stop at diminishing returns.
Poor code extraction: Fragile regex-based extraction fails on valid code formatting variations. Use robust parsing libraries.
Incomplete test coverage: Generated test cases often miss subtle edge cases. Always include domain expert-designed test cases alongside model-generated ones.
Treating dimensions as independent: Editorial, Code, and Cases dimensions interact. A poor understanding propagates to code quality. Address root problems, not symptoms.

Reference

Coding Triangle: How Does Large Language Model Understand Code? https://arxiv.org/abs/2507.06138

Coding Triangle: Decomposing LLM Code Understanding into Three Dimensions

Core Concept

Architecture Overview

Editorial Dimension: Parse problem statement into constraints, identify input/output specifications, outline solution approach, explain key insights
Code Dimension: Implement solution in target language, ensure boundary condition handling, verify algorithmic efficiency
Cases Dimension: Generate diverse test cases covering edge cases, stress scenarios, and normal inputs; validate test effectiveness
Evaluation Framework: Assess each dimension independently, then analyze relationships—e.g., does Editorial correctness predict Code correctness?
Ensemble Strategy: Combine multiple models' outputs by voting or diversity selection to increase overall solution coverage
Analysis Tools: Measure dimension-level correctness, correlation matrices between dimensions, error clustering analysis, distribution comparison to human solutions

Implementation

The following implements the Coding Triangle evaluation framework for assessing LLM coding abilities across dimensions.

Step 1: Problem Analysis (Editorial Dimension)

This code extracts problem understanding from model outputs and validates completeness against gold standards.

import json
from typing import Dict, List, Tuple
from dataclasses import dataclass

@dataclass
class EditorialAnalysis:
    problem_statement: str
    constraints: List[str]
    input_format: str
    output_format: str
    approach: str
    key_insights: List[str]

class EditorialEvaluator:
    def __init__(self):
        self.required_components = [
            "constraints",
            "input_format",
            "output_format",
            "approach",
            "key_insights"
        ]

    def extract_editorial(self, model_output: str) -> EditorialAnalysis:
        """Parse model's problem analysis from text output."""
        lines = model_output.split("\n")
        analysis = EditorialAnalysis(
            problem_statement=model_output[:200],
            constraints=[],
            input_format="",
            output_format="",
            approach="",
            key_insights=[]
        )

        current_section = None
        for line in lines:
            if "constraint" in line.lower():
                current_section = "constraints"
            elif "input" in line.lower():
                current_section = "input_format"
            elif "output" in line.lower():
                current_section = "output_format"
            elif "approach" in line.lower() or "algorithm" in line.lower():
                current_section = "approach"
            elif "insight" in line.lower():
                current_section = "key_insights"
            elif current_section and line.strip():
                if current_section == "constraints":
                    analysis.constraints.append(line.strip())
                elif current_section == "key_insights":
                    analysis.key_insights.append(line.strip())

        return analysis

    def score_editorial(self, analysis: EditorialAnalysis, gold_standard: Dict) -> float:
        """Score completeness of problem analysis."""
        score = 0.0
        total = len(self.required_components)

        if analysis.constraints and any(
            gold_constraint in str(analysis.constraints)
            for gold_constraint in gold_standard.get("constraints", [])
        ):
            score += 1

        if analysis.input_format and gold_standard.get("input_format"):
            score += 0.5 if len(analysis.input_format) > 20 else 0

        if analysis.output_format and gold_standard.get("output_format"):
            score += 0.5 if len(analysis.output_format) > 20 else 0

        if analysis.approach and gold_standard.get("expected_approach"):
            score += 1

        if analysis.key_insights:
            score += len(analysis.key_insights) * 0.2

        return min(score / total, 1.0)

Step 2: Code Implementation Dimension

This evaluates whether the LLM-generated code correctly solves the problem.

from typing import Callable
import subprocess
import tempfile

class CodeEvaluator:
    def __init__(self, language: str = "python"):
        self.language = language

    def extract_code(self, model_output: str) -> str:
        """Extract executable code from model output."""
        lines = model_output.split("\n")
        code_lines = []
        in_code_block = False

        for line in lines:
            if "```" in line:
                in_code_block = not in_code_block
            elif in_code_block:
                code_lines.append(line)

        return "\n".join(code_lines)

    def test_code(self, code: str, test_cases: List[Tuple]) -> Dict:
        """Execute code against test cases and report results."""
        results = {
            "passed": 0,
            "failed": 0,
            "errors": [],
            "outputs": []
        }

        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
            f.write(code)
            f.flush()
            temp_path = f.name

        for test_input, expected_output in test_cases:
            try:
                result = subprocess.run(
                    ["python", temp_path],
                    input=str(test_input),
                    capture_output=True,
                    text=True,
                    timeout=5
                )

                actual_output = result.stdout.strip()
                expected_output_str = str(expected_output).strip()

                if actual_output == expected_output_str:
                    results["passed"] += 1
                else:
                    results["failed"] += 1
                    results["outputs"].append({
                        "input": test_input,
                        "expected": expected_output_str,
                        "actual": actual_output
                    })

            except subprocess.TimeoutExpired:
                results["failed"] += 1
                results["errors"].append(f"Timeout on input: {test_input}")
            except Exception as e:
                results["failed"] += 1
                results["errors"].append(str(e))

        import os
        os.unlink(temp_path)
        return results

    def score_code(self, test_results: Dict) -> float:
        """Score code based on test passage rate."""
        total = test_results["passed"] + test_results["failed"]
        return test_results["passed"] / total if total > 0 else 0.0

Step 3: Test Case Generation (Cases Dimension)

This evaluates the diversity and effectiveness of test cases generated by the model.

class CasesEvaluator:
    def __init__(self):
        self.case_categories = [
            "edge_case",
            "boundary_case",
            "normal_case",
            "stress_case"
        ]

    def extract_test_cases(self, model_output: str) -> List[Tuple]:
        """Extract test cases from model output."""
        import re
        test_cases = []
        lines = model_output.split("\n")

        for i, line in enumerate(lines):
            if "test" in line.lower() or "case" in line.lower():
                # Look for input/output patterns
                if i + 1 < len(lines):
                    match = re.search(r"(\[.*?\]|{.*?})", lines[i + 1])
                    if match:
                        test_cases.append(lines[i + 1])

        return test_cases

    def categorize_test_case(self, test_case: str, problem_constraints: Dict) -> str:
        """Classify test case as edge, boundary, normal, or stress."""
        if any(val == 0 or val == 1 for val in test_case if isinstance(val, int)):
            return "boundary_case"
        if any(str(val) in test_case for val in problem_constraints.get("limits", [])):
            return "edge_case"
        # Rough heuristic: large inputs are stress cases
        if len(str(test_case)) > 100:
            return "stress_case"
        return "normal_case"

    def score_test_coverage(self, test_cases: List[str], problem_constraints: Dict) -> float:
        """Score diversity of test case coverage."""
        if not test_cases:
            return 0.0

        categories = [self.categorize_test_case(tc, problem_constraints) for tc in test_cases]
        unique_categories = len(set(categories))

        # Higher diversity score
        coverage_score = unique_categories / len(self.case_categories)
        return min(coverage_score + len(test_cases) * 0.05, 1.0)

Step 4: Integrated Triangle Evaluation and Ensemble

This combines all three dimensions and demonstrates ensemble improvement through diversity.

from dataclasses import dataclass
from typing import List
import numpy as np

@dataclass
class TriangleDimensions:
    editorial_score: float
    code_score: float
    cases_score: float

    @property
    def overall_score(self) -> float:
        return np.mean([self.editorial_score, self.code_score, self.cases_score])

class CodingTriangleEvaluator:
    def __init__(self):
        self.editorial_eval = EditorialEvaluator()
        self.code_eval = CodeEvaluator()
        self.cases_eval = CasesEvaluator()

    def evaluate_response(self, model_output: str, problem: Dict, test_cases: List[Tuple]) -> TriangleDimensions:
        """Evaluate LLM response across all three dimensions."""
        # Editorial dimension
        editorial = self.editorial_eval.extract_editorial(model_output)
        editorial_score = self.editorial_eval.score_editorial(editorial, problem)

        # Code dimension
        code = self.code_eval.extract_code(model_output)
        test_results = self.code_eval.test_code(code, test_cases)
        code_score = self.code_eval.score_code(test_results)

        # Cases dimension
        generated_cases = self.cases_eval.extract_test_cases(model_output)
        cases_score = self.cases_eval.score_test_coverage(generated_cases, problem)

        return TriangleDimensions(
            editorial_score=editorial_score,
            code_score=code_score,
            cases_score=cases_score
        )

    def ensemble_evaluation(self, model_outputs: List[str], problem: Dict, test_cases: List[Tuple]) -> Dict:
        """Evaluate multiple models and combine results via diversity voting."""
        all_scores = []
        all_codes = []

        for output in model_outputs:
            scores = self.evaluate_response(output, problem, test_cases)
            all_scores.append(scores)
            all_codes.append(self.code_eval.extract_code(output))

        # Select most diverse solution set
        unique_codes = list(set(all_codes))
        best_code = None
        best_test_rate = 0

        for code in unique_codes:
            results = self.code_eval.test_code(code, test_cases)
            test_rate = self.code_eval.score_code(results)
            if test_rate > best_test_rate:
                best_test_rate = test_rate
                best_code = code

        avg_editorial = np.mean([s.editorial_score for s in all_scores])
        avg_cases = np.mean([s.cases_score for s in all_scores])

        return {
            "ensemble_code_score": best_test_rate,
            "avg_editorial_score": avg_editorial,
            "avg_cases_score": avg_cases,
            "solution_diversity": len(unique_codes) / len(all_codes),
            "best_solution": best_code
        }

Practical Guidance

Hyperparameters and Configuration

When to Use

Competitive programming evaluation pipelines where deep analysis of model capabilities is needed
Identifying which reasoning dimensions cause LLM failures on coding tasks
Building self-improving code systems that target specific dimensional weaknesses
Creating diverse solution ensembles for more robust code generation
Benchmarking LLM progress on fine-grained coding abilities rather than just pass rates

When NOT to Use

Not suitable when only final code correctness matters (use simpler pass/fail metrics)
Avoid for production code generation systems without extensive validation safeguards
Not recommended for real-time code systems with strict latency requirements (evaluation is slow)
Do not use as a replacement for human code review in production systems

Common Pitfalls

Ignoring distribution shift: Model-generated solutions differ systematically from human solutions. Measure this gap rather than assuming equivalence.
Overweighting ensemble diversity: More models don't automatically improve results. Measure actual improvement and stop at diminishing returns.
Poor code extraction: Fragile regex-based extraction fails on valid code formatting variations. Use robust parsing libraries.
Incomplete test coverage: Generated test cases often miss subtle edge cases. Always include domain expert-designed test cases alongside model-generated ones.
Treating dimensions as independent: Editorial, Code, and Cases dimensions interact. A poor understanding propagates to code quality. Address root problems, not symptoms.

Reference

Coding Triangle: How Does Large Language Model Understand Code? https://arxiv.org/abs/2507.06138

Adoption

ADu2021/coding-comprehension-evaluation

$ install --global

Security Scan Results

SKILL.md

Coding Triangle: Decomposing LLM Code Understanding into Three Dimensions

Core Concept

Architecture Overview

Implementation

Practical Guidance

Reference

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

ADu2021/coding-comprehension-evaluation

$ install --global

Security Scan Results

SKILL.md

Coding Triangle: Decomposing LLM Code Understanding into Three Dimensions

Core Concept

Architecture Overview

Implementation

Practical Guidance

Reference

Related Skills

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution