AgentDrive: Factorized Scenario Generation for Autonomous Driving Benchmarks

This skill enables Claude to generate structured, simulation-ready autonomous driving scenarios using AgentDrive's factorized 7-axis methodology. Rather than writing free-form scenario descriptions, you decompose the scenario space into orthogonal axes (scenario type, driver behavior, environment, road layout, objective, difficulty, traffic density), then use an LLM-driven prompt-to-JSON pipeline to produce validated specifications. The same methodology extends to generating multiple-choice reasoning benchmarks across five dimensions: physics, policy, hybrid, scenario, and comparative reasoning.

When to Use

When the user asks to generate synthetic driving scenarios for simulation or AV testing
When building benchmark datasets to evaluate LLM reasoning about driving decisions
When creating multiple-choice questions that test physics, policy, or safety reasoning in traffic contexts
When the user needs structured JSON scenario files compatible with driving simulators (CARLA, Highway-Env, SUMO)
When designing safety-critical test suites that systematically cover edge cases via combinatorial axis sampling
When evaluating or comparing LLMs on autonomous driving reasoning tasks
When the user wants to fine-tune models on driving decision-making data

Key Technique

AgentDrive's core insight is factorized scenario decomposition: instead of generating driving scenarios monolithically, the space is split into seven independent axes, each with enumerated values. This guarantees systematic coverage and prevents mode collapse toward common scenarios. The seven axes are:

Scenario type — lane change, intersection crossing, obstacle avoidance, merging, emergency braking, pedestrian crossing, etc.
Driver behavior — normal, aggressive, cautious, distracted, impaired, emergency vehicle
Environment — weather (clear/rain/fog/snow) crossed with time-of-day (day/dawn-dusk/night)
Road layout — highway, urban intersection, rural road, roundabout, multi-lane arterial
Objective — reach goal safely, minimize travel time, yield to emergency vehicle, stop at obstacle
Difficulty — easy (single agent, clear decisions), medium (two-vehicle interaction, temporal pressure), hard (multi-agent, competing objectives, uncertainty)
Traffic density — sparse, moderate, dense

The generation pipeline works as follows: sample an axis combination, construct a structured prompt that specifies each axis value, pass it to an LLM to produce a JSON scenario, then validate the output against both a JSON schema and physical constraints (plausible acceleration bounds, lane geometry, kinematic continuity). Invalid scenarios are rejected and regenerated. Each validated scenario is then run through simulation, where surrogate safety metrics — Time-to-Collision (TTC), Post-Encroachment Time (PET), Deceleration Rate to Avoid Collision (DRAC) — are computed, and outcomes are labeled as safe_goal, safe_stop, inefficient, or unsafe.

For MCQ generation, critical decision points are extracted from scenario narratives, correct answers are grounded in simulation outcomes, and plausible distractors are generated by perturbing one reasoning dimension at a time. This produces questions that discriminate between surface-level pattern matching and genuine physics/policy understanding.

Step-by-Step Workflow

Generating Driving Scenarios

Define the axis value space. Create a Python dict or JSON object enumerating all values for each of the 7 axes. Start with the defaults from AgentDrive or customize for your domain (e.g., add "construction zone" to road layout).
Sample axis combinations. Use stratified or full-factorial sampling to select tuples across the 7 axes. For systematic coverage, use itertools.product with a subset; for scale, use Latin hypercube or random sampling with balance constraints.
Construct the structured prompt. For each axis tuple, build a prompt that explicitly specifies every axis value and instructs the LLM to produce a JSON scenario. Include the target JSON schema in the prompt so the model knows the expected output structure.
Generate the JSON scenario via LLM call. Send the prompt to an LLM (GPT-4, Claude, etc.) requesting structured JSON output. Use response_format: { type: "json_object" } or equivalent to enforce valid JSON.
Validate against schema and physics constraints. Check the output against a JSON Schema (required fields: name, road, environment, ego, duration_steps, metrics). Then apply physics checks: vehicle speeds within road speed limits, acceleration/deceleration within [-8, 4] m/s^2, spawn positions on valid lane geometry.
Reject and regenerate invalid scenarios. If validation fails, log the error, adjust the prompt with the specific constraint violation, and retry (up to 3 attempts per combination).
Run simulation rollouts. Feed validated JSON into a driving simulator. Record timestep trajectories (position, velocity, heading, brake) for all agents.
Compute surrogate safety metrics. Calculate TTC (time until collision at current trajectories), PET (time gap after conflict), DRAC (deceleration needed to avoid crash), and minimum headway distance from the trajectory data.
Label outcomes. Apply rule-based classification: safe_goal (reached destination, no violations), safe_stop (stopped safely but didn't reach goal), inefficient (reached goal but with excessive time/distance), unsafe (collision or traffic violation occurred).
Export the dataset. Write scenarios, trajectories, metrics, and labels to structured files (JSON for scenarios, CSV/Parquet for trajectories and metrics).

Generating MCQ Benchmarks

Select a scenario and reasoning dimension. Pick from: physics (braking distance, friction), policy (right-of-way, speed limits), hybrid (policy + physics interaction), scenario (situational judgment), comparative (which of two actions is safer).
Extract the decision point. Identify the critical moment in the scenario where the ego vehicle must choose an action.
Write the stem question. Frame it as a concrete driving situation with all necessary context (speed, distance, road condition, other agents).
Generate the correct answer. Ground it in the simulation outcome or physics calculation.
Generate 3 distractors. Create plausible wrong answers by perturbing one axis: wrong physics (ignore friction), wrong policy (misapply right-of-way), wrong risk assessment (underestimate closing speed).
Assign difficulty and reasoning tags. Label each question with its primary reasoning dimension and difficulty tier.

Concrete Examples

Example 1: Generating a batch of structured driving scenarios

User: "Generate 5 diverse driving scenarios for testing an autonomous vehicle in bad weather."

Approach:

Fix environment axis to adverse weather (rain, fog, snow) and vary other axes.
Construct axis tuples ensuring diversity across scenario type, difficulty, and traffic density.
Generate JSON for each.

Output:

[
  {
    "name": "EmergencyBrake_FoggyUrban_v1",
    "seed": 101,
    "duration_steps": 400,
    "policy_frequency": 10,
    "scenario_type": "emergency_braking",
    "driver_behavior": "normal",
    "road": {
      "type": "urban_intersection",
      "lanes": 2,
      "speed_limit_kph": 50
    },
    "environment": { "weather": "fog", "visibility_m": 80, "time_of_day": "dawn" },
    "traffic_density": "moderate",
    "difficulty": "medium",
    "ego": {
      "spawn": { "x": -60.0, "y": 0.0, "heading_deg": 0, "v_mps": 12.5 },
      "objective": "stop_safely_before_obstacle"
    },
    "other_agents": [
      { "type": "pedestrian", "spawn": { "x": -10.0, "y": 3.5 }, "behavior": "crossing" }
    ],
    "metrics": ["ttc_front", "min_headway", "collisions", "drac"]
  },
  {
    "name": "LaneChange_RainyHighway_v1",
    "seed": 102,
    "duration_steps": 600,
    "policy_frequency": 10,
    "scenario_type": "lane_change",
    "driver_behavior": "cautious",
    "road": {
      "type": "highway",
      "lanes": 3,
      "speed_limit_kph": 110
    },
    "environment": { "weather": "rain", "visibility_m": 200, "time_of_day": "night" },
    "traffic_density": "dense",
    "difficulty": "hard",
    "ego": {
      "spawn": { "x": -200.0, "y": 0.0, "heading_deg": 0, "v_mps": 28.0 },
      "objective": "reach_goal_safely"
    },
    "other_agents": [
      { "type": "vehicle", "spawn": { "x": -180.0, "y": 3.7 }, "v_mps": 25.0, "behavior": "aggressive" },
      { "type": "vehicle", "spawn": { "x": -150.0, "y": -3.7 }, "v_mps": 30.0, "behavior": "normal" }
    ],
    "metrics": ["ttc_front", "ttc_rear", "pet", "collisions"]
  }
]

(3 more scenarios would follow with snow/freezing, fog/night, rain/rural combinations.)

Example 2: Creating MCQ benchmark questions

User: "Create 3 multiple-choice questions to test an LLM's physics reasoning about driving."

Approach:

Select physics reasoning dimension.
Design questions around braking distance, friction, and closing speed.
Ground correct answers in physics, create distractors by introducing common misconceptions.

Output:

[
  {
    "id": "phys_001",
    "reasoning_dimension": "physics",
    "difficulty": "medium",
    "stem": "A vehicle traveling at 90 km/h on a wet highway (friction coefficient 0.5) must emergency brake to avoid a stationary obstacle 60 meters ahead. The vehicle's mass is 1500 kg and reaction time is 1.0 second. Can the vehicle stop in time?",
    "options": {
      "A": "Yes, the stopping distance is approximately 55 m, leaving a 5 m margin.",
      "B": "No, the total stopping distance including reaction is approximately 88 m, exceeding the 60 m available.",
      "C": "Yes, ABS will reduce stopping distance to under 40 m regardless of road conditions.",
      "D": "No, but the vehicle can swerve to avoid the obstacle at this speed safely."
    },
    "correct": "B",
    "explanation": "At 90 km/h (25 m/s), reaction distance = 25 m. Braking distance on wet road = v^2/(2*mu*g) = 625/(2*0.5*9.81) = 63.7 m. Total = 88.7 m > 60 m."
  },
  {
    "id": "phys_002",
    "reasoning_dimension": "physics",
    "difficulty": "easy",
    "stem": "Two vehicles approach each other head-on, each traveling at 60 km/h. What is the closing speed?",
    "options": {
      "A": "60 km/h — speed is relative to the road, not the other vehicle.",
      "B": "120 km/h — closing speed is the sum of both vehicles' speeds.",
      "C": "90 km/h — closing speed is the average times a safety factor.",
      "D": "84.85 km/h — use the Pythagorean theorem to combine the velocities."
    },
    "correct": "B",
    "explanation": "For head-on approach along the same axis, closing speed = v1 + v2 = 60 + 60 = 120 km/h. Vector addition applies when directions are opposite on the same line."
  },
  {
    "id": "phys_003",
    "reasoning_dimension": "physics",
    "difficulty": "hard",
    "stem": "A 2000 kg vehicle rounds a curve with radius 50 m at 72 km/h on a road with friction coefficient 0.7. Is the vehicle at risk of skidding?",
    "options": {
      "A": "No, the required centripetal force (16,000 N) is below the maximum friction force (13,734 N).",
      "B": "Yes, the required centripetal force (16,000 N) exceeds the maximum friction force (13,734 N).",
      "C": "No, modern tires provide enough grip for any speed below 100 km/h on curves.",
      "D": "Yes, but only if the road is banked at less than 5 degrees."
    },
    "correct": "B",
    "explanation": "v = 20 m/s. Centripetal force = mv^2/r = 2000*400/50 = 16,000 N. Max friction = mu*m*g = 0.7*2000*9.81 = 13,734 N. Since 16,000 > 13,734, the vehicle will skid."
  }
]

Example 3: Building an evaluation pipeline for LLMs

User: "I want to evaluate GPT-4 and Claude on driving reasoning. Set up the evaluation."

Approach:

Structure the MCQ dataset into the 5 reasoning dimensions.
Create an evaluation script that sends questions, collects answers, and scores by dimension.
Produce a comparison table.

Output:

import json
import asyncio
from collections import defaultdict

# Load MCQ benchmark
with open("agentdrive_mcq.json") as f:
    questions = json.load(f)

# Group by reasoning dimension
by_dimension = defaultdict(list)
for q in questions:
    by_dimension[q["reasoning_dimension"]].append(q)

async def evaluate_model(model_fn, questions, model_name):
    """Evaluate a model on AgentDrive-MCQ, returning per-dimension accuracy."""
    results = defaultdict(lambda: {"correct": 0, "total": 0})
    for q in questions:
        prompt = f"{q['stem']}\n\nOptions:\n"
        for key, val in q["options"].items():
            prompt += f"  {key}. {val}\n"
        prompt += "\nRespond with only the letter of the correct answer."

        response = await model_fn(prompt)
        answer = response.strip().upper()[:1]

        dim = q["reasoning_dimension"]
        results[dim]["total"] += 1
        if answer == q["correct"]:
            results[dim]["correct"] += 1

    print(f"\n=== {model_name} Results ===")
    for dim, counts in sorted(results.items()):
        acc = counts["correct"] / counts["total"] * 100
        print(f"  {dim:20s}: {acc:5.1f}% ({counts['correct']}/{counts['total']})")
    return results

Best Practices

Do:

Always specify all 7 axes explicitly when generating scenarios — omitting axes leads to implicit defaults that reduce coverage.
Validate generated JSON against both schema constraints AND physical plausibility before using in any pipeline.
Use stratified sampling across axes to ensure balanced representation, especially for rare combinations (e.g., snow + night + dense traffic + hard difficulty).
Include surrogate safety metrics (TTC, PET, DRAC) in every scenario specification — they are essential for outcome labeling.
When generating MCQs, ground correct answers in physics equations or explicit traffic law, not LLM intuition.
Tag every question and scenario with its axis values and reasoning dimension to enable fine-grained analysis.

Avoid:

Generating scenarios without the factorized axis structure — unconstrained generation produces clusters of similar easy scenarios.
Trusting LLM-generated physics calculations in MCQ distractors without manual verification — models frequently make arithmetic errors in braking distance and force calculations.
Using only one difficulty level — the easy/medium/hard split is critical for discriminating between model capabilities.
Skipping the validation-and-retry loop — approximately 10-15% of LLM-generated scenarios will violate physical constraints on first attempt.

Error Handling

| Error | Cause | Fix | |-------|-------|-----| | Schema validation failure | Missing required fields in generated JSON | Include the full JSON schema in the generation prompt; retry with explicit field list | | Physics constraint violation | Impossible speed/acceleration values | Add explicit bounds to the prompt (e.g., "acceleration must be between -8 and 4 m/s^2"); clamp values post-generation | | Duplicate scenarios | Axis sampling produces identical combinations | Hash axis tuples and deduplicate before generation | | Incorrect MCQ answers | LLM arithmetic errors in physics questions | Always verify correct answers with independent calculation; include the computation in the explanation field | | Simulator rejection | JSON format incompatible with target simulator | Maintain a mapping layer between AgentDrive schema and simulator-specific formats (CARLA, Highway-Env, SUMO) |

Limitations

Simulation fidelity: LLM-generated scenarios describe idealized conditions. Real-world sensor noise, partial observability, and actuator latency are not captured in the JSON specification alone — these must be added by the simulator.
Physics approximation: Surrogate safety metrics assume simplified vehicle dynamics (point-mass model). Real braking distances depend on tire condition, road grade, and brake pad state.
Cultural policy variation: Traffic rules vary by jurisdiction. The MCQ benchmark's policy questions assume a generic set of rules — adapt for specific regions (e.g., left-hand drive countries, different speed limit conventions).
LLM generation bias: Even with factorized sampling, LLMs tend to generate more conventional scenarios. Rare but critical edge cases (e.g., wrong-way driver, debris from overhead) may need targeted prompting.
Scale vs. diversity tradeoff: Full factorial coverage of all 7 axes quickly produces millions of combinations. Practical datasets require intelligent subsampling.

Reference

Paper: AgentDrive: An Open Benchmark Dataset for Agentic AI Reasoning with LLM-Generated Scenarios in Autonomous Systems — Ferrag, Lakas, Debbah (2026). Focus on Section 3 (factorized scenario space definition), Section 4 (prompt-to-JSON pipeline and validation), and Section 6 (MCQ benchmark design and LLM evaluation results across 50 models).

Code & Data: github.com/maferrag/AgentDrive

AgentDrive: Factorized Scenario Generation for Autonomous Driving Benchmarks

When to Use

When the user asks to generate synthetic driving scenarios for simulation or AV testing
When building benchmark datasets to evaluate LLM reasoning about driving decisions
When creating multiple-choice questions that test physics, policy, or safety reasoning in traffic contexts
When the user needs structured JSON scenario files compatible with driving simulators (CARLA, Highway-Env, SUMO)
When designing safety-critical test suites that systematically cover edge cases via combinatorial axis sampling
When evaluating or comparing LLMs on autonomous driving reasoning tasks
When the user wants to fine-tune models on driving decision-making data

Key Technique

Scenario type — lane change, intersection crossing, obstacle avoidance, merging, emergency braking, pedestrian crossing, etc.
Driver behavior — normal, aggressive, cautious, distracted, impaired, emergency vehicle
Environment — weather (clear/rain/fog/snow) crossed with time-of-day (day/dawn-dusk/night)
Road layout — highway, urban intersection, rural road, roundabout, multi-lane arterial
Objective — reach goal safely, minimize travel time, yield to emergency vehicle, stop at obstacle
Difficulty — easy (single agent, clear decisions), medium (two-vehicle interaction, temporal pressure), hard (multi-agent, competing objectives, uncertainty)
Traffic density — sparse, moderate, dense

Step-by-Step Workflow

Generating Driving Scenarios

Define the axis value space. Create a Python dict or JSON object enumerating all values for each of the 7 axes. Start with the defaults from AgentDrive or customize for your domain (e.g., add "construction zone" to road layout).
Sample axis combinations. Use stratified or full-factorial sampling to select tuples across the 7 axes. For systematic coverage, use itertools.product with a subset; for scale, use Latin hypercube or random sampling with balance constraints.
Construct the structured prompt. For each axis tuple, build a prompt that explicitly specifies every axis value and instructs the LLM to produce a JSON scenario. Include the target JSON schema in the prompt so the model knows the expected output structure.
Generate the JSON scenario via LLM call. Send the prompt to an LLM (GPT-4, Claude, etc.) requesting structured JSON output. Use response_format: { type: "json_object" } or equivalent to enforce valid JSON.
Validate against schema and physics constraints. Check the output against a JSON Schema (required fields: name, road, environment, ego, duration_steps, metrics). Then apply physics checks: vehicle speeds within road speed limits, acceleration/deceleration within [-8, 4] m/s^2, spawn positions on valid lane geometry.
Reject and regenerate invalid scenarios. If validation fails, log the error, adjust the prompt with the specific constraint violation, and retry (up to 3 attempts per combination).
Run simulation rollouts. Feed validated JSON into a driving simulator. Record timestep trajectories (position, velocity, heading, brake) for all agents.
Compute surrogate safety metrics. Calculate TTC (time until collision at current trajectories), PET (time gap after conflict), DRAC (deceleration needed to avoid crash), and minimum headway distance from the trajectory data.
Label outcomes. Apply rule-based classification: safe_goal (reached destination, no violations), safe_stop (stopped safely but didn't reach goal), inefficient (reached goal but with excessive time/distance), unsafe (collision or traffic violation occurred).
Export the dataset. Write scenarios, trajectories, metrics, and labels to structured files (JSON for scenarios, CSV/Parquet for trajectories and metrics).

Generating MCQ Benchmarks

Select a scenario and reasoning dimension. Pick from: physics (braking distance, friction), policy (right-of-way, speed limits), hybrid (policy + physics interaction), scenario (situational judgment), comparative (which of two actions is safer).
Extract the decision point. Identify the critical moment in the scenario where the ego vehicle must choose an action.
Write the stem question. Frame it as a concrete driving situation with all necessary context (speed, distance, road condition, other agents).
Generate the correct answer. Ground it in the simulation outcome or physics calculation.
Generate 3 distractors. Create plausible wrong answers by perturbing one axis: wrong physics (ignore friction), wrong policy (misapply right-of-way), wrong risk assessment (underestimate closing speed).
Assign difficulty and reasoning tags. Label each question with its primary reasoning dimension and difficulty tier.

Concrete Examples

Example 1: Generating a batch of structured driving scenarios

User: "Generate 5 diverse driving scenarios for testing an autonomous vehicle in bad weather."

Approach:

Fix environment axis to adverse weather (rain, fog, snow) and vary other axes.
Construct axis tuples ensuring diversity across scenario type, difficulty, and traffic density.
Generate JSON for each.

Output:

[
  {
    "name": "EmergencyBrake_FoggyUrban_v1",
    "seed": 101,
    "duration_steps": 400,
    "policy_frequency": 10,
    "scenario_type": "emergency_braking",
    "driver_behavior": "normal",
    "road": {
      "type": "urban_intersection",
      "lanes": 2,
      "speed_limit_kph": 50
    },
    "environment": { "weather": "fog", "visibility_m": 80, "time_of_day": "dawn" },
    "traffic_density": "moderate",
    "difficulty": "medium",
    "ego": {
      "spawn": { "x": -60.0, "y": 0.0, "heading_deg": 0, "v_mps": 12.5 },
      "objective": "stop_safely_before_obstacle"
    },
    "other_agents": [
      { "type": "pedestrian", "spawn": { "x": -10.0, "y": 3.5 }, "behavior": "crossing" }
    ],
    "metrics": ["ttc_front", "min_headway", "collisions", "drac"]
  },
  {
    "name": "LaneChange_RainyHighway_v1",
    "seed": 102,
    "duration_steps": 600,
    "policy_frequency": 10,
    "scenario_type": "lane_change",
    "driver_behavior": "cautious",
    "road": {
      "type": "highway",
      "lanes": 3,
      "speed_limit_kph": 110
    },
    "environment": { "weather": "rain", "visibility_m": 200, "time_of_day": "night" },
    "traffic_density": "dense",
    "difficulty": "hard",
    "ego": {
      "spawn": { "x": -200.0, "y": 0.0, "heading_deg": 0, "v_mps": 28.0 },
      "objective": "reach_goal_safely"
    },
    "other_agents": [
      { "type": "vehicle", "spawn": { "x": -180.0, "y": 3.7 }, "v_mps": 25.0, "behavior": "aggressive" },
      { "type": "vehicle", "spawn": { "x": -150.0, "y": -3.7 }, "v_mps": 30.0, "behavior": "normal" }
    ],
    "metrics": ["ttc_front", "ttc_rear", "pet", "collisions"]
  }
]

(3 more scenarios would follow with snow/freezing, fog/night, rain/rural combinations.)

Example 2: Creating MCQ benchmark questions

User: "Create 3 multiple-choice questions to test an LLM's physics reasoning about driving."

Approach:

Select physics reasoning dimension.
Design questions around braking distance, friction, and closing speed.
Ground correct answers in physics, create distractors by introducing common misconceptions.

Output:

[
  {
    "id": "phys_001",
    "reasoning_dimension": "physics",
    "difficulty": "medium",
    "stem": "A vehicle traveling at 90 km/h on a wet highway (friction coefficient 0.5) must emergency brake to avoid a stationary obstacle 60 meters ahead. The vehicle's mass is 1500 kg and reaction time is 1.0 second. Can the vehicle stop in time?",
    "options": {
      "A": "Yes, the stopping distance is approximately 55 m, leaving a 5 m margin.",
      "B": "No, the total stopping distance including reaction is approximately 88 m, exceeding the 60 m available.",
      "C": "Yes, ABS will reduce stopping distance to under 40 m regardless of road conditions.",
      "D": "No, but the vehicle can swerve to avoid the obstacle at this speed safely."
    },
    "correct": "B",
    "explanation": "At 90 km/h (25 m/s), reaction distance = 25 m. Braking distance on wet road = v^2/(2*mu*g) = 625/(2*0.5*9.81) = 63.7 m. Total = 88.7 m > 60 m."
  },
  {
    "id": "phys_002",
    "reasoning_dimension": "physics",
    "difficulty": "easy",
    "stem": "Two vehicles approach each other head-on, each traveling at 60 km/h. What is the closing speed?",
    "options": {
      "A": "60 km/h — speed is relative to the road, not the other vehicle.",
      "B": "120 km/h — closing speed is the sum of both vehicles' speeds.",
      "C": "90 km/h — closing speed is the average times a safety factor.",
      "D": "84.85 km/h — use the Pythagorean theorem to combine the velocities."
    },
    "correct": "B",
    "explanation": "For head-on approach along the same axis, closing speed = v1 + v2 = 60 + 60 = 120 km/h. Vector addition applies when directions are opposite on the same line."
  },
  {
    "id": "phys_003",
    "reasoning_dimension": "physics",
    "difficulty": "hard",
    "stem": "A 2000 kg vehicle rounds a curve with radius 50 m at 72 km/h on a road with friction coefficient 0.7. Is the vehicle at risk of skidding?",
    "options": {
      "A": "No, the required centripetal force (16,000 N) is below the maximum friction force (13,734 N).",
      "B": "Yes, the required centripetal force (16,000 N) exceeds the maximum friction force (13,734 N).",
      "C": "No, modern tires provide enough grip for any speed below 100 km/h on curves.",
      "D": "Yes, but only if the road is banked at less than 5 degrees."
    },
    "correct": "B",
    "explanation": "v = 20 m/s. Centripetal force = mv^2/r = 2000*400/50 = 16,000 N. Max friction = mu*m*g = 0.7*2000*9.81 = 13,734 N. Since 16,000 > 13,734, the vehicle will skid."
  }
]

Example 3: Building an evaluation pipeline for LLMs

User: "I want to evaluate GPT-4 and Claude on driving reasoning. Set up the evaluation."

Approach:

Structure the MCQ dataset into the 5 reasoning dimensions.
Create an evaluation script that sends questions, collects answers, and scores by dimension.
Produce a comparison table.

Output:

import json
import asyncio
from collections import defaultdict

# Load MCQ benchmark
with open("agentdrive_mcq.json") as f:
    questions = json.load(f)

# Group by reasoning dimension
by_dimension = defaultdict(list)
for q in questions:
    by_dimension[q["reasoning_dimension"]].append(q)

async def evaluate_model(model_fn, questions, model_name):
    """Evaluate a model on AgentDrive-MCQ, returning per-dimension accuracy."""
    results = defaultdict(lambda: {"correct": 0, "total": 0})
    for q in questions:
        prompt = f"{q['stem']}\n\nOptions:\n"
        for key, val in q["options"].items():
            prompt += f"  {key}. {val}\n"
        prompt += "\nRespond with only the letter of the correct answer."

        response = await model_fn(prompt)
        answer = response.strip().upper()[:1]

        dim = q["reasoning_dimension"]
        results[dim]["total"] += 1
        if answer == q["correct"]:
            results[dim]["correct"] += 1

    print(f"\n=== {model_name} Results ===")
    for dim, counts in sorted(results.items()):
        acc = counts["correct"] / counts["total"] * 100
        print(f"  {dim:20s}: {acc:5.1f}% ({counts['correct']}/{counts['total']})")
    return results

Best Practices

Do:

Always specify all 7 axes explicitly when generating scenarios — omitting axes leads to implicit defaults that reduce coverage.
Validate generated JSON against both schema constraints AND physical plausibility before using in any pipeline.
Use stratified sampling across axes to ensure balanced representation, especially for rare combinations (e.g., snow + night + dense traffic + hard difficulty).
Include surrogate safety metrics (TTC, PET, DRAC) in every scenario specification — they are essential for outcome labeling.
When generating MCQs, ground correct answers in physics equations or explicit traffic law, not LLM intuition.
Tag every question and scenario with its axis values and reasoning dimension to enable fine-grained analysis.

Avoid:

Generating scenarios without the factorized axis structure — unconstrained generation produces clusters of similar easy scenarios.
Trusting LLM-generated physics calculations in MCQ distractors without manual verification — models frequently make arithmetic errors in braking distance and force calculations.
Using only one difficulty level — the easy/medium/hard split is critical for discriminating between model capabilities.
Skipping the validation-and-retry loop — approximately 10-15% of LLM-generated scenarios will violate physical constraints on first attempt.

Error Handling

Limitations

Simulation fidelity: LLM-generated scenarios describe idealized conditions. Real-world sensor noise, partial observability, and actuator latency are not captured in the JSON specification alone — these must be added by the simulator.
Physics approximation: Surrogate safety metrics assume simplified vehicle dynamics (point-mass model). Real braking distances depend on tire condition, road grade, and brake pad state.
Cultural policy variation: Traffic rules vary by jurisdiction. The MCQ benchmark's policy questions assume a generic set of rules — adapt for specific regions (e.g., left-hand drive countries, different speed limit conventions).
LLM generation bias: Even with factorized sampling, LLMs tend to generate more conventional scenarios. Rare but critical edge cases (e.g., wrong-way driver, debris from overhead) may need targeted prompting.
Scale vs. diversity tradeoff: Full factorial coverage of all 7 axes quickly produces millions of combinations. Practical datasets require intelligent subsampling.

Reference

Code & Data: github.com/maferrag/AgentDrive

Adoption

ndpvt-web/agentdrive-open-benchmark-dataset

$ install --global

Security Scan Results

SKILL.md

AgentDrive: Factorized Scenario Generation for Autonomous Driving Benchmarks

When to Use

Key Technique

Step-by-Step Workflow

Generating Driving Scenarios

Generating MCQ Benchmarks

Concrete Examples

Best Practices

Error Handling

Limitations

Reference

Related Skills

ndpvt-web/gradingattack-attacking-short-answer

ndpvt-web/gisa-benchmark-general-information-seeking

ndpvt-web/gflowpo-generative-flow-network

ndpvt-web/generative-ontology-structured-knowledge

ndpvt-web/agentdrive-open-benchmark-dataset

$ install --global

Security Scan Results

SKILL.md

AgentDrive: Factorized Scenario Generation for Autonomous Driving Benchmarks

When to Use

Key Technique

Step-by-Step Workflow

Generating Driving Scenarios

Generating MCQ Benchmarks

Concrete Examples

Best Practices

Error Handling

Limitations

Reference

Related Skills

ndpvt-web/gradingattack-attacking-short-answer

ndpvt-web/gisa-benchmark-general-information-seeking

ndpvt-web/gflowpo-generative-flow-network

ndpvt-web/generative-ontology-structured-knowledge