skills/agentdrive-open-benchmark-dataset/SKILL.md
Generate structured autonomous driving scenarios and MCQ benchmarks using AgentDrive's factorized 7-axis prompt-to-JSON pipeline. Use when asked to: 'generate driving scenarios for AV testing', 'create autonomous driving benchmarks', 'build LLM evaluation datasets for driving', 'design safety-critical simulation scenarios', 'evaluate LLM reasoning on driving tasks', 'create multiple-choice questions for autonomous systems'.
npx skillsauth add ndpvt-web/arxiv-claude-skills agentdrive-open-benchmark-datasetInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to generate structured, simulation-ready autonomous driving scenarios using AgentDrive's factorized 7-axis methodology. Rather than writing free-form scenario descriptions, you decompose the scenario space into orthogonal axes (scenario type, driver behavior, environment, road layout, objective, difficulty, traffic density), then use an LLM-driven prompt-to-JSON pipeline to produce validated specifications. The same methodology extends to generating multiple-choice reasoning benchmarks across five dimensions: physics, policy, hybrid, scenario, and comparative reasoning.
AgentDrive's core insight is factorized scenario decomposition: instead of generating driving scenarios monolithically, the space is split into seven independent axes, each with enumerated values. This guarantees systematic coverage and prevents mode collapse toward common scenarios. The seven axes are:
The generation pipeline works as follows: sample an axis combination, construct a structured prompt that specifies each axis value, pass it to an LLM to produce a JSON scenario, then validate the output against both a JSON schema and physical constraints (plausible acceleration bounds, lane geometry, kinematic continuity). Invalid scenarios are rejected and regenerated. Each validated scenario is then run through simulation, where surrogate safety metrics — Time-to-Collision (TTC), Post-Encroachment Time (PET), Deceleration Rate to Avoid Collision (DRAC) — are computed, and outcomes are labeled as safe_goal, safe_stop, inefficient, or unsafe.
For MCQ generation, critical decision points are extracted from scenario narratives, correct answers are grounded in simulation outcomes, and plausible distractors are generated by perturbing one reasoning dimension at a time. This produces questions that discriminate between surface-level pattern matching and genuine physics/policy understanding.
Define the axis value space. Create a Python dict or JSON object enumerating all values for each of the 7 axes. Start with the defaults from AgentDrive or customize for your domain (e.g., add "construction zone" to road layout).
Sample axis combinations. Use stratified or full-factorial sampling to select tuples across the 7 axes. For systematic coverage, use itertools.product with a subset; for scale, use Latin hypercube or random sampling with balance constraints.
Construct the structured prompt. For each axis tuple, build a prompt that explicitly specifies every axis value and instructs the LLM to produce a JSON scenario. Include the target JSON schema in the prompt so the model knows the expected output structure.
Generate the JSON scenario via LLM call. Send the prompt to an LLM (GPT-4, Claude, etc.) requesting structured JSON output. Use response_format: { type: "json_object" } or equivalent to enforce valid JSON.
Validate against schema and physics constraints. Check the output against a JSON Schema (required fields: name, road, environment, ego, duration_steps, metrics). Then apply physics checks: vehicle speeds within road speed limits, acceleration/deceleration within [-8, 4] m/s^2, spawn positions on valid lane geometry.
Reject and regenerate invalid scenarios. If validation fails, log the error, adjust the prompt with the specific constraint violation, and retry (up to 3 attempts per combination).
Run simulation rollouts. Feed validated JSON into a driving simulator. Record timestep trajectories (position, velocity, heading, brake) for all agents.
Compute surrogate safety metrics. Calculate TTC (time until collision at current trajectories), PET (time gap after conflict), DRAC (deceleration needed to avoid crash), and minimum headway distance from the trajectory data.
Label outcomes. Apply rule-based classification: safe_goal (reached destination, no violations), safe_stop (stopped safely but didn't reach goal), inefficient (reached goal but with excessive time/distance), unsafe (collision or traffic violation occurred).
Export the dataset. Write scenarios, trajectories, metrics, and labels to structured files (JSON for scenarios, CSV/Parquet for trajectories and metrics).
Select a scenario and reasoning dimension. Pick from: physics (braking distance, friction), policy (right-of-way, speed limits), hybrid (policy + physics interaction), scenario (situational judgment), comparative (which of two actions is safer).
Extract the decision point. Identify the critical moment in the scenario where the ego vehicle must choose an action.
Write the stem question. Frame it as a concrete driving situation with all necessary context (speed, distance, road condition, other agents).
Generate the correct answer. Ground it in the simulation outcome or physics calculation.
Generate 3 distractors. Create plausible wrong answers by perturbing one axis: wrong physics (ignore friction), wrong policy (misapply right-of-way), wrong risk assessment (underestimate closing speed).
Assign difficulty and reasoning tags. Label each question with its primary reasoning dimension and difficulty tier.
Example 1: Generating a batch of structured driving scenarios
User: "Generate 5 diverse driving scenarios for testing an autonomous vehicle in bad weather."
Approach:
Output:
[
{
"name": "EmergencyBrake_FoggyUrban_v1",
"seed": 101,
"duration_steps": 400,
"policy_frequency": 10,
"scenario_type": "emergency_braking",
"driver_behavior": "normal",
"road": {
"type": "urban_intersection",
"lanes": 2,
"speed_limit_kph": 50
},
"environment": { "weather": "fog", "visibility_m": 80, "time_of_day": "dawn" },
"traffic_density": "moderate",
"difficulty": "medium",
"ego": {
"spawn": { "x": -60.0, "y": 0.0, "heading_deg": 0, "v_mps": 12.5 },
"objective": "stop_safely_before_obstacle"
},
"other_agents": [
{ "type": "pedestrian", "spawn": { "x": -10.0, "y": 3.5 }, "behavior": "crossing" }
],
"metrics": ["ttc_front", "min_headway", "collisions", "drac"]
},
{
"name": "LaneChange_RainyHighway_v1",
"seed": 102,
"duration_steps": 600,
"policy_frequency": 10,
"scenario_type": "lane_change",
"driver_behavior": "cautious",
"road": {
"type": "highway",
"lanes": 3,
"speed_limit_kph": 110
},
"environment": { "weather": "rain", "visibility_m": 200, "time_of_day": "night" },
"traffic_density": "dense",
"difficulty": "hard",
"ego": {
"spawn": { "x": -200.0, "y": 0.0, "heading_deg": 0, "v_mps": 28.0 },
"objective": "reach_goal_safely"
},
"other_agents": [
{ "type": "vehicle", "spawn": { "x": -180.0, "y": 3.7 }, "v_mps": 25.0, "behavior": "aggressive" },
{ "type": "vehicle", "spawn": { "x": -150.0, "y": -3.7 }, "v_mps": 30.0, "behavior": "normal" }
],
"metrics": ["ttc_front", "ttc_rear", "pet", "collisions"]
}
]
(3 more scenarios would follow with snow/freezing, fog/night, rain/rural combinations.)
Example 2: Creating MCQ benchmark questions
User: "Create 3 multiple-choice questions to test an LLM's physics reasoning about driving."
Approach:
Output:
[
{
"id": "phys_001",
"reasoning_dimension": "physics",
"difficulty": "medium",
"stem": "A vehicle traveling at 90 km/h on a wet highway (friction coefficient 0.5) must emergency brake to avoid a stationary obstacle 60 meters ahead. The vehicle's mass is 1500 kg and reaction time is 1.0 second. Can the vehicle stop in time?",
"options": {
"A": "Yes, the stopping distance is approximately 55 m, leaving a 5 m margin.",
"B": "No, the total stopping distance including reaction is approximately 88 m, exceeding the 60 m available.",
"C": "Yes, ABS will reduce stopping distance to under 40 m regardless of road conditions.",
"D": "No, but the vehicle can swerve to avoid the obstacle at this speed safely."
},
"correct": "B",
"explanation": "At 90 km/h (25 m/s), reaction distance = 25 m. Braking distance on wet road = v^2/(2*mu*g) = 625/(2*0.5*9.81) = 63.7 m. Total = 88.7 m > 60 m."
},
{
"id": "phys_002",
"reasoning_dimension": "physics",
"difficulty": "easy",
"stem": "Two vehicles approach each other head-on, each traveling at 60 km/h. What is the closing speed?",
"options": {
"A": "60 km/h — speed is relative to the road, not the other vehicle.",
"B": "120 km/h — closing speed is the sum of both vehicles' speeds.",
"C": "90 km/h — closing speed is the average times a safety factor.",
"D": "84.85 km/h — use the Pythagorean theorem to combine the velocities."
},
"correct": "B",
"explanation": "For head-on approach along the same axis, closing speed = v1 + v2 = 60 + 60 = 120 km/h. Vector addition applies when directions are opposite on the same line."
},
{
"id": "phys_003",
"reasoning_dimension": "physics",
"difficulty": "hard",
"stem": "A 2000 kg vehicle rounds a curve with radius 50 m at 72 km/h on a road with friction coefficient 0.7. Is the vehicle at risk of skidding?",
"options": {
"A": "No, the required centripetal force (16,000 N) is below the maximum friction force (13,734 N).",
"B": "Yes, the required centripetal force (16,000 N) exceeds the maximum friction force (13,734 N).",
"C": "No, modern tires provide enough grip for any speed below 100 km/h on curves.",
"D": "Yes, but only if the road is banked at less than 5 degrees."
},
"correct": "B",
"explanation": "v = 20 m/s. Centripetal force = mv^2/r = 2000*400/50 = 16,000 N. Max friction = mu*m*g = 0.7*2000*9.81 = 13,734 N. Since 16,000 > 13,734, the vehicle will skid."
}
]
Example 3: Building an evaluation pipeline for LLMs
User: "I want to evaluate GPT-4 and Claude on driving reasoning. Set up the evaluation."
Approach:
Output:
import json
import asyncio
from collections import defaultdict
# Load MCQ benchmark
with open("agentdrive_mcq.json") as f:
questions = json.load(f)
# Group by reasoning dimension
by_dimension = defaultdict(list)
for q in questions:
by_dimension[q["reasoning_dimension"]].append(q)
async def evaluate_model(model_fn, questions, model_name):
"""Evaluate a model on AgentDrive-MCQ, returning per-dimension accuracy."""
results = defaultdict(lambda: {"correct": 0, "total": 0})
for q in questions:
prompt = f"{q['stem']}\n\nOptions:\n"
for key, val in q["options"].items():
prompt += f" {key}. {val}\n"
prompt += "\nRespond with only the letter of the correct answer."
response = await model_fn(prompt)
answer = response.strip().upper()[:1]
dim = q["reasoning_dimension"]
results[dim]["total"] += 1
if answer == q["correct"]:
results[dim]["correct"] += 1
print(f"\n=== {model_name} Results ===")
for dim, counts in sorted(results.items()):
acc = counts["correct"] / counts["total"] * 100
print(f" {dim:20s}: {acc:5.1f}% ({counts['correct']}/{counts['total']})")
return results
Do:
Avoid:
| Error | Cause | Fix | |-------|-------|-----| | Schema validation failure | Missing required fields in generated JSON | Include the full JSON schema in the generation prompt; retry with explicit field list | | Physics constraint violation | Impossible speed/acceleration values | Add explicit bounds to the prompt (e.g., "acceleration must be between -8 and 4 m/s^2"); clamp values post-generation | | Duplicate scenarios | Axis sampling produces identical combinations | Hash axis tuples and deduplicate before generation | | Incorrect MCQ answers | LLM arithmetic errors in physics questions | Always verify correct answers with independent calculation; include the computation in the explanation field | | Simulator rejection | JSON format incompatible with target simulator | Maintain a mapping layer between AgentDrive schema and simulator-specific formats (CARLA, Highway-Env, SUMO) |
Paper: AgentDrive: An Open Benchmark Dataset for Agentic AI Reasoning with LLM-Generated Scenarios in Autonomous Systems — Ferrag, Lakas, Debbah (2026). Focus on Section 3 (factorized scenario space definition), Section 4 (prompt-to-JSON pipeline and validation), and Section 6 (MCQ benchmark design and LLM evaluation results across 50 models).
Code & Data: github.com/maferrag/AgentDrive
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".