skills/gamedevbench-evaluating-agentic-capabilities/SKILL.md
Agentic game development with visual feedback loops for Godot Engine projects. Applies the GameDevBench methodology: navigating scene hierarchies, editing multimodal assets (sprites, shaders, animations), and using screenshot/video feedback to verify changes visually. Trigger phrases: 'build a game in Godot', 'fix my Godot scene', 'add animation to my game character', 'edit this shader effect', 'set up sprite animations from a spritesheet', 'create a game UI with Godot'
npx skillsauth add ndpvt-web/arxiv-claude-skills gamedevbench-evaluating-agentic-capabilitiesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to tackle game development tasks in Godot Engine using the agentic workflow and visual feedback mechanisms from GameDevBench (arXiv:2602.11103). The core insight: game development requires simultaneously navigating large codebases, understanding hierarchical scene graphs, and manipulating intrinsically multimodal assets (sprites, shaders, animations, audio). By incorporating screenshot feedback from the Godot editor and runtime video capture, agents improve from ~33% to ~48% task success -- a technique this skill applies to real Godot projects.
Multi-file, multimodal-aware editing with visual feedback loops. GameDevBench found that game development tasks average 5 files modified and 106 lines changed per solution across 3.4 distinct file types -- far exceeding typical software engineering benchmarks. The critical differentiator is that errors are often visual, not logical: an agent might wire up correct gameplay code but select walking sprites instead of attack sprites, or attach a node at the wrong depth in the scene tree. Traditional text-only feedback (compiler errors, test output) cannot catch these mistakes.
Two feedback mechanisms close this gap. First, editor screenshot feedback: capturing a screenshot of the Godot editor after changes shows the scene graph structure, the inspector panel with node properties, and the 2D/3D viewport with the current visual state. This lets the agent verify node hierarchy, property values, and spatial layout. Second, runtime video feedback: using Godot's built-in recording to capture a short gameplay clip reveals temporal dynamics -- whether animations play correctly, physics behave as expected, and camera movement tracks properly. The paper found that either mechanism alone delivers most of the improvement; combining them yields marginal additional gains.
The practical implication for coding agents: after each significant edit pass, take a visual checkpoint. Compare the visual state against the task requirements. If the visual output diverges from intent, diagnose whether the issue is structural (wrong node hierarchy), referential (wrong asset path or sprite frame), or parametric (wrong property values). This feedback loop catches the dominant failure mode in game development: correct code structure with incorrect multimodal asset integration.
Inventory the project structure. Scan the Godot project directory for project.godot, all .tscn scene files, .gd scripts, .gdshader shaders, and asset directories (sprites, audio, fonts). Map the dependency graph: which scenes instance other scenes, which scripts attach to which nodes.
Parse the target scene's node tree. Read the relevant .tscn file and reconstruct the node hierarchy. Godot's tscn format is text-based -- each [node] entry specifies name, type, parent (path relative to root), and property overrides. Identify the exact insertion point for new nodes and the parent-child relationships.
Catalog available assets. List all image files (.png, .svg), audio files (.wav, .ogg), font files (.ttf), and resource files (.tres). For sprite sheets, determine the grid dimensions and which frames correspond to which animation states (idle, walk, attack, jump). This prevents the most common agent failure: selecting wrong sprite frames.
Plan the multi-file edit. Game tasks typically require coordinated changes across scene files (.tscn), scripts (.gd), resource files (.tres), and sometimes shaders (.gdshader). Draft which files change, what nodes are added/modified, and what signals need connecting. Verify that node paths used in scripts match the actual scene tree paths.
Edit scene files with correct hierarchy. When adding nodes to .tscn, ensure:
parent path is correct (use . for root children, NodeA for children of NodeA, NodeA/NodeB for deeper nesting)CharacterBody2D needs a CollisionShape2D child)ExtResource / SubResource) point to valid IDs declared in the file headerWrite or modify GDScript with proper node references. Use $NodeName or get_node("path") with paths that match the actual scene tree. Connect signals either in the scene file ([connection] entries) or via code (node.signal_name.connect(callable)). Verify that exported variables (@export) match the types expected by the inspector.
Handle sprite animations correctly. For AnimatedSprite2D, create a SpriteFrames resource with named animations. Each animation needs frames in the correct order, appropriate FPS, and loop settings. For sprite sheets, use AtlasTexture resources with correct region Rect2 values to extract individual frames. Double-check frame selection against the visual content of the sprite sheet.
Verify with visual feedback. If the user can provide a screenshot of the Godot editor or a recording of gameplay, use it to verify:
Run deterministic checks. If tests exist (test.gd / test.tscn), describe how to execute them via godot --headless --path . -s test.gd. Check for node existence, property values, collision layer setup, and signal connections. Parse test output to identify remaining failures.
Iterate on failures with targeted fixes. When something fails, classify the error type:
Example 1: Adding a character animation from a sprite sheet
User: "I have a sprite sheet at assets/hero_spritesheet.png (8 columns, 4 rows). Row 1 is idle, row 2 is walk, row 3 is attack, row 4 is jump. Add these animations to my Player node."
Approach:
AnimatedSprite2D child (or add one).tres SpriteFrames resource with four named animations: "idle" (frames 0-7 from row 0), "walk" (row 1), "attack" (row 2), "jump" (row 3)Rect2(col*128, row*128, 128, 128)sprite_frames property_physics_process:# In player.gd
func _physics_process(delta):
var anim = $AnimatedSprite2D
if not is_on_floor():
anim.play("jump")
elif velocity.length() > 10:
anim.play("walk")
elif is_attacking:
anim.play("attack")
else:
anim.play("idle")
Output: Four animations correctly mapped to sprite sheet rows, with state machine logic driving transitions. The idle animation loops, attack plays once, walk/jump loop.
Example 2: Fixing a broken scene hierarchy for a platformer
User: "My player falls through the floor. The scene has a CharacterBody2D but collisions aren't working."
Approach:
.tscn file and the level's .tscn filemove_and_slide() (not just modifying position directly)Common fix -- missing collision shape in the tscn:
[node name="CollisionShape2D" type="CollisionShape2D" parent="Player"]
shape = SubResource("RectangleShape2D_abc12")
[sub_resource type="RectangleShape2D" id="RectangleShape2D_abc12"]
size = Vector2(32, 64)
Output: Player collides with floor. Root cause was either a missing CollisionShape2D node, an unassigned shape resource, or mismatched collision layer bits.
Example 3: Creating a dissolve shader effect
User: "Add a dissolve effect to my enemy sprite that I can trigger from code when the enemy dies."
Approach:
dissolve.gdshader with a noise-based dissolve:shader_type canvas_item;
uniform float dissolve_amount : hint_range(0.0, 1.0) = 0.0;
uniform sampler2D noise_texture;
uniform vec4 edge_color : source_color = vec4(1.0, 0.5, 0.0, 1.0);
uniform float edge_width = 0.05;
void fragment() {
vec4 tex = texture(TEXTURE, UV);
float noise = texture(noise_texture, UV).r;
float edge = smoothstep(dissolve_amount, dissolve_amount + edge_width, noise);
tex.a *= edge;
vec4 glow = edge_color * (1.0 - smoothstep(dissolve_amount - edge_width, dissolve_amount, noise));
COLOR = mix(glow, tex, edge);
}
noise_texture uniformdissolve_amount parameter:func die():
var tween = create_tween()
tween.tween_property(
$Sprite2D.material, "shader_parameter/dissolve_amount",
1.0, 0.8
)
tween.tween_callback(queue_free)
Output: Enemy sprite dissolves from edges inward with an orange glow edge over 0.8 seconds, then the node is freed.
.tscn file before editing. Godot's text-based scene format has header sections ([gd_scene], [ext_resource], [sub_resource]) that must stay consistent with node references. Adding a node that references a nonexistent resource ID silently breaks the scene.Player referencing $AnimatedSprite2D requires that node to be a direct child named exactly AnimatedSprite2D. Use get_node("Path/To/Node") for non-direct descendants.CollisionShape2D must be a direct child of a physics body (CharacterBody2D, StaticBody2D, RigidBody2D), not a grandchild or sibling. Check the parent field in .tscn entries.position directly on physics bodies for movement. Use velocity + move_and_slide() for CharacterBody2D, or apply_force()/apply_impulse() for RigidBody2D. Direct position changes bypass the physics engine and cause tunneling.| Error Pattern | Likely Cause | Fix |
|---|---|---|
| "Invalid get index 'sprite_frames' on base Sprite2D" | Used Sprite2D instead of AnimatedSprite2D | Change the node type to AnimatedSprite2D |
| Scene loads but viewport is empty | Nodes exist but are positioned off-screen or have zero scale | Check position, scale, and visible properties; verify camera is targeting the right area |
| "Node not found: $NodeName" | Script references a node that doesn't exist at that path in the scene tree | Cross-reference the script's node path with the actual .tscn hierarchy |
| Collision not detected | Missing CollisionShape2D, shape resource not assigned, or collision layer/mask mismatch | Verify shape assignment and that layer bits overlap between interacting bodies |
| Shader compiles but renders black | Uniform texture not assigned, or UV coordinates incorrect | Ensure all sampler2D uniforms have textures assigned in the material's shader parameters |
| Animation plays wrong frames | AtlasTexture regions calculated incorrectly from sprite sheet | Recalculate frame regions: verify sheet dimensions, column/row count, and frame order |
project.godot before making edits. This skill targets Godot 4.x patterns.project.godot autoload and main scene configuration as entry points.Paper: GameDevBench: Evaluating Agentic Capabilities Through Game Development (Chi et al., 2026)
Key takeaway: Visual feedback loops (editor screenshots and runtime video capture) improve agent game development performance by ~14 percentage points. The dominant failure mode is not logic errors but multimodal asset misalignment -- wrong sprites, incorrect hierarchy depth, mismatched resource references. Prioritize verifying visual/structural correctness over code logic.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".