skills/code2world-gui-world-renderable/SKILL.md
Predict and simulate GUI state transitions by generating renderable HTML/CSS/SVG code from screenshots and user actions. Use when asked to: 'simulate what happens when I click this button', 'predict the next UI state', 'generate HTML that reproduces this screenshot', 'build a GUI world model', 'create a visual sandbox for UI testing', 'convert a mobile screenshot to interactive HTML'.
npx skillsauth add ndpvt-web/arxiv-claude-skills code2world-gui-world-renderableInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to simulate GUI state transitions by generating self-contained, renderable HTML/CSS/SVG code that represents the next visual state of a user interface after an action is performed. Based on the Code2World paper (arXiv:2602.09856), the core insight is that instead of predicting pixels directly or describing changes in text, you generate structured code that a browser can deterministically render into a pixel-accurate UI state. This bridges high visual fidelity with fine-grained structural controllability -- the code is inspectable, editable, and renders consistently.
Code as intermediate representation for UI prediction. Traditional GUI world models either generate text descriptions (losing spatial/visual detail) or synthesize pixels directly (lacking structural control and producing blurry or hallucinated outputs). Code2World sidesteps this tradeoff by predicting the code that, when rendered by a browser engine, produces the next UI state. The prediction is modeled as: given a current screenshot I_t, an action a_t (e.g., click at coordinates, type text), and a task goal G, generate HTML code C_{t+1} such that rendering R(C_{t+1}) produces a faithful visual representation of the next screen.
Self-contained HTML with semantic placeholders. The generated code uses a fixed-dimension root container matching the original screenshot's coordinate space. Images are replaced with descriptive text placeholders (e.g., [IMG: Red sneaker product photo]) since image URLs are unreliable and unnecessary for structural prediction. UI icons are rendered as inline SVGs. No external assets or dependencies are required -- the HTML is fully self-contained and renderable in any browser.
Visual-feedback refinement loop. Code quality is ensured by rendering the generated HTML and comparing it visually against the target screenshot. When the rendered output diverges (measured by visual similarity scoring), the code is revised by identifying specific discrepancies and correcting them. This render-compare-revise cycle is the key to achieving high fidelity without pixel-level generation.
Analyze the source screenshot. Identify all visible UI elements: navigation bars, buttons, text fields, lists, images, icons, status bars, and their spatial layout. Note coordinates, sizes, colors, and hierarchy. For mobile UIs, identify the platform (Android/iOS) and standard UI patterns.
Parse the specified action. Determine the action type (tap/click at coordinates, long press, scroll, type text, swipe) and its target element. Map the action coordinates to the specific UI element being interacted with. Understand what this action would logically trigger (navigation, state change, modal, input focus, etc.).
Determine the expected state transition. Based on the action and standard UI behavior, predict what changes: new screens appearing, elements expanding/collapsing, text being entered, selections changing, dialogs opening, navigation occurring. Preserve all elements that should remain unchanged.
Generate the HTML document structure. Create a self-contained HTML file with a root <div> container set to the exact dimensions of the original screenshot (typically 1080x2400 for mobile). Use absolute positioning to place elements at their correct coordinates. Apply inline styles for all visual properties.
Encode visual elements as renderable code. Translate each UI component into HTML/CSS: use <div> elements with background colors and border-radius for buttons, inline SVG for icons, styled <span> elements for text, and descriptive placeholder blocks ([IMG: description]) for images. Match colors, fonts, spacing, and sizing from the original screenshot.
Apply the action-conditioned changes. Modify the HTML to reflect the predicted next state: add new elements for newly visible content, remove or hide dismissed elements, update text content for input actions, change visual states (e.g., button pressed, checkbox toggled), and adjust layout for scroll or navigation actions.
Validate structural consistency. Review the generated HTML to ensure: the coordinate system is preserved, no elements overlap incorrectly, text is readable, interactive elements are visually distinct, and the overall layout matches expected platform conventions.
Render and compare. If possible, render the HTML in a browser or headless renderer and visually compare against expectations. Identify discrepancies in element positioning, sizing, color, or missing components. Revise the code to correct any issues found.
Deliver the output. Provide the complete, self-contained HTML file that can be opened directly in a browser. Include a brief description of what changed and why, linking the action to the visual state transition.
Example 1: Predicting a button tap on a settings screen
User: "Here's a screenshot of an Android settings page. What happens when I tap the 'Wi-Fi' toggle that's currently OFF?"
Approach:
Output:
<!DOCTYPE html>
<html>
<head><meta charset="utf-8"><style>
* { margin: 0; padding: 0; box-sizing: border-box; }
body { background: #f5f5f5; font-family: 'Roboto', sans-serif; }
</style></head>
<body>
<div style="width:1080px;height:2400px;position:relative;background:#f5f5f5;">
<!-- Status bar -->
<div style="position:absolute;top:0;left:0;width:1080px;height:88px;background:#fff;
display:flex;align-items:center;padding:0 32px;">
<span style="font-size:28px;color:#333;">9:41</span>
<!-- ... battery, signal icons as inline SVG ... -->
</div>
<!-- Settings header -->
<div style="position:absolute;top:88px;left:0;width:1080px;height:120px;background:#fff;
display:flex;align-items:center;padding:0 40px;">
<svg width="48" height="48" viewBox="0 0 24 24" fill="none" stroke="#333" stroke-width="2">
<path d="M15 18l-6-6 6-6"/></svg>
<span style="font-size:44px;color:#1a1a1a;margin-left:32px;font-weight:500;">Settings</span>
</div>
<!-- Wi-Fi row - CHANGED: toggle is now ON -->
<div style="position:absolute;top:240px;left:0;width:1080px;height:140px;background:#fff;
display:flex;align-items:center;padding:0 40px;border-bottom:1px solid #e0e0e0;">
<svg width="48" height="48" viewBox="0 0 24 24" fill="#1a73e8">
<path d="M1 9l2 2c4.97-4.97 13.03-4.97 18 0l2-2C16.93 2.93 7.08 2.93 1 9zm8 8l3 3 3-3
c-1.65-1.66-4.34-1.66-6 0zm-4-4l2 2c2.76-2.76 7.24-2.76 10 0l2-2
C15.14 9.14 8.87 9.14 5 13z"/>
</svg>
<div style="margin-left:32px;flex:1;">
<span style="font-size:34px;color:#1a1a1a;">Wi-Fi</span>
<span style="font-size:26px;color:#666;display:block;margin-top:4px;">HomeNetwork_5G</span>
</div>
<!-- Toggle ON state -->
<div style="width:100px;height:56px;background:#1a73e8;border-radius:28px;position:relative;">
<div style="width:48px;height:48px;background:#fff;border-radius:50%;
position:absolute;top:4px;right:4px;box-shadow:0 2px 4px rgba(0,0,0,0.2);"></div>
</div>
</div>
<!-- Bluetooth row - unchanged -->
<div style="position:absolute;top:380px;left:0;width:1080px;height:140px;background:#fff;
display:flex;align-items:center;padding:0 40px;border-bottom:1px solid #e0e0e0;">
<svg width="48" height="48" viewBox="0 0 24 24" fill="#666">
<path d="M17.71 7.71L12 2h-1v7.59L6.41 5 5 6.41 10.59 12 5 17.59 6.41 19 11 14.41V22h1
l5.71-5.71-4.3-4.29 4.3-4.29zM13 5.83l1.88 1.88L13 9.59V5.83zm1.88 10.46L13 18.17
v-3.76l1.88 1.88z"/>
</svg>
<div style="margin-left:32px;flex:1;">
<span style="font-size:34px;color:#1a1a1a;">Bluetooth</span>
</div>
<div style="width:100px;height:56px;background:#bdbdbd;border-radius:28px;position:relative;">
<div style="width:48px;height:48px;background:#fff;border-radius:50%;
position:absolute;top:4px;left:4px;box-shadow:0 2px 4px rgba(0,0,0,0.2);"></div>
</div>
</div>
</div>
</body>
</html>
Key changes: Wi-Fi toggle moved from gray/left (OFF) to blue/right (ON), Wi-Fi icon turned blue, connected network name appeared below the label.
Example 2: Simulating text input in a search field
User: "I have a screenshot of Google's home page. Simulate what it looks like after I tap the search bar and type 'weather today'."
Approach:
Output:
<!DOCTYPE html>
<html>
<head><meta charset="utf-8"><style>
* { margin: 0; padding: 0; box-sizing: border-box; }
</style></head>
<body>
<div style="width:1080px;height:2400px;position:relative;background:#fff;">
<!-- Google logo placeholder -->
<div style="position:absolute;top:600px;left:50%;transform:translateX(-50%);
width:520px;height:180px;display:flex;align-items:center;justify-content:center;">
<span style="font-size:96px;font-weight:bold;">
<span style="color:#4285F4;">G</span><span style="color:#EA4335;">o</span>
<span style="color:#FBBC05;">o</span><span style="color:#4285F4;">g</span>
<span style="color:#34A853;">l</span><span style="color:#EA4335;">e</span>
</span>
</div>
<!-- Search bar - CHANGED: focused with typed text -->
<div style="position:absolute;top:820px;left:80px;width:920px;height:100px;
border:2px solid #4285F4;border-radius:50px;display:flex;align-items:center;
padding:0 32px;box-shadow:0 4px 12px rgba(0,0,0,0.15);">
<svg width="40" height="40" viewBox="0 0 24 24" fill="#4285F4">
<path d="M15.5 14h-.79l-.28-.27C15.41 12.59 16 11.11 16 9.5 16 5.91 13.09 3
9.5 3S3 5.91 3 9.5 5.91 16 9.5 16c1.61 0 3.09-.59 4.23-1.57l.27.28v.79l5
4.99L20.49 19l-4.99-5zm-6 0C7.01 14 5 11.99 5 9.5S7.01 5 9.5 5 14 7.01 14
9.5 11.99 14 9.5 14z"/>
</svg>
<span style="font-size:34px;color:#1a1a1a;margin-left:20px;">weather today</span>
<span style="display:inline-block;width:2px;height:40px;background:#4285F4;
margin-left:2px;animation:blink 1s step-end infinite;"></span>
<div style="flex:1;"></div>
<svg width="40" height="40" viewBox="0 0 24 24" fill="#999" style="margin-left:16px;">
<path d="M19 11h-1.7c0 .74-.16 1.44-.43 2.08l1.27 1.27c.56-.97.86-2.1.86-3.35zm-4.02
.17c0-.06.02-.11.02-.17V5c0-1.66-1.34-3-3-3S9 3.34 9 5v6c0 .06 0 .11.02.17l5.96
0zM4.27 3L3 4.27l6.01 6.01V11c0 1.66 1.33 3 2.99 3 .22 0 .44-.03.65-.08l1.66
1.66c-.71.33-1.5.52-2.31.52-2.76 0-5.3-2.1-5.3-5.1H5c0 3.41 2.72 6.23 6 6.72V21h2
v-3.28c.91-.13 1.77-.45 2.54-.9L19.73 21 21 19.73 4.27 3z"/>
</svg>
</div>
<!-- Autocomplete suggestions - NEW element -->
<div style="position:absolute;top:930px;left:80px;width:920px;background:#fff;
border:1px solid #dfe1e5;border-radius:0 0 24px 24px;box-shadow:0 4px 6px rgba(0,0,0,0.1);
padding:16px 0;">
<div style="padding:16px 32px;display:flex;align-items:center;">
<svg width="36" height="36" viewBox="0 0 24 24" fill="#999">
<path d="M15.5 14h-.79l-.28-.27C15.41 12.59 16 11.11 16 9.5 16 5.91 13.09 3
9.5 3S3 5.91 3 9.5 5.91 16 9.5 16c1.61 0 3.09-.59 4.23-1.57l.27.28v.79l5
4.99L20.49 19l-4.99-5zm-6 0C7.01 14 5 11.99 5 9.5S7.01 5 9.5 5 14 7.01 14
9.5 11.99 14 9.5 14z"/>
</svg>
<span style="font-size:30px;color:#1a1a1a;margin-left:20px;">weather today in my area</span>
</div>
<div style="padding:16px 32px;display:flex;align-items:center;">
<svg width="36" height="36" viewBox="0 0 24 24" fill="#999">
<path d="M15.5 14h-.79l-.28-.27C15.41 12.59 16 11.11 16 9.5 16 5.91 13.09 3
9.5 3S3 5.91 3 9.5 5.91 16 9.5 16c1.61 0 3.09-.59 4.23-1.57l.27.28v.79l5
4.99L20.49 19l-4.99-5zm-6 0C7.01 14 5 11.99 5 9.5S7.01 5 9.5 5 14 7.01 14
9.5 11.99 14 9.5 14z"/>
</svg>
<span style="font-size:30px;color:#1a1a1a;margin-left:20px;">weather today hourly</span>
</div>
</div>
</div>
<style>@keyframes blink { 50% { opacity: 0; } }</style>
</body>
</html>
Key changes: search bar gained focus border (blue) and shadow, typed text appeared, cursor blinks after text, autocomplete dropdown appeared below with suggestions.
Example 3: Navigation action -- tapping a list item
User: "This is an email inbox screenshot. Predict what the screen looks like after I tap on the first email in the list."
Approach:
Output structure (abbreviated):
<!-- New screen: Email detail view -->
<div style="width:1080px;height:2400px;position:relative;background:#fff;">
<!-- App bar with back arrow and action icons -->
<div style="position:absolute;top:88px;left:0;width:1080px;height:120px;...">
<svg><!-- back arrow --></svg>
<div style="flex:1;"></div>
<svg><!-- archive icon --></svg>
<svg><!-- delete icon --></svg>
<svg><!-- more options --></svg>
</div>
<!-- Sender info with avatar placeholder -->
<div style="position:absolute;top:220px;...">
<div style="width:80px;height:80px;border-radius:50%;background:#1a73e8;
display:flex;align-items:center;justify-content:center;">
<span style="color:#fff;font-size:36px;font-weight:bold;">JD</span>
</div>
<div>
<span style="font-size:34px;font-weight:500;">John Doe</span>
<span style="font-size:26px;color:#666;">to me</span>
</div>
</div>
<!-- Email subject -->
<div style="position:absolute;top:340px;left:40px;">
<span style="font-size:40px;font-weight:400;color:#1a1a1a;">Q4 Report Review</span>
</div>
<!-- Email body text -->
<div style="position:absolute;top:420px;left:40px;right:40px;">
<p style="font-size:30px;color:#333;line-height:1.6;">
Hi, please find attached the Q4 report for your review.
Let me know if you have any questions...
</p>
</div>
<!-- Reply/Forward bar at bottom -->
<div style="position:absolute;bottom:0;left:0;width:1080px;height:120px;...">
<button style="...">Reply</button>
<button style="...">Forward</button>
</div>
</div>
Key change: entire screen transitioned from inbox list to email detail view, preserving platform UI conventions (Material Design app bar, avatar, action buttons).
Do:
[IMG: Product thumbnail of red sneakers]. These convey content meaning without requiring external assets.Avoid:
Ambiguous action targets: If the action coordinates fall between two UI elements or on a non-interactive area, ask the user to clarify which element they intended to interact with rather than guessing.
Unknown navigation destinations: When tapping a button that would navigate to a screen not visible in the provided screenshot (e.g., "Settings" from a home screen), generate the predicted screen based on standard platform conventions and clearly note that the prediction is based on typical patterns, not observed content.
Complex dynamic content: For actions that trigger animations, loading states, or asynchronous data fetches, generate the final settled state rather than intermediate frames. Note any loading states that would appear transiently.
Coordinate system mismatch: If the user provides coordinates that don't match the apparent resolution of the screenshot, normalize coordinates to the detected resolution before mapping to elements.
Paper: Code2World: A GUI World Model via Renderable Code Generation -- Zheng et al., 2026. Key sections: Section 3 (methodology) for the code generation pipeline and visual-feedback revision loop; Section 4 (AndroidCode dataset) for understanding the HTML representation constraints; Section 5 (render-aware RL) for the dual reward structure that balances visual fidelity with action consistency.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".