skills/skillxiv-v0.0.2-claude-opus-4.6/computer-using-world-model/SKILL.md
Enable AI agents to safely explore action outcomes before execution by predicting UI state changes in desktop applications. Two-stage approach: first predict textual description of what changes, then synthesize visual representation of resulting screen. Allows agents to compare multiple candidate actions without risky trial-and-error, trained on Microsoft Office interactions (Word, Excel, PowerPoint).
npx skillsauth add ADu2021/skillXiv computer-using-world-modelInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Web and desktop automation agents face a critical challenge: many actions are irreversible and costly. Deleting a file, overwriting a spreadsheet cell, or closing an unsaved document cannot be undone, yet agents often learn through trial-and-error exploration. Unlike robotics where physical environments are forgiving, software environments demand careful planning before execution.
The standard approach—act first, observe consequences—is unsafe in software contexts where action consequences are immediate and permanent. A safer strategy is to enable agents to simulate action outcomes before committing to execution, supporting "think-then-act" decision-making that compares multiple candidate paths without risky exploration.
The Computer-Using World Model (CUWM) predicts how desktop applications will change in response to user actions without actually executing them. The system operates in two stages:
This separation allows the model to focus on decision-relevant information (what changes) separately from appearance details (how it looks), improving generalization and prediction accuracy.
Implement a two-stage predictor combining text generation and visual modification:
def predict_ui_state_change(screenshot, action, model):
"""
Predict what will change when action is taken.
screenshot: PIL Image of current state
action: dict with type, target, value (e.g., {'type': 'click', 'target': (x, y)})
Returns: (change_description, predicted_screenshot)
"""
# Encode current screenshot to latent state
state_latent = model.encode_screenshot(screenshot)
action_latent = model.embed_action(action)
# Predict textual change
change_description = model.predict_change(
state_latent, action_latent, temperature=0.3
)
# Synthesize visual outcome
predicted_screenshot = model.render_screenshot(
screenshot, change_description, action
)
return change_description, predicted_screenshot
Implement multi-action planning by comparing outcomes:
def plan_action_sequence(
screenshot, goal, candidate_actions, model, num_lookahead=3
):
"""
Evaluate multiple action candidates and select best.
Avoids execution until final action selected.
"""
outcomes = []
for action in candidate_actions:
change_desc, pred_screenshot = predict_ui_state_change(
screenshot, action, model
)
# Score predicted outcome against goal
outcome_score = model.score_against_goal(
pred_screenshot, goal, change_desc
)
outcomes.append({
'action': action,
'prediction': change_desc,
'screenshot': pred_screenshot,
'score': outcome_score
})
# Rank by score and return best action
best_outcome = max(outcomes, key=lambda x: x['score'])
return best_outcome['action']
Implement training on observed (action, outcome) pairs from interaction traces:
def train_world_model(
interaction_traces, model, optimizer, num_epochs=10
):
"""
Train on pairs of (screenshot_before, action, screenshot_after).
interaction_traces: list of trajectories with screenshots and actions
"""
for epoch in range(num_epochs):
total_loss = 0
for trajectory in interaction_traces:
for step_idx in range(len(trajectory) - 1):
before_screenshot = trajectory[step_idx]['screenshot']
action = trajectory[step_idx]['action']
after_screenshot = trajectory[step_idx + 1]['screenshot']
# Predict change
pred_change, pred_screenshot = predict_ui_state_change(
before_screenshot, action, model
)
# Compute losses
text_loss = model.change_loss(
pred_change, trajectory[step_idx + 1]['change_description']
)
image_loss = model.visual_loss(pred_screenshot, after_screenshot)
loss = text_loss + image_loss
total_loss += loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch}: avg loss = {total_loss / len(interaction_traces)}")
| Parameter | Default | Guidance | |---|---|---| | Change prediction temperature | 0.3 | Lower (0.1–0.2) for deterministic predictions; higher for diversity | | Visual synthesis method | Copy + modify | Use OCR-based element detection for text field updates | | Action space | 50–100 candidates | Score top-k (k=5) actions to avoid redundant computation | | Lookahead depth | 3 steps | Increase for planning-heavy tasks; limit for speed |
When to use: For autonomous desktop automation agents (RPA, form filling, document editing) where mistakes are costly or irreversible.
When not to use: For simple click-and-wait tasks where exploration risk is low; overhead of prediction may exceed benefit.
Common pitfalls:
CUWM enables "think-then-act" planning by allowing agents to evaluate action outcomes before execution. Training on Microsoft Office interactions demonstrates the approach scales to complex applications with multiple interacting components and rich state representations.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.