skills/agent-desktop/SKILL.md
Desktop automation via native OS accessibility trees using the agent-desktop CLI. Use when an AI agent needs to observe, interact with, or automate desktop applications (click buttons, fill forms, navigate menus, read UI state, toggle checkboxes, scroll, drag, type text, take screenshots, manage windows, use clipboard, manage notifications). Covers 54 commands across observation, interaction, keyboard/mouse, app lifecycle, notifications (macOS), clipboard, wait, and a `skills` command that prints these bundled docs straight from the binary. Triggers on: "click button", "fill form", "open app", "read UI", "automate desktop", "accessibility tree", "snapshot app", "type into field", "navigate menu", "toggle checkbox", "take screenshot", "desktop automation", "agent-desktop", or any desktop GUI interaction task. Supports the macOS Phase 1 adapter, with Windows and Linux planned against the same core contracts.
npx skillsauth add lahfir/agent-desktop agent-desktopInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
CLI tool enabling AI agents to observe and control desktop applications via native OS accessibility trees.
Core principle: agent-desktop is NOT an AI agent. It is a tool that AI agents invoke. It outputs structured JSON with ref-based element identifiers. The observation-action loop lives in the calling agent.
npm install -g agent-desktop
# or
bun install -g --trust agent-desktop
Requires macOS 12+ with Accessibility permission granted to your terminal. Screen Recording permission is also required for screenshots.
Detailed documentation is split into focused reference files. Read them as needed:
| Reference | Contents |
|-----------|----------|
| references/commands-observation.md | snapshot, find, get, is, screenshot, list-surfaces — all flags, output examples |
| references/commands-interaction.md | click, type, set-value, select, toggle, scroll, drag, keyboard, mouse — choosing the right command |
| references/commands-system.md | launch, close, windows, clipboard, wait, batch, status, permissions, version |
| references/workflows.md | 12 common patterns: forms, menus, dialogs, scroll-find, drag-drop, async wait, anti-patterns |
| references/macos.md | macOS permissions/TCC, AX API internals, smart activation chain, surfaces, Notification Center, troubleshooting |
Use progressive skeleton traversal as the default approach. It reduces token consumption 78-96% for dense apps by exploring the UI in two phases: a shallow skeleton overview, then targeted drill-downs into regions of interest.
1. SKELETON → agent-desktop snapshot --skeleton --app "App" -i --compact
Parse the overview. Identify the region containing your target.
Regions show children_count (e.g., "Sidebar" with children_count: 42).
Named containers at truncation boundary have refs for drill-down.
Keep the returned snapshot_id.
2. DRILL → agent-desktop snapshot --root @e3 --snapshot <snapshot_id> -i --compact
Expand the target region. Now you see its interactive elements.
3. ACT → agent-desktop click @e12 --snapshot <snapshot_id> (or type, select, toggle...)
4. VERIFY → agent-desktop snapshot --root @e3 --snapshot <snapshot_id> -i --compact
Re-drill the same region to confirm the state change.
Scoped invalidation: only @e3's subtree refs are replaced.
5. REPEAT → Continue drilling other regions or acting as needed.
When to skip skeleton and use full snapshot instead:
find insteadWhen skeleton shines:
@e1, @e2, @e3...available_actions)snapshot_id; ref-consuming commands accept --snapshot <snapshot_id>last_refmap.json is only a latest-snapshot inspection artifact. The command path uses snapshot-scoped storage.--root @e3 only replaces refs from @e3's previous drill — refs from other regions and the skeleton itself are preservedEvery command returns a JSON envelope on stdout:
Success: { "version": "2.0", "ok": true, "command": "snapshot", "data": { ... } }
Error: { "version": "2.0", "ok": false, "command": "click", "error": { "code": "STALE_REF", "message": "...", "suggestion": "..." } }
Exit codes: 0 success, 1 structured error, 2 argument error.
| Code | Meaning | Recovery |
|------|---------|----------|
| PERM_DENIED | Accessibility or Screen Recording permission not granted | Grant the named permission in System Settings |
| ELEMENT_NOT_FOUND | Ref cannot be resolved against the live UI | Re-run snapshot, use fresh ref |
| APP_NOT_FOUND | App not running | Launch it first |
| ACTION_FAILED | AX action rejected | Try an explicit alternative command |
| ACTION_NOT_SUPPORTED | Element can't do this | Use different command |
| STALE_REF | Ref from old snapshot | Re-run snapshot |
| SNAPSHOT_NOT_FOUND | Snapshot ID is missing or expired | Run snapshot again and use the returned ID |
| POLICY_DENIED | A physical/headed path was blocked | Use an explicit mouse/focus/keyboard command if physical interaction is intended |
| WINDOW_NOT_FOUND | No matching window | Check app name, use list-windows |
| PLATFORM_NOT_SUPPORTED | Adapter method not implemented on this platform | Use a supported platform adapter |
| TIMEOUT | Wait condition not met | Increase --timeout |
| INVALID_ARGS | Bad arguments | Check command syntax |
| NOTIFICATION_NOT_FOUND | Notification index no longer exists | Re-run list-notifications |
agent-desktop snapshot --skeleton --app "App" -i --compact # Skeleton overview (preferred)
agent-desktop snapshot --root @e3 -i --compact # Drill into region
agent-desktop snapshot --app "App" -i # Full tree (simple apps)
agent-desktop snapshot --app "App" --surface menu -i # Surface snapshot
agent-desktop screenshot --app "App" out.png # PNG screenshot
agent-desktop find --app "App" --role button # Search elements
agent-desktop get @e1 --snapshot <snapshot_id> --property text # Read element property
agent-desktop is @e1 --snapshot <snapshot_id> --property enabled # Check element state
agent-desktop list-surfaces --app "App" # Available surfaces
agent-desktop click @e5 --snapshot <snapshot_id> # AX-first click, no cursor move by default
agent-desktop double-click @e3 # AXOpen; physical double-click uses mouse-click --count 2
agent-desktop triple-click @e2 # Physical triple-click uses mouse-click --count 3
agent-desktop right-click @e5 # Right-click; menu returned when verified
agent-desktop type @e2 --snapshot <snapshot_id> "hello" # Headless AX text insertion when supported
agent-desktop set-value @e2 "new value" # Set value directly
agent-desktop clear @e2 # Clear element value
agent-desktop focus @e2 # Set keyboard focus
agent-desktop select @e4 "Option B" # Select dropdown/list option
agent-desktop toggle @e6 # Toggle checkbox/switch
agent-desktop check @e6 # Idempotent check
agent-desktop uncheck @e6 # Idempotent uncheck
agent-desktop expand @e7 # Expand disclosure
agent-desktop collapse @e7 # Collapse disclosure
agent-desktop scroll @e1 --direction down # Scroll element
agent-desktop scroll-to @e8 # Scroll into view
agent-desktop press cmd+c # Key combo
agent-desktop press return --app "App" # Targeted key press
agent-desktop key-down shift # Hold key
agent-desktop key-up shift # Release key
agent-desktop hover @e5 # Explicit cursor movement
agent-desktop hover --xy 500,300 # Cursor to coordinates
agent-desktop drag --from @e1 --to @e5 # Drag between elements
agent-desktop mouse-click --xy 500,300 # Click at coordinates
agent-desktop mouse-move --xy 100,200 # Move cursor
agent-desktop mouse-down --xy 100,200 # Press mouse button
agent-desktop mouse-up --xy 300,400 # Release mouse button
agent-desktop launch "System Settings" # Launch and wait
agent-desktop close-app "TextEdit" # Quit gracefully
agent-desktop close-app "TextEdit" --force # Force kill
agent-desktop list-windows --app "Finder" # List windows
agent-desktop list-apps # List running GUI apps
agent-desktop focus-window --app "Finder" # Bring to front
agent-desktop resize-window --app "App" --width 800 --height 600
agent-desktop move-window --app "App" --x 0 --y 0
agent-desktop minimize --app "App"
agent-desktop maximize --app "App"
agent-desktop restore --app "App"
agent-desktop list-notifications # List all notifications
agent-desktop list-notifications --app "Slack" # Filter by app
agent-desktop list-notifications --text "deploy" --limit 5 # Filter by text
agent-desktop dismiss-notification 1 # Dismiss by index
agent-desktop dismiss-all-notifications # Dismiss all
agent-desktop dismiss-all-notifications --app "Slack" # Dismiss all from app
agent-desktop notification-action 1 "Reply" --expected-app Slack # Click action (with NC reorder guard)
agent-desktop clipboard-get # Read clipboard
agent-desktop clipboard-set "text" # Write to clipboard
agent-desktop clipboard-clear # Clear clipboard
agent-desktop wait 1000 # Pause 1 second
agent-desktop wait --element @e5 --snapshot <snapshot_id> --timeout 5000 # Wait for element
agent-desktop wait --window "Title" # Wait for window
agent-desktop wait --text "Done" --app "App" # Wait for text
agent-desktop wait --menu --app "App" # Wait for menu surface
agent-desktop wait --menu-closed --app "App" # Wait for menu dismissal
agent-desktop wait --notification --app "App" # Wait for new notification
agent-desktop status # Health check
agent-desktop permissions # Check permission
agent-desktop permissions --request # Trigger permission dialog
agent-desktop version --json # Version info
agent-desktop batch '[...]' --stop-on-error # Batch uses the same typed command path as CLI
agent-desktop skills # List bundled skill docs
agent-desktop skills get desktop --full # Load this skill + all references
--skeleton -i --compact for dense apps. Drill into regions with --root @ref. Full snapshot only for simple apps.-i --compact flags. Filters to interactive elements and collapses empty wrappers, minimizing tokens.snapshot_id for deterministic multi-step use; re-drill the affected region after any UI-changing action. Scoped invalidation keeps other refs intact.click @e5 > mouse-click --xy 500,300.wait for async UI. After launch/dialog triggers, wait for expected state.permissions on first use; screenshots also need Screen Recording.error.code and follow error.suggestion.find for targeted searches. Faster than any snapshot when you know role/name.snapshot --surface menu for menus, --surface sheet for dialogs. Never --skeleton for surfaces — they're already focused.focus, press, hover, drag, or mouse-* commands only when physical/headed interaction is intended.tools
C-ABI bindings over agent-desktop's PlatformAdapter. Consumers (Python ctypes, Swift, Node ffi-napi, Go cgo, C++, Ruby fiddle) link libagent_desktop_ffi.{dylib,so,dll} and call `ad_*` functions directly instead of spawning the CLI binary per call.
tools
Use when work should span one or more detached tasks but still behave like one job with a single owner context. TaskFlow is the durable flow substrate under authoring layers like Lobster, ACPX, plugins, or plain code. Keep conditional logic in the caller; use TaskFlow for flow identity, child-task linkage, waiting state, revision-checked mutations, and user-facing emergence.
tools
# Lobster Lobster executes multi-step workflows with approval checkpoints. Use it when: - User wants a repeatable automation (triage, monitor, sync) - Actions need human approval before executing (send, post, delete) - Multiple tool calls should run as one deterministic operation ## When to use Lobster | User intent | Use Lobster? | | ------------------------------------------------------ | --------------------------
tools
# Lobster Lobster executes multi-step workflows with approval checkpoints. Use it when: - User wants a repeatable automation (triage, monitor, sync) - Actions need human approval before executing (send, post, delete) - Multiple tool calls should run as one deterministic operation ## When to use Lobster | User intent | Use Lobster? | | ------------------------------------------------------ | --------------------------