skills/chatting-images-introspective-visual/SKILL.md
Apply introspective visual thinking by iteratively 'chatting with images' — using language-guided re-examination of visual content to reason over fine-grained details, spatial relationships, and multi-image comparisons. Use when: 'analyze this image in detail', 'compare these images', 'reason about spatial layout', 'what's different between these screenshots', 'explain the visual relationship', 'trace the visual logic step by step'.
npx skillsauth add ndpvt-web/arxiv-claude-skills chatting-images-introspective-visualInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to perform iterative, language-guided visual reasoning instead of relying on a single-pass description of an image. Inspired by the "chatting with images" framework from Wu et al. (2026), the technique treats visual analysis as a multi-turn dialogue between linguistic reasoning and visual re-examination. Rather than describing an image once and reasoning purely from that text, Claude iteratively formulates targeted visual questions, re-examines specific image regions with those questions in mind, and refines its understanding — producing tighter coupling between what it sees and what it reasons about.
Standard vision-language reasoning follows a pipeline: encode the image once, convert it to tokens, then reason in text. This loses fine-grained information — once the visual encoding is fixed, details not captured in that initial pass are gone. "Thinking with images" approaches attempt to fix this by calling external tools (cropping, annotating, running code on images), but the resulting visual states are disconnected from the linguistic reasoning context.
Chatting with images takes a different approach: instead of manipulating pixels, it treats visual re-examination as language-guided feature modulation. The model formulates an explicit linguistic query ("What color is the label in the top-right corner?"), then re-examines the relevant image region with that query as a lens. This produces a new visual understanding that is tightly coupled to the reasoning chain. The key insight is that what you're looking for changes what you see — a targeted re-examination guided by a specific question extracts different information than an open-ended first pass.
The ViLaVT model implements this via a dynamic vision encoder that performs joint re-encoding of image regions conditioned on language prompts. For Claude (which doesn't have a trainable vision encoder), we approximate this by structuring the reasoning process as an explicit multi-pass protocol: (1) initial broad examination, (2) formulation of targeted follow-up visual queries based on what's needed, (3) focused re-examination of specific regions guided by those queries, (4) synthesis of findings. This mirrors the paper's two-stage training — first learning to describe (SFT), then learning when and what to re-examine (RL-driven reasoning behaviors).
Perform an initial broad scan of the image(s). Describe the overall layout, major elements, and general structure. Do NOT try to answer the user's question yet — this pass is for orientation only.
Identify what the question actually requires visually. Decompose the user's question into specific visual sub-questions. For "Are these two UIs the same?", the sub-questions might be: "What elements are in UI-A?", "What elements are in UI-B?", "Do they share the same layout grid?", "Are colors/fonts identical?"
Formulate targeted re-examination prompts. For each sub-question, write an explicit directive like: "Focus on the top-left quadrant and list every text label and its approximate position." These prompts serve as the language guidance for visual feature modulation.
Re-examine specific image regions with each prompt in mind. Look at the image again, but this time constrained to the specific region and specific visual feature your prompt targets. Record observations with precise spatial language (coordinates, relative positions, directional relationships).
Cross-reference observations across regions or images. If comparing multiple images or distant regions, explicitly align your observations: "In Image A, the button is at top-right; in Image B, it's at center-left." Use a structured comparison format.
Check for contradictions or gaps. Review your accumulated observations. If any sub-question remains unanswered, or if observations conflict, perform another targeted re-examination pass on the specific area of uncertainty.
Synthesize findings into a coherent answer. Combine all observations, explicitly linking each claim to the specific visual evidence gathered during re-examination passes.
State confidence and residual uncertainty. Flag any aspects where the image resolution, occlusion, or ambiguity prevented a definitive answer.
Example 1: Comparing Two UI Screenshots
User: [Provides two screenshots] What changed between version A and version B of this settings page?
Approach:
1. INITIAL SCAN: "Image A shows a settings page with a sidebar on the left,
main content area with form fields, and a save button at the bottom.
Image B has the same general layout."
2. DECOMPOSE INTO SUB-QUESTIONS:
- Are the sidebar items the same in both?
- Are the form fields identical (labels, order, types)?
- Has the save button changed (position, style, text)?
- Are there any new or removed elements?
3. TARGETED RE-EXAMINATION:
Prompt: "Examine the sidebar items in Image A, listing each from top to bottom."
→ "Profile, Security, Notifications, Billing, Advanced"
Prompt: "Examine the sidebar items in Image B, listing each from top to bottom."
→ "Profile, Security, Notifications, Integrations, Billing, Advanced"
Prompt: "Examine the form field labels in the main content area of Image A."
→ "Display Name, Email, Bio (text area), Timezone (dropdown)"
Prompt: "Examine the form field labels in the main content area of Image B."
→ "Display Name, Email, Bio (text area), Language (dropdown), Timezone (dropdown)"
4. CROSS-REFERENCE:
- Sidebar: Image B added "Integrations" between Notifications and Billing.
- Form fields: Image B added "Language" dropdown before Timezone.
- Save button: unchanged in both.
Output:
Two changes between versions:
1. **New sidebar item**: "Integrations" was added between "Notifications" and "Billing."
2. **New form field**: A "Language" dropdown was added above the existing "Timezone" dropdown.
No other visual differences detected in layout, styling, or other elements.
Example 2: Spatial Reasoning in a Floor Plan
User: [Provides floor plan image] Can someone in the kitchen see
directly into the master bedroom?
Approach:
1. INITIAL SCAN: Identify room locations and overall layout shape.
2. TARGETED RE-EXAMINATION:
Prompt: "Locate the kitchen — what are its boundaries and door/opening positions?"
→ "Kitchen is in the northwest corner. Open archway faces east into the
living room. No other openings."
Prompt: "Locate the master bedroom — what are its boundaries and
door positions?"
→ "Master bedroom is in the southeast corner. Single door opens north
into the hallway."
Prompt: "Trace the line of sight from the kitchen archway toward
the master bedroom door. What walls or rooms intervene?"
→ "The kitchen archway opens east into the living room. The living room's
south wall is solid. The hallway to the master bedroom branches south
from the living room's east end, turns east, then reaches the
master bedroom door — not a straight line from the kitchen."
3. SYNTHESIS: No direct line of sight exists. The kitchen archway faces
east into the living room, but the master bedroom is accessed via a
hallway that turns, with intervening walls blocking any direct view.
Output:
No — someone in the kitchen cannot see directly into the master bedroom.
The kitchen opens east into the living room via an archway, but the master
bedroom is accessed through a hallway that turns south and then east from
the living room, with solid walls blocking any direct sightline.
Example 3: Multi-Frame Visual Sequence Analysis
User: [Provides 4 sequential screenshots of an animation] What is the
trajectory of the red circle across these frames?
Approach:
1. INITIAL SCAN: Four frames showing a canvas with geometric shapes.
2. TARGETED RE-EXAMINATION per frame:
Frame 1: "Red circle is at approximately (100, 300) — bottom-left area."
Frame 2: "Red circle is at approximately (250, 200) — center, moved
up and right."
Frame 3: "Red circle is at approximately (400, 150) — upper-right area,
continued same direction."
Frame 4: "Red circle is at approximately (400, 300) — right side,
dropped back down to original y."
3. CROSS-REFERENCE positions into trajectory:
(100,300) → (250,200) → (400,150) → (400,300)
Pattern: moves right and up for frames 1-3, then drops vertically
in frame 4.
Output:
The red circle follows an arcing trajectory:
- Frames 1→3: Diagonal movement up and to the right (bottom-left to
upper-right), covering ~300px horizontally and ~150px vertically.
- Frame 3→4: Vertical drop of ~150px with no horizontal movement.
Overall shape: an inverted-L or a throw-and-drop arc.
Wu, J., Guan, J., Liu, Q., Wu, S., & Wang, L. (2026). Chatting with Images for Introspective Visual Thinking. arXiv:2602.11073. https://arxiv.org/abs/2602.11073v1
Key takeaway from the paper: Visual reasoning improves substantially when the model iteratively re-examines images guided by its own evolving linguistic reasoning, rather than relying on a single initial encoding. The "chatting with images" framing — treating each re-examination as a language-guided query to the visual input — achieves tighter cross-modal alignment than either text-only reasoning or tool-based image manipulation.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".