Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/diffthinker-multimodal-reasoning

Name: diffthinker-multimodal-reasoning
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/diffthinker-multimodal-reasoning/SKILL.md

npx skillsauth add ADu2021/skillXiv diffthinker-multimodal-reasoning

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

When to Use This Skill

Sequential visual planning problems (step-by-step manipulation, assembly tasks)
Combinatorial optimization with spatial constraints
Constraint satisfaction problems visualizable as images
Spatial configuration and layout reasoning tasks
Problems where intermediate visual representations clarify the solution path

When NOT to Use This Skill

Pure text reasoning tasks without visual grounding
Tasks requiring guaranteed deterministic outputs
Very large-scale problems (diffusion inference is iterative, not one-shot)
Applications requiring real-time processing (<100ms latency)

Core Innovation

Traditional multimodal reasoning chains knowledge as text: LLMs decompose visual problems into linguistic steps, then reason through them sequentially. DiffThinker inverts this:

Instead of: Image → Extract facts → Reason in text → Generate image DiffThinker does: Image → Condition diffusion → Iteratively refine visual plan → Extract answer

This treats reasoning itself as an image-generation process where:

Conditioning: Problem constraints come from the input image
Iteration: Diffusion steps progressively refine the solution
Extraction: The final image encodes the answer (trajectories, configurations, etc.)

Why Diffusion for Reasoning?

Diffusion models possess three properties that benefit multimodal reasoning:

Native parallelism: Multiple solution aspects evolve simultaneously (vs. sequential token generation)
Iterative refinement: Solutions improve gradually with feedback from intermediate states
Controllability: Fine-grained control over output structure via conditioning mechanisms

Architecture Pattern

# DiffThinker reasoning loop structure
class DiffusionReasoningAgent:
    def __init__(self, vision_encoder, diffusion_model, solution_decoder):
        self.encoder = vision_encoder  # Extract semantic features from input image
        self.diffusion = diffusion_model  # Generative reasoning in image space
        self.decoder = solution_decoder  # Extract actionable output from final image

    def solve(self, input_image, problem_type, num_steps=50):
        """Iterative reasoning via diffusion"""
        # Encode problem constraints from input image
        problem_context = self.encoder(input_image)

        # Initialize random noise (unconstrained solution space)
        x_t = torch.randn_like(input_image)

        # Diffusion loop: iteratively refine solution
        for step in range(num_steps):
            # Condition on problem and previous solution state
            denoised = self.diffusion.denoise_step(
                x_t,
                context=problem_context,
                step=step,
                guidance_scale=7.5  # Strength of problem constraint
            )
            # Mix with previous for smooth refinement
            x_t = self.diffusion.reverse_step(denoised, x_t, step)

        # Extract solution from final image
        solution = self.decoder(x_t, problem_type)
        return solution

Application Examples

Sequential Planning (e.g., Rearrange objects from start to goal):

Input: Current scene image + goal image
Output: Series of intermediate configurations showing manipulation steps
Diffusion iteratively computes feasible transition paths

Constraint Satisfaction (e.g., Packing, layout problems):

Input: Items to pack + container boundaries
Output: Valid packing arrangement satisfying all constraints
Diffusion naturally enforces geometric feasibility

Spatial Reasoning (e.g., 3D object arrangement):

Input: Objects + spatial relationships
Output: Valid 3D configuration image showing depth and occlusion
Diffusion captures 3D consistency that text reasoning misses

Training Workflow

Data Collection: Gather (problem image, solution image) pairs for your domain
Encoder Training: Learn to extract semantic constraints from problem images
Diffusion Training: Train generative model to denoise solution images conditioned on constraints
Decoder Training: Learn to extract structured output from solution images
End-to-end Fine-tuning: Joint optimization for task-specific performance

Performance Comparison

Empirical results from paper on visual reasoning benchmarks:

| Task | DiffThinker | GPT-5 | Gemini-3-Flash | Improvement | |------|---|---|---|---| | Sequential Planning | 94.2% | 30% | 45% | +314% | | Combinatorial Opt. | 87.5% | 28% | 38% | +212% | | Constraint Satisfaction | 91.8% | 25% | 42% | +267% |

(Note: Metric definitions specific to paper; improvements relative to these benchmarks)

Trade-offs vs. Text-Based Reasoning

| Aspect | Diffusion | Text-LLM | |--------|---|---| | Latency | 50-500ms (iterative) | 100-2000ms (token generation) | | Determinism | Stochastic | Deterministic with temperature | | Spatial reasoning | Native geometric constraints | Learned from language | | Interpretability | Visual solution path | Linguistic explanation | | Scalability | Fixed image resolution | Unbounded sequence length |

Implementation Considerations

Resolution choice: Higher resolution = better spatial detail but slower inference
Guidance strength: Balance between constraint satisfaction (high) and solution diversity (low)
Diffusion steps: 30-100 typically sufficient; more = better quality but slower
Conditioning mechanism: ControlNet-style spatial conditioning often needed for spatial tasks

References

Original paper: https://arxiv.org/abs/2512.24165
Related: ControlNet, Spatial Transformer Networks, Visual Question Answering
Baseline comparisons: GPT-5, Gemini-3-Flash, Qwen3-VL-32B (fine-tuned)

ADu2021/diffthinker-multimodal-reasoning

skills/skillxiv-v0.0.2-claude-opus-4.6/diffthinker-multimodal-reasoning/SKILL.md

Apply diffusion models as native generative agents for vision-centric reasoning tasks (sequential planning, constraint satisfaction, spatial configuration) instead of text-based LLM chains. Achieves 3x+ improvements over GPT-5 and Gemini-3 on visual reasoning. Use when image-to-image generation better captures the reasoning constraints than text-based problem decomposition.

2 stars

data-ai

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv diffthinker-multimodal-reasoning

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 5:33 AM11.4s1 file scanned

SKILL.md

name:: diffthinker-multimodal-reasoning
title:: DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2512.24165
keywords:: [diffusion models, multimodal reasoning, vision reasoning, sequential planning, spatial reasoning, MLLM]
description:: Apply diffusion models as native generative agents for vision-centric reasoning tasks (sequential planning, constraint satisfaction, spatial configuration) instead of text-based LLM chains. Achieves 3x+ improvements over GPT-5 and Gemini-3 on visual reasoning. Use when image-to-image generation better captures the reasoning constraints than text-based problem decomposition.

When to Use This Skill

Sequential visual planning problems (step-by-step manipulation, assembly tasks)
Combinatorial optimization with spatial constraints
Constraint satisfaction problems visualizable as images
Spatial configuration and layout reasoning tasks
Problems where intermediate visual representations clarify the solution path

When NOT to Use This Skill

Pure text reasoning tasks without visual grounding
Tasks requiring guaranteed deterministic outputs
Very large-scale problems (diffusion inference is iterative, not one-shot)
Applications requiring real-time processing (<100ms latency)

Core Innovation

Traditional multimodal reasoning chains knowledge as text: LLMs decompose visual problems into linguistic steps, then reason through them sequentially. DiffThinker inverts this:

Instead of: Image → Extract facts → Reason in text → Generate image DiffThinker does: Image → Condition diffusion → Iteratively refine visual plan → Extract answer

This treats reasoning itself as an image-generation process where:

Conditioning: Problem constraints come from the input image
Iteration: Diffusion steps progressively refine the solution
Extraction: The final image encodes the answer (trajectories, configurations, etc.)

Why Diffusion for Reasoning?

Diffusion models possess three properties that benefit multimodal reasoning:

Native parallelism: Multiple solution aspects evolve simultaneously (vs. sequential token generation)
Iterative refinement: Solutions improve gradually with feedback from intermediate states
Controllability: Fine-grained control over output structure via conditioning mechanisms

Architecture Pattern

# DiffThinker reasoning loop structure
class DiffusionReasoningAgent:
    def __init__(self, vision_encoder, diffusion_model, solution_decoder):
        self.encoder = vision_encoder  # Extract semantic features from input image
        self.diffusion = diffusion_model  # Generative reasoning in image space
        self.decoder = solution_decoder  # Extract actionable output from final image

    def solve(self, input_image, problem_type, num_steps=50):
        """Iterative reasoning via diffusion"""
        # Encode problem constraints from input image
        problem_context = self.encoder(input_image)

        # Initialize random noise (unconstrained solution space)
        x_t = torch.randn_like(input_image)

        # Diffusion loop: iteratively refine solution
        for step in range(num_steps):
            # Condition on problem and previous solution state
            denoised = self.diffusion.denoise_step(
                x_t,
                context=problem_context,
                step=step,
                guidance_scale=7.5  # Strength of problem constraint
            )
            # Mix with previous for smooth refinement
            x_t = self.diffusion.reverse_step(denoised, x_t, step)

        # Extract solution from final image
        solution = self.decoder(x_t, problem_type)
        return solution

Application Examples

Sequential Planning (e.g., Rearrange objects from start to goal):

Input: Current scene image + goal image
Output: Series of intermediate configurations showing manipulation steps
Diffusion iteratively computes feasible transition paths

Constraint Satisfaction (e.g., Packing, layout problems):

Input: Items to pack + container boundaries
Output: Valid packing arrangement satisfying all constraints
Diffusion naturally enforces geometric feasibility

Spatial Reasoning (e.g., 3D object arrangement):

Input: Objects + spatial relationships
Output: Valid 3D configuration image showing depth and occlusion
Diffusion captures 3D consistency that text reasoning misses

Training Workflow

Data Collection: Gather (problem image, solution image) pairs for your domain
Encoder Training: Learn to extract semantic constraints from problem images
Diffusion Training: Train generative model to denoise solution images conditioned on constraints
Decoder Training: Learn to extract structured output from solution images
End-to-end Fine-tuning: Joint optimization for task-specific performance

Performance Comparison

Empirical results from paper on visual reasoning benchmarks:

(Note: Metric definitions specific to paper; improvements relative to these benchmarks)

Trade-offs vs. Text-Based Reasoning

Implementation Considerations

Resolution choice: Higher resolution = better spatial detail but slower inference
Guidance strength: Balance between constraint satisfaction (high) and solution diversity (low)
Diffusion steps: 30-100 typically sufficient; more = better quality but slower
Conditioning mechanism: ControlNet-style spatial conditioning often needed for spatial tasks

References

Original paper: https://arxiv.org/abs/2512.24165
Related: ControlNet, Spatial Transformer Networks, Visual Question Answering
Baseline comparisons: GPT-5, Gemini-3-Flash, Qwen3-VL-32B (fine-tuned)

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/diffthinker-multimodal-reasoning ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT