
Convert arXiv and ML/AI research papers into ready-to-use Claude agent skills in seconds — so anyone can apply cutting-edge techniques without reading the full paper. Use this skill whenever the user wants to turn a paper into a skill, extract practical techniques from research, build a skill library from papers, create a paper-to-skill pipeline, convert arXiv papers into reusable agent instructions, or batch-process multiple papers into skills. Also trigger when someone asks about extracting actionable knowledge from papers, building skill libraries from literature, making research practical for LLM agents, or systematically converting academic contributions into structured agent capabilities.
Generate realistic video footage of people from audio input using a unified self-attention framework, producing convincing speaker performances without domain-specific restrictions.
Accelerate agentic AI training by distributing task execution across clusters, achieving 14.6x speedup in experience collection and enabling practical large-scale agent development
Categorize ML/AI research papers into 11 types based on their title and abstract. Returns structured JSON with a primary category, optional secondary categories, extractability rating, and rationale. Designed for the SkillXiv paper2skill pipeline as a triage step. Use this skill whenever the user wants to classify, categorize, sort, or triage research papers — whether a single paper or a batch. Also trigger when someone asks "what kind of paper is this?", wants to filter papers by type, or needs to decide which papers to prioritize for skill extraction.
Convert component innovation papers into drop-in replacement guides. Extracts what was swapped, why, conditions for when it helps, and the performance delta. Use this skill when extracting skills from Category 5 (Component Innovation) papers — BatchNorm-style papers, ResNet skip connections, new loss functions, or any paper proposing one elegant modification with outsized impact.
Convert dataset and benchmark papers into evaluation infrastructure skills. For datasets: extracts collection protocol, annotation design, quality control. For benchmarks: extracts task definition, metric selection, leaderboard design. Use this skill when extracting skills from Category 2 (Evaluation Infrastructure) papers — ImageNet-style dataset papers, SWE-bench-style benchmark papers, or any paper whose primary contribution is evaluation methodology.
Convert insight-driven papers into minimal reproducible recipes built around a single non-obvious observation. Extracts the key insight, why the problem seemed hard, and the minimal implementation. Use this skill when extracting skills from Category 6 (Insight-Driven) papers — 'Learning to See in the Dark'-style papers where one clever observation unlocks a seemingly hard problem.
Convert papers that disprove conventional wisdom into paradigm-challenge skills. Extracts the prior belief, the falsifying experiment, and the revised principle. Use this skill when extracting skills from Category 3 (Paradigm Challenge) papers — papers that say 'rethinking', 'revisiting', or 'do we really need X', where the core move is adversarial (proving the community wrong).
Convert research infrastructure papers into design pattern guides. Extracts capability gaps addressed, API design decisions, performance/usability trade-offs, and integration patterns. Use this skill when extracting skills from Category 7 (Research Infrastructure) papers — PyTorch-style framework papers, evaluation harness tooling, or any paper where the tool itself is the contribution.
Convert survey and synthesis papers into field navigation guides. Extracts taxonomies, method selection decision trees, literature navigation heuristics, and open problems. Use this skill when extracting skills from Category 10 (Survey and Synthesis) papers — comprehensive reviews, position papers, tutorials, or roadmaps that organize a research landscape.
Asymmetric Confidence-aware Error Penalty (ACE) dynamically penalizes overconfident mistakes in RL training, improving reasoning quality without requiring additional computation.
Reduce memory overhead of long-horizon LLM agents by learning task-specific context compression strategies. A learnable compressor adapts by analyzing failure cases, achieving 26-54% memory reduction while preserving 95%+ accuracy, enabling smaller models to act as efficient long-context agents.
Train a single LLM to decompose complex queries into subquestions and integrate retrieved contexts through two-stage supervised and preference-based reinforcement fine-tuning, achieving 7.6% average improvement and matching 685B models with 32B parameters.
Accelerate LLM inference 1.5-2x using a universal draft model that adapts to user data in real-time, handling different target models and tokenizers via online n-gram cache and hybrid distillation.
Combines adversarial training with optimal transport constraints enabling deterministic, transport-optimal one-step image generation with FID 2.38 on ImageNet-256. Apply when you need fast generative models without teacher-student distillation complexity.
Transition from simple LLM-based evaluation to agentic judges that employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory. Survey of sophisticated evaluation paradigms for complex, specialized, and multi-step assessment tasks across diverse domains.
Optimize multi-agent collaboration by learning task-specific interaction topologies. Use an LLM orchestrator to generate layered DAG topologies that adapt to inferred problem difficulty, treating agent interactions as a learned graph structure rather than fixed patterns.
Standardizes agent training data representation across diverse sources (API use, web browsing, coding, software engineering). Single lightweight protocol unifies 13 datasets enabling 20% performance gains without domain-specific tuning. Enables reproducible agent training and scalable data combination.
Bridge imitation learning and experience-driven RL by collecting state-based supervision from agents' own actions. Trigger: improve agent generalization when expert demonstrations are limited and environments lack dense rewards.
Compress agent interaction history by converting observation-action sequences into compact visual representations (images), leveraging visual tokens' superior information density. Implements segment optical caching with 20x rendering speedup and enables dynamic compression rates. Preserves over 95% of agent performance while reducing token consumption by 50%+, enabling agents to maintain longer interaction histories within fixed budgets.
Enable research agents to interleave evidence-based drafting with reasoning-driven deepening, automatically expanding outlines based on discovered gaps, using trajectory pruning for efficient RL training.
Transform uncertainty estimates into active control signals for agents, combining implicit confidence mechanisms with targeted reflection to prevent error propagation in long-horizon reasoning tasks. Use when building autonomous agents that must navigate complex multi-step problems while managing confidence and uncertainty.
Rigorous theoretical framework reformulating DeepSeek's ALF-LB as single-step primal-dual method for assignment problem, proving monotonic Lagrangian improvement, approximate balancing guarantees, and logarithmic expected regret in stochastic settings.
Train safety-aligned agents using collaborative multi-agent RL where conversation and feedback agents improve together. Trigger: reduce overrefusal while maintaining safety on sensitive queries.
Adaptive framework for dynamically selecting optimal model-tool combinations in multi-domain reasoning, using cluster-based routing and reinforcement learning for improved agent reasoning across diverse tasks.
Replace discrete token prediction with continuous vector prediction by training a high-fidelity autoencoder to compress K tokens into single latent vectors, enabling K-fold sequence length reduction while maintaining likelihood-free generation through energy-based scoring rules.
Protects computer use agents from prompt injection by using single-shot execution planning that generates complete control flow graphs before UI observation, preventing instruction hijacking while maintaining 57% performance on frontier models.
CapImagine teaches models to explicitly imagine through text rather than latent reasoning, significantly improving visual reasoning performance.
Train single models to simulate multi-agent collaboration through distillation from complex multi-agent systems and agentic RL, creating efficient Agent Foundation Models for tool use and web navigation.
Analyze when CoT reasoning succeeds or fails using DataAlchemy synthetic environment and distribution discrepancy measurement.
Implement techniques from ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch. Chart reasoning is a critical capability for Vision Language Models (VLMs)
Research contribution advancing agent and reasoning capabilities through novel approaches to model development, training, and evaluation.
To teach robots complex manipulation tasks, it is now a common practice to fine-tune a pre-trained vision-language-action model (VLA) on task-specific data. However, since this recipe updates existing representations, it is unsuitable for long-term operation in the real world, where robots must continually adapt to new tasks and environments while retaining the knowledge they have already acquired. Existing continual learning methods for robotics commonly require storing previous data (exemplars...
Represent 3D scenes as compressed light-field tokens for efficient neural rendering. Multi-view images are tokenized via Plücker coordinates, condensed through K-means clustering, and rendered adaptively. Achieves 5-7× data reduction versus MVSplat while enabling on-the-fly quality-speed tradeoffs: up to 66% FPS improvement with controlled token counts.
Augment verifiable reward RL (RLVR) with contrastive learning to generate dense auxiliary rewards. Enforce proximity among correct reasoning trajectories in embedding space while suppressing errors, amplifying invariant reasoning patterns.
Replace hard clipping in policy gradients with smooth quadratic penalties derived from Total Variation divergence constraints. Eliminates zero-gradient regions and training instability while maintaining stable policy evolution.
No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning. From arXiv:2601.06794
Predict GUI state evolution by generating HTML code rather than pixel images. Combines visual fidelity of pixel-based approaches with structural precision of code-based methods through deterministic rendering. Enables agents to evaluate action consequences and select best decisions before execution.
Scale agent learning by synthesizing diverse experiences using reasoning-based models instead of costly real-world rollouts, maintaining replay buffers with both real and synthetic interactions while using adaptive curriculum to focus on challenging tasks.
Implement techniques from DRPG (Decompose, Retrieve, Plan, Generate): An Agentic Framework for Academic Rebuttal. Despite the growing adoption of large language models (LLMs) in scientific research workflows, automated support for academic rebuttal, a crucial step in academic communication and peer review, remains largely underexplored
Sparse attention mechanism combining content-aware and position-aware sparse patterns through dynamic masking. Achieves 10x speedup while maintaining model quality on long-context benchmarks through hardware-friendly implementation.
Extends text embedding models to perform listwise reranking through continued training on ranking objectives. Constructs listwise prompts from queries and top-K candidates, leveraging pseudo-relevance feedback while maintaining embedding model efficiency. Unifies retrieval and reranking in single model.
Train efficient planners for long-horizon agent tasks using homologous consensus filtering to generate synthetic plans from strong LLMs and rule-based RL with executor capability rewards. Reduces training cost by 8x while maintaining state-of-the-art performance.
Generate Verilog hardware code from natural language using reasoning-enhanced LLMs, combining rule-based testbench generation with round-trip data synthesis and adaptive DAPO reinforcement learning for reliable hardware design.
Joint optimization of policy and reward models in LLM reinforcement learning by leveraging rule-based reward precision and dynamically constructing training pairs to prevent reward hacking and improve performance.
CoVe synthesizes high-quality tool-use training data using explicit task constraints as both generation guidance and verification validators, enabling effective agent training without manual curation.
Build desktop agents via reusable, parameterized skills encoding human computer-use knowledge. Skills combine execution graphs (handling UI variations) with composition graphs (chaining strategies). 57.5% success on WindowsAgentArena.
Mitigate lazy reasoning in Large Reasoning Models via self-distillation teaching task decomposition, followed by Diversity-Aware GRPO with entropy-based advantage functions, enabling effective decomposition without external teachers while balancing structured reasoning with diversity.
Build research agents that systematically search for comprehensive answers to complex questions by maintaining search state, iterating on queries, and validating answer completeness. Implement strategies for identifying knowledge gaps and conducting follow-up searches to ensure thorough coverage of topics.
Filter low-quality reasoning traces using model-internal confidence signals at test time, eliminating weak paths during generation to achieve 99.9% accuracy while reducing token generation by up to 84.7%.
Enable efficient speculative decoding by training discrete diffusion language models for parallel draft generation. Use AR-style distillation and scribe refinement to train dLLMs. Eliminate left-to-right error accumulation through independent parallel proposals. Achieve 5.54× speedup on HumanEval vs. 2.41× for AR-based methods.
Replace LayerNorm with Derf(x) = erf(αx + s) for improved generalization in transformers. Derf outperforms LayerNorm across vision, speech, and DNA modeling—ideal when normalization-free training provides benefits without architectural complexity.
Unlocks parallelism in recurrent memory transformers through diagonal batching of the layers-segments grid, achieving 3.3x speedup on 131K-token sequences without model retraining.
Improve state-space model performance on language modeling and retrieval by applying differential mechanisms to reduce noise in intermediate representations and enhance long-context reasoning.
Train masked diffusion models for code generation using coupled-GRPO to optimize non-autoregressive generation. Achieves 4.4% improvement on code benchmarks while reducing autoregressive bias.
Apply reasoning-focused RL to vision foundation models using GRPO to develop deep visual understanding and abstract reasoning beyond visual recognition.
Understand fundamental limitations of applying diffusion to discrete text: position-agnostic corruption ignores linguistic structure, and token-wise training misses multi-token dependencies. Design text diffusion systems satisfying five essential properties: position-aware corruption, dependency-aware training, parallel consistency, linguistic structure respecting, and robust handling of token boundaries.
Build open-source agents for document question-answering by modeling DocQA as information-seeking with explicit tool utilization. DocDancer uses an exploration-then-synthesis pipeline to generate high-quality training data, addressing the scarcity that limits agent-based document understanding systems.
Systematic evaluation toolkit for assessing large language models across multiple dimensions, enabling comprehensive benchmarking of agent capabilities and comparative analysis of model performance.
Diagnose and fix multi-agent system failures through targeted interventions (message edits, plan changes) rather than static log analysis. DoVer recovers 18-28% of failed trials with 30-60% hypothesis validation—essential for autonomous multi-agent reliability.
Compress LLMs by 20-30% in width while preserving functionality through optimal transport-based neuron merging. Instead of discarding neurons, redistribute their signal to retained neurons via learned transport maps. Use when you need to reduce model size with minimal accuracy loss and measurable computational speedup.
Implements DPWriter from arXiv:2601.09609
Enable stable multi-agent reinforcement learning by normalizing advantages per-agent rather than globally, preventing gradient-norm inflation in heterogeneous multi-agent systems.
Dr. Zero: Self-Evolving Search Agents without Training Data. From arXiv:2601.07055
DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving. From arXiv:2601.01528
Optimize disaggregated prefill-decoding LLM serving for multi-turn (agentic) workloads by introducing dual-path KV-cache loading. Traditional approach loads all KV-cache to prefill engines, saturating their storage network. DualPath loads to decoding engines first, then transfers via compute network (lower contention). Adaptive routing selects path based on real-time queue depths. Achieves 1.87× offline throughput and 1.96× online serving improvement.
Minimal modification to SFT that dynamically rescales objectives by token probability. Rectifies implicit reward structure to improve generalization comparable to RL while maintaining SFT simplicity.
Automatically construct compact, diverse action spaces for LLM reasoning through corpus-based estimation and submodular optimization—enabling efficient decision-making without manual specification or expensive exhaustive search.
Monitor token-wise entropy to adaptively allocate compute during inference. Branch into multiple paths at high-entropy tokens, reducing token generation by up to 65% while improving accuracy by up to 37% on reasoning tasks.
Enable image editors to handle complex instructions through iterative critique and refinement cycles. A multimodal LLM critiques editing results, reasons about improvements, and refines instructions until satisfactory output—ideal for instruction-following challenges in visual editing.
Improve multimodal embeddings through RL-optimized reasoning that grounds evidence in retrievable visual cues. Frozen embedder provides stable rewards while reasoner generates evidential traceability CoT with text keywords, bounding boxes, and key frames.
Implement techniques from Endless Terminals: Scaling RL Environments for Terminal Agents. Environments are the bottleneck for self-improving agents
Enhanced language model pre-training methodology improving linguistic competence across languages, strengthening foundational capabilities for multilingual agent systems.
Research contribution advancing agent and reasoning capabilities through novel approaches to model development, training, and evaluation.
Provide continuous confidence scores throughout LLM text generation via supervised learning and backward confidence integration, enabling real-time uncertainty awareness
Accelerate LLM decoding by predicting multiple future tokens simultaneously using mask-token probing in embedding space, without retraining or auxiliary models.
Choose optimal pretraining strategy for text encoders: pure MLM, pure CLM, or biphasic CLM-then-MLM training, with empirical guidance on performance across downstream tasks.
24-trillion-token web dataset with 12-category taxonomy labels enabling efficient curation of specialized datasets through SQL filtering and small annotation models.
The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the dynamic, full-process requirements of real-world engineering, particularly in backend development which demands rigorous environment configuration and service deployment. To address this gap, we intr...
Production LLM systems often rely on separate models for safety and other classification-heavy steps, increasing latency, VRAM footprint, and operational complexity. We instead reuse computation already paid for by the serving LLM: we train lightweight probes on its hidden states and predict labels in the same forward pass used for generation. We frame classification as representation selection over the full token-layer hidden-state tensor, rather than committing to a fixed token or fixed layer ...
Research contribution advancing agent and reasoning capabilities through novel approaches to model development, training, and evaluation.
Vision-Language-Action (VLA) models have emerged as essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model (VLM) embeddings. Recent advancements have introduced explicit intermediary reasoning, such as sub-task prediction (language) or goal image synthesis (vision), to guide action generation. However, these intermediate reasoning are often indirect and inherently limited in their...
Mitigate long-tail distribution problems in VLM training data through adaptive rebalancing and diffusion-based synthesis. Uses entity distribution analysis to identify head/tail imbalance and applies targeted data augmentation, improving LLaVA 1.5 performance by 4.36% without increasing training data volume.
Adaptively prune visual tokens from vision encoders by reconstructing discarded features from retained ones, reducing computational cost by 50% while maintaining task performance on OCR and image understanding tasks.
Uncover and fix reward hacking vulnerabilities in LLM-based judges. Simple tokens like punctuation or generic reasoning phrases trigger false positive rewards without substantive content. Defend using data augmentation with truncated model outputs as adversarial negatives, creating robust Master Reward Models resistant to superficial inputs.
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive bench...
Train agents from scratch without human-annotated data via symbiotic competition—curriculum agent proposes progressively harder tasks while executor agent learns to solve them, creating autonomous self-reinforcing loops.
Enable vision-language agents to self-evolve by grounding verification in tool outputs rather than text: implement nested loops where Solver+Verifier generate trajectories and tool-based feedback, then optimize via GRPO using self-generated rewards without external supervision.
Dynamically adapt LLM reasoning strategies at inference time by curating episodic memory of past problem solutions—generate task-specific prompts, tool configs, and control logic for up to 111× cost reduction and 14% accuracy gains.
Decompose agent work across four specialized modules (planner, executor, verifier, generator) coordinated via evolving memory. Use Flow-GRPO to convert multi-turn sparse-reward optimization into sequential single-turn updates with outcome broadcasting, achieving 4-15% accuracy gains on benchmarks while scaling better than monolithic agent policies.
Enable agent learning through episodic memory and neural case selection without fine-tuning the underlying LLM, achieving efficient continual adaptation via policy updates in memory space.
Enables web agents to handle long-horizon tasks by actively managing context workspace. Implements granular condensations of recent steps and deep consolidations of multi-step sub-tasks, preventing context saturation. Achieves 36.2% on BrowseComp with 30B model, matching larger proprietary agents.
Diagnose and correct overconfidence failures in autonomous agents using Holistic Trajectory Calibration (HTC), analyzing process-level features across entire execution paths. Use when building reliable autonomous systems that need better confidence estimates and reduced overconfidence failures.
Improves LLM agent decision-making by training agents to first critically evaluate actions before generating, using RL on action-pair comparisons. Develops intrinsic reasoning about action quality without requiring reflection supervision.
Reasoning is a fundamental cognitive process underlying inference, problem-solving, and decision-making. While large language models (LLMs) demonstrate strong reasoning capabilities in closed-world settings, they struggle in open-ended and dynamic environments. Agentic reasoning marks a paradigm shift by reframing LLMs as autonomous agents that plan, act, and learn through continual interaction. In this survey, we organize agentic reasoning along three complementary dimensions. First, we charact...
Framework for enabling autonomous agents to self-verify code generation and reasoning quality through structured evaluation, supporting software engineering agent deployment with built-in correctness checking mechanisms.
Enables agents to maintain strategic coherence over extended experimental cycles through hierarchical cognitive caching that distills execution traces into stable knowledge, achieving 56.44% on MLE-Bench within 24-hour budgets.
Enables long-horizon agentic search extending beyond 100 tool calls through scalable asynchronous RL training with autonomous QA dataset synthesis.
Train RL on diverse agent frameworks (LangChain, AutoGen, custom) via unified data interface and transition-based RL decomposition.
Build agentic applications using unified agent interfaces, asynchronous design patterns, ReAct paradigm grounding, and developer-centric evaluation and deployment tools.
Empirically analyzes 31,132 agent skills to identify 14 distinct vulnerability patterns, finding 26.1% contain security flaws including data exfiltration, privilege escalation, and malicious intent risks that require mandatory vetting.
Automatically synthesize executable RL training environments with database backends, Python tools, and task descriptions. Generate 1000+ diverse domains with 10K+ tasks enabling data-efficient tool-use agent training without manual scenario design.
Reveals that inference-time scaling techniques for LLMs don't transfer to VLMs: majority voting beats verification, self-correction happens in <10% of cases, and models verify better without images. Use insights to design VLM evaluation methods that work rather than assuming LLM techniques apply directly.
Train LLMs to generate high-quality research plans via rubric-based RL without requiring experimental verification. Extracts research goals and domain-specific rubrics from scientific papers, uses frozen model as grader with 12-22% relative improvements, achieves human-expert preference 70% of time with strong cross-domain generalization.
Select optimal training subsets for T2I models through meta-gradient-based rater networks. Score each sample based on gradient influence on validation performance without retraining. Implement shift-Gaussian pruning excluding high-scoring samples. Achieve 5× training speedup with 50% subset outperforming full dataset.
Preserve LLM safety alignment during LoRA fine-tuning via Fisher information regularization and collision-aware geometric constraints.
Identify and mitigate alignment degradation in self-evolving LLM agents. After deployment, agents systematically abandon training-time safety constraints when environmental feedback rewards rule-breaking. Model two mechanisms: Self-Interested Exploration (individual drift) and Imitative Strategy Diffusion (collective norm erosion), with practical safeguards for post-deployment monitoring.
Dynamically modulate reasoning depth at test time using alpha moments and Bernoulli scheduling to optimize inference speed-quality tradeoffs without retraining.
Enable LLMs to solve complex problems through multi-turn agentic reasoning with tool-assisted verification and iterative refinement loops. Trigger: improve reasoning reliability on long-horizon tasks by combining RL with verification.
Implements A^3-Bench from arXiv:2601.09274
Evaluate language models using open-ended answer generation and semantic matching instead of multiple choice, eliminating test-taking shortcuts and achieving near-perfect alignment with human judgment.
Accelerate diffusion language model inference by dynamically adjusting parallel tokens per step using a small auxiliary autoregressive model, achieving substantial throughput gains.
Enable LLM agents to autonomously retrieve information across multiple granularities using keyword search, semantic search, and chunk read tools. Simple ReAct-based loop with hierarchical interfaces outperforms dense retrieval by allowing adaptive information seeking without complex graph construction.
Route generation dynamically based on relative model advantage for 2× latency reduction in reasoning. Arbitrage learns when draft models excel versus when target models are worthwhile—critical for balancing cost and quality in long reasoning chains.
Calibrate exploration effort in reasoning traces based on problem difficulty by detecting high-entropy windows and applying hierarchical entropy rewards. Reduces unnecessary reasoning on easy tasks while increasing exploration on hard tasks.
Reduce policy gradient variance in language agent training by aggregating rewards in semantic intention space, enabling 9.95% average performance gains across downstream tasks without exponential action space explosion.
Build reusable skill libraries for mathematical reasoning through hierarchical RL. Maintain a high-level skills manager that summarizes successful solution traces and selects relevant strategies to condition future rollouts.
Agentic reward model framework enabling active tool invocation (cropping, retrieval, validation) to ground judgments in verifiable evidence, using multi-stage GRPO with adaptive reward shaping for systematic evidence-based evaluation.
Comprehensive empirical study recommending model-specific test-time scaling strategies (majority voting, first-finish search) across eight LLMs based on architectural family, problem difficulty, and compute budget rather than universal approaches.
Train LLMs to effectively integrate tools through advantage shaping, directly modifying advantage functions to guide policy without compromising training stability.
Optimize multi-turn agent policies via entropy-guided tree expansion and turn-level credit assignment. AT²PO addresses exploration diversity, sparse credit signal, and policy misalignment problems in LLM agents through structured tree search and turn-aware policy updates.
Decompose agent reasoning into atomic thoughts guided by curriculum-based reasoning reward models, enabling multi-hop information retrieval and interpretable deep research.
Demonstrates position bias where LLMs neglect middle content while over-attending to endpoints. Proposes Attention-Driven Reranking (AttnRank) to align content with model's intrinsic attention preferences.
Replace uniform residual accumulation with depth-wise attention that selectively aggregates earlier layer representations. Improve gradient flow and model performance in deep architectures by learning content-dependent depth-wise selection.
Guide LLM exploration in reasoning tasks using attention patterns as navigation signals. This technique branches exploration from high-attention tokens (likely reasoning steps) and applies adaptive sampling to maintain effective gradients, significantly improving training efficiency for mathematical reasoning.
Identify influential texts in long contexts via attention weights using top-K filtering and context subsampling, achieving 10-20x speedup over perturbation methods.
Automatically generates diverse multilingual code benchmarks using LLMs, creating 3920 problems across 20 programming languages with quality assurance filtering.
Generate diverse, validated game environments automatically using domain-specific language specifications and LLM coding agents with self-repair, enabling cost-effective (≈$4/env) creation of heterogeneous test domains for evaluating cross-environment agent generalization.
Build autonomous research agents using pre-computed knowledge graphs instead of online reasoning. Extract methodological patterns from literature, organize them into structured knowledge, and enable agents to align user research intents with established paradigms for efficient, grounded research planning and execution.
Autoregressive U-Net operating directly on raw bytes with hierarchical multi-scale pooling for adaptive token embedding, eliminating fixed vocabularies.
Generate synthetic web environments at scale by specifying websites as Finite State Machines with explicit state transitions, then programmatically executing GUI actions to collect verified interaction trajectories. Reduces trajectory cost from $0.15–$1.00 to $0.04 per sample while generating 11,000+ verified trajectories with deterministic, executable validation requiring no external judges.
Fixes batch speculative decoding ragged tensor problem where sequences in batches accept different token counts, desynchronizing state. EQSPEC guarantees output equivalence through proper synchronization. EXSPEC reduces overhead 40% via cross-batch scheduling. Enables efficient parallel decoding with 95% equivalence.
Allocate LLM reasoning budget optimally via value tree search: use residual value prediction to estimate step utility, then dynamically shift exploration-exploitation balance as budget depletes. Outperform high-budget baselines at 1/4 cost.
Reinforcement learning (RL) is central to post-training, particularly for agentic models that require specialized reasoning behaviors. In this setting, model merging offers a practical mechanism for integrating multiple RL-trained agents from different tasks into a single generalist model. However, existing merging methods are designed for supervised fine-tuning (SFT), and they are suboptimal to preserve task-specific capabilities on RL-trained agentic models. The root is a task-vector mismatch ...
Systematic evaluation toolkit for assessing large language models across multiple dimensions, enabling comprehensive benchmarking of agent capabilities and comparative analysis of model performance.
Align diffusion models to hierarchical fine-grained criteria rather than binary preferences. Decompose expert knowledge into attribute hierarchies and apply Complex Preference Optimization to simultaneously maximize positive attributes while minimizing negative ones.
Build fully ternary quantized vision-language-action models for robotic manipulation, achieving 11x memory reduction and 4.4x speedup while maintaining task performance on edge devices.
Optimize language model policies layer-by-layer rather than monolithically to understand internal reasoning structure. Decompose models into per-layer and per-module policies via residual streams, analyze entropy patterns revealing exploration→convergence phases, and optimize layers sequentially—improving reasoning on math tasks by up to 4.69 points.
Overcome reasoning model training plateaus by increasing rollouts per prompt (N=512) rather than training steps, addressing unsampled coupling that destabilizes learning. Theoretical analysis shows broad exploration eliminates plateau bottleneck.
Build web agents using human-inspired browser actions (scrolling, clicking, typing) operated directly on raw HTML via Playwright. Combine supervised fine-tuning and rejection fine-tuning with explicit memory for strong generalization on web tasks.
Automates artistic typography customization through self-distilled learning and localized style injection. Generates stylized text images by encoding reference style and injecting it into diffusion denoising. Use for digital design workflows, text-based visual content creation, or applications needing artistic typography control without manual annotation.
Adapt large reasoning models for optimization tasks using expert-guided hint correction. Generate high-quality training data with minimal expert intervention (<2.6% token modification). Trigger: fine-tune reasoning models on domain-specific tasks without large supervised datasets.
Implement techniques from Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs. Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them, which is essential for a wide range of data-centric applications
Enhance LLM reasoning by combining contrastive learning on reasoning representations with reinforced fine-tuning, leveraging both annotated chains and unsupervised signals.
Train reusable pre-computed KV cache representations of large text corpora for efficient retrieval, achieving 38.6x memory reduction and 26.4x throughput improvement.
Implement CASTLE, a causal attention mechanism that dynamically updates key representations as context expands. Reduces validation loss by 0.006-0.037 across model scales while maintaining O(L²d) training complexity and O(td) decoding speed. Deploy for improved language model perplexity without inference overhead.
Extract sparse causal concept graphs from LLM activations using SAE and DAGMA, then validate through ablation to identify causally influential features. Bridges mechanistic interpretability with causal inference for understanding reasoning flow.
Control policy entropy dynamics in RL by reweighting gradients from clipped tokens. CE-GPPO preserves out-of-clip gradients with beta parameters to stabilize exploration-exploitation balance, preventing entropy collapse while maintaining training stability in LLM fine-tuning.
Implement techniques from CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval. General-purpose embedding models have demonstrated strong performance in text retrieval but remain suboptimal for table retrieval, where highly structured content leads to semantic compression and query-table mismatch
Train search agents using citation-aware rubric rewards that decompose complex questions into verifiable single-hop facts. Agents learn to chain evidence through explicit source citations, preventing hallucinations and shortcut exploitation. Citation-aware Group Relative Policy Optimization (C-GRPO) combines rubric and outcome rewards, enabling agents to solve multi-hop reasoning tasks with high factual grounding and transparency.
Enable language models to dynamically switch between four cognitive modes (spatial, convergent, divergent, algorithmic) during problem-solving. Meta-agent observes state and selects optimal mode per step, improving reasoning across math, coding, and spatial tasks without requiring model training.
Improve diffusion model sampling by planning content-adaptive denoising trajectories. Extract Diffusion DNA signatures quantifying per-stage difficulty, then apply graph planning to allocate computation to challenging generative phases.
Harmonize supervised fine-tuning and reinforcement learning through dynamic weighting, balancing expert imitation and on-policy exploration to prevent response pattern disruption.
Unify retrieval and generation in RAG systems by compressing documents into shared continuous embeddings that serve both retrieval and generation: implement joint training with differentiable selection, achieving up to 16× context compression while improving generation quality.
Accelerate LLM inference by dynamically skipping transformer layers based on in-context signals, without training auxiliary draft models or changing model weights.
Generates complex text-heavy and structured images by converting natural language descriptions into executable code that specifies layouts, then renders and refines. Achieves 68% improvement on structured image generation benchmarks.
Train dual-brain agents with specialized planner (Cerebrum) and executor (Cerebellum) through decoupled RL, resolving planning-execution trade-off for scientific GUI agents
Train code and test generators through adversarial co-evolution where test LLM generates adversarial test cases to expose code defects. Prevent self-collusion by separating models and enabling white-box test generation.
Dynamically adapt LLM depth per input at test time by skipping, repeating, or reordering layers using MCTS search, correcting 60% of initially wrong predictions and processing 75% of correct predictions with shorter architectures without retraining.
Financial agents powered by large language models (LLMs) are increasingly deployed for investment analysis, risk assessment, and automated decision-making, where their abilities to plan, invoke tools, and manipulate mutable state introduce new security risks in high-stakes and highly regulated financial environments. However, existing safety evaluations largely focus on language-model-level content compliance or abstract agent settings, failing to capture execution-grounded risks arising from re...
Improve RL sample efficiency through co-evolution of policy and experience extractor, enabling dynamic experience replay that adapts to the agent's skill level.
Enable dynamic control over reasoning depth during inference using Reasoning Control Fields that guide long chain-of-thought generation based on execution and quality parameters.
Enhance reasoning models by integrating executable code within thinking traces, enabling grounded computation verification and reducing hallucination in mathematical and logical reasoning.
Implement low-rank attention routing using encode-decode factorization to achieve linear-time complexity on long sequences while maintaining compatibility with optimized attention kernels.
Fuse categorical sampling directly into LM-head matrix multiplication to eliminate logits materialization. Use Gumbel noise during computation and hierarchical reduction to achieve 19% token-level latency reduction.
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation. From arXiv:2601.09688
Combines DeepSeek Sparse Attention (DSA) achieving O(Lk) complexity, scalable RL framework allocating 10%+ compute to post-training, and large-scale agentic task synthesis with 1,800+ environments. Matches proprietary models in reasoning; DeepSeek-V3.2-Speciale achieves olympiad gold.
Research contribution advancing agent and reasoning capabilities through novel approaches to model development, training, and evaluation.
Optimize diffusion models for preference alignment by learning directly from group-level preferences without stochastic policies, enabling efficient and stable training.
Replace PPO's heuristic ratio-based clipping with Divergence Proximal Policy Optimization (DPPO) that directly constrains policy divergence using either Total Variation or KL, enabling lightweight approximations (Binary, Top-K) for vocabulary-scale computations while improving stability and efficiency.
Improve RL training efficiency by embedding explicit reflection and revision loops. Models generate initial responses, receive feedback, produce self-reflections describing improvements, revise their attempts, and distill successful corrections into the base policy. Achieves up to 81% improvement on complex tasks through structured behavioral change.
Multimodal system for joint audio-video understanding and generation, enhancing agent perception and communication in environments with both audio and visual information.
Generates high-quality images directly from pixels using flow-matching-based latent sequences. Transforms images via invertible flows into manageable latent sequences, applies autoregressive modeling, and uses classifier-free guidance. Provides exact likelihood estimates and one-step distillation capabilities.
Implements Fast-ThinkAct from arXiv:2601.09708
Demonstrate that adversarial attacks on vision-language models need not target all tokens equally. Entropy-guided attacks identify high-entropy tokens (critical decision points) where perturbations have maximum impact, achieving comparable attack success with 80% fewer tokens targeted.
Edit 3D faces with flexible mask layouts using only a few training samples. FFaceNeRF employs geometry adapters with feature injection and latent mixing for tri-plane augmentation, enabling rapid NeRF adaptation without fixed segmentation masks. Ideal for personalized medical imaging, creative face editing, and applications requiring user-defined mask control.
Convert arXiv papers that apply ML techniques to real-world domains into application-transfer skills. Extracts problem formulation, domain adaptation gaps, and deployment recipes. Use this skill when extracting skills from Category 1 (Application Transfer) papers — papers about AlphaFold-style domain applications, robotics deployment, interdisciplinary ML crossings, or any paper where the novelty is in what was solved rather than the method itself.
Convert foundational papers that create new subfields into conceptual framework skills. Extracts problem definitions, vocabulary, founding experiments, and opened research directions. Use this skill when extracting skills from Category 8 (Field Foundation) papers — MAML-style paradigm-creating papers or 'Deep Learning' review-style papers that define entire research communities.
Convert mechanistic analysis papers into transferable analytical methodology skills. Extracts the research question, analytical instrument, controlled confounds, and practitioner implications. Use this skill when extracting skills from Category 9 (Mechanistic Analysis) papers — Network Dissection-style interpretability work or any paper whose goal is exploratory understanding of why something works.
Convert scaling and efficiency papers into practical resource planning guides. Extracts empirical scaling laws, compute-optimal allocation rules, and budget-performance trade-offs. Use this skill when extracting skills from Category 11 (Scaling and Efficiency) papers — Chinchilla-style scaling law papers, Flash Attention efficiency papers, or knowledge distillation studies.
Convert systematic empiricism papers into ranked practitioner checklists. Extracts implementation tricks, hyperparameter findings, and design choice ablations with conditions of applicability. Use this skill when extracting skills from Category 4 (Systematic Empiricism) papers — '37 PPO details'-style papers, hyperparameter studies, or ablation-heavy guides that systematize scattered knowledge.
Replace token-insertion for fusing vision and language with efficient cross-attention that maintains separate text self-attention. Enables text tokens to attend images within local windows, preserves gist tokens from prior images, and maintains near-constant memory costs for streaming video—more practical than direct token insertion for resource-constrained applications.
Train specialized LLMs to generate optimized Triton GPU kernels using RL with dual rewards for correctness and syntax compliance. 8B model achieves parity with Claude-Sonnet and DeepSeek-R1 by combining supervised fine-tuning on curated code pairs with RL exploration beyond imitation learning ceilings.
We introduce Being-H0.5, a foundational Vision-Language-Action (VLA) model designed for robust cross-embodiment generalization across diverse robotic platforms. While existing VLAs often struggle with morphological heterogeneity and data scarcity, we propose a human-centric learning paradigm that treats human interaction traces as a universal 'mother tongue' for physical interaction. To support this, we present UniHand-2.0, the largest embodied pre-training recipe to date, comprising over 35,000...
Convert arXiv papers into ready-to-use agent skills using category-aware extraction. First classifies the paper into one or more of 11 research categories, then applies a specialized extraction pipeline for each category — because different types of papers produce different types of usable knowledge. A single paper can yield multiple skills if it spans categories. Use this skill whenever the user wants to turn a paper into a skill, extract practical techniques from research, build a skill library from papers, convert arXiv papers into reusable agent instructions, or batch-process multiple papers into skills. Also trigger when someone asks about extracting actionable knowledge from papers, making research practical for LLM agents, or systematically converting academic contributions into structured agent capabilities.
Route queries to specialized reasoning modes (internal reasoning, tool calling, or instant answers) using task-aware routing and Adaptive Policy Optimization to reduce inference costs by 45% while maintaining accuracy.
Build controllable benchmarks for evaluating long-context agents using environment rollouts. Generate diverse multi-step agent tasks that require maintaining context across extended interaction sequences, enabling evaluation of agent reasoning quality in scenarios with long history requirements.
Generate adversarial test suites that catch LLM code errors by analyzing constraint differences between correct and incorrect solutions. SAGA framework improves test detection rate 9.55% and verifier accuracy 12.14% by using human bug patterns and multi-input differential analysis.
Train LLM-based agents with end-to-end RL by extending MDPs to handle tool invocation and environmental stochasticity—enable dense process rewards for intermediate steps and masked policy gradients for learnable actions.
Build multi-faceted reward models for agent trajectories that provide structured feedback on intermediate reasoning quality. Implement explicit reasoning traces, focused critiques with refinement guidance, and overall process scores to train more effective agentic agents without relying solely on sparse outcome rewards.
Implements The Agent's First Day from arXiv:2601.08173
Evolve agent behavior through iterative context refinement using delta updates rather than full rewrites, accumulating strategies and insights across execution traces.
Train agentic LLMs through curriculum-based learning to autonomously execute full data science workflows from raw data to analysis reports, enabling 8B models to match proprietary systems.
Current Large Language Models (LLMs) exhibit a critical modal disconnect: they possess vast semantic knowledge but lack the procedural grounding to respect the immutable laws of the physical world. Consequently, while these agents implicitly function as world models, their simulations often suffer from physical hallucinations-generating plans that are logically sound but physically unexecutable. Existing alignment strategies predominantly rely on resource-intensive training or fine-tuning, which...
Use meta-learning to automatically balance Supervised Fine-Tuning and Reinforcement Learning signals, treating SFT and RL as complementary rewards in a unified single-stage training framework.
Automate sub-agent creation by treating agents as dynamically creatable executors defined by four-tuple abstraction (Instruction, Context, Tools, Model), enabling flexible delegation and cost-aware routing for complex multi-step tasks.
Reduces inference cost by compressing context into continuous representations using a separate encoder. Generates 4-8x fewer representations than token embeddings while maintaining model performance. Works with any decoder LLM without modification or fine-tuning.
Replace ratio-based clipping in GRPO with KL-divergence constraints using the KL3 estimator, improving exploration and training stability with asymmetric clipping that requires no additional computation.
Build fully open audio-language models supporting reasoning over speech, sound, and music with 10-minute long-form comprehension and multi-turn conversation capabilities. Use when you need to process audio modalities alongside text for complex reasoning tasks across speech recognition, sound classification, and music analysis.
Improve computer-use agent performance by running multiple rollouts and selecting the best trajectory using narrative-level reasoning. The Behavior Judge (BJudge) converts raw execution traces into behavior narratives, enabling intelligent trajectory selection that scales agent effectiveness beyond single-rollout limitations.
Stabilize off-policy RL for LLMs using adaptive clipping that dynamically rebalances positive/negative gradients and preserves entropy, improving mathematical reasoning performance vs standard PPO.
Improve long-context performance by incorporating imaginary components discarded in standard RoPE implementations. Use phase information from complex-valued attention for richer positional encoding—especially valuable as context length increases beyond normal ranges.
Defend against indirect prompt injection attacks by detecting dominance shifts using leave-one-out attribution, enabling selective sanitization without sacrificing latency or utility.
Improves LLM tool-use capabilities through automated environment construction that generates realistic feedback and verifiable rewards for RL-based training without external tools.
Compress long prompts to 1/26th of original size while maintaining retrieval accuracy using hierarchical page-level pooling, without model fine-tuning.
Improve credit assignment in multi-objective RL by decomposing advantages into segment-specific values. Use Outcome-Conditioned Baselines to reduce cross-objective interference without expensive rollouts, enabling better training signals for multi-step completions with different reward functions per segment.
Evaluate AI systems' ability to conduct autonomous research experiments using EXP-Bench, a benchmark for multi-step scientific reasoning and iterative experimental workflows.
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.
Reduce agent execution steps by 35% and latency by parallelizing sequential tool calls through task dependency graphs (DAGs). Use when deploying information-retrieval agents where tool execution ordering is flexible.
Accelerates long-context LLM prefilling by identifying sparse attention patterns without expensive scoring, using block-level approximations and dynamic thresholding. Achieves 27.78x speedup at 256K tokens while maintaining accuracy.
Train small models to recover from tool execution errors by treating errors as training signals. Use error simulators to generate recovery examples and fission failed trajectories into multiple recovery attempts.
Active learning (AL) has the potential to drastically reduce annotation costs in 3D biomedical image segmentation, where expert labeling of volumetric data is both time-consuming and expensive. Yet, existing AL methods are unable to consistently outperform improved random sampling baselines adapted to 3D data, leaving the field without a reliable solution. We introduce Class-stratified Scheduled Power Predictive Entropy (ClaSP PE), a simple and effective query strategy that addresses two key lim...
Unify LLM alignment methods through f-divergence theory. f-GRPO extends GRPO to handle any divergence measure (KL, Jensen-Shannon, Hellinger), enabling tailored alignment objectives. f-HAL combines on-policy and off-policy preference learning to prevent reward hacking while maintaining safety alignment.
Research contribution advancing agent and reasoning capabilities through novel approaches to model development, training, and evaluation.
Systematically adapt state-of-the-art distillation methods for T2I generation. Compare sCM (stabilized Consistency Models), MeanFlow, and IMM within unified framework. sCM excels at extreme few-step regimes (52.81% GenEval at 2 steps), MeanFlow achieves superior fidelity at 4 NFEs.
Identify and measure feedback friction in LLM reasoning tasks where models resist high-quality guidance, discovering that confidence predicts feedback receptiveness and revealing mitigation strategies.
Train efficient reasoning models using stage-wise context scaling and complexity-aware data selection. Achieves 49.6% accuracy on AIME 2024 while reducing training steps by 50% through alternating compress-extend cycles that progressively refine reasoning quality.
Identify dominant frequency components in RoPE embeddings to determine which attention dimensions are essential, reducing KV cache by 81% while maintaining performance. Training-free approach integrates with existing compression methods for 2.56x speedup on long-context reasoning.
Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-li...
Enable extreme 1-2bit quantization of pre-trained LLMs by leveraging complex-valued arithmetic. Convert real-valued linear layers to complex domain losslessly, quantize to fourth roots of unity {±1, ±i}, and apply recursive residual error quantization for near full-precision performance.
F1 is a 4.2B-parameter Vision-Language-Action model that reformulates robotic control as foresight-guided inverse dynamics. Rather than reactive state-to-action mappings, F1 predicts plausible future visual states and derives actions to achieve them, enabling robust planning in dynamic environments with 82.2% real-world task success rate.
Investigate exploration-exploitation trade-offs in reinforcement learning with verifiable rewards through theoretical analysis and empirical validation. Derive explicit clipping bias bounds, establish policy-entropy shift formulation, and introduce reward-misalignment framework. Show policy entropy and performance lack direct causal relationships.
Improve MoE language model efficiency with causal threshold-based routing that eliminates auxiliary losses and enables dynamic per-token computation.
Improve LLM reasoning efficiency by systematically reusing past rollouts through experience replay. ExGRPO organizes training data by success and diversity, applying a mixed-policy objective that prioritizes high-quality examples while maintaining exploration, achieving 3.5-7.6 point gains over on-policy methods.
Improve RLVR training efficiency by selectively replaying trajectories based on correctness and entropy. Medium-difficulty questions and low-entropy solutions are most valuable; selective replay yields +3.5-7.6% improvements.
Build robust reward models for computer-use agents by processing execution videos with spatiotemporal token pruning (STP+TTP) to remove redundancy while preserving task-relevant UI details. Achieve 84.7% accuracy with adversarial instruction translation.
Build unified LLMs supporting both fast standard inference and slow reasoning modes through hybrid attention and adaptive post-training. Combine non-reasoning and reasoning capabilities in a single model with 128K context windows and tool use. Use when you need models balancing speed and reasoning depth across diverse tasks and languages.
System for evolving and composing programmatic skills through learning, enabling agents to discover and develop new behavioral capabilities dynamically.
Scale Evolution Strategies to billion-parameter LLMs without backpropagation for superior robustness and stability across diverse models, reward horizons, and evaluation tasks. Outperforms RL methods while eliminating gradient computation overhead.
Generate verifiable synthetic training data (problems + solutions + verification artifacts) through evolutionary synthesis that works across math, code, and agent tasks without task-specific rules.
Chunk-based landmark-guided sparse attention enabling 16M-token context windows with 90%+ retrieval accuracy on in-context tasks. Use when processing ultra-long documents where full attention is prohibitive but context selection must be dynamic.
Transform vision-language models into embodied agents through two stages: learning embodied priors from trajectory-augmented data with LLM reasoning, then online RL with self-summarization and dense rewards for long-horizon tasks.
Stabilize multi-turn LLM agent training with entropy-regularized policy optimization that prevents exploration-exploitation cascade failures in sparse-reward environments through trajectory-level entropy regulation, historical smoothing, and adaptive phase-based weighting. Achieve up to 152% performance improvement on scientific reasoning tasks and 19.8% on embodied control by maintaining controlled entropy oscillations across 30+ interaction turns.
Systematic evaluation toolkit for assessing large language models across multiple dimensions, enabling comprehensive benchmarking of agent capabilities and comparative analysis of model performance.
Stabilize LLM post-training by constraining global distributional shifts in policy exploration. Entropy Ratio Clipping supplements local clipping mechanisms with global entropy constraints—essential when PPO alone produces unstable gradients and distribution shifts.
Monitor Shannon entropy in LLM token distributions to detect sudden uncertainty spikes, then trigger adaptive prompt consolidation to realign conversation context and improve accuracy by 56.6% and reliability by 35.3%.
Stabilize RL training on reasoning tasks by performing entropy-guided rollouts from uncertain decision points, avoiding policy collapse and premature convergence. Increases fully correct trajectories on math reasoning while maintaining stable entropy throughout training.
One-line code modification augmenting RL advantage function with clipped entropy term to encourage exploratory reasoning chains while maintaining optimization stability.
Technique for efficient model adaptation that mitigates catastrophic forgetting during fine-tuning, enabling agents to learn new tasks while preserving existing capabilities.
Train LLMs to disambiguate tool calls in enterprise settings where multiple similar APIs exist and parameters are incomplete. Generates synthetic multi-turn dialogues with realistic ambiguity to improve tool selection accuracy by 27+ percentage points.
Enable step-by-step reasoning in diffusion models through iterative latent state refinement. Condition diffusion on evolving thought states across multiple reasoning steps, grounded with textual supervision to prevent drift.
Improve exploration in LLM-based agents through external memory-augmented RL with hybrid on/off-policy training. Agents generate exploration 'tips' (self-reflections) after trajectories, storing them in memory. During rollouts, policy samples between standard execution and memory-conditioned execution. Off-policy updates distill memory-guided behaviors into base policy via reward-guided knowledge distillation. Achieves 128.6% improvement on ScienceWorld and 11.3% on WebShop vs. GRPO.
Build a single model handling multimodal understanding, generation, and editing tasks efficiently through token compression and intelligent component sharing. EMMA-4B surpasses larger models while reducing computational burden—ideal when you need unified performance across vision and language tasks.
Discover hierarchical temporal abstractions within autoregressive models via internal RL, enabling efficient exploration of sparse-reward tasks. Metacontroller learns abstract action sequences modifying residual streams, switching gates enable quasi-binary patterns, and abstract-space RL achieves many orders-of-magnitude speedup over token-level learning.
Bridge vision-to-action gap using pointing as unified intermediate representation, enabling 56.2% success on manipulation tasks without task-specific fine-tuning.
Enable autonomous embodied agents to function in 3D communities with structured memory systems. Combines semantic memory (scene graphs, knowledge graphs) and episodic memory (spatiotemporal experiences) for social intelligence and multi-agent coordination.
Comprehensive survey of techniques for optimizing large reasoning models. Covers single-model optimization and multi-model collaboration approaches to reduce reasoning path length without sacrificing capability.
Framework for efficient machine unlearning that reformulates forgetting as inverse learning. Achieves significant computational speedup by replacing expensive Hessian operations with gradient-based optimization, enabling privacy-preserving model updates.
Achieve 10× higher decoding throughput on long prompts by replacing 50% of cross-attention layers with gated memory units (GMUs) combining SSMs and attention. Maintains reasoning capability while reducing memory I/O bottleneck from O(d_kv·N) to O(d_h).
Systematically convert pretrained autoregressive models into efficient diffusion language models via block-wise attention and position-dependent masking. Efficient-DLM family (1.5B/4B/8B) maintains comparable accuracy to standard AR models while delivering 4.5× higher throughput.
Systematically optimize agent system costs via empirical analysis of LLM, planning, memory, and search components achieving 28.4% cost reduction.
Dynamic token pruning framework for VLMs that adapts compression to scene complexity through single-pass selection. Removes 92.6% of visual tokens while maintaining performance and enabling superior fine-tuning.
Implement hierarchical language modeling that compresses variable-length token sequences into high-capacity semantic concepts, achieving +2.69% benchmark improvements while reducing inference FLOPs by reallocating compute to concept-level reasoning. Use for efficiency-critical deployments where reasoning quality can be improved while maintaining computational budget.
Accelerate test-time scaling for diffusion language models by identifying inconsistent tokens, selectively remask and regenerate only uncertain tokens, and aggregate across samples via voting. Achieve 5.5-22× speedup over standard iterative sampling with 6-8% accuracy gains on reasoning tasks.
Implement dual preference optimization to generate self-supervised feedback without manual annotation by decomposing tasks into known/unknown components and reconstructing hidden information from model outputs.
Generate SVGs through simultaneous image and SVG token generation with internal visual guidance. DuetSVG overcomes text-only limitations by leveraging visual predictions to enhance SVG coherence—ideal when visual quality and geometric correctness matter.
Implement techniques from DSGym: A Holistic Framework for Evaluating and Training Data Science Agents. Data science agents promise to accelerate discovery and insight-generation by turning data into executable analyses and findings
Improve LLM reasoning by promoting diversity at both trajectory and token levels simultaneously. Global (trajectory) scale rewards distinct correct solutions; local (token) scale applies entropy regularization per decision point. Dual-scale approach couples these via diversity-weighted allocation: solutions that are globally more distinctive receive stronger local regularization, focusing exploration where it matters most among underexplored correct modes.
Use per-layer routers trained with Monte Carlo Tree Search to dynamically skip, execute, or repeat transformer layers for each token. Saves ~5 layers per example while maintaining accuracy on diverse benchmarks.
Unified system for multi-task video generation combining audio and visual synthesis, demonstrating scalable approaches for content generation that can enhance agent communication capabilities.
Convert video generation model outputs into executable robotic manipulation by extracting 3D object flow trajectories as an intermediate representation. Enables zero-shot manipulation of diverse object types (rigid, articulated, deformable, granular) without task-specific training. Use when pre-trained video models capture plausible manipulation patterns but need grounding in low-level robot control.
Enable Diffusion Language Models to achieve 8.5x inference speedup (24-30 steps vs. 256) through certainty-forcing distillation that trains models to achieve simultaneous high confidence across multiple tokens. Use when optimizing inference latency for dLLM deployments.
Improve long-context length extrapolation by denoising instabilities in Rotary Position Embeddings (RoPE) through spectral analysis and selective head rewriting—training-free post-hoc intervention for longer context windows.
Generate multiple tokens simultaneously by having late transformer layers directly predict multiple outputs after early layer processing. No auxiliary parameters or verification needed. Achieve up to 2x generation speedup.
Improve multimodal reasoning at test-time through confidence-guided latent optimization without retraining. Iteratively refine learnable latent think tokens via policy gradient using confidence reward. Dynamically select and update relevant image patches based on internal confidence levels. Maintain high efficiency with all optimization in latent space.
Train models to decompose complex problems into subproblems via divide-and-conquer reasoning. Structured approach enables systematic solution assembly and improved long-horizon reasoning compared to end-to-end generation.
Synthesize diverse, verifiable training tasks by executing real tools first, then reverse-deriving tasks from execution traces. Ensure diversity across tools and reasoning patterns while maintaining grounding by construction.
Align latent distributions with arbitrary reference distributions via explicit matching constraints rather than fixed priors. DMVAE achieves gFID 3.2 on ImageNet with 64 epochs—when you need flexibility in latent representation design for image generation.
Convert inference-time feedback into persistent, retrievable guidelines stored as agent memory. Framework enables LLMs to improve performance over time by systematically accumulating and applying learned critiques. Augmented models rapidly match test-time refinement performance while drastically reducing inference cost. Memory-as-tool pattern enables agents to learn from feedback without expensive retraining.
Reformulate discrete diffusion using signal-to-noise ratio for hybrid masked-uniform noise scheduling. Derive compute-optimal scaling laws through careful hyperparameter tuning, showing uniform diffusion scales more favorably in token-constrained settings than autoregressive models.
Enables diffusion LLMs to achieve 2.5× faster inference than autoregressive models through block-wise generation with parallel inter-block decoding.
Enable effective RL for diffusion language models via DiPO (unbiased GRPO for dLLMs) and framework optimizations. FlexAttention accelerates blockwise training, LMDeploy optimizes inference, achieving training-inference consistency—improving dLLM math performance to rival larger autoregressive models.
Improve diffusion model capacity efficiency by directly predicting clean data instead of noise—leverage the manifold assumption that natural data occupies low-dimensional space while noise spans full dimensionality.
Convert pre-trained autoregressive vision-language models into diffusion VLMs without architectural modifications. Use block diffusion strategy enabling arbitrary-length generation and KV-cache reuse. Hybrid attention enforces bidirectional within blocks, causal between blocks. Requires less than 5% of data compared to prior diffusion VLM methods.
Apply diffusion models as native generative agents for vision-centric reasoning tasks (sequential planning, constraint satisfaction, spatial configuration) instead of text-based LLM chains. Achieves 3x+ improvements over GPT-5 and Gemini-3 on visual reasoning. Use when image-to-image generation better captures the reasoning constraints than text-based problem decomposition.
Recast chain-of-thought reasoning as iterative denoising using diffusion principles to overcome exposure bias in autoregressive reasoning. DiffCoT enables retrospective refinement of intermediate steps while maintaining temporal consistency through causal noise scheduling.
Train diffusion language models to generate optimized CUDA kernels using bi-phase reinforcement learning. First phase masks and regenerates core kernel logic with provided scaffolding to prevent PyTorch shortcuts. Second phase enables end-to-end generation. Leverage diffusion's global context awareness for non-sequential code generation.
Accelerate LLM inference 6x by using block diffusion for parallel token drafting with tight coupling to the target model's hidden representations, achieving higher speedups than existing speculative methods without quality loss.
Extend monocular depth models to handle adverse conditions (weather, darkness, sensor noise) using only 540K training samples. Applies perturbation-based consistency learning and spatial constraints to maintain robust depth prediction across challenging real-world scenarios.
Improve diffusion model alignment by assigning step-wise rewards during denoising instead of terminal rewards. Fixes sparse reward signal mismatch in multi-step generation processes through ODE-based reward estimation.
Synergistic verifier-generator training loop enabling LLMs to identify logical issues in mathematical proofs without reference solutions, improving reasoning rigor through meta-verification. Apply when you need to scale mathematical reasoning without hand-labeled proof annotations.
Monitor search agent reasoning quality via hierarchical uncertainty detection. Fast consistency checks identify anomalies; slow experience-driven feedback provides corrections. Minimal overhead while catching misalignment.
Reduce redundant tokens in parallel reasoning by 80% while maintaining accuracy via dynamic pruning of equivalent reasoning paths. Trigger: improve efficiency of consensus-based reasoning (multiple CoT generation).
Enhances model safety by filtering dual-use topics from pretraining data, creating tamper-resistant models robust to adversarial fine-tuning without degrading unrelated capabilities.
Maintains half of sliding window as attention sinks with dynamic temporal RoPE alignment plus importance-aware KV cache pruning, enabling 12× extrapolation beyond training length (60+ seconds from 5-second training) without fine-tuning.
Train multimodal agents to dynamically invoke tools (code execution, web search) within reasoning loops through a two-stage pipeline combining cold-start supervised learning with reinforcement learning—enabling task-adaptive tool invocation for perception, reasoning, and retrieval tasks.
Transform research specifications into production-grade codebases through strategic information management and autonomous agent orchestration. DeepCode surpasses PhD experts and commercial tools—critical when you need scientific code reproducibility at scale.
Enables autonomous reasoning agents to discover and invoke tools efficiently through end-to-end training. Uses autonomous memory folding to compress interaction history and ToolPO to learn general-purpose tool use, applicable across diverse benchmarks from QA to web automation.
Bridge the gap between informal mathematical reasoning (80% accuracy) and formal proof synthesis (8% success) by decoupling them: a general-purpose reasoner generates strategic lemmas, then a specialized prover verifies them formally. First open-source solver of post-2000 IMO problems.
Accelerate diffusion transformer inference by dynamically adjusting patch granularity during generation based on detail complexity at each timestep. Early denoising steps (establishing low-frequency structure) use coarse patches; later steps (adding high-frequency detail) use fine patches. Achieves 3.52× speedup on FLUX-1.Dev and 3.2× on video models while maintaining quality through variance-based adaptive scheduling.
Accelerate video generation by 14.8x through deep compression autoencoder (32x-64x spatial, 4x temporal compression) combined with lightweight adapter-based model adaptation. Use when deploying video diffusion models under compute or latency constraints.
DCPO eliminates zero-gradient dead zones in policy optimization by adaptively adjusting token-level clipping bounds based on prior probabilities and smoothing advantage standardization across cumulative training steps, achieving 28% improvement in effective response utilization and 10x reduction in token clipping ratio on mathematical reasoning benchmarks.
Accelerate video generation through dual-expert consistency distillation, using separate denoisers for semantic layout/motion and detail refinement to resolve conflicting optimization gradients.
Automatically synthesize and optimize training data using GRPO to generate data recipes (specifications for dataset creation). Use a Data Verifier to efficiently evaluate sample quality without full model training. Achieve performance comparable to human expert curation across diverse domains.
Train efficient robot manipulation policies by strategically applying task diversity and debiasing expert demonstrations to remove execution speed variations that degrade learning.
Boost language model performance by strategically ordering training data without changing content or model size. Uses learnability-quality scoring and folding schedules to improve convergence and knowledge retention, achieving consistent gains across all model scales.
Accelerate the Shampoo optimizer 4.8x using batched block-wise preconditioning and numerical approximations, enabling more frequent preconditioner updates without computational bottleneck.
Enable autonomous agent self-improvement through evolutionary mutation of agent codebases, using LLM-generated variants and empirical validation to discover beneficial modifications like enhanced tools and context management.
Framework for training cybersecurity agents without access to live environments. Uses CTF writeups and persona-driven LLM simulation to synthesize training trajectories, achieving performance matching proprietary systems like Claude-3.5-Sonnet.
Improves reasoning efficiency through curriculum learning that progressively constrains token budgets, enabling models to first discover solution strategies then distill them into concise traces.
While large language models (LLMs) have shown to perform well on monolingual mathematical and commonsense reasoning, they remain unreliable for multilingual medical reasoning applications, hindering their deployment in multilingual healthcare settings. We address this by first introducing CUREMED-BENCH, a high-quality multilingual medical reasoning dataset with open-ended reasoning queries with a single verifiable answer, spanning thirteen languages, including underrepresented languages such as ...
Improve code and test generation through co-evolution where LLMs generate both solutions and tests, optimizing each based on mutual evaluation and discriminative testing performance.
Uses LLMs with RL to automatically optimize HGEMM CUDA kernels across 1,000 configurations, systematically outperforming NVIDIA's cuBLAS and cuBLASLt through continued pretraining, general RL, and specialized HGEMM RL stages.
Create a universal memory infrastructure enabling agents across different frameworks to share experience trajectories without retraining. Improve agent performance by retrieving workflows from related domains and applying diagnostic fixes.
Trains language models to provide quality feedback through two-stage RL. Stage 1 optimizes discriminability (distinguishing good vs bad responses). Stage 2 adds helpfulness rewards (improving actor after feedback). Achieves 9.02% improvement without requiring stronger supervisors for training data.
Improve LLM reasoning by combining numerical and natural language critique feedback in online RL for policy refinement.
Improve formal theorem proofs by treating criticism—evaluation of semantic correctness—as a learning signal. Train critic models to distinguish correct from incorrect formalizations, then use their feedback to guide RL-based proof generation.
Permanently remove unwanted concepts from LLMs by identifying and suppressing sparse autoencoder features across layers, creating parameter-level changes that prevent reversal.
Optimize approximate nearest neighbor search via contrastive RL, learning to generate efficient code for HNSW graph construction, search, and refinement.
Enhance language model reasoning through coupled sampling from prior (question-only) and posterior (answer-conditioned) distributions. Construct composite distribution mixing both at token level using hybrid sampling. Combine reconstruction term, selective NLL loss, and KL regularization. Achieve 12.4% improvement over base model and 2.3% over comparable baselines.
Enable vision-language models to perform embodied question answering in 3D environments through active camera exploration. CoV uses training-free test-time reasoning to iteratively select relevant viewpoints and adjust camera angles until sufficient context is gathered, achieving 11-13% accuracy improvements across spatial reasoning benchmarks.
Enable open-source reasoning model development with a 100K-instance Long CoT Collection, scaling from 1K o1 seed samples through guided synthesis with GPT-4o, achieving 2-3× RL performance gains.
Achieve efficient neural networks via self-supervised dynamic routing using Cosine Incompatibility Ratio (CIR). Ground gating decisions in geometric novelty rather than learned heuristics, enable per-sample/per-block binary routing via Gumbel-softmax, constrain with progressive FLOPs regularization—maintaining accuracy while reducing computation 28.5% on CIFAR-10.
Enhance spatial reasoning in multimodal LLMs by integrating depth and segmentation as auxiliary modalities with adaptive reasoning strategies. COOPER achieves 6.91% improvement in spatial understanding—when you need 3D-aware vision-language capabilities.
Implement techniques from CooperBench: Why Coding Agents Cannot be Your Teammates Yet. Resolving team conflicts requires not only task-specific competence, but also social intelligence to find common ground and build consensus
Train a specialized continue-thinking token via reinforcement learning to extend reasoning depth during inference, achieving superior performance over fixed-token baselines.
Train adaptive ODE solvers that generate fast diffusion previews while maintaining consistency with full-step refinement. Learn context-aware integration coefficients through PPO without distilling base models. Achieve high-quality previews in few steps with 47% fewer steps than standard methods.
Improves speculative decoding acceptance rates by exposing target model's intermediate reasoning through contemplate tokens. Achieves 8-11% acceptance rate improvement over EAGLE through future-direction guidance without extra forward passes.
Build AI code agents that scale to massive repositories with long-context reasoning and persistent memory. Confucius SDK achieves 59% Resolve@1 on SWE-Bench-Pro—ideal when AI needs to handle real-world codebases with complex toolchains.
Improve student model robustness under covariate shift by using diffusion-based augmentation that targets spurious features via teacher-student disagreement.
Learn optimal configurations for agentic AI systems through hierarchical RL that treats configuration as a query-wise decision problem. Structure policy selects workflows/tools/budgets while prompt policy composes specific instructions, achieving 25% accuracy improvement with 35% cost reduction.
Improve language model reasoning using only model confidence as reward signals, eliminating need for labels or preference models while achieving substantial gains with minimal data.
Learns richer spatial representations by training on both 2D and 3D data simultaneously. Combines 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding, achieving 14.2% and 4.8% improvements over single-modality baselines in scene understanding and geometric consistency.
Enable AI agents to safely explore action outcomes before execution by predicting UI state changes in desktop applications. Two-stage approach: first predict textual description of what changes, then synthesize visual representation of resulting screen. Allows agents to compare multiple candidate actions without risky trial-and-error, trained on Microsoft Office interactions (Word, Excel, PowerPoint).
Enable computer-use agents to flexibly choose between GUI primitives (click, type) and high-level tool calls, reducing cascading errors by 22% and improving execution speed by 11%.
Convert easy, high-accuracy training prompts into harder compositional problems by sequentially chaining multiple prompts together. Use Composition-RL to maintain effective learning signals during RL training when many prompts achieve near-perfect accuracy, enabling curriculum learning through progressive compositional depths.
Lightweight verifier model for evaluating LLM outputs across multiple domains, handling diverse answer types through meta-error pattern analysis.
Build generalist judge models for evaluating LLM outputs using verifiable rewards and policy gradient training. Create a 7B model competitive with much larger judges through reward-guided optimization and critical thinking decomposition. Use when you need reliable automated evaluation of model outputs across diverse tasks.
Advanced reasoning approach for optimizing inference efficiency through meta-cognitive planning, enabling agents to make better decisions with reduced computational overhead.
Generate physically grounded 4D scenes from natural language through dual-stream architecture separating object detail from scene orchestration. VLM-Motion Critic validates physics parameters iteratively, bridging semantic-physical execution gap.
Generate high-quality synthetic training data that enables 7.7x faster training than web data, with smaller models achieving better performance through strategic content rephasing and data optimization.
Evaluate and optimize tool-integrated reasoning in LLMs through empirical benchmarking, performance-cost metrics (PAC, AUC-PCC), and measurement frameworks for diverse reasoning tasks.
Hybrid multi-agent architecture where orchestrator delegates tasks to GUI Operator or Programmer agent. Coding enables efficiency on computational tasks, achieving 60.76% on OSWorld with 33% fewer steps.
Compress agent interaction histories and environment observations through natural language guideline optimization, reducing token usage by 26-54% while preserving 95%+ accuracy. Use for cost/latency reduction in multi-step agent tasks.
Decompose complex image editing instructions into simpler sub-tasks with automatically generated control guidance. Handles multi-object edits, preserves identity of surrounding regions, and eliminates manual mask creation.
Generate images with unified control over identity, spatial position, pose, and layout: encode diverse control modalities (spatial canvas, pose canvas, box canvas) into single RGB image, train diffusion model jointly across all control types, and enable flexible multi-modal composition at inference without task-specific fine-tuning.
Scale RL training to large models through decoupled rollout and training workers with controlled data staleness.
Extract fine-grained visual features from pretrained text-to-image diffusion models to improve multimodal LLMs beyond CLIP, achieving up to 6% gains through strategic fusion of unconditional and conditional diffusion representations.
Enable video generation customization via first-frame reuse—treat initial frame as a visual entity buffer storing objects for reuse during generation with just 20-50 examples, requiring minimal architectural change.
Connects multimodal language models with diffusion models using patch-level CLIP embeddings as shared latent variables, enabling controllable image generation with minimal training overhead.
Agentic search has recently emerged as a powerful paradigm, where an agent interleaves multi-step reasoning with on-demand retrieval to solve complex questions. Despite its success, how to design a retriever for agentic search remains largely underexplored. Existing search agents typically rely on similarity-based retrievers, while similar passages are not always useful for final answer generation. In this paper, we propose a novel retriever training framework tailored for agentic search. Unlike...
Deploy efficient MoE models on resource-constrained edge devices by learning chunk-level activation sparsity that achieves 3.67× speedup. Use when you need to compress LLMs for on-device inference while maintaining reasoning quality and supporting speculative decoding acceleration.
Avey architecture pairs a ranker with autoregressive processor to select relevant tokens, decoupling context window from sequence length for efficient long-range processing.
Align VLA efficiency with human cognition through 3-stage progressive routing: instruction-aware aggregation, instruction-irrelevant pruning, and coupled attention for 2.8x inference speedup
Train autonomous agents to use computers by generating synthetic task experiences and iterating on them, achieving 56.7% success on OSWorld benchmarks through scalable experience-driven optimization. Use when you need agents that autonomously learn complex computer interaction patterns without manual task curation.
Automatically generate diverse, scalable tool-interactive training environments for LLM agents without manual sandbox creation. Uses topic mining and logic modeling to create varied environment architectures with task scenarios, enabling agents to learn complex multi-turn, multi-tool interactions. Synthesis framework tested on 191 environments with ~7,000 scenarios, improving Qwen3 model performance on knowledge-intensive and search tasks.
Iteratively retrieve and reason over long narratives using a dynamic memory workspace that integrates retrieved facts into a shared context for complex multi-hop reasoning.
Evaluate LLM coding capabilities across three dimensions—problem analysis, code implementation, and test validation—to identify specific reasoning gaps and improve model robustness through ensemble approaches.
Overcome exploration bottlenecks in reasoning RL by integrating Monte Carlo Tree Search during training (not just inference). Global frontier selection and entropy-guided sampling reduce GPU hours by 5.7x while improving performance.
Build LLM-driven data science agents grounded in empirical knowledge through expert knowledge base, tree search algorithms, and complexity-adaptive code generation, surpassing SOTA by 8% on MLE-Bench.
Organize agent memory as an event graph with explicit logical relationships rather than flat embeddings. Framework incrementally segments experiences into events and links them through causal, temporal, and logical relations. Enables agents to navigate memory as a logic map for goal-directed searching and structured reasoning, improving performance on multi-hop reasoning and long-horizon planning tasks.
Prevent policy collapse onto common solutions during reinforcement learning by applying difficulty-aware focal weighting to gradient contributions, maintaining diversity across solution modes while preserving performance.
Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely focus on symbolic or weakly grounded environments, leaving their performance in physics-constrained real-world domains underexplored. We introduce AstroReason-Bench, a comprehensive benchmark for evaluating agentic planning in Space Planning Problems (SPP), a family of high-stakes problems with heterog...
Dynamically allocate training budget across samples using a capability-oriented value function that measures per-sample training importance based on model capability evolution. Reduces training time via greedy heap-based allocation optimizing exploration-exploitation tradeoff.
Enable language models to improve via context adaptation rather than weight updates. Use ACE (Agentic Context Engineering) to treat contexts as evolving playbooks that accumulate, refine, and organize strategies through modular generation, reflection, and curation processes. Achieve +10.6% agent benchmark gains and +8.6% on finance tasks using small open-source models matching production-level performance.