astroreason-bench-evaluating-unified-agentic

Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely focus on symbolic or weakly grounded environments, leaving their performance in physics-constrained real-world domains underexplored. We introduce AstroReason-Bench, a comprehensive benchmark for evaluating agentic planning in Space Planning Problems (SPP), a family of high-stakes problems with heterog...

data-ai2

coba-rl-budget-allocation

Dynamically allocate training budget across samples using a capability-oriented value function that measures per-sample training importance based on model capability evolution. Reduces training time via greedy heap-based allocation optimizing exploration-exploitation tradeoff.

data-ai2

agentic-context-engineering-evolution

Enable language models to improve via context adaptation rather than weight updates. Use ACE (Agentic Context Engineering) to treat contexts as evolving playbooks that accumulate, refine, and organize strategies through modular generation, reflection, and curation processes. Achieve +10.6% agent benchmark gains and +8.6% on finance tasks using small open-source models matching production-level performance.

data-ai2

Adoption

ADu2021

paper2skill

audio-roll-video-generation

aworld-distributed-agent-training

paper-categorizer

paper2skill-component-innovation

paper2skill-evaluation-infrastructure

paper2skill-insight-driven

paper2skill-paradigm-challenge

paper2skill-research-infrastructure

paper2skill-survey-synthesis

ace-confidence-penalty

acon-context-compression-agents

acesearcher-reasoning-search-self-play

adaptive-speculative-decoding

adversarial-flow-models

agent-as-a-judge-evaluation-framework

agent-conductor-topology-evolution

agent-data-protocol

agent-early-experience

agent-ocr-history-compression

agentcpm-report

agentic-uncertainty-quantification

alf-load-balancing-theory

alignment-waltz

atlas-orchestrating-heterogeneous-models-and-tools

calm-continuous-autoregressive-language-models

camels-computer-use-security

capmagine-visual-reasoning

chain-of-agents

chain-of-thought-distribution-lens

chartverse-scaling-chart-reasoning-via-reliable-pr

choreographing-a-world-of-dynamic-objects

clare-continual-learning-for-vision-language

clift-light-field-tokens

clipo-contrastive-policy-optimization

clipping-free-policy-optimization

co-evolving-critics-agent

code2world-gui-synthesis

dreamgym-experience-synthesis-rl

drpg-decompose-retrieve-plan-generate-an-agentic-f

dynamic-mask-sparse-attention

e2rank-embedding-reranking

eaglet-planner

codev-verilog-reasoning

cooper-co-optimized-policy-reward

cove-tool-use-training

cua-skill-computer-use

d-core-task-decomposition

deep-search-research-agent

deepconf-confidence-filtering

deer-diffusion-speculative-decoding

derf-normalization-free-transformers

diagonal-batching-recurrent-transformers

differential-sequence-modeling

diffucoder-diffusion-code

dino-r1-vision-reasoning

discreteness-diffusion-llm

docdancer-document-agent

doc-pp-document-policy-preservation-benchmark-for

dover-auto-debugging

dot-resize-optimal-transport-compression

dpwriter-planning

dr-mas

dr-zero-self-evolving-search

driving-gen-video-benchmark

dualpath-storage-bandwidth-optimization

dynamic-fine-tuning-sft-rl

dynaact-dynamic-action-spaces-reasoning

eager-entropy-scaling

editthinker-iterative-reasoning

embed-rl

endless-terminals-scaling-rl-environments-for-term

enhancing-linguistic-competence-of-language-models

e-grpo-high-entropy-steps-drive-effective-reinforc

finece-confidence

embedding-space-multitoken-prediction

encoder-pretraining-strategy

essential-web-taxonomy