alphaXiv

Explore

Sign In

Blog

Feedback

ComputeNEW

Browser Extension

Upgrade to Pro

Dark mode

We're hiring

Ask or search anything...

What are the most popular benchmarks for math reasoning?

Alt + Enter to search

alphaXiv Compute

Live
H100 SXMfrom $3.29/hr
H200 SXMfrom $4.39/hr
B200from $5.89/hr
Browse compute
HotLikes
Sign in
HotLikes
Cosmos 3: Omnimodal World Models for Physical AI
01 Jun 2026
Aditi
Niket Agarwal
Arslan Ali

NVIDIA introduces Cosmos 3, a family of omnimodal world models that jointly process and generate language, image, video, audio, and action sequences within a unified Mixture-of-Transformers architecture for Physical AI. This framework achieves competitive to state-of-the-art performance across 48 understanding benchmarks and leads open-source models in various image, video, and robot policy generation tasks, including a 39.7% success rate on RoboLab.

View blog
#agents#computer-science#artificial-intelligence
Audio
8,991
Paper thumbnail
1,216
Qwen-Image-Flash: Beyond Objective Design
03 Jun 2026
Tianhe Wu
Kun Yan
Zikai Zhou

Researchers from Alibaba-inc.com developed Qwen-Image-Flash, a unified visual foundation model for text-to-image generation and instruction-guided editing that operates with just 4 function evaluations (NFEs). The model achieves performance levels comparable to or exceeding its 80-NFE teacher by systematically optimizing the distillation training recipe, including data composition, teacher guidance, and task mixture.

View blog
#computer-science#artificial-intelligence#computer-vision-and-pattern-recognition
Audio
Paper thumbnail
468
MAI-Thinking-1: Building a Hill-Climbing Machine
04 Jun 2026
Microsoft logoMicrosoft
The Microsoft AI Team

Microsoft AI developed MAI-Thinking-1, a 35B active / 1T total parameter Mixture-of-Experts (MoE) model, using a "hill-climbing machine" framework for continuous, systematic improvement by training exclusively on human-generated data. The model achieved competitive performance against other frontier LLMs in STEM reasoning, coding, and general capabilities, while maintaining a strong helpfulness-safety balance.

View blog
#computer-science
Audio
Paper thumbnail
171
AutoresearchAutoresearch Nanochat

An autoresearch template built on Andrej Karpathy's nanochat: tokenizer, pretraining, SFT, and eval in a single script. Fork it, hit launch, and explore the full ChatGPT-style training pipeline end to end.

#autoresearch#transformers#ml-systems
Try autoresearch
Paper thumbnail
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
02 Jun 2026
Hao Zhong
Muzhi Zhu
Shenyan Zeng

This research introduces ReasonMatch-Bench, a benchmark for evaluating complex spatial reasoning in multimodal large language models (MLLMs) through wide-baseline matching tasks, and proposes Dynamic Correspondence Reinforcement Learning (DCRL). DCRL significantly enhances MLLM performance on cross-view spatial reasoning, achieving an F1 score of 70.5 on ReasonMatch-Bench and demonstrating positive transfer to other spatial intelligence benchmarks.

View blog
#computer-science#computer-vision-and-pattern-recognition#data-curation
Audio
Paper thumbnail
212
Do Transformers Need Three Projections? Systematic Study of QKV Variants
04 Jun 2026
Ali Kayyam
Anusha Madan Gopal
M Anthony Lewis

Researchers systematically evaluated Transformer self-attention QKV projection variants, finding that unifying Key and Value projections (Q-K=V) reduces KV cache memory by 50%. This approach maintains model quality with a 2.48% perplexity degradation for 1.2B models and a 0.41% average loss in downstream task accuracy, enabling doubled context window capacity or throughput during inference.

View blog
#attention-mechanisms#computer-science#artificial-intelligence
Audio
1
Paper thumbnail
172
Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation
03 Jun 2026
Yuxuan Bian
Zeyue Xue
Songchun Zhang

Echo-Infinity introduces an autoregressive framework for real-time, infinite video generation by integrating learnable memory queries and a unified relative positional encoding strategy. The framework achieves state-of-the-art performance on various video benchmarks and successfully demonstrates consistent video generation for durations up to 24 hours.

View blog
#attention-mechanisms#computer-science#computer-vision-and-pattern-recognition
Audio
Paper thumbnail
336
Trust Region On-Policy Distillation
03 Jun 2026
Xingrun Xing
Haoqing Wang
Boyan Gao

Trust Region On-Policy Distillation (TrOPD) introduces an adaptive trust-region mechanism and outlier-aware supervision for On-Policy Distillation, enhancing the stability and reliability of training small reasoning models. This method consistently improves performance over baseline OPD approaches by 3-6 points across mathematical reasoning, code generation, and STEM benchmarks.

View blog
#agents#computer-science#computation-and-language
Audio
7
Paper thumbnail
575
OneReason Technical Report
04 Jun 2026
OneRec Team
Biao Yang
Boyang Ding

OneReason is a generative recommendation foundation model that integrates robust itemic token perception and sophisticated, recommendation-specific cognition. It enables a "thinking mode" to consistently outperform a "non-thinking mode" on Kuaishou's real-world benchmarks, resulting in significant online business uplifts.

View blog
#chain-of-thought#computer-science#artificial-intelligence
Audio
Paper thumbnail
109
VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization
01 Jun 2026
Junhao Cheng
Liang Hou
Tianxiong Zhong

A framework termed "VLM-as-Teacher" leverages Vision-Language Models (VLMs) as supervisors providing differentiable feedback to guide Video Generation Models (VGMs) in generating logically consistent video trajectories. The approach improves overall reasoning performance by 0.115 points on VBVR-Bench and 21.8 points on RULER-Bench while maintaining computational efficiency.

View blog
#computer-science#computer-vision-and-pattern-recognition#fine-tuning
Audio
Paper thumbnail
311
Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
01 Jun 2026
Pengcheng Jiang
Zhiyi Shi
Kelly Hong

The paper "Harness-1" introduces a 20B search agent trained with reinforcement learning (RL) within a stateful harness that offloads complex state management from the LLM policy to the environment. This design enabled the agent to achieve an average curated recall of 0.730 across eight diverse retrieval benchmarks, outperforming open-source baselines by +11.4 points and demonstrating stronger generalization on unseen domains.

View blog
#agentic-frameworks#agents#computer-science
Audio
13
Paper thumbnail
261
SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents
04 Jun 2026
Wenxuan Wang
Haoyu Sun
Fukuan Hou
Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated recall. Existing long-term memory benchmarks rarely probe how agents preserve and utilize such relations during downstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-grained relational memory discrimination in long-running AI agents. SubtleMemory constructs relation-controlled latent semantic artifacts whose variants instantiate complementary, nuanced, or contradictory relations, and embeds them into realistic user-agent histories, requiring agents to recover distributed relational structures during later queries and instructions. The benchmark contains 1,522 evaluation instances over 10 long histories, grounded in 1,090 relation-controlled memory-variant sets and spanning user-related and non-user-related queries. Evaluating six standalone memory systems, two Claw-style agents with native memory modules, and three Claw-style agents with plugin memory modules, we find that current systems remain weak on fine-grained relational memory discrimination. We further introduce diagnostic protocols that reveal distinct capability profiles across memory preservation, retrieval, and downstream reasoning stages.
View blog
#agentic-frameworks#agents#computer-science
Audio
Paper thumbnail
92
GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors
03 Jun 2026
Tianyi Xie
Haotian Zhang
Jinhyung Park

GRAIL presents a fully digital pipeline for generating robot-compatible 4D human-object interaction data to train humanoid loco-manipulation policies. The system synthesizes over 20,000 physically plausible sequences from 3D assets and video priors, allowing egocentric visual policies trained exclusively on this synthetic data to achieve up to 90% real-world success rates on a Unitree G1 robot for tasks like stair-climbing and object pick-up.

View blog
#computer-science#robotics
Audio
Paper thumbnail
167
On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters
02 Jun 2026
Mind Lab
Vin Bo
Song Cao

Mind Lab researchers propose a framework that re-conceptualizes Parameter-Efficient Fine-Tuning (PEFT) as a fundamental mechanism for scaling personalized AI. They demonstrate the operational feasibility of adapting trillion-parameter foundation models with lightweight adapters, manage adaptive states efficiently through a new infrastructure, and show how population-level personalization can lead to collective intelligence.

View blog
#agentic-frameworks#agents#computer-science
Audio
Paper thumbnail
414
Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models
01 Jun 2026
Guangzhao He
Rundong Luo
Wei-Chiu Ma

A new framework named Staged Executable Inverse Graphics (SEIG) reconstructs editable 3D scenes as Blender programs directly from single 2D images, utilizing only pretrained Vision-Language Models without specialized tools. This staged approach, which progressively refines geometry, materials, composition, and lighting, yields more accurate and coherent reconstructions than monolithic baselines, demonstrating the latent 3D reasoning capabilities of general-purpose VLMs.

View blog
#agentic-frameworks#agents#chain-of-thought
Audio
Paper thumbnail
293
WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling
02 Jun 2026
Wenxi Chen
Dongya Jia
Yushen Chen

WavTTS establishes high-quality zero-shot Text-to-Speech by directly generating raw audio waveforms using a flow matching framework with a Diffusion Transformer, eliminating reliance on lossy intermediate representations. It achieves competitive intelligibility and naturalness, recording a 1.50% WER and 3.92 UTMOS on English test sets, outperforming previous end-to-end models.

View blog
#computer-science#sound#audio-and-speech-processing
Audio
Paper thumbnail
431
Self-Distilled Policy Gradient
02 Jun 2026
Yifeng Liu
Shiyuan Zhang
Yifan Zhang

Self-Distilled Policy Gradient (SDPG) introduces a framework for training large language models on complex reasoning tasks, combining sparse verifier-based outcome rewards with dense, full-vocabulary on-policy self-distillation and KL regularization. The method achieves higher accuracy and improved training stability on mathematical reasoning benchmarks by mitigating issues like sparse credit assignment and mode collapse.

View blog
#computer-science#machine-learning#deep-reinforcement-learning
Audio
Paper thumbnail
117
Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking
02 Jun 2026
Zekun Qi
Xuchuan Chen
Dairu Liu

Researchers from Tsinghua University, Galbot Inc., and collaborators developed Humanoid-GPT, a GPT-style Transformer that achieves robust, real-time, zero-shot whole-body motion tracking for humanoid robots. By training on an unprecedented 2-billion-frame motion corpus, the system demonstrates high fidelity on unseen, dynamic tasks and successfully transfers to physical hardware, addressing a long-standing trade-off between tracking agility and generalization.

View blog
#computer-science#artificial-intelligence#computer-vision-and-pattern-recognition
Audio
116
Paper thumbnail
286
Benchmarking Visual State Tracking in Multimodal Video Understanding
02 Jun 2026
Sihyun Yu
Nanye Ma
Pinzhi Huang

A new benchmark, VSTAT, evaluates Multimodal Large Language Models (MLLMs) on continuous visual state tracking within videos, revealing a significant performance gap where human accuracy reaches 90.5% compared to 44.4% for the top proprietary MLLM and 35.1% for the best open-source MLLM. The study identifies visual perception, rather than reasoning, as the primary bottleneck for MLLMs in these tasks.

View blog
#agents#computer-science#computer-vision-and-pattern-recognition
Audio
20
Paper thumbnail
243
AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?
03 Jun 2026
Zhangchen Xu
Junda Chen
Yue Huang

The AUTOLAB benchmark evaluates frontier large language model agents on ultra long-horizon, closed-loop optimization tasks across diverse scientific and engineering domains. It demonstrates that while some top models like Claude Opus 4.6 can sustain iterative improvements for hours, many frontier models struggle with time management, persistent iteration, and learning from empirical feedback.

View blog
#agentic-frameworks#agents#computer-science
Audio
85
Paper thumbnail
87
Reinforcement Learning from Rich Feedback with Distributional DAgger
03 Jun 2026
Rishabh Agrawal
Jacob Fein-Ashley
Paria Rashidinejad

University of Southern California researchers developed DistIL, a distributional imitation learning algorithm that leverages rich feedback to enhance reasoning capabilities in large language models. DistIL guarantees monotonic policy improvement and achieves superior performance across scientific reasoning, coding, and mathematical tasks, outperforming previous methods by up to 9.6 points on scientific reasoning and 3.8 points on hard mathematical problems.

View blog
#agents#computer-science#artificial-intelligence
Audio
Paper thumbnail
97
There are no more papers matching your filters at the moment.