NVIDIA introduces Cosmos 3, a family of omnimodal world models that jointly process and generate language, image, video, audio, and action sequences within a unified Mixture-of-Transformers architecture for Physical AI. This framework achieves competitive to state-of-the-art performance across 48 understanding benchmarks and leads open-source models in various image, video, and robot policy generation tasks, including a 39.7% success rate on RoboLab.
View blogResearchers from Alibaba-inc.com developed Qwen-Image-Flash, a unified visual foundation model for text-to-image generation and instruction-guided editing that operates with just 4 function evaluations (NFEs). The model achieves performance levels comparable to or exceeding its 80-NFE teacher by systematically optimizing the distillation training recipe, including data composition, teacher guidance, and task mixture.
View blogMicrosoft AI developed MAI-Thinking-1, a 35B active / 1T total parameter Mixture-of-Experts (MoE) model, using a "hill-climbing machine" framework for continuous, systematic improvement by training exclusively on human-generated data. The model achieved competitive performance against other frontier LLMs in STEM reasoning, coding, and general capabilities, while maintaining a strong helpfulness-safety balance.
View blogAn autoresearch template built on Andrej Karpathy's nanochat: tokenizer, pretraining, SFT, and eval in a single script. Fork it, hit launch, and explore the full ChatGPT-style training pipeline end to end.
This research introduces ReasonMatch-Bench, a benchmark for evaluating complex spatial reasoning in multimodal large language models (MLLMs) through wide-baseline matching tasks, and proposes Dynamic Correspondence Reinforcement Learning (DCRL). DCRL significantly enhances MLLM performance on cross-view spatial reasoning, achieving an F1 score of 70.5 on ReasonMatch-Bench and demonstrating positive transfer to other spatial intelligence benchmarks.
View blogResearchers systematically evaluated Transformer self-attention QKV projection variants, finding that unifying Key and Value projections (Q-K=V) reduces KV cache memory by 50%. This approach maintains model quality with a 2.48% perplexity degradation for 1.2B models and a 0.41% average loss in downstream task accuracy, enabling doubled context window capacity or throughput during inference.
View blogEcho-Infinity introduces an autoregressive framework for real-time, infinite video generation by integrating learnable memory queries and a unified relative positional encoding strategy. The framework achieves state-of-the-art performance on various video benchmarks and successfully demonstrates consistent video generation for durations up to 24 hours.
View blogTrust Region On-Policy Distillation (TrOPD) introduces an adaptive trust-region mechanism and outlier-aware supervision for On-Policy Distillation, enhancing the stability and reliability of training small reasoning models. This method consistently improves performance over baseline OPD approaches by 3-6 points across mathematical reasoning, code generation, and STEM benchmarks.
View blogOneReason is a generative recommendation foundation model that integrates robust itemic token perception and sophisticated, recommendation-specific cognition. It enables a "thinking mode" to consistently outperform a "non-thinking mode" on Kuaishou's real-world benchmarks, resulting in significant online business uplifts.
View blogA framework termed "VLM-as-Teacher" leverages Vision-Language Models (VLMs) as supervisors providing differentiable feedback to guide Video Generation Models (VGMs) in generating logically consistent video trajectories. The approach improves overall reasoning performance by 0.115 points on VBVR-Bench and 21.8 points on RULER-Bench while maintaining computational efficiency.
View blogThe paper "Harness-1" introduces a 20B search agent trained with reinforcement learning (RL) within a stateful harness that offloads complex state management from the LLM policy to the environment. This design enabled the agent to achieve an average curated recall of 0.730 across eight diverse retrieval benchmarks, outperforming open-source baselines by +11.4 points and demonstrating stronger generalization on unseen domains.
View blogGRAIL presents a fully digital pipeline for generating robot-compatible 4D human-object interaction data to train humanoid loco-manipulation policies. The system synthesizes over 20,000 physically plausible sequences from 3D assets and video priors, allowing egocentric visual policies trained exclusively on this synthetic data to achieve up to 90% real-world success rates on a Unitree G1 robot for tasks like stair-climbing and object pick-up.
View blogMind Lab researchers propose a framework that re-conceptualizes Parameter-Efficient Fine-Tuning (PEFT) as a fundamental mechanism for scaling personalized AI. They demonstrate the operational feasibility of adapting trillion-parameter foundation models with lightweight adapters, manage adaptive states efficiently through a new infrastructure, and show how population-level personalization can lead to collective intelligence.
View blogA new framework named Staged Executable Inverse Graphics (SEIG) reconstructs editable 3D scenes as Blender programs directly from single 2D images, utilizing only pretrained Vision-Language Models without specialized tools. This staged approach, which progressively refines geometry, materials, composition, and lighting, yields more accurate and coherent reconstructions than monolithic baselines, demonstrating the latent 3D reasoning capabilities of general-purpose VLMs.
View blogWavTTS establishes high-quality zero-shot Text-to-Speech by directly generating raw audio waveforms using a flow matching framework with a Diffusion Transformer, eliminating reliance on lossy intermediate representations. It achieves competitive intelligibility and naturalness, recording a 1.50% WER and 3.92 UTMOS on English test sets, outperforming previous end-to-end models.
View blogSelf-Distilled Policy Gradient (SDPG) introduces a framework for training large language models on complex reasoning tasks, combining sparse verifier-based outcome rewards with dense, full-vocabulary on-policy self-distillation and KL regularization. The method achieves higher accuracy and improved training stability on mathematical reasoning benchmarks by mitigating issues like sparse credit assignment and mode collapse.
View blogResearchers from Tsinghua University, Galbot Inc., and collaborators developed Humanoid-GPT, a GPT-style Transformer that achieves robust, real-time, zero-shot whole-body motion tracking for humanoid robots. By training on an unprecedented 2-billion-frame motion corpus, the system demonstrates high fidelity on unseen, dynamic tasks and successfully transfers to physical hardware, addressing a long-standing trade-off between tracking agility and generalization.
View blogA new benchmark, VSTAT, evaluates Multimodal Large Language Models (MLLMs) on continuous visual state tracking within videos, revealing a significant performance gap where human accuracy reaches 90.5% compared to 44.4% for the top proprietary MLLM and 35.1% for the best open-source MLLM. The study identifies visual perception, rather than reasoning, as the primary bottleneck for MLLMs in these tasks.
View blogThe AUTOLAB benchmark evaluates frontier large language model agents on ultra long-horizon, closed-loop optimization tasks across diverse scientific and engineering domains. It demonstrates that while some top models like Claude Opus 4.6 can sustain iterative improvements for hours, many frontier models struggle with time management, persistent iteration, and learning from empirical feedback.
View blogUniversity of Southern California researchers developed DistIL, a distributional imitation learning algorithm that leverages rich feedback to enhance reasoning capabilities in large language models. DistIL guarantees monotonic policy improvement and achieves superior performance across scientific reasoning, coding, and mathematical tasks, outperforming previous methods by up to 9.6 points on scientific reasoning and 3.8 points on hard mathematical problems.
View blog