{"ok":true,"snapshot":{"date":"2026-05-04","capturedAt":"2026-05-04T22:25:19.481Z","total_papers":30,"raw_count":50,"papers":[{"paperId":"2604.27351","title":"Heterogeneous Scientific Foundation Model Collaboration","summary":"Agentic large language model systems have demonstrated strong capabilities. However, their reliance on language as the universal interface fundamentally limits their applicability to many real-world problems, especially in scientific domains where domain-specific foundation models have been developed to address specialized tasks beyond natural language. In this work, we introduce Eywa, a heterogeneous agentic framework designed to extend language-centric systems to a broader class of scientific foundation models. The key idea of Eywa is to augment domain-specific foundation models with a language-model-based reasoning interface, enabling language models to guide inference over non-linguistic data modalities. This design allows predictive foundation models, which are typically optimized fo…","authors":["Zihao Li","Jiaru Zou","Feihao Fang","Xuying Ning","Mengting Ai","Tianxin Wei","Sirui Chen","Xiyuan Yang"],"publishedAt":"2026-04-30T00:00:00.000Z","submittedAt":null,"upvotes":196,"num_comments":3,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.27351.png","hf_url":"https://huggingface.co/papers/2604.27351","arxiv_url":"https://arxiv.org/abs/2604.27351","github_repo":null,"github_stars":null,"ai_keywords":["agentic framework","domain-specific foundation models","language-model-based reasoning","non-linguistic data modalities","predictive foundation models","multi-agent systems","planning-based orchestration","heterogeneous data modalities"]},{"paperId":"2604.28185","title":"Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling","summary":"Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation …","authors":["Keming Wu","Zuhao Yang","Kaichen Zhang","Shizun Wang","Haowei Zhu","Sicong Leng","Zhongyu Yang","Qijie Wang"],"publishedAt":"2026-04-30T00:00:00.000Z","submittedAt":null,"upvotes":83,"num_comments":4,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.28185.png","hf_url":"https://huggingface.co/papers/2604.28185","arxiv_url":"https://arxiv.org/abs/2604.28185","github_repo":null,"github_stars":null,"ai_keywords":["visual generation models","photorealism","spatial reasoning","long-horizon consistency","causal understanding","flow matching","unified understanding-and-generation models","visual representations","post-training","reward modeling"]},{"paperId":"2605.00658","title":"UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors","summary":"Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone's native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, …","authors":["Houyuan Chen","Hong Li","Xianghao Kong","Tianrui Zhu","Shaocong Xu","Weiqing Xiao","Yuwei Guo","Chongjie Ye"],"publishedAt":"2026-05-01T00:00:00.000Z","submittedAt":null,"upvotes":67,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2605.00658.png","hf_url":"https://huggingface.co/papers/2605.00658","arxiv_url":"https://arxiv.org/abs/2605.00658","github_repo":null,"github_stars":null,"ai_keywords":["video diffusion models","multimodal graphics tasks","conditional generation","stochastic condition masking","decoupled gated LoRA","cross-modal self-attention","modality-specific distributions","omnidirectional conditional generation","video generation"]},{"paperId":"2604.27083","title":"Co-Evolving Policy Distillation","summary":"RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mixed RLVR suffers from inter-capability divergence cost, while the pipeline of first training experts and then performing OPD, though avoiding divergence, fails to fully absorb teacher capabilities due to large behavioral pattern gaps between teacher and student. We propose Co-Evolving Policy Distillation (CoPD), which encourages parallel training of experts and introduces OPD during each expert's ongoing RLVR training rather than after complete expert training, with experts serving as mutual teachers (making OPD bidirectional) to co-evolve. This enables more co…","authors":["Naibin Gu","Chenxu Yang","Qingyi Si","Chuanyu Qin","Dingyu Yao","Peng Fu","Zheng Lin","Weiping Wang"],"publishedAt":"2026-04-29T00:00:00.000Z","submittedAt":null,"upvotes":51,"num_comments":3,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.27083.png","hf_url":"https://huggingface.co/papers/2604.27083","arxiv_url":"https://arxiv.org/abs/2604.27083","github_repo":null,"github_stars":null,"ai_keywords":["post-training","RLVR","OPD","policy distillation","Co-Evolving Policy Distillation","expert capabilities","behavioral pattern gaps","mutual teachers","bidirectional policy distillation","multi-modal reasoning"]},{"paperId":"2604.28158","title":"Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists","summary":"Existing research infrastructure is fundamentally document-centric, providing citation links between papers but lacking explicit representations of methodological evolution. In particular, it does not capture the structured relationships that explain how and why research methods emerge, adapt, and build upon one another. With the rise of AI-driven research agents as a new class of consumers of scientific knowledge, this limitation becomes increasingly consequential, as such agents cannot reliably reconstruct method evolution topologies from unstructured text. We introduce Intern-Atlas, a methodological evolution graph that automatically identifies method-level entities, infers lineage relationships among methodologies, and captures the bottlenecks that drive transitions between successive…","authors":["Yujun Wu","Dongxu Zhang","Xinchen Li","Jinhang Xu","Yiling Duan","Yumou Liu","Jiabao Pan","Xuanhe Zhou"],"publishedAt":"2026-04-30T00:00:00.000Z","submittedAt":null,"upvotes":40,"num_comments":4,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.28158.png","hf_url":"https://huggingface.co/papers/2604.28158","arxiv_url":"https://arxiv.org/abs/2604.28158","github_repo":null,"github_stars":null,"ai_keywords":["methodological evolution graph","temporal tree search algorithm","causal network","automated scientific discovery"]},{"paperId":"2604.27711","title":"ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control","summary":"Humanoid control systems have made significant progress in recent years, yet modeling fluent interaction-rich behavior between a robot, its surrounding environment, and task-relevant objects remains a fundamental challenge. This difficulty arises from the need to jointly capture spatial context, temporal dynamics, robot actions, and task intent at scale, which is a poor match to conventional supervision. We propose ExoActor, a novel framework that leverages the generalization capabilities of large-scale video generation models to address this problem. The key insight in ExoActor is to use third-person video generation as a unified interface for modeling interaction dynamics. Given a task instruction and scene context, ExoActor synthesizes plausible execution processes that implicitly enco…","authors":["Yanghao Zhou","Jingyu Ma","Yibo Peng","Zhenguo Sun","Yu Bai","Börje F. Karlsson"],"publishedAt":"2026-04-30T00:00:00.000Z","submittedAt":null,"upvotes":39,"num_comments":4,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.27711.png","hf_url":"https://huggingface.co/papers/2604.27711","arxiv_url":"https://arxiv.org/abs/2604.27711","github_repo":null,"github_stars":null,"ai_keywords":["video generation","interaction dynamics","humanoid control","task-conditioned behavior","motion estimation","motion controller","generative models","end-to-end system"]},{"paperId":"2604.27085","title":"Efficient Training on Multiple Consumer GPUs with RoundPipe","summary":"Fine-tuning Large Language Models (LLMs) on consumer-grade GPUs is highly cost-effective, yet constrained by limited GPU memory and slow PCIe interconnects. Pipeline parallelism combined with CPU offloading mitigates these hardware bottlenecks by reducing communication overhead. However, existing PP schedules suffer from an inherent limitation termed the weight binding issue. Binding uneven model stages (e.g., the LM head is large) to GPUs limits the pipeline's throughput to that of the GPU with the heaviest load, leading to severe pipeline bubbles.\n  In this paper, we propose RoundPipe, a novel pipeline schedule that breaks the weight binding constraint on consumer GPU servers. RoundPipe treats GPUs as a pool of stateless execution workers and dynamically dispatches computation stages ac…","authors":["Yibin Luo","Shiwei Gao","Huichuan Zheng","Youyou Lu","Jiwu Shu"],"publishedAt":"2026-04-29T00:00:00.000Z","submittedAt":null,"upvotes":35,"num_comments":3,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.27085.png","hf_url":"https://huggingface.co/papers/2604.27085","arxiv_url":"https://arxiv.org/abs/2604.27085","github_repo":null,"github_stars":null,"ai_keywords":["pipeline parallelism","CPU offloading","weight binding issue","pipeline bubbles","RoundPipe","stateless execution workers","round-robin dispatching","distributed event-based synchronization","layer partitioning","LoRA fine-tuning"]},{"paperId":"2604.28139","title":"Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows","summary":"LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eva…","authors":["Chenxin Li","Zhengyang Tang","Huangxin Lin","Yunlong Lin","Shijue Huang","Shengyuan Liu","Bowen Ye","Rang Li"],"publishedAt":"2026-04-30T00:00:00.000Z","submittedAt":null,"upvotes":33,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.28139.png","hf_url":"https://huggingface.co/papers/2604.28139","arxiv_url":"https://arxiv.org/abs/2604.28139","github_repo":null,"github_stars":null,"ai_keywords":["workflow agents","live benchmark","execution traces","audit logs","structured LLM judging","task families","execution surface","workflow automation"]},{"paperId":"2604.27505","title":"Leveraging Verifier-Based Reinforcement Learning in Image Editing","summary":"While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggrega…","authors":["Hanzhong Guo","Jie Wu","Jie Liu","Yu Gao","Zilyu Ye","Linxiao Yuan","Xionghui Wang","Yizhou Yu"],"publishedAt":"2026-04-30T00:00:00.000Z","submittedAt":null,"upvotes":30,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.27505.png","hf_url":"https://huggingface.co/papers/2604.27505","arxiv_url":"https://arxiv.org/abs/2604.27505","github_repo":null,"github_stars":null,"ai_keywords":["Reinforcement Learning from Human Feedback","image editing","reward model","chain-of-thought","reasoning verifier","supervised fine-tuning","Group Contrastive Preference Optimization","reinforcement learning","non-differentiable reward model","GRPO"]},{"paperId":"2604.27221","title":"Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction","summary":"Agentic web search increasingly faces two distinct demands: deep reasoning over a single target, and structured aggregation across many entities and heterogeneous sources. Current systems struggle on both fronts. Breadth-oriented tasks demand schema-aligned outputs with wide coverage and cross-entity consistency, while depth-oriented tasks require coherent reasoning over long, branching search trajectories. We introduce Web2BigTable, a multi-agent framework for web-to-table search that supports both regimes. Web2BigTable adopts a bi-level architecture in which an upper-level orchestrator decomposes the task into sub-problems and lower-level worker agents solve them in parallel. Through a closed-loop run--verify--reflect process, the framework jointly improves decomposition and execution o…","authors":["Yuxuan Huang","Yihang Chen","Zhiyuan He","Yuxiang Chen","Ka Yiu Lee","Huichi Zhou","Weilin Luo","Meng Fang"],"publishedAt":"2026-04-29T00:00:00.000Z","submittedAt":null,"upvotes":25,"num_comments":3,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.27221.png","hf_url":"https://huggingface.co/papers/2604.27221","arxiv_url":"https://arxiv.org/abs/2604.27221","github_repo":null,"github_stars":null,"ai_keywords":["multi-agent framework","bi-level architecture","task decomposition","parallel execution","closed-loop run--verify--reflect process","external memory","shared workspace","coordinated agents","iterative improvement"]},{"paperId":"2604.28190","title":"Representation Fréchet Loss for Visual Generation","summary":"We show that Fréchet Distance (FD), long considered impractical as a training objective, can in fact be effectively optimized in the representation space. Our idea is simple: decouple the population size for FD estimation (e.g., 50k) from the batch size for gradient computation (e.g., 1024). We term this approach FD-loss. Optimizing FD-loss reveals several surprising findings. First, post-training a base generator with FD-loss in different representation spaces consistently improves visual quality. Under the Inception feature space, a one-step generator achieves0.72 FID on ImageNet 256x256. Second, the same FD-loss repurposes multi-step generators into strong one-step generators without teacher distillation, adversarial training or per-sample targets. Third, FID can misrank visual quality…","authors":["Jiawei Yang","Zhengyang Geng","Xuan Ju","Yonglong Tian","Yue Wang"],"publishedAt":"2026-04-30T00:00:00.000Z","submittedAt":null,"upvotes":21,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.28190.png","hf_url":"https://huggingface.co/papers/2604.28190","arxiv_url":"https://arxiv.org/abs/2604.28190","github_repo":null,"github_stars":null,"ai_keywords":["Fréchet Distance","FD-loss","representation space","Inception feature space","FID","multi-step generators","one-step generators","distributional distances"]},{"paperId":"2604.27039","title":"Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling","summary":"Token serves as the fundamental unit of computation in modern autoregressive models, and generation length directly influences both inference cost and reasoning performance. Despite its importance, existing approaches lack fine-grained length modeling, operating primarily at the coarse-grained sequence level. We introduce the Length Value Model (LenVM), a token-level framework that models the remaining generation length. By formulating length modeling as a value estimation problem and assigning a constant negative reward to each generated token, LenVM predicts a bounded, discounted return that serves as a monotone proxy for the remaining generation horizon. This formulation yields supervision that is annotation-free, dense, unbiased, and scalable. Experiments on LLMs and VLMs demonstrate …","authors":["Zhen Zhang","Changyi Yang","Zijie Xia","Zhen Yang","Chengzhi Liu","Zhaotiao Weng","Yepeng Liu","Haobo Chen"],"publishedAt":"2026-04-29T00:00:00.000Z","submittedAt":null,"upvotes":20,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.27039.png","hf_url":"https://huggingface.co/papers/2604.27039","arxiv_url":"https://arxiv.org/abs/2604.27039","github_repo":null,"github_stars":null,"ai_keywords":["Length Value Model","token-level framework","autoregressive models","generation length","value estimation","reinforcement learning","token budget","LLMs","VLMs","GSM8K"]},{"paperId":"2604.24954","title":"Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence","summary":"We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpo…","authors":["NVIDIA","Amala Sanjay Deshmukh","Kateryna Chumachenko","Tuomas Rintamaki","Matthieu Le","Tyler Poon","Danial Mohseni Taheri","Ilia Karmanov"],"publishedAt":"2026-04-27T00:00:00.000Z","submittedAt":null,"upvotes":17,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.24954.png","hf_url":"https://huggingface.co/papers/2604.24954","arxiv_url":"https://arxiv.org/abs/2604.24954","github_repo":null,"github_stars":null,"ai_keywords":["multipmodal","audio inputs","text inputs","image inputs","video inputs","document understanding","long audio-video comprehension","agentic computer use","token-reduction techniques","inference latency"]},{"paperId":"2604.28130","title":"MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons","summary":"Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where a Video-to-Pose network predicts joint positions and an analytical inverse-kinematics (IK) stage recovers joint rotations. While effective, this design is inherently limited, since joint positions do not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fully end-to-end framework in which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missing coordinate system information…","authors":["Kehong Gong","Zhengyu Wen","Dao Thien Phong","Mingxi Xu","Weixia He","Qi Wang","Ning Zhang","Zhengyu Li"],"publishedAt":"2026-04-30T00:00:00.000Z","submittedAt":null,"upvotes":16,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.28130.png","hf_url":"https://huggingface.co/papers/2604.28130","arxiv_url":"https://arxiv.org/abs/2604.28130","github_repo":null,"github_stars":null,"ai_keywords":["Video-to-Pose network","inverse-kinematics","joint positions","joint rotations","end-to-end framework","rotation prediction","coordinate system information","reference pose-rotation pair","rest pose","skeleton-aware Global-Local Graph-guided Multi-Head Attention"]},{"paperId":"2604.28181","title":"Synthetic Computers at Scale for Long-Horizon Productivity Simulation","summary":"Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and organized through directory structures and content-rich artifacts. To scale synthetic data creation for such productivity scenarios, we introduce Synthetic Computers at Scale, a scalable methodology for creating such environments with realistic folder hierarchies and content-rich artifacts (e.g., documents, spreadsheets, and presentations). Conditioned on each synthetic computer, we run long-horizon simulations: one agent creates productivity objectives that are specific to the computer's user and require multiple professional deliverables and about a month of human work; another agent then acts as that user and keeps working across the comp…","authors":["Tao Ge","Baolin Peng","Hao Cheng","Jianfeng Gao"],"publishedAt":"2026-04-30T00:00:00.000Z","submittedAt":null,"upvotes":15,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.28181.png","hf_url":"https://huggingface.co/papers/2604.28181","arxiv_url":"https://arxiv.org/abs/2604.28181","github_repo":null,"github_stars":null,"ai_keywords":["synthetic data creation","long-horizon simulations","agent-based modeling","experiential learning","agent self-improvement","agentic reinforcement learning"]},{"paperId":"2604.27151","title":"Step-level Optimization for Efficient Computer-use Agents","summary":"Computer-use agents provide a promising path toward general software automation because they can interact directly with arbitrary graphical user interfaces instead of relying on brittle, application-specific integrations. Despite recent advances in benchmark performance, strong computer-use agents remain expensive and slow in practice, since most systems invoke large multimodal models at nearly every interaction step. We argue that this uniform allocation of compute is fundamentally inefficient for long-horizon GUI tasks. Such trajectories are highly heterogeneous: many steps are routine and can be handled reliably by smaller, cheaper policies, while errors tend to concentrate at a relatively small number of high-risk moments. Across computer-use benchmarks, these failures repeatedly take…","authors":["Jinbiao Wei","Kangqi Ni","Yilun Zhao","Guo Gan","Arman Cohan"],"publishedAt":"2026-04-29T00:00:00.000Z","submittedAt":null,"upvotes":14,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.27151.png","hf_url":"https://huggingface.co/papers/2604.27151","arxiv_url":"https://arxiv.org/abs/2604.27151","github_repo":null,"github_stars":null,"ai_keywords":["computer-use agents","graphical user interfaces","multimodal models","compute allocation","event-driven cascade","Stuck Monitor","Milestone Monitor","semantic drift","progress stalls","risk detection"]},{"paperId":"2604.23774","title":"Prox-E: Fine-Grained 3D Shape Editing via Primitive-Based Abstractions","summary":"Text-based 2D image editing models have recently reached an impressive level of maturity, motivating a growing body of work that heavily depends on these models to drive 3D edits. While effective for appearance-based modifications, such 2D-centric 3D editing pipelines often struggle with fine-grained 3D editing, where localized structural changes must be applied while strictly preserving an object's overall identity. To address this limitation, we propose Prox-E, a training-free framework that enables fine-grained 3D control through an explicit, primitive-based geometric abstraction. Our framework first abstracts an input 3D shape into a compact set of geometric primitives. A pretrained vision-language model (VLM) then edits this abstraction to specify primitive-level changes. These struc…","authors":["Etai Sella","Hao Phung","Nitay Amiel","Or Litany","Or Patashnik","Hadar Averbuch-Elor"],"publishedAt":"2026-04-29T00:00:00.000Z","submittedAt":null,"upvotes":13,"num_comments":3,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.23774.png","hf_url":"https://huggingface.co/papers/2604.23774","arxiv_url":"https://arxiv.org/abs/2604.23774","github_repo":null,"github_stars":null,"ai_keywords":["geometric primitives","vision-language model","3D generative model","3D editing","identity preservation","structural changes"]},{"paperId":"2605.00781","title":"Map2World: Segment Map Conditioned Text to 3D World Generation","summary":"3D world generation is essential for applications such as immersive content creation or autonomous driving simulation. Recent advances in 3D world generation have shown promising results; however, these methods are constrained by grid layouts and suffer from inconsistencies in object scale throughout the entire world. In this work, we introduce a novel framework, Map2World, that first enables 3D world generation conditioned on user-defined segment maps of arbitrary shapes and scales, ensuring global-scale consistency and flexibility across expansive environments. To further enhance the quality, we propose a detail enhancer network that generates fine details of the world. The detail enhancer enables the addition of fine-grained details without compromising overall scene coherence by incor…","authors":["Jaeyoung Chung","Suyoung Lee","Jianfeng Xiang","Jiaolong Yang","Kyoung Mu Lee"],"publishedAt":"2026-05-01T00:00:00.000Z","submittedAt":null,"upvotes":13,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2605.00781.png","hf_url":"https://huggingface.co/papers/2605.00781","arxiv_url":"https://arxiv.org/abs/2605.00781","github_repo":null,"github_stars":null,"ai_keywords":["3D world generation","segment maps","scale consistency","detail enhancer network","asset generators","scene generation"]},{"paperId":"2604.24658","title":"The Last Human-Written Paper: Agent-Native Research Artifacts","summary":"Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, where failed experiments, rejected hypotheses, and the branching exploration process are discarded to fit a linear narrative; and an Engineering Tax, where the gap between reviewer-sufficient prose and agent-sufficient specification leaves critical implementation details unwritten. Tolerable for human readers, these costs become critical when AI agents must understand, reproduce, and extend published work. We introduce the Agent-Native Research Artifact (ARA), a protocol that replaces the narrative paper with a machine-executable research package structured around …","authors":["Jiachen Liu","Jiaxin Pei","Jintao Huang","Chenglei Si","Ao Qu","Xiangru Tang","Runyu Lu","Lichang Chen"],"publishedAt":"2026-04-29T00:00:00.000Z","submittedAt":null,"upvotes":13,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.24658.png","hf_url":"https://huggingface.co/papers/2604.24658","arxiv_url":"https://arxiv.org/abs/2604.24658","github_repo":null,"github_stars":null,"ai_keywords":[]},{"paperId":"2604.27419","title":"InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?","summary":"With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing benchmarks rely on idealized assumptions, especially for well-structured, information-rich inputs and static execution settings. In contrast, real-world development is constrained by a critical bottleneck: the semantic misalignment between ambiguous, low-quality instructions from non-expert users and model understanding, which results in a failure mode that we term blind execution. To address this gap, we introduce InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code user conditions. InteractWeb-Bench introduces four types of user agents…","authors":["Qiyao Wang","Haoran Hu","Longze Chen","Hongbo Wang","Hamid Alinejad-Rokny","Yuan Lin","Min Yang"],"publishedAt":"2026-04-30T00:00:00.000Z","submittedAt":null,"upvotes":11,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.27419.png","hf_url":"https://huggingface.co/papers/2604.27419","arxiv_url":"https://arxiv.org/abs/2604.27419","github_repo":null,"github_stars":null,"ai_keywords":["multimodal large language models","coding agents","website generation","interactive benchmark","user agents","persona-driven instruction perturbations","requirement engineering defect taxonomies","interactive execution environment","unified action space","blind execution"]},{"paperId":"2604.28169","title":"PhyCo: Learning Controllable Physical Priors for Generative Motion","summary":"Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied across diverse scenarios; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision-language model evalu…","authors":["Sriram Narayanan","Ziyu Jiang","Srinivasa Narasimhan","Manmohan Chandraker"],"publishedAt":"2026-04-30T00:00:00.000Z","submittedAt":null,"upvotes":11,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.28169.png","hf_url":"https://huggingface.co/papers/2604.28169","arxiv_url":"https://arxiv.org/abs/2604.28169","github_repo":null,"github_stars":null,"ai_keywords":["video diffusion models","physics-supervised fine-tuning","ControlNet","pixel-aligned physical property maps","vision-language model","reward optimization","physics-IQ benchmark","generative video models"]},{"paperId":"2604.24026","title":"From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills","summary":"LLM agents increasingly rely on reusable skills, capability packages that combine instructions, control flow, constraints, and tool calls. In most current agent systems, however, skills are still represented by text-heavy artifacts, including SKILL.md-style documents and structured records whose machine-usable evidence remains embedded largely in natural-language descriptions. This poses a challenge for skill-centered agent systems: managing skill collections and using skills to support agent both require reasoning over invocation interfaces, execution structure, and concrete side effects that are often entangled in a single textual surface. An explicit representation of skill knowledge may therefore help make these artifacts easier for machines to acquire and leverage. Drawing on Memory …","authors":["Qiliang Liang","Hansi Wang","Zhong Liang","Yang Liu"],"publishedAt":"2026-04-27T00:00:00.000Z","submittedAt":null,"upvotes":10,"num_comments":3,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.24026.png","hf_url":"https://huggingface.co/papers/2604.24026","arxiv_url":"https://arxiv.org/abs/2604.24026","github_repo":null,"github_stars":null,"ai_keywords":["LLM agents","reusable skills","skill-centered agent systems","skill discovery","risk assessment","SSL representation","scheduling-structural-logical representation","memory organization packets","script theory","conceptual dependency"]},{"paperId":"2605.00416","title":"Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies","summary":"Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward …","authors":["Yi Wang","Xinchen Li","Pengwei Xie","Pu Yang","Buqing Nie","Yunuo Cai","Qinglin Zhang","Chendi Qu"],"publishedAt":"2026-05-01T00:00:00.000Z","submittedAt":null,"upvotes":9,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2605.00416.png","hf_url":"https://huggingface.co/papers/2605.00416","arxiv_url":"https://arxiv.org/abs/2605.00416","github_repo":null,"github_stars":null,"ai_keywords":["Vision-Language-Action","reinforcement learning","policy improvement","autonomous rollouts","human interventions","Distributional Implicit Value Learning","Q-learning","Adjoint Matching","flow-based action generators"]},{"paperId":"2605.00809","title":"Let ViT Speak: Generative Language-Image Pre-training","summary":"In this paper, we present Generative Language-Image Pre-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) Simplicity: a single transformer jointly models visual and textual tokens; (2) Scalability: it scales effectively with both data and model size; and (3) Performance: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-Dat…","authors":["Yan Fang","Mengcheng Lan","Zilong Huang","Weixian Lei","Yunqing Zhao","Yujie Zhong","Yingchen Yu","Qi She"],"publishedAt":"2026-05-01T00:00:00.000Z","submittedAt":null,"upvotes":9,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2605.00809.png","hf_url":"https://huggingface.co/papers/2605.00809","arxiv_url":"https://arxiv.org/abs/2605.00809","github_repo":null,"github_stars":null,"ai_keywords":["Vision Transformers","multimodal large language models","autoregressive nature","language modeling objective","visual tokens","language tokens","transformer","multimodal benchmarks","Recap-DataComp-1B","multi-resolution images"]},{"paperId":"2605.00553","title":"Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance","summary":"Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse. In particular, unstable rewards in red-teaming accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates partition function Z estimation in GFN and reduces training instability. S-GFN avoids Z-estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck …","authors":["Minchan Kwon","Sunghyun Baek","Minseo Kim","Jaemyung Yu","Dongyoon Han","Junmo Kim"],"publishedAt":"2026-05-01T00:00:00.000Z","submittedAt":null,"upvotes":9,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2605.00553.png","hf_url":"https://huggingface.co/papers/2605.00553","arxiv_url":"https://arxiv.org/abs/2605.00553","github_repo":null,"github_stars":null,"ai_keywords":["Generative Flow Networks","GFN","partition function Z estimation","training instability","mode collapse","pairwise comparisons","robust masking","fluency stabilizer","local optima","gibberish"]},{"paperId":"2604.25135","title":"FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments","summary":"Large Language Models are being increasingly deployed as the decision-making core of autonomous agents capable of effecting change in external environments. Yet, in conversational benchmarks, which simulate real-world customer-centric issue resolution scenarios, these agents frequently fail due to the cascading effects of incorrect decision-making. These challenges are particularly pronounced for open-source LLMs with smaller parameter sizes, limited context windows, and constrained inference budgets, which contribute to increased error accumulation in agentic settings. To tackle these challenges, we present the Failure-Aware Meta-Agentic (FAMA) framework. FAMA operates in two stages: first, it analyzes failure trajectories from baseline agents to identify the most prevalent errors; secon…","authors":["Amir Saeidi","Venkatesh Mishra","Souradeep Mukhopadhyay","Gaowen Liu","Ali Payani","Jayanth Srinivasa","Chitta Baral"],"publishedAt":"2026-04-28T00:00:00.000Z","submittedAt":null,"upvotes":8,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.25135.png","hf_url":"https://huggingface.co/papers/2604.25135","arxiv_url":"https://arxiv.org/abs/2604.25135","github_repo":null,"github_stars":null,"ai_keywords":["large language models","autonomous agents","decision-making","failure trajectories","orchestration mechanism","specialized agents","tool-use agents","context injection","error accumulation","multi-turn conversations"]},{"paperId":"2604.27251","title":"Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models","summary":"Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abduction, can be decoupled from specific problem instances remains a critical challenge for model controllability, and for shedding light on reasoning controllability. In this paper, we present the first systematic investigation of this problem through the lens of reasoning conflicts: an explicit tension between parametric and contextual information induced by mandating logical schemata that deviate from those expected for a target task. Our evaluation reveals that LLMs consistently prioritize sensibility over compl…","authors":["Xingwei Tan","Marco Valentino","Mahmud Elahi Akhter","Yuxiang Zhou","Maria Liakata","Nikolaos Aletras"],"publishedAt":"2026-04-29T00:00:00.000Z","submittedAt":null,"upvotes":5,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.27251.png","hf_url":"https://huggingface.co/papers/2604.27251","arxiv_url":"https://arxiv.org/abs/2604.27251","github_repo":null,"github_stars":null,"ai_keywords":["Chain-of-Thought","parametric memory","logical schemata","reasoning conflicts","instruction following","activation-level controllability","internalized parametric memory"]},{"paperId":"2604.26091","title":"Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital","summary":"We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trac…","authors":["T. J. Barton","Chris Constantakis","Patti Hauseman","Annie Mous","Alaska Hoffman","Brian Bergeron","Hunter Goodreau"],"publishedAt":"2026-04-28T00:00:00.000Z","submittedAt":null,"upvotes":5,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2604.26091.png","hf_url":"https://huggingface.co/papers/2604.26091","arxiv_url":"https://arxiv.org/abs/2604.26091","github_repo":null,"github_stars":null,"ai_keywords":["language-model agents","tool actions","onchain market","agent invocations","inference tokens","settlement success","prompt compilation","policy validation","execution guards","memory design"]},{"paperId":"2605.00414","title":"Trees to Flows and Back: Unifying Decision Trees and Diffusion Models","summary":"Decision trees and diffusion models are ostensibly disparate model classes, one discrete and hierarchical, the other continuous and dynamic. This work unifies the two by establishing a crisp mathematical correspondence between hierarchical decision trees and diffusion processes in appropriate limiting regimes. Our unification reveals a shared optimization principle: Global Trajectory Score Matching (GTSM), for which gradient boosting (in an idealized version) is asymptotically optimal. We underscore the conceptual value of our work through two key practical instantiations: \\treeflow, which achieves competitive generation quality on tabular data with higher fidelity and a 2\\times computational speedup, and \\dsmtree, a novel distillation method that transfers hierarchical decision logic int…","authors":["Sai Niranjan Ramachandran","Suvrit Sra"],"publishedAt":"2026-05-01T00:00:00.000Z","submittedAt":null,"upvotes":4,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2605.00414.png","hf_url":"https://huggingface.co/papers/2605.00414","arxiv_url":"https://arxiv.org/abs/2605.00414","github_repo":null,"github_stars":null,"ai_keywords":["decision trees","diffusion models","hierarchical decision trees","diffusion processes","Global Trajectory Score Matching","gradient boosting","\\treeflow","\\dsmtree","neural network distillation","generative models"]},{"paperId":"2605.00273","title":"When Do Diffusion Models learn to Generate Multiple Objects?","summary":"Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on …","authors":["Yujin Jeong","Arnas Uselis","Iro Laina","Seong Joon Oh","Anna Rohrbach"],"publishedAt":"2026-04-30T00:00:00.000Z","submittedAt":null,"upvotes":4,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2605.00273.png","hf_url":"https://huggingface.co/papers/2605.00273","arxiv_url":"https://arxiv.org/abs/2605.00273","github_repo":null,"github_stars":null,"ai_keywords":["diffusion models","multi-object generation","concept generalization","compositional generalization","dataset size","scene complexity","counting","mosaic framework"]}],"summary":{"by_keyword":[{"keyword":"reinforcement learning","count":3},{"keyword":"post-training","count":2},{"keyword":"video diffusion models","count":2},{"keyword":"video generation","count":2},{"keyword":"generative models","count":2},{"keyword":"vision-language model","count":2},{"keyword":"multimodal large language models","count":2},{"keyword":"diffusion models","count":2},{"keyword":"agentic framework","count":1},{"keyword":"domain-specific foundation models","count":1},{"keyword":"language-model-based reasoning","count":1},{"keyword":"non-linguistic data modalities","count":1},{"keyword":"predictive foundation models","count":1},{"keyword":"multi-agent systems","count":1},{"keyword":"planning-based orchestration","count":1}],"most_upvoted":{"paperId":"2604.27351","title":"Heterogeneous Scientific Foundation Model Collaboration","upvotes":196},"most_discussed":{"paperId":"2604.28185","title":"Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling","comments":4}}}}