{"ok":true,"snapshot":{"date":"2026-06-18","capturedAt":"2026-06-18T14:15:39.142Z","total_papers":30,"raw_count":50,"papers":[{"paperId":"2606.18023","title":"LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling","summary":"Looped Transformers scale latent computation by repeatedly applying shared blocks, but sequential looping increases latency and KV-cache memory with the loop count. Parallel loop Transformers (PLT) alleviate this cost through cross-loop position offsets (CLP) and shared-KV gated sliding-window attention, making loop count a practical design choice. We therefore study PLT loop-count selection through a gain--cost view: an extra loop may refine representations, but CLP also introduces a positional mismatch at each loop boundary. We instantiate this study by training LoopCoder-v2, a family of 7B PLT coders with different loop counts, from scratch on 18T tokens, followed by matched instruction tuning and evaluation. Empirically, the two-loop variant delivers broad gains over the non-looped ba…","authors":["Jian Yang","Shawn Guo","Wei Zhang","Tianyu Zheng","Yaxin Du","Haau-Sing Li","Jiajun Wu","Yue Song"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedAt":null,"upvotes":133,"num_comments":3,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.18023.png","hf_url":"https://huggingface.co/papers/2606.18023","arxiv_url":"https://arxiv.org/abs/2606.18023","github_repo":null,"github_stars":null,"ai_keywords":["Looped Transformers","parallel loop Transformers","cross-loop position offsets","shared-KV gated sliding-window attention","loop-count selection","LoopCoder-v2","instruction tuning","SWE-bench","Multi-SWE"]},{"paperId":"2606.18195","title":"Learning from the Self-future: On-policy Self-distillation for dLLMs","summary":"On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from \"self future-experience\" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, al…","authors":["Yifu Luo","Zeyu Chen","Haoyu Wang","Xinhao Hu","Yuxuan Zhang","Zhizhou Sha","Shiwei Liu"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedAt":null,"upvotes":68,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.18195.png","hf_url":"https://huggingface.co/papers/2606.18195","arxiv_url":"https://arxiv.org/abs/2606.18195","github_repo":null,"github_stars":null,"ai_keywords":["on-policy self-distillation","diffusion LLMs","self-teacher construction","suffix conditioning","step-level supervision","iterative denoising process","reasoning benchmarks","sample efficiency","RLVR","SFT"]},{"paperId":"2606.18216","title":"Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients","summary":"Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradie…","authors":["Byung-Kwan Lee","Ximing Lu","Shizhe Diao","Minki Kang","Saurav Muralidharan","Karan Sapra","Andrew Tao","Pavlo Molchanov"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedAt":null,"upvotes":48,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.18216.png","hf_url":"https://huggingface.co/papers/2606.18216","arxiv_url":"https://arxiv.org/abs/2606.18216","github_repo":null,"github_stars":null,"ai_keywords":["knowledge distillation","student model","teacher model","reinforcement learning","policy gradient","on-policy assumption","prompt replay buffer","Binary Candidate-included Question","Negative Candidate-included Question","zone of proximal development"]},{"paperId":"2606.17200","title":"ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining","summary":"Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels compara…","authors":["Hao Li","Ganlong Zhao","Yufei Liu","Haotian Hou","Guoquan Ye","Tongyan Fang","Chunxiao Liu","Siyuan Huang"],"publishedAt":"2026-06-15T00:00:00.000Z","submittedAt":null,"upvotes":42,"num_comments":3,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.17200.png","hf_url":"https://huggingface.co/papers/2606.17200","arxiv_url":"https://arxiv.org/abs/2606.17200","github_repo":null,"github_stars":null,"ai_keywords":["Vision-Language-Action models","egocentric human videos","robot trajectory collection","unified action representation","camera-space actions","time-aligned action chunking","reliability-aware training objective","human auxiliary loss","pseudo-action trajectories","embodied AI tasks"]},{"paperId":"2606.19338","title":"Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games","summary":"Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three c…","authors":["Shengyuan Ding","Xilin Wei","Xinyu Fang","Haodong Duan","Dahua Lin","Jiaqi Wang","Yuhang Zang"],"publishedAt":"2026-06-17T00:00:00.000Z","submittedAt":null,"upvotes":35,"num_comments":4,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.19338.png","hf_url":"https://huggingface.co/papers/2606.19338","arxiv_url":"https://arxiv.org/abs/2606.19338","github_repo":null,"github_stars":null,"ai_keywords":["multimodal foundation models","closed-loop policies","observation reconstruction","multi-step interaction","RNG-Bench","Matching Pairs","3D Maze","memory gap","fine-tuning","Qwen3.5-9B"]},{"paperId":"2606.17628","title":"OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation","summary":"Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy.…","authors":["Guibin Zhang","Xun Xu","Yanwei Yue","Zikun Su","Wangchunshu Zhou","Xiaobin Hu","Shuicheng Yan"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedAt":null,"upvotes":26,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.17628.png","hf_url":"https://huggingface.co/papers/2606.17628","arxiv_url":"https://arxiv.org/abs/2606.17628","github_repo":null,"github_stars":null,"ai_keywords":["self-evolving agents","memory hierarchy","on-policy self-distillation","slow-fast co-evolution","policy learning","memory management","experience retention","agent evolver"]},{"paperId":"2606.18363","title":"Guava: An Effective and Universal Harness for Embodied Manipulation","summary":"Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterativ…","authors":["Haowen Liu","Xirui Li","Shaoxiong Yao","Peng Shi","Tianyi Zhou","Jia-Bin Huang","Furong Huang","Jiayuan Mao"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedAt":null,"upvotes":21,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.18363.png","hf_url":"https://huggingface.co/papers/2606.18363","arxiv_url":"https://arxiv.org/abs/2606.18363","github_repo":null,"github_stars":null,"ai_keywords":["embodied agents","vision-language models","embodied manipulation","agent workflows","action spaces","observation spaces","iterative perception-reasoning-action loops","semantic action abstractions","multimodal observations","end-to-end training"]},{"paperId":"2606.16533","title":"Kairos: A Native World Model Stack for Physical AI","summary":"World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Att…","authors":["Kairos Team","Fei Wang","Shan You","Qiming Zhang","Tao Huang","Zuoyi Fu","Zhisheng Zheng","Yunlong Xi"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedAt":null,"upvotes":20,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.16533.png","hf_url":"https://huggingface.co/papers/2606.16533","arxiv_url":"https://arxiv.org/abs/2606.16533","github_repo":null,"github_stars":null,"ai_keywords":["world models","native pre-training paradigm","cross-embodiment data curriculum","native unified architecture","hybrid linear temporal attention","sliding-window attention","dilated sliding windows","gated linear attention","temporal factorization","error accumulation"]},{"paperId":"2606.15236","title":"Show the Signal, Hide the Noise: Spectral Forcing for Pixel-Space Diffusion","summary":"Pixel-space diffusion models are trained on full-bandwidth noisy images, yet the useful signal available to the denoiser is strongly frequency dependent. Under rectified-flow diffusion and natural-image power-law spectra, the per-band data-to-noise contour k^{*}(t) = (1-t)^{-2/α} separates a signal-bearing low-frequency region from a noise-dominated high-frequency region at each time t. We show that this implicit coarse-to-fine structure is not merely descriptive: it induces a capacity-allocation problem. A standard pixel-space denoiser must discover the moving bandwidth boundary internally and can spend computation on frequency-time regions where the optimal prediction collapses to deterministic baselines rather than data-distribution modeling. To make this boundary explicit, we introduc…","authors":["Weichen Fan","Haiwen Diao","Penghao Wu","Ziwei Liu"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedAt":null,"upvotes":18,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.15236.png","hf_url":"https://huggingface.co/papers/2606.15236","arxiv_url":"https://arxiv.org/abs/2606.15236","github_repo":null,"github_stars":null,"ai_keywords":["diffusion models","pixel-space","denoiser","frequency-dependent","rectified-flow diffusion","power-law spectra","data-to-noise contour","capacity-allocation problem","spectral forcing","2D-DCT"]},{"paperId":"2606.16767","title":"Text-Vision Co-Instructed Image Editing","summary":"Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples deriv…","authors":["Chenxi Xie","Yuhui Wu","Qiaosi Yi","Lei Zhang"],"publishedAt":"2026-06-15T00:00:00.000Z","submittedAt":null,"upvotes":14,"num_comments":3,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.16767.png","hf_url":"https://huggingface.co/papers/2606.16767","arxiv_url":"https://arxiv.org/abs/2606.16767","github_repo":null,"github_stars":null,"ai_keywords":["textual instruction-based","visual prompt-based","textual-visual instruction paired dataset","TV-Edit","cross-modal instruction","semantic intent","spatial guidance","pretrained editing backbones","semantic-aware control representations","spatial control"]},{"paperId":"2606.18322","title":"SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior","summary":"Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified \"unsafe\" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserv…","authors":["Mingyue Cui","Linghui Shen","Xingyi Yang"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedAt":null,"upvotes":14,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.18322.png","hf_url":"https://huggingface.co/papers/2606.18322","arxiv_url":"https://arxiv.org/abs/2606.18322","github_repo":null,"github_stars":null,"ai_keywords":["Sparse Autoencoders","residual-stream activations","latent-space defenses","feature-level intervention","post-intervention recovery","residual-space optimization","encoder-orthogonal updates","feature-map Jacobian","TPP","unlearning"]},{"paperId":"2606.15378","title":"Rethinking the Role of Efficient Attention in Hybrid Architectures","summary":"Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, wher…","authors":["Ziqing Qiao","Yinuo Xu","Chaojun Xiao","Zhou Su","Zihan Zhou","Yingfa Chen","Xiaoyue Xu","Xu Han"],"publishedAt":"2026-06-13T00:00:00.000Z","submittedAt":null,"upvotes":12,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.15378.png","hf_url":"https://huggingface.co/papers/2606.15378","arxiv_url":"https://arxiv.org/abs/2606.15378","github_repo":null,"github_stars":null,"ai_keywords":["hybrid architectures","full attention","efficient attention modules","sliding-window attention","recurrent sequence mixers","scaling behavior","mechanism analysis","architecture design","long-range retrieval","optimization trajectory"]},{"paperId":"2606.18101","title":"Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding","summary":"Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality…","authors":["Jingyuan Huang","Zuming Huang","Yucheng Shi","Tianze Yang","Xiaoming Zhai","Wei Chu","Ninghao Liu"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedAt":null,"upvotes":12,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.18101.png","hf_url":"https://huggingface.co/papers/2606.18101","arxiv_url":"https://arxiv.org/abs/2606.18101","github_repo":null,"github_stars":null,"ai_keywords":["vision-language models","on-policy self-distillation","coordinate-sensitive task","dense token-level teacher signals","soft correctness-aware gating","teacher-probability scaling","GUI grounding","screen coordinates","vision-language models"]},{"paperId":"2606.18180","title":"EgoCS-400K: An Egocentric Gameplay Dataset for World Models","summary":"The shift from video generation to interactive world modeling places new demands on data: beyond captioned videos, world models require temporally aligned video-action-language trajectories grounded in the actions, camera motion, states, and events that drive future scene changes. However, such data is difficult to obtain at scale. Web video datasets offer broad visual coverage but lack executable actions and reliable states; robotic datasets provide action and state supervision but are costly and limited in scene diversity; and existing simulators often lack large-scale human-driven interaction trajectories. In this paper, we introduce EgoCS-400K, a large-scale replay-grounded egocentric Counter-Strike dataset for world models, built from public professional CS and CS2 match demos that p…","authors":["Rongjin Guo","Dong Liang","Yuhao Liu","Fang Liu","Tianyu Huang","Gerhard P. Hancke","Rynson W. H. Lau"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedAt":null,"upvotes":12,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.18180.png","hf_url":"https://huggingface.co/papers/2606.18180","arxiv_url":"https://arxiv.org/abs/2606.18180","github_repo":null,"github_stars":null,"ai_keywords":["egocentric","world models","video-action-language trajectories","player states","game events","first-person videos","temporal alignment","replay-grounded","action-conditioned future prediction","state-aware scene rollout"]},{"paperId":"2606.17539","title":"Reinforcing Dual-Path Reasoning in Spatial Vision Language Models","summary":"Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens b…","authors":["Yatai Ji","An-Chieh Cheng","Yang Fu","Yukang Chen","Han Zhang","Zhaojing Yang","Wei Huang","Ka Chun Cheung"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedAt":null,"upvotes":12,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.17539.png","hf_url":"https://huggingface.co/papers/2606.17539","arxiv_url":"https://arxiv.org/abs/2606.17539","github_repo":null,"github_stars":null,"ai_keywords":["spatial VLMs","reinforcement learning","language-only reasoning","detect-then-reason","chain-of-thought supervision","region tokens","3D geometric cues","discrete center-based detection","cold-start supervised fine-tuning","policy model"]},{"paperId":"2606.13929","title":"Self-Evolving Visual Questioner","summary":"Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners' performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce a…","authors":["Yijun Liang","Hengguang Zhou","Ming Li","Lichen Li","Cho-Jui Hsieh","Tianyi Zhou"],"publishedAt":"2026-06-11T00:00:00.000Z","submittedAt":null,"upvotes":12,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.13929.png","hf_url":"https://huggingface.co/papers/2606.13929","arxiv_url":"https://arxiv.org/abs/2606.13929","github_repo":null,"github_stars":null,"ai_keywords":["vision-language models","visual questioner","self-evolving framework","visual-centric questions","training data","question generation","answerer mode","questioner mode","agentic protocol","training collapse"]},{"paperId":"2606.18967","title":"EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts","summary":"Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy…","authors":["Minseo Kim","Minjae Lee","Seunghyuk Oh","Kevin Galim","Donghoon Kim","Coleman Hooper","Harman Singh","Amir Gholami"],"publishedAt":"2026-06-17T00:00:00.000Z","submittedAt":null,"upvotes":10,"num_comments":3,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.18967.png","hf_url":"https://huggingface.co/papers/2606.18967","arxiv_url":"https://arxiv.org/abs/2606.18967","github_repo":null,"github_stars":null,"ai_keywords":["reinforcement learning","autoregressive sampling","speculative decoding","rollout generation","self-speculative decoding","drafters","acceptance-aware draft-length adaptation","compute-bound regimes","memory-bound regimes"]},{"paperId":"2606.17682","title":"From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning","summary":"Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior,…","authors":["Chao Chen","Chengzu Li","Zhiwei Li","Yinhong Liu","Zhijiang Guo"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedAt":null,"upvotes":9,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.17682.png","hf_url":"https://huggingface.co/papers/2606.17682","arxiv_url":"https://arxiv.org/abs/2606.17682","github_repo":null,"github_stars":null,"ai_keywords":["reinforcement learning","Large Language Models","environment redesign","policy analysis","failure trajectories","environment engineering","Qwen3-4B","benchmarking","policy learning","environment configuration"]},{"paperId":"2606.19341","title":"Native Active Perception as Reasoning for Omni-Modal Understanding","summary":"Passive models for long video understanding typically rely on a \"watch-it-all\" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-…","authors":["Zhenghao Xing","Ruiyang Xu","Yuxuan Wang","Jinzheng He","Ziyang Ma","Qize Yang","Yunfei Chu","Jin Xu"],"publishedAt":"2026-06-17T00:00:00.000Z","submittedAt":null,"upvotes":9,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.19341.png","hf_url":"https://huggingface.co/papers/2606.19341","arxiv_url":"https://arxiv.org/abs/2606.19341","github_repo":null,"github_stars":null,"ai_keywords":["POMDP","Observation-Thought-Action cycle","active perception","agentic supervised fine-tuning","agentic reinforcement learning","TAURA","turn-level entropy","video understanding","omni-modal agent"]},{"paperId":"2606.14885","title":"Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion","summary":"Agentic search over large corpora relies on retriever-mediated interfaces (e.g., BM25 or ColBERT) for scalable candidate discovery. While effective at ranking relevant documents, these interfaces expose evidence only as ranked results or bounded document views, limiting agents' ability to reorganize material and verify constraints across documents. Direct Corpus Interaction (DCI) addresses this limitation by exposing shell-executable corpus operations for flexible search, filtering, comparison, and verification. However, full-corpus terminal commands become slow and unstable as the corpus grows, degrading performance and efficiency. We introduce DR-DCI, a retriever-steered DCI framework that treats retrieval as an agent-callable action for expanding a local workspace. Rather than operatin…","authors":["Yi Lu","Zhuofeng Li","Ping Nie","Haoxiang Zhang","Yuyu Zhang","Kai Zou","Wenhu Chen","Jimmy Lin"],"publishedAt":"2026-06-12T00:00:00.000Z","submittedAt":null,"upvotes":8,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.14885.png","hf_url":"https://huggingface.co/papers/2606.14885","arxiv_url":"https://arxiv.org/abs/2606.14885","github_repo":null,"github_stars":null,"ai_keywords":["retriever-mediated interfaces","direct corpus interaction","shell-executable corpus operations","agent-callable action","local workspace","corpus-scaling experiments","ranked previews","inter-document DCI"]},{"paperId":"2606.05985","title":"Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems","summary":"Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a single agent matches a target culture. Yet alignment is a per-agent property and cannot reveal whether a system, taken as a whole, preserves the cultural plurality it is meant to represent. We propose value diversity as a system-level evaluation axis for multicultural agent systems, defined through the dissimilarity between culturally conditioned agents' responses on a shared value survey. Using the World Values Survey, we evaluate 19 cultures and 18 backbone models across a wide range of system configurations. We find that diversity is largely uncorrelated wit…","authors":["Shaoyang Xu","Jingshen Zhang","Long P. Hoang","Jinyuan Li","Wenxuan Zhang"],"publishedAt":"2026-06-04T10:26:33.000Z","submittedAt":null,"upvotes":7,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.05985.png","hf_url":"https://huggingface.co/papers/2606.05985","arxiv_url":"https://arxiv.org/abs/2606.05985","github_repo":null,"github_stars":null,"ai_keywords":["multicultural multi-agent systems","cultural evaluation","value alignment","value diversity","cultural plurality","World Values Survey","backbone models","social interaction","collective decision-making"]},{"paperId":"2606.15158","title":"RefGC-SR^2: Reference-guided Generated Content Super-Resolution and Refinement","summary":"Reference-guided generation (e.g., object compositing, customization) has progressed rapidly, yet current pipelines share a fundamental limitation: the object-centric high-resolution reference image (HRRI) provided by users is downsampled to a fixed low-resolution (LR) before being fed into the model, so the fine-grained details are discarded before the output is even produced. In addition, the generation step then introduces its own artifacts (e.g., identity distortion) on top of this loss. Existing reference-guided generated content refinement (RefGCR) methods can correct some of these artifacts but still operate in the LR domain; reference-guided super-resolution (RefSR) methods recover resolution but assume natural-image degradations and ignore the artifact distribution of generative …","authors":["Jeahun Sung","Dahyeon Kye","Soo Ye Kim","Jihyong Oh"],"publishedAt":"2026-06-13T00:00:00.000Z","submittedAt":null,"upvotes":7,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.15158.png","hf_url":"https://huggingface.co/papers/2606.15158","arxiv_url":"https://arxiv.org/abs/2606.15158","github_repo":null,"github_stars":null,"ai_keywords":["reference-guided generation","high-resolution reference image","low-resolution","diffusion transformer","frequency-aware","diptych-conditioned generator","generative artifacts","super-resolution-refinement","real-world triplet data generation","object compositing"]},{"paperId":"2606.19236","title":"STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability","summary":"Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surp…","authors":["Haipeng Luo","Qingfeng Sun","Songli Wu","Can Xu","Wenfeng Deng","Han Hu","Yansong Tang"],"publishedAt":"2026-06-17T00:00:00.000Z","submittedAt":null,"upvotes":7,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.19236.png","hf_url":"https://huggingface.co/papers/2606.19236","arxiv_url":"https://arxiv.org/abs/2606.19236","github_repo":null,"github_stars":null,"ai_keywords":["Reinforcement Learning","GRPO","policy entropy collapse","first-order gradient analysis","token-level entropy dynamics","trajectory-level advantage","entropy sensitivity function","surprisal-guided token-level advantage reweighting","target-entropy closed-loop gate","policy entropy stability"]},{"paperId":"2606.19005","title":"Sumi: Open Uniform Diffusion Language Model from Scratch","summary":"Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi (\"ink\" in Japanese), a fully open 7B uniform d…","authors":["Mengyu Ye","Keito Kudo","Wataru Ikeda","Ryosuke Matsuda","Keisuke Sakaguchi","Jun Suzuki"],"publishedAt":"2026-06-17T00:00:00.000Z","submittedAt":null,"upvotes":7,"num_comments":0,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.19005.png","hf_url":"https://huggingface.co/papers/2606.19005","arxiv_url":"https://arxiv.org/abs/2606.19005","github_repo":null,"github_stars":null,"ai_keywords":["uniform diffusion language models","autoregressive models","diffusion models","pretraining","token budget","model scaling","generation dynamics","controllability","data mixture","model weights"]},{"paperId":"2606.17905","title":"ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions","summary":"Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same latent logical structure is expressed in English and diverse Chinese surface realizations. Built from formal logical templates, the benchmark contains three data sets: (i) the General aligned set, derived from 60 General Propositions across nine template families; (ii) the Difficult aligned set, derived from 40 Difficult Problems; and (iii) the Chinese-only set, covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five Chinese realization…","authors":["Peixian Zhou","Yuxu Chen","Chaorui Zhang","Wei Han","Bo Bai","Xueyan Niu"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedAt":null,"upvotes":6,"num_comments":3,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.17905.png","hf_url":"https://huggingface.co/papers/2606.17905","arxiv_url":"https://arxiv.org/abs/2606.17905","github_repo":null,"github_stars":null,"ai_keywords":["large language models","logical reasoning benchmarks","multilingual reasoning","back-translation","surface realization"]},{"paperId":"2606.18558","title":"MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction","summary":"Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M uncons…","authors":["Jianing Zhang","Chenhao Zheng","Yajun Yang","Max Argus","Rustin Soraki","Winson Han","Taira Anderson","Chun-Liang Li"],"publishedAt":"2026-06-17T00:00:00.000Z","submittedAt":null,"upvotes":5,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.18558.png","hf_url":"https://huggingface.co/papers/2606.18558","arxiv_url":"https://arxiv.org/abs/2606.18558","github_repo":null,"github_stars":null,"ai_keywords":["motion forecasting","3D point trajectories","goal-conditioned","language description","autoregressive coordinate prediction","flow-matching-based trajectory generation","robot manipulation","generative models","video synthesis"]},{"paperId":"2606.17321","title":"ProCUA-SFT Technical Report","summary":"Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content -- 912 spreadsheets from SpreadsheetBench, approximately 10…","authors":["Jaehun Jung","Ximing Lu","Brandon Cui","Muhammad Khalifa","Shaokun Zhang","Hao Zhang","Jin Xu","Amala Sanjay Deshmukh"],"publishedAt":"2026-06-15T00:00:00.000Z","submittedAt":null,"upvotes":5,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.17321.png","hf_url":"https://huggingface.co/papers/2606.17321","arxiv_url":"https://arxiv.org/abs/2606.17321","github_repo":null,"github_stars":null,"ai_keywords":["computer-use agents","supervised fine-tuning","UI-TARS","OSWorld","synthetic trajectories","precondition checking","VLM","step-prefix samples"]},{"paperId":"2606.18246","title":"Variable-Width Transformers","summary":"Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a times-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on lang…","authors":["Zhaofeng Wu","Oliver Sieberling","Shawn Tan","Rameswar Panda","Yury Polyanskiy","Yoon Kim"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedAt":null,"upvotes":5,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.18246.png","hf_url":"https://huggingface.co/papers/2606.18246","arxiv_url":"https://arxiv.org/abs/2606.18246","github_repo":null,"github_stars":null,"ai_keywords":["transformer-based language models","model size scaling","depth","width","parameter-efficient fine-tuning","$\\times$-shaped architecture","residual resizing mechanism","decoder-only language models","MoE","FLOPs"]},{"paperId":"2606.15231","title":"Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning","summary":"Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynam…","authors":["Zhengbo Zhang","Changtao Miao","Jinbo Su","Zhaowen Zhou","Chunxia Zhang","Xukai Wang","Ruiqi Liu","Kaiyuan Zheng"],"publishedAt":"2026-06-13T00:00:00.000Z","submittedAt":null,"upvotes":3,"num_comments":2,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.15231.png","hf_url":"https://huggingface.co/papers/2606.15231","arxiv_url":"https://arxiv.org/abs/2606.15231","github_repo":null,"github_stars":null,"ai_keywords":["multimodal large language models","visual-native search","deep search agents","active visual reasoning","multimodal trajectories","visual evidence harvesting"]},{"paperId":"2606.15872","title":"SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks","summary":"Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial complementarity that single-model evaluation hides: different frontier models excel on different question types, and no single model captures the full picture. We present SciOrch, a framework that trains a lightweight 8B model to orchestrate frontier LLMs for scientific reasoning. The orchestrator decomposes each question, delegates sub-problems to selected commercial models through API calls, and synthesizes a final answer. Training such an orchestrator is fundamentally harder than conventional agentic RL: each action triggers an API call that is expensive in both d…","authors":["Jingru Guo","Xiangyuan Xue","Lian Zhang","Wanghan Xu","Siki Chen","Philip Torr","Wanli Ouyang","Lei Bai"],"publishedAt":"2026-06-14T15:45:34.000Z","submittedAt":null,"upvotes":3,"num_comments":1,"thumbnail":"https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2606.15872.png","hf_url":"https://huggingface.co/papers/2606.15872","arxiv_url":"https://arxiv.org/abs/2606.15872","github_repo":null,"github_stars":null,"ai_keywords":["large language models","scientific reasoning","frontier models","orchestrator model","API calls","MCTS-based approach","GRPO-style training","multi-agent baseline","SGI-Reasoning","Scientists' First Exam"]}],"summary":{"by_keyword":[{"keyword":"reinforcement learning","count":4},{"keyword":"vision-language models","count":4},{"keyword":"on-policy self-distillation","count":3},{"keyword":"policy learning","count":2},{"keyword":"world models","count":2},{"keyword":"sliding-window attention","count":2},{"keyword":"diffusion models","count":2},{"keyword":"large language models","count":2},{"keyword":"Looped Transformers","count":1},{"keyword":"parallel loop Transformers","count":1},{"keyword":"cross-loop position offsets","count":1},{"keyword":"shared-KV gated sliding-window attention","count":1},{"keyword":"loop-count selection","count":1},{"keyword":"LoopCoder-v2","count":1},{"keyword":"instruction tuning","count":1}],"most_upvoted":{"paperId":"2606.18023","title":"LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling","upvotes":133},"most_discussed":{"paperId":"2606.19338","title":"Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games","comments":4}}}}