{"ok":true,"snapshot":{"date":"2026-07-04","capturedAt":"2026-07-04T11:30:28.010Z","total_papers":50,"categories_queried":["cs.AI","cs.LG","cs.CL","cs.CV"],"raw_count":100,"papers":[{"arxivId":"2607.02517","version":"v1","title":"WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory","abstract":"We present WorldDirector, a highly controllable video world model framework designed for persistent dynamic object memory and unrestricted viewpoint exploration. Unlike existing world models that entangle physical dynamics with pixel rendering and rely on continuous visual observation to sustain motion, our framework explicitly decouples semantic motion orchestration from visual generation. By leveraging an LLM to coordinate 3D trajectories with camera movements and subsequently employing these orchestrated trajectories as control signals for video generation, our approach ensures strict physical logic and appearance stability, successfully preserving the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view. Experimental results…","authors":["Hanlin Wang","Hao Ouyang","Qiuyu Wang","Wen Wang","Qingyan Bai","Ka Leong Cheng","Yue Yu","Yixuan Li"],"primaryCategory":"cs.CV","categories":["cs.CV"],"publishedAt":"2026-07-02T17:59:59Z","updatedAt":"2026-07-02T17:59:59Z","htmlUrl":"https://arxiv.org/abs/2607.02517v1","pdfUrl":"https://arxiv.org/pdf/2607.02517v1","doi":null},{"arxivId":"2607.02516","version":"v1","title":"Alignment Is All You Need For X-to-4D Generation","abstract":"Generative diffusion models excel at synthesizing high-quality images, videos, and 3D content under multimodal control. However, arbitrary user-defined modality-to-4D (X-to-4D) generation remains challenging due to the high cost of constructing diverse datasets and the limited scalability of existing methods. This paper presents Align4D, a flexible framework that translates any-modal input into coherent video-3D pairs, using video to guide 4D motion and 3D data to shape 4D geometry. Align4D introduces three key techniques: (1) Object Distance Alignment, which searches Video-Aligned and Multiview-Aligned Object Distances (VAOD/MAOD), respectively, to reconcile 4D renderings with video and the priors of multiview diffusion models; (2) Motion-Geometry Joint Alignment, which constrains known …","authors":["Qiaowei Miao","Kehan Li","Yawei Luo","Yi Yang"],"primaryCategory":"cs.CV","categories":["cs.CV"],"publishedAt":"2026-07-02T17:59:57Z","updatedAt":"2026-07-02T17:59:57Z","htmlUrl":"https://arxiv.org/abs/2607.02516v1","pdfUrl":"https://arxiv.org/pdf/2607.02516v1","doi":null},{"arxivId":"2607.02514","version":"v1","title":"Distributed Attacks in Persistent-State AI Control","abstract":"As AI coding agents become more autonomous, they increasingly ship code iteratively, with the codebase persisting across sessions. This persistence creates a new attack surface: a misaligned or prompt-injected agent can distribute attacks across pull requests (PRs) and time its payload for the PR with the best natural cover. To study the resulting dynamics, we introduce Iterative VibeCoding, a setting for AI control, the study of safely deploying capable but potentially untrusted AI. In Iterative VibeCoding, a coding agent builds software over a sequence of PRs in a persistent codebase while pursuing a covert side task. Our benchmark includes two task families: CLI tools and Flask web services, across 20 total task variations. We use Claude Sonnet 4.5 as the attack agent and GPT-4o as the…","authors":["Josh Hills","Ida Caspary","Asa Cooper Stickland"],"primaryCategory":"cs.AI","categories":["cs.AI"],"publishedAt":"2026-07-02T17:59:56Z","updatedAt":"2026-07-02T17:59:56Z","htmlUrl":"https://arxiv.org/abs/2607.02514v1","pdfUrl":"https://arxiv.org/pdf/2607.02514v1","doi":null},{"arxivId":"2607.02515","version":"v1","title":"PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation","abstract":"State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent spaces in order to leverage pre-trained latent diffusion models. In this work, we show that such architectural overhead and intricate loss formulations are unnecessary. We introduce a minimalist pixel-space Diffusion Transformer, built on a plain ViT, that operates directly on raw 3D point map patches and is conditioned on image tokens from a pre-trained DINOv3. Unlike existing latent diffusion approaches, we train our diffusion backbone entirely from scratch, eliminating the need for point map tokenizers. Despite its simplicity, our approach surpasses complex latent-based diffusion models while remaining significantly simpler than hybrid …","authors":["Haofei Xu","Rundi Wu","Philipp Henzler","Nikolai Kalischek","Michael Oechsle","Fabian Manhardt","Marc Pollefeys","Andreas Geiger"],"primaryCategory":"cs.CV","categories":["cs.CV"],"publishedAt":"2026-07-02T17:59:56Z","updatedAt":"2026-07-02T17:59:56Z","htmlUrl":"https://arxiv.org/abs/2607.02515v1","pdfUrl":"https://arxiv.org/pdf/2607.02515v1","doi":null},{"arxivId":"2607.02513","version":"v1","title":"LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning","abstract":"LLMs memorize sensitive training data, including personally identifiable information (PII), creating a pressing need for reliable post hoc removal methods. Unlearning has emerged as a promising solution, with state-of-the-art(SOTA) methods often following a localize-first, unlearn-second paradigm that targets specific model parameters. However, existing benchmarks evaluate unlearning solely at the output level, leaving open the question of whether unlearning truly erases knowledge from a model's parameters or merely obfuscates it, a concern reinforced by the success of resurfacing attacks. To bridge this gap, we introduce LACUNA: the first unlearning testbed with ground-truth parameter-level localization. LACUNA injects PII of synthetic individuals into predefined parameters of 1B and 7B …","authors":["Matteo Boglioni","Thibault Rousset","Siva Reddy","Marius Mosbach","Verna Dankers"],"primaryCategory":"cs.CL","categories":["cs.CL","cs.AI","cs.LG"],"publishedAt":"2026-07-02T17:59:52Z","updatedAt":"2026-07-02T17:59:52Z","htmlUrl":"https://arxiv.org/abs/2607.02513v1","pdfUrl":"https://arxiv.org/pdf/2607.02513v1","doi":null},{"arxivId":"2607.02512","version":"v1","title":"Program-as-Weights: A Programming Paradigm for Fuzzy Functions","abstract":"Many everyday programming tasks resist clean rule-based implementation, such as alerting on important log lines, repairing malformed JSON, or ranking search results by intent, and are increasingly outsourced to large language model APIs at the cost of locality, reproducibility, and price. We propose fuzzy-function programming: compiling such a function from a natural-language specification into a compact, locally-executable neural artifact. We instantiate this paradigm with Program-as-Weights (PAW), in which a 4B compiler trained on FuzzyBench, a 10M-example dataset we release, emits parameter-efficient adapters for a frozen, lightweight interpreter. A 0.6B Qwen3 interpreter executing PAW programs matches the performance of direct prompting of Qwen3-32B, while using roughly one fiftieth o…","authors":["Wentao Zhang","Liliana Hotsko","Woojeong Kim","Pengyu Nie","Stuart Shieber","Yuntian Deng"],"primaryCategory":"cs.LG","categories":["cs.LG","cs.AI","cs.CL"],"publishedAt":"2026-07-02T17:59:50Z","updatedAt":"2026-07-02T17:59:50Z","htmlUrl":"https://arxiv.org/abs/2607.02512v1","pdfUrl":"https://arxiv.org/pdf/2607.02512v1","doi":null},{"arxivId":"2607.02510","version":"v1","title":"Online Safety Monitoring for LLMs","abstract":"Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with more advanced monitors based on sequential hypothesis testing.","authors":["Mona Schirmer","Metod Jazbec","Alexander Timans","Christian Naesseth","Maja Waldron","Eric Nalisnick"],"primaryCategory":"cs.AI","categories":["cs.AI","cs.CL","cs.LG","stat.AP","stat.ML"],"publishedAt":"2026-07-02T17:59:43Z","updatedAt":"2026-07-02T17:59:43Z","htmlUrl":"https://arxiv.org/abs/2607.02510v1","pdfUrl":"https://arxiv.org/pdf/2607.02510v1","doi":null},{"arxivId":"2607.02509","version":"v1","title":"ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning","abstract":"Understanding and reasoning over long contexts has become a key requirement for deploying large language models (LLMs) in realistic applications. Although recent LLMs support increasingly long context windows, they often fail to use relevant evidence that is already present in the input, revealing a gap between context access and effective context utilization. In this work, we propose Recursive Evidence Replay as LLM Harness for Long-Context Reasoning (RECONTEXT), a training-free inference method for improving long-context reasoning. RECONTEXT uses model-internal relevance signals to construct a query-conditioned evidence pool and replays it before final generation while preserving the full original context. This recursive selection process separates evidence organization from answer gene…","authors":["Yanjun Zhao","Ruizhong Qiu","Tianxin Wei","Yuanchen Bei","Zhining Liu","Lingjie Chen","Ismini Lourentzou","Hanghang Tong"],"primaryCategory":"cs.AI","categories":["cs.AI"],"publishedAt":"2026-07-02T17:59:26Z","updatedAt":"2026-07-02T17:59:26Z","htmlUrl":"https://arxiv.org/abs/2607.02509v1","pdfUrl":"https://arxiv.org/pdf/2607.02509v1","doi":null},{"arxivId":"2607.02508","version":"v1","title":"From SRA to Self-Flow: Data Augmentation or Self-Supervision?","abstract":"Representation alignment has become an effective way to accelerate diffusion transformer training and improve generation quality. Recent self-alignment methods, such as SRA and Self-Flow, further remove the dependency on external pretrained encoders by constructing alignment within the diffusion model itself. However, the mechanism behind the improvement from SRA to Self-Flow, dual-time scheduling, remains under-examined: Self-Flow attributes its gain to interactions between tokens at different noise levels, where cleaner tokens help infer noisier ones. In this work, we revisit this explanation and ask whether the gain instead comes from data augmentation along the noise dimension. To disentangle these factors, we introduce Attention Separation, which preserves the same dual-timestep inpu…","authors":["Dengyang Jiang","Mengmeng Wang","Harry Yang","Jingdong Wang"],"primaryCategory":"cs.CV","categories":["cs.CV"],"publishedAt":"2026-07-02T17:59:25Z","updatedAt":"2026-07-02T17:59:25Z","htmlUrl":"https://arxiv.org/abs/2607.02508v1","pdfUrl":"https://arxiv.org/pdf/2607.02508v1","doi":null},{"arxivId":"2607.02507","version":"v1","title":"What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates","abstract":"LLM agents will increasingly act in socially structured settings where role, audience, and relational context can shape what is advantageous or costly to say. We study whether such social structure, without any explicit objective in the prompt, changes what an agent expresses publicly relative to an off-the-record (OTR) channel elicited under the same condition. We introduce a dual-channel debate framework in which agents produce public utterances that enter the shared history alongside OTR responses that are recorded but never shown to the other participant. Across 10 models, 3 scenarios, and 5 variations within each scenario, alignment-inducing settings produce systematic public-OTR divergence in the targeted agent, with its decision divergence rising from a $\\sim$3% baseline to roughly…","authors":["Arman Ghaffarizadeh","Danyal Mohaddes","Aliakbar Izadkhah","Shahriar Noroozizadeh"],"primaryCategory":"cs.AI","categories":["cs.AI","cs.CL","cs.LG","cs.MA"],"publishedAt":"2026-07-02T17:59:23Z","updatedAt":"2026-07-02T17:59:23Z","htmlUrl":"https://arxiv.org/abs/2607.02507v1","pdfUrl":"https://arxiv.org/pdf/2607.02507v1","doi":null},{"arxivId":"2607.02504","version":"v1","title":"Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas","abstract":"Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \\textbf{speaker recognition}, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce \\textbf{DramaSR-532K}, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integration of auditory, linguistic, and visual cues for speaker recognition. (2) We propose \\textbf{DramaSR-LRM}, a robust approach built upon a large reasoning model (LRM). DramaSR-LRM is designed to autonomously aggregate contextual evidence via multimodal tool-use, synthesizing diverse inputs to achie…","authors":["Yuxuan Li","Lingxi Xie","Xinyue Huo","Jihao Qiu","Jiacheng Shao","Pengfei Chen","Jiannan Ge","Kaiwen Duan"],"primaryCategory":"cs.CL","categories":["cs.CL","cs.AI","cs.CV"],"publishedAt":"2026-07-02T17:58:52Z","updatedAt":"2026-07-02T17:58:52Z","htmlUrl":"https://arxiv.org/abs/2607.02504v1","pdfUrl":"https://arxiv.org/pdf/2607.02504v1","doi":null},{"arxivId":"2607.02502","version":"v1","title":"DemoPSD: Disagreement-Modulated Policy Self-Distillation","abstract":"On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher's dense token-level supervision, conditioned on privileged information, can lead to overfitting to in-domain patterns, suppress exploration, and hurt cross-domain generalization, while also introducing a more fundamental issue: *privileged information leakage*, where the student encodes answer-dependent shortcuts that are unavailable at test time. We introduce **DemoPSD**, a novel framework that resolves such problems through the idea of *selective adoption of teacher guidance*. Instead of fitting the full teache…","authors":["Yunhe Li","Hao Shi","Wenhao Liu","Mengzhe Ruan","Hanxu Hou","Zhongxiang Dai","Shuang Qiu","Linqi Song"],"primaryCategory":"cs.LG","categories":["cs.LG","cs.AI"],"publishedAt":"2026-07-02T17:58:29Z","updatedAt":"2026-07-02T17:58:29Z","htmlUrl":"https://arxiv.org/abs/2607.02502v1","pdfUrl":"https://arxiv.org/pdf/2607.02502v1","doi":null},{"arxivId":"2607.02501","version":"v1","title":"Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots","abstract":"Embodied AI models now span vision-language-action (VLA) models and world-action models (WAMs), but practical deployment remains fragmented across model-specific Python stacks, backend assumptions, and robot-side glue code, especially on heterogeneous edge devices. Existing inference runtimes are designed mainly for request-response serving and therefore do not satisfy the runtime contract of embodied deployment: multi-rate execution inside closed-loop control, latency-first batch-1 inference on heterogeneous hardware, and extensible embodied interfaces beyond fixed token I/O. We present Embodied.cpp, a portable C++ inference runtime for embodied models. Based on an architectural analysis of representative VLA models and WAMs, Embodied.cpp captures a shared execution path and organizes it…","authors":["Ling Xu","Chuyu Han","Borui Li","Hao Wu","Shiqi Jiang","Ting Cao","Chuanyou Li","Sheng Zhong"],"primaryCategory":"cs.RO","categories":["cs.RO","cs.CV","cs.OS"],"publishedAt":"2026-07-02T17:58:28Z","updatedAt":"2026-07-02T17:58:28Z","htmlUrl":"https://arxiv.org/abs/2607.02501v1","pdfUrl":"https://arxiv.org/pdf/2607.02501v1","doi":null},{"arxivId":"2607.02499","version":"v1","title":"Beyond Adam: SOAP and Muon for Faster, Label-Efficient Training of Machine Learning Interatomic Potentials","abstract":"Machine learning interatomic potentials (MLIPs) have become a hallmark of AI for scientific simulation. While efforts on new architectures and datasets have led to increasingly accurate and general models, the choice of optimizer for training has largely remained unexplored, defaulting to Adam and its variants in the community. Here, we implement and systematically compare a class of recently proposed matrix-structured optimizers, including Muon, SOAP, and the hybrid SOAP-Muon, for training NequIP and Allegro MLIP models. We find that these optimizers can substantially outperform Adam in both convergence speed and final accuracy. SOAP and SOAP-Muon emerge as robust and consistently strong methods, while Muon only provides partial gains relative to Adam. The improvements are particularly p…","authors":["Gil Harari","Yoel Zimmermann","Ola Tangen Kulseng","Laura Zichi","Chuin Wei Tan","Marc L. Descoteaux","Boris Kozinsky"],"primaryCategory":"cs.LG","categories":["cs.LG","cs.AI","physics.chem-ph","physics.comp-ph"],"publishedAt":"2026-07-02T17:57:31Z","updatedAt":"2026-07-02T17:57:31Z","htmlUrl":"https://arxiv.org/abs/2607.02499v1","pdfUrl":"https://arxiv.org/pdf/2607.02499v1","doi":null},{"arxivId":"2607.02497","version":"v1","title":"Seek to Segment: Active Perception for Panoramic Referring Segmentation","abstract":"Existing referring segmentation models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in the continuous 360$^\\circ$ environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS). In this setting, an agent is required to adjust its viewing direction ($Δθ, Δφ$) to explore the 360$^\\circ$ environment, seeking the object specified by a user instruction for segmentation. To tackle this challenging task, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model (VLM) with EgoSphere, an explicit spatial visual memory. By progressively integrating …","authors":["Song Tang","Shuming Hu","Xincheng Shuai","Henghui Ding","Yu-Gang Jiang"],"primaryCategory":"cs.CV","categories":["cs.CV"],"publishedAt":"2026-07-02T17:56:49Z","updatedAt":"2026-07-02T17:56:49Z","htmlUrl":"https://arxiv.org/abs/2607.02497v1","pdfUrl":"https://arxiv.org/pdf/2607.02497v1","doi":null},{"arxivId":"2607.02496","version":"v1","title":"Controllable Sim Agents with Behavior Latents","abstract":"Realistic traffic simulation requires agents that imitate logged behavior and can also be steered along interpretable axes. Such controllability enables engineers to isolate variables, reproduce specific edge cases, and test autonomous systems without real-world risk. We introduce Controllable Neural Variational Agents (CNeVA), a controllable simulated-agent framework that learns to infer a per-agent Gaussian behavior latent from per-channel discounted returns via a closed-form conjugate variational update, conditioning a rectified-flow trajectory generator trained on a mixed channel-mask curriculum for classifier-free guidance. To tackle scarcity in reward signals, we propose soft eligibility gates that replace hard binary thresholds with smooth exponential decay, preserving the gradient…","authors":["Juanwu Lu","Junyu Zhu","Ziran Wang"],"primaryCategory":"cs.RO","categories":["cs.RO","cs.LG"],"publishedAt":"2026-07-02T17:55:39Z","updatedAt":"2026-07-02T17:55:39Z","htmlUrl":"https://arxiv.org/abs/2607.02496v1","pdfUrl":"https://arxiv.org/pdf/2607.02496v1","doi":null},{"arxivId":"2607.02494","version":"v1","title":"Towards Robustness against Typographic Attack with Training-free Concept Localization","abstract":"Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state represent…","authors":["Bohan Liu","Wenqian Ye","Guangzhi Xiong","Zhenghao He","Sanchit Sinha","Aidong Zhang"],"primaryCategory":"cs.CV","categories":["cs.CV","cs.CL"],"publishedAt":"2026-07-02T17:55:24Z","updatedAt":"2026-07-02T17:55:24Z","htmlUrl":"https://arxiv.org/abs/2607.02494v1","pdfUrl":"https://arxiv.org/pdf/2607.02494v1","doi":null},{"arxivId":"2607.02491","version":"v1","title":"G-RRM: Guiding Symbolic Solvers with Recurrent Reasoning Models","abstract":"In this work, we focus on SE-RRMs, a symbol-equivariant instantiation of RRMs that exhibits improved extrapolation to larger problem sizes. We propose a neuro-symbolic approach, ``Guiding with Recurrent Reasoning Models'' (G-RRM), which integrates SE-RRMs with symbolic solvers for constraint satisfaction problems. SE-RRMs act as neural solvers that generate full solution proposals and guide classical symbolic solvers, such as backtracking or SAT-based methods like Glucose 4.1 and CaDiCaL 3.0.0, that produce globally correct solutions. Centrally, we investigate when neural guidance with G-RRM improves the search efficiency of symbolic solvers. % Our experiments show that the efficacy of G-RRM depends on two conditions: first, the problem instances must have an expansive combinatorial searc…","authors":["Timo Bertram","Sidhant Bhavnani","Richard Freinschlag","Erich Kobler","Andreas Mayr","Günter Klambauer"],"primaryCategory":"cs.AI","categories":["cs.AI"],"publishedAt":"2026-07-02T17:53:31Z","updatedAt":"2026-07-02T17:53:31Z","htmlUrl":"https://arxiv.org/abs/2607.02491v1","pdfUrl":"https://arxiv.org/pdf/2607.02491v1","doi":null},{"arxivId":"2607.02490","version":"v1","title":"Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning","abstract":"Large vision-language models can reason over multimodal inputs by generating textual chains of thought (CoT). A key capability exhibited in CoT reasoning is self-reflection: revisiting earlier decisions and correcting previous errors. However, existing LVLMs often fail to properly attend to visual inputs during reflection, limiting their ability to translate feedback into grounded corrections, especially for out-of-distribution images. To address this issue, we propose a novel reinforcement learning training framework VRRL, with two components explicitly designed to elicit visually grounded self-reflection. First, we randomly mask trajectory prefixes during training to emphasize recovery from incorrect intermediate predictions rather than making early mistakes. Second, we introduce buffer…","authors":["Liyan Tang","Fangcong Yin","Greg Durrett"],"primaryCategory":"cs.CL","categories":["cs.CL","cs.CV"],"publishedAt":"2026-07-02T17:53:15Z","updatedAt":"2026-07-02T17:53:15Z","htmlUrl":"https://arxiv.org/abs/2607.02490v1","pdfUrl":"https://arxiv.org/pdf/2607.02490v1","doi":null},{"arxivId":"2607.02486","version":"v1","title":"GeoMix: Descriptor-Free Visual Localization via Global Context and Multi-Detector Training","abstract":"Descriptor-free visual localization eliminates high-dimensional descriptor storage, preserves scene privacy, and simplifies map maintenance, yet its accuracy still lags far behind descriptor-based pipelines. We identify this gap to insufficient geometric discriminability in geometry-only matching. Without visual appearance, current methods underutilize local geometry cues, lack the global context among keypoints, and overfit to a single keypoint detector. We further observe that descriptor-free matching naturally enables multi-detector training, as heterogeneous keypoints can be optimized in a shared geometry-only space without aligning descriptor spaces. Building on these insights, we propose GeoMix, a descriptor-free 2D-3D matching framework that strengthens geometric discriminability a…","authors":["Yejun Zhang","Xinjue Wang","Zihan Wang","Esa Rahtu","Juho Kannala"],"primaryCategory":"cs.CV","categories":["cs.CV"],"publishedAt":"2026-07-02T17:52:41Z","updatedAt":"2026-07-02T17:52:41Z","htmlUrl":"https://arxiv.org/abs/2607.02486v1","pdfUrl":"https://arxiv.org/pdf/2607.02486v1","doi":null},{"arxivId":"2607.02484","version":"v1","title":"Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning","abstract":"Visual token pruning is a crucial strategy for accelerating VLMs by compressing redundant image patches, yet existing methods often fail to preserve critical cues under dense instructions and fine-grained queries. In this paper, we investigate this failure and identify two underlying bottlenecks: the widespread dispersion of textual noise that corrupts dense cross-modal scoring, and the feature fragmentation inherent to standard token selection. To address these issues, we propose Entropy-Aware Dense Pruning (EADP), a framework that reformulates pruning as a structured compression problem. EADP first leverages statistical entropy to quantify and filter out textual noise, yielding a robust, fine-grained instruction relevance score. Subsequently, instead of naive Top-K selection, EADP casts…","authors":["Xuehui Wang","Xuankun Yang","Wei Shen"],"primaryCategory":"cs.CV","categories":["cs.CV","cs.AI"],"publishedAt":"2026-07-02T17:50:57Z","updatedAt":"2026-07-02T17:50:57Z","htmlUrl":"https://arxiv.org/abs/2607.02484v1","pdfUrl":"https://arxiv.org/pdf/2607.02484v1","doi":null},{"arxivId":"2607.02479","version":"v1","title":"EAGLE-360: Embodied Active Global-to-Local Exploration in 360$^\\circ$","abstract":"While Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in standard visual understanding, adapting them for active visual search in 360$^\\circ$ panoramic environments exposes fundamental limitations. Specifically, standard MLLMs struggle to effectively model inherent panoramic properties, such as severe polar distortion and continuous cylindrical topologies, which significantly degrades target detection accuracy. Consequently, existing panoramic search methods attempt to compensate by relying heavily on fragmented local viewpoints. Burdened by rigid initialization and a lack of global panoramic priors, these approaches suffer from myopic, inefficient exploration and struggle with robust error recovery when targets are out of view. To overcome these challe…","authors":["Jingtao Xu","Zizhuo Lin","Jianwen Sun","Yi Yang","Yawei Luo"],"primaryCategory":"cs.CV","categories":["cs.CV"],"publishedAt":"2026-07-02T17:47:27Z","updatedAt":"2026-07-02T17:47:27Z","htmlUrl":"https://arxiv.org/abs/2607.02479v1","pdfUrl":"https://arxiv.org/pdf/2607.02479v1","doi":null},{"arxivId":"2607.02473","version":"v1","title":"Audio-Based Understanding of Audiobook Narration Appeal","abstract":"Narration is central to the audiobook listening experience, shaping how listeners engage with and understand the content. This work explores how narration qualities shape an audiobook's appeal, noting that their effects can vary by genre, title, and audience. We extract vocal and acoustic features (e.g., tone, pace, loudness) from LibriVox using pre-trained audio models and analyse their relationship with consumption data (specifically, view-rate) and their interplay with genre and title. Despite limited consumption data, we find that acoustic information alone has a robust association with appeal, even after accounting for title effects. We further validate these findings using more nuanced proprietary engagement metrics. To our knowledge, this is the first systematic computational study…","authors":["Shahar Elisha","Mariano Beguerisse-Díaz","Emmanouil Benetos"],"primaryCategory":"cs.CL","categories":["cs.CL","cs.SD","eess.AS"],"publishedAt":"2026-07-02T17:43:05Z","updatedAt":"2026-07-02T17:43:05Z","htmlUrl":"https://arxiv.org/abs/2607.02473v1","pdfUrl":"https://arxiv.org/pdf/2607.02473v1","doi":null},{"arxivId":"2607.02471","version":"v1","title":"Interpretation-Oriented Cloud Removal via Observation-Anchored Residual Flow with Geo-Contextual Alignment","abstract":"Cloud removal (CR) is essential for optical remote sensing, serving as a prerequisite for reliable downstream interpretation, such as semantic segmentation and change detection. However, existing CR approaches often prioritize visual realism while overlooking their impact on subsequent analytical tasks, leading to semantic drift and degraded downstream performance. To address this issue, we propose Geo-Anchored Cloud Removal (GACR), a unified framework that jointly ensures faithful reconstruction and robust interpretability. At its core, GACR incorporates Observation-Anchored Residual Flow (OAR-Flow), which reformulates CR as a physically grounded residual inversion process. By anchoring the generative trajectory to the cloudy observation rather than pure noise, OAR-Flow enables fast, sta…","authors":["Ziyao Wang","Maonan Wang","Yucheng He","Xianping Ma","Ziyi Wang","Hongyang Zhang","Yirong Cheng","Man-on Pun"],"primaryCategory":"cs.CV","categories":["cs.CV"],"publishedAt":"2026-07-02T17:39:23Z","updatedAt":"2026-07-02T17:39:23Z","htmlUrl":"https://arxiv.org/abs/2607.02471v1","pdfUrl":"https://arxiv.org/pdf/2607.02471v1","doi":null},{"arxivId":"2607.02469","version":"v1","title":"TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution","abstract":"Software tests and code evolve together: a code change should be followed by new or updated tests that record the new software behavior. Yet existing test generation and update benchmarks often isolate the test from the code change, and rely on static metadata that does not verify whether a test is executable or semantically tied to the code change. This makes it difficult to evaluate whether a test automation agent understands how a code change should propagate into the test suite. We introduce TestEvo-Bench, a benchmark of test and code co-evolution tasks mined from software repositories, with two tracks: in test generation, the agent shall write new tests to capture the new software behavior; in test update, the agent shall adapt failing existing tests to the changed software behavior.…","authors":["Jiale Amber Wang","Kaiyuan Wang","Pengyu Nie"],"primaryCategory":"cs.SE","categories":["cs.SE","cs.AI","cs.CL"],"publishedAt":"2026-07-02T17:35:20Z","updatedAt":"2026-07-02T17:35:20Z","htmlUrl":"https://arxiv.org/abs/2607.02469v1","pdfUrl":"https://arxiv.org/pdf/2607.02469v1","doi":null},{"arxivId":"2607.02467","version":"v1","title":"Human Capital, Not Model Benchmarks, Predicts Hybrid Intelligence in Forecasting","abstract":"Whether pairing people with AI helps or hurts is usually reported as a single average effect. Using a real-money prediction market (Polymarket) as an objective, externally resolved benchmark, this pilot shows that the value of human-AI collaboration depends on a specific, measurable form of human capital. Analyzed at the level of the individual forecaster, hybrid performance is trimodal: most people either deferred to the model (matching it) or used it to rubber-stamp a prior guess (performing worse than the model alone), while a minority engaged in genuine complementary reasoning and reached accuracy matching or even exceeding (i.e., lower error than) the market itself. Collaborative traits (perspective-taking, intellectual humility, and curiosity) rather than raw cognitive ability or mo…","authors":["Vivienne Ming"],"primaryCategory":"cs.CY","categories":["cs.CY","cs.AI"],"publishedAt":"2026-07-02T17:34:37Z","updatedAt":"2026-07-02T17:34:37Z","htmlUrl":"https://arxiv.org/abs/2607.02467v1","pdfUrl":"https://arxiv.org/pdf/2607.02467v1","doi":null},{"arxivId":"2607.02466","version":"v1","title":"Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs","abstract":"Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from conflating two distinct learning objectives: acquiring physical competence (how to move) and acquiring semantic alignment (what to do). Crucially, only the latter requires language supervision. Building on this Decomposition Hypothesis, we propose Task-Agnostic Pretraining (TAP), a two-stage framework that first learns transferable motor priors from cheap, unlabeled interaction data -- including discarded off-task trajectories and autonomous robot play -- via a self-supervised Inverse Dynamics objective. A lightweight second stage then grounds these pr…","authors":["Junhao Shi","Siyin Wang","Xiaopeng Yu","Li Ji","Jingjing Gong","Xipeng Qiu"],"primaryCategory":"cs.RO","categories":["cs.RO","cs.AI"],"publishedAt":"2026-07-02T17:33:37Z","updatedAt":"2026-07-02T17:33:37Z","htmlUrl":"https://arxiv.org/abs/2607.02466v1","pdfUrl":"https://arxiv.org/pdf/2607.02466v1","doi":null},{"arxivId":"2607.02464","version":"v1","title":"Will Scaling Improve Social Simulation with LLMs?","abstract":"Large Language Model (LLM) social simulations are a promising research method, but they are not yet faithful enough to be adopted widely. In this work, we investigate whether the current scaling paradigm in language modeling is likely to close these gaps, or whether simulation fidelity is orthogonal to general capabilities and therefore deserving of more research attention. We use scaling laws to study the relationship between LLMs' compute scale, general capability benchmarks, and the fidelity of social simulation in three representative sub-domains: opinion modeling, behavioral simulation, and longitudinal forecasting. Surprisingly, we discover strong compute scaling in all three settings, using a suite of 85 transformer LLMs with the Qwen3 architecture pre-trained on the DCLM web text …","authors":["Caleb Ziems","William Held","Su Doga Karaca","David Grusky","Tatsunori Hashimoto","Diyi Yang"],"primaryCategory":"cs.CL","categories":["cs.CL"],"publishedAt":"2026-07-02T17:30:38Z","updatedAt":"2026-07-02T17:30:38Z","htmlUrl":"https://arxiv.org/abs/2607.02464v1","pdfUrl":"https://arxiv.org/pdf/2607.02464v1","doi":null},{"arxivId":"2607.02461","version":"v1","title":"OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers","abstract":"Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activations shift across timesteps, prompts, and guidance branches, forcing prior methods to re-fit calibration data for every new checkpoint or modality. We present OrbitQuant, a data-agnostic weight-activation quantizer that bypasses range estimation by quantizing in a normalized, rotated basis. In this basis, a randomized permuted block-Hadamard (RPBH) rotation concentrates each coordinate around one fixed, known marginal regardless of the input, so a single Lloyd-Max codebook serves all timesteps, prompts, and layers of a given input dimension. We extend the…","authors":["Donghyun Lee","Jitesh Chavan","Duy Nguyen","Sam Huang","Liming Jiang","Priyadarshini Panda","Timo Mertens","Saurabh Shukla"],"primaryCategory":"cs.CV","categories":["cs.CV","cs.AI","cs.LG"],"publishedAt":"2026-07-02T17:27:34Z","updatedAt":"2026-07-02T17:27:34Z","htmlUrl":"https://arxiv.org/abs/2607.02461v1","pdfUrl":"https://arxiv.org/pdf/2607.02461v1","doi":null},{"arxivId":"2607.02460","version":"v1","title":"Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation","abstract":"Post-training large language models (LLMs) without real-world interaction feedback or human-labeled supervision remains challenging, particularly in specialized domains where expert annotations are costly to obtain. Recent annotation-free self-evolution methods address this by using the model's own outputs as supervision signals, constructing a teacher via additional context and aggregating predictions across multiple rollouts through majority voting to produce pseudo-labels. However, these approaches are not without drawbacks: SFT- and GRPO-based variants suffer out-of-domain performance degradation, while reward-based on-policy RL inflates calibration error. In this paper, we propose Neuron On-Policy Self-Distillation (Neuron-OPSD), a data-centric framework for annotation-free self-dist…","authors":["Zhuowei Chen","Xiang Lorraine Li"],"primaryCategory":"cs.LG","categories":["cs.LG","cs.AI"],"publishedAt":"2026-07-02T17:27:24Z","updatedAt":"2026-07-02T17:27:24Z","htmlUrl":"https://arxiv.org/abs/2607.02460v1","pdfUrl":"https://arxiv.org/pdf/2607.02460v1","doi":null},{"arxivId":"2607.02459","version":"v1","title":"Language Models as Measurement Apparatus for Culture","abstract":"Language models are increasingly used to quantify cultural phenomena, but what makes such measurement distinctively cultural? This paper argues that NLP work on culture is a material-discursive practice: the apparatus -- model, data, annotation, evaluation -- participates in constituting the cultural reality it measures, rather than passively recording it. Drawing on Karen Barad's concept of the agential cut -- the contingent boundary between phenomenon and instrument -- I show that the apparatus's substantive design choices draw such boundaries, and that the boundary is entangled from the start because language models have already internalized much of the cultural material they measure. I illustrate this through three case studies on television and film dialogue (measuring structure, int…","authors":["Kent K. Chang"],"primaryCategory":"cs.CL","categories":["cs.CL"],"publishedAt":"2026-07-02T17:25:55Z","updatedAt":"2026-07-02T17:25:55Z","htmlUrl":"https://arxiv.org/abs/2607.02459v1","pdfUrl":"https://arxiv.org/pdf/2607.02459v1","doi":"10.18653/v1/2026.bigpicture-main.11"},{"arxivId":"2607.02447","version":"v1","title":"Understanding the Robustness of Distributed Self-Supervised Learning Frameworks Against Non-IID Data","abstract":"Recent research has introduced distributed self-supervised learning (D-SSL) approaches to leverage vast amounts of unlabeled decentralized data. However, D-SSL faces the critical challenge of data heterogeneity, and there is limited theoretical understanding of how different D-SSL frameworks respond to this challenge. To fill this gap, we present a rigorous theoretical analysis of the robustness of D-SSL frameworks under non-IID (non-independent and identically distributed) settings. Our results show that pre-training with Masked Image Modeling (MIM) is inherently more robust to heterogeneous data than Contrastive Learning (CL), and that the robustness of decentralized SSL increases with average network connectivity, implying that federated learning (FL) is no less robust than decentraliz…","authors":["Xuanyu Chen","Nan Yang","Shuai Wang","Dong Yuan"],"primaryCategory":"cs.LG","categories":["cs.LG"],"publishedAt":"2026-07-02T17:17:05Z","updatedAt":"2026-07-02T17:17:05Z","htmlUrl":"https://arxiv.org/abs/2607.02447v1","pdfUrl":"https://arxiv.org/pdf/2607.02447v1","doi":null},{"arxivId":"2607.02444","version":"v1","title":"Optimal Stabilizer Testing and Learning with Limited Quantum Memory","abstract":"We study stabilizer state testing and learning with limited coherent quantum memory. Here an algorithm sequentially receives copies of an unknown $n$-qubit state, but may keep only $k$ qubits of coherent quantum memory between measurements. With unrestricted memory, seminal work of Gross, Nezami and Walter showed how to test $n$-qubit stabilizer states using $6$ copies, which is dimension independent, unlike the learning complexity of $Θ(n)$. We show that this testing-vs-learning separation is lost under memory constraints. More concretely we show that (1) The sample complexity of testing stabilizer states in the $k$-qubit memory framework is $Θ(n-k)$. Our upper bound goes via a novel connection to the hidden shift problem and the lower bound is proven using a novel approach to average ca…","authors":["Srinivasan Arunachalam","Louis Schatzki"],"primaryCategory":"quant-ph","categories":["quant-ph","cs.CC","cs.DS","cs.IT","cs.LG"],"publishedAt":"2026-07-02T17:11:38Z","updatedAt":"2026-07-02T17:11:38Z","htmlUrl":"https://arxiv.org/abs/2607.02444v1","pdfUrl":"https://arxiv.org/pdf/2607.02444v1","doi":null},{"arxivId":"2607.02440","version":"v1","title":"EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments","abstract":"Autonomous agents are increasingly expected to improve executable policies through feedback, yet existing evaluations often collapse this process into a final score or confound it with open-ended software-engineering progress. We introduce Autonomous Policy Evolution, a controlled evaluation setting in which a harness-model agent repeatedly edits an executable policy system under a fixed interaction budget. We instantiate this setting in EvoPolicyGym, a benchmark built from compact interactive RL environments that evaluates how agents iteratively improve explored policies. On the EvoPolicyGym suite, GPT-5.5 achieves the strongest aggregate rank score and top-two performance on all 16 environments. Beyond leaderboard results, EvoPolicyGym also provides trajectory-level diagnostics that dis…","authors":["Zhilin Wang","Han Song","Runzhe Zhan","Jusen Du","Jiacheng Chen","Tianle Li","Qingyu Yin","Yulun Wu"],"primaryCategory":"cs.AI","categories":["cs.AI","cs.CL"],"publishedAt":"2026-07-02T17:10:13Z","updatedAt":"2026-07-02T17:10:13Z","htmlUrl":"https://arxiv.org/abs/2607.02440v1","pdfUrl":"https://arxiv.org/pdf/2607.02440v1","doi":null},{"arxivId":"2607.02437","version":"v1","title":"Extreme Adaptive Transformer for Time Series Forecasting","abstract":"Time series forecasting remains challenging when the underlying data contain rare but critical extreme events. This issue is particularly important in hydrologic forecasting, where streamflow distributions are often highly skewed and extreme peaks can have substantial impacts on flood monitoring, water resource management, and early warning systems. Although Transformer-based forecasting models have achieved strong performance by modeling long-range temporal dependencies, they typically treat all time points uniformly and may therefore underrepresent rare extreme patterns. In this paper, we propose the Extreme-Adaptive Transformer (Exformer), a forecasting framework designed to explicitly model temporal dependencies involving both normal and extreme events. Exformer introduces an extreme-…","authors":["Sanjeev Shrestha","Hui Liu","Yifan Zhang"],"primaryCategory":"cs.LG","categories":["cs.LG"],"publishedAt":"2026-07-02T17:09:14Z","updatedAt":"2026-07-02T17:09:14Z","htmlUrl":"https://arxiv.org/abs/2607.02437v1","pdfUrl":"https://arxiv.org/pdf/2607.02437v1","doi":null},{"arxivId":"2607.02436","version":"v1","title":"Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study","abstract":"Agentic coding assistants are increasingly given extra capabilities, such as browser based testing tools and design oriented system prompts, on the assumption that more capability yields better software. This study tested that assumption directly. Ninety independent agent runs built the same application, a real time retrospective board, from one detailed specification, each scored on a fixed 14 criterion functional rubric (42 point maximum) and a visual quality review. The runs spanned several model generations, two agent harnesses, two reasoning effort levels, a testing tool, and two design oriented prompts. Capability tier dominated: frontier models clustered near the ceiling while a low cost local model fell to 24 to 37 points. A criterion level analysis revealed what run totals concea…","authors":["Achint Mehta"],"primaryCategory":"cs.SE","categories":["cs.SE","cs.AI"],"publishedAt":"2026-07-02T17:08:21Z","updatedAt":"2026-07-02T17:08:21Z","htmlUrl":"https://arxiv.org/abs/2607.02436v1","pdfUrl":"https://arxiv.org/pdf/2607.02436v1","doi":null},{"arxivId":"2607.02435","version":"v1","title":"MARVEL: Margin-Aware Robust von Mises-Fischer Expert Learning for Long-Tailed Out-of-Distribution Detection","abstract":"For clinical deployment, it is essential that automated diagnostic systems remain reliable when confronted with previously unseen cases, yet deep models routinely misclassify out-of-distribution (OOD) inputs with high confidence, underscoring the need for more robust OOD detection methods. Although substantial effort has been devoted to improving model robustness, most of the existing literature assumes balanced datasets, evaluates OOD detection on coarse or non-clinical OOD sources, or lacks comprehensive assessment across diverse OOD scenarios. To address the gaps, we propose a novel methodology trained on diverse and imbalanced medical datasets and evaluated across a clinically reflective OOD spectrum. Our framework comprises three key components: (1) a Nonlinear von Mises-Fisher (NvMF…","authors":["A. S. Anudeep","Vaanathi Sundaresan"],"primaryCategory":"cs.CV","categories":["cs.CV","eess.IV"],"publishedAt":"2026-07-02T17:06:31Z","updatedAt":"2026-07-02T17:06:31Z","htmlUrl":"https://arxiv.org/abs/2607.02435v1","pdfUrl":"https://arxiv.org/pdf/2607.02435v1","doi":null},{"arxivId":"2607.02432","version":"v1","title":"Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach","abstract":"Scalable and reliable grading of command-line examinations remains a challenge in computing education, where rising enrolments make manual marking difficult and rule-based autograders cannot handle partial credit, equivalent solutions, or syntactic variation. This paper evaluates whether four frontier Large Language Models (GPT, Claude Opus, Gemini, and GLM) can approximate expert judgment when grading short Linux/bash command responses. The study adopts a four-level cognitive taxonomy that combines cognitive complexity and operational impact, ranging from information retrieval (L1) and basic file manipulation (L2) to structural operations (L3) and advanced system management (L4). The models were tested with two prompt variants, a minimal baseline and a rubric-enhanced version, on 1200 re…","authors":["Manuel Alonso-Carracedo","Ruben Fernandez-Boullon","Pedro Celard","Francisco J. Rodriguez-Martinez","Lorena Otero-Cerdeira"],"primaryCategory":"cs.AI","categories":["cs.AI","cs.CL","cs.CY"],"publishedAt":"2026-07-02T17:01:47Z","updatedAt":"2026-07-02T17:01:47Z","htmlUrl":"https://arxiv.org/abs/2607.02432v1","pdfUrl":"https://arxiv.org/pdf/2607.02432v1","doi":null},{"arxivId":"2607.02431","version":"v1","title":"WorldSample: Closed-loop Real-robot RL with World Modelling","abstract":"Reinforcement learning (RL) can overcome the demonstration-coverage limitation of imitation learning (IL) by allowing robots to improve through trial-and-error interaction beyond the states observed in demonstrations. However, deploying RL on real robots remains constrained by high interaction costs, since each physical rollout is costly and reflects only one realized action-outcome path. To address this challenge, we propose WorldSample, a physically grounded data augmentation framework for real-robot RL that closes a real-synthetic loop between physical rollouts, world-model generation, and policy improvement. Grounded on real rollouts, WorldSample generates high-fidelity synthetic transitions through a post-trained world model, which greatly lowers the visual hallucination. Specificall…","authors":["Yuquan Xue","Le Xu","Zeyi Liu","Zhenyu Wu","Zhengyi Gu","Xinyang Song","Bofang Jia","Ziwei Wang"],"primaryCategory":"cs.RO","categories":["cs.RO","cs.AI"],"publishedAt":"2026-07-02T17:00:37Z","updatedAt":"2026-07-02T17:00:37Z","htmlUrl":"https://arxiv.org/abs/2607.02431v1","pdfUrl":"https://arxiv.org/pdf/2607.02431v1","doi":null},{"arxivId":"2607.02428","version":"v1","title":"Self-Auditing Residual Drifting for Pathology-Preserving Accelerated Knee MRI","abstract":"Accelerated magnetic resonance imaging reduces acquisition time, but reconstruction from undersampled k-space can blur diagnostically relevant structures or introduce failures that are not captured by global image metrics. We propose SA-RDM-DC, a Self-Auditing Residual generative Drifting Model with Data Consistency for accelerated knee MRI. The method adapts the newly proposed generative drifting paradigm to accelerated MRI by training a physics-conditioned drift field from the zero-filled reconstruction toward the fully sampled residual correction. It predicts image- and missing-k-space residual corrections, enforces data consistency with acquired k-space, uses frequency-aware and residual drifting supervision to recover fine detail, and produces dense error maps and slice-level risk sc…","authors":["Qing Lyu","Jianxu Wang","Mohammad Kawas","Ge Wang","Christopher T. Whitlow"],"primaryCategory":"eess.IV","categories":["eess.IV","cs.CV","physics.med-ph"],"publishedAt":"2026-07-02T16:59:03Z","updatedAt":"2026-07-02T16:59:03Z","htmlUrl":"https://arxiv.org/abs/2607.02428v1","pdfUrl":"https://arxiv.org/pdf/2607.02428v1","doi":null},{"arxivId":"2607.02426","version":"v1","title":"QFedAgent: Quantum-Enhanced Personalized Federated Learning for Multi-Agent Activity Recognition","abstract":"Federated learning (FL) enables collaborative model training across distributed devices without sharing raw data, making it suitable for privacy-sensitive robotic sensing applications. However, multi-agent systems generate heterogeneous and non-independent and identically distributed (non-IID) multimodal sensor streams that degrade conventional FL algorithms, while classical fusion modules introduce substantial parameter overhead and communication cost. This paper proposes QFedAgent, a hybrid quantum-classical personalized FL framework for multi-agent activity recognition. The approach integrates a variational quantum circuit fusion module that models accelerometer--gyroscope interactions through quantum state encoding and entanglement, requiring only 72 quantum rotation parameters versus…","authors":["Quoc Bao Phan","Tuy Tan Nguyen"],"primaryCategory":"cs.LG","categories":["cs.LG","cs.AI"],"publishedAt":"2026-07-02T16:54:35Z","updatedAt":"2026-07-02T16:54:35Z","htmlUrl":"https://arxiv.org/abs/2607.02426v1","pdfUrl":"https://arxiv.org/pdf/2607.02426v1","doi":null},{"arxivId":"2607.02425","version":"v1","title":"Learning to Evolve Scenes: Reasoning about Human Activities with Scene Graphs","abstract":"Understanding human behavior while interacting with the surrounding world is crucial for many applications of embodied AI. First-person videos are particularly informative for this problem, as they well capture how activities reshape the scene over time. However, existing approaches often rely on implicit visual or language-aligned representations, disregarding structured reasoning over the scene dynamic. We argue that explicit, compositional and editable representations of human-environment interactions can play a crucial role for rich grounded activity understanding. To this end, we introduce SG-Ego, a large scale annotation set extending Ego4D with spatio-temporal scene graphs, where relations triplets are consolidated over time into explicit time-evolving descriptions of the scene sta…","authors":["Francesca Pistilli","Simone Alberto Peirone","Giuseppe Averta"],"primaryCategory":"cs.CV","categories":["cs.CV"],"publishedAt":"2026-07-02T16:53:42Z","updatedAt":"2026-07-02T16:53:42Z","htmlUrl":"https://arxiv.org/abs/2607.02425v1","pdfUrl":"https://arxiv.org/pdf/2607.02425v1","doi":null},{"arxivId":"2607.02423","version":"v1","title":"Neuron-Aware Active Few-Shot Learning for LLMs","abstract":"Active Few-Shot Learning (AFSL) adapts LLMs to specialized domains by identifying the most valuable unlabeled samples for annotation and use as few-shot demonstrations, effectively reducing human annotation costs while promoting high performance. However, existing methods typically rely on output-level signals for sample identification, such as predictive entropy or semantic similarities with test-time data based on external embeddings, which often overlook models' internal dynamics, which could pinpoint specific knowledge gaps. To bridge this gap, we propose NeuFS, a Neuron-Aware Active Few-Shot Learning framework that shifts the selection paradigm from output-level proxies to models' internal dynamics. NeuFS utilizes neuron activation patterns to represent sample directly, and includes …","authors":["Zhuowei Chen","Liwei Chen","Christian Schunn","Raquel Coelho","Xiang Lorraine Li"],"primaryCategory":"cs.LG","categories":["cs.LG","cs.AI"],"publishedAt":"2026-07-02T16:51:11Z","updatedAt":"2026-07-02T16:51:11Z","htmlUrl":"https://arxiv.org/abs/2607.02423v1","pdfUrl":"https://arxiv.org/pdf/2607.02423v1","doi":null},{"arxivId":"2607.02421","version":"v1","title":"Wavelet-Guided Semantic Signal Compensation for Inversion-Free Image Editing","abstract":"Text-guided image editing aims to modify visual content according to a target prompt while preserving the background. Recent inversion-free image editing frameworks such as FlowEdit have demonstrated strong editing capability without requiring inversion. Empirically, FlowEdit can achieve substantial semantic changes under appropriate hyperparameter settings. However, we observe that under certain global attribute shifts, the editing trajectory may not effectively move away from the source distribution in the early timesteps. Our analysis suggests that in the high-noise regime, the dominant manifold-seeking flow toward the data manifold can reduce the influence of the text-conditioned direction, leading to limited global modification while background structures remain only moderately prese…","authors":["Anqi Tang","Wenhao Sun","Zhaoqiang Liu"],"primaryCategory":"cs.CV","categories":["cs.CV"],"publishedAt":"2026-07-02T16:50:26Z","updatedAt":"2026-07-02T16:50:26Z","htmlUrl":"https://arxiv.org/abs/2607.02421v1","pdfUrl":"https://arxiv.org/pdf/2607.02421v1","doi":null},{"arxivId":"2607.02417","version":"v1","title":"LIME: Learning Intent-aware Camera Motion from Egocentric Video","abstract":"Autonomous robots often need to move their camera before they can act: to inspect an object, reveal an occluded region, or obtain a view that responds to a user's intent. While vision-language navigation translates instructions to base motion and vision-language-action policies map instructions to manipulation actions, language-conditioned camera motion remains comparatively underexplored as a first-class action. We formulate language-conditioned camera motion generation: given a current RGB observation and a free-form natural-language intent, predict a relative target camera pose for the next observation. This task is inherently non-trivial: viewpoint changes are driven by latent perceptual intentions, and a valid motion may operate at different semantic granularity, from entering a room…","authors":["Boyang Sun","Jiajie Li","Yung-Hsu Yang","Chenyangguang Zhang","Tim Engelbracht","Sunghwan Hong","Cesar Cadena","Marc Pollefeys"],"primaryCategory":"cs.RO","categories":["cs.RO","cs.CV","cs.LG"],"publishedAt":"2026-07-02T16:48:43Z","updatedAt":"2026-07-02T16:48:43Z","htmlUrl":"https://arxiv.org/abs/2607.02417v1","pdfUrl":"https://arxiv.org/pdf/2607.02417v1","doi":null},{"arxivId":"2607.02416","version":"v1","title":"The Future of NLP may not be at NLP Conferences: Scholarly Migration Patterns in Natural Language Processing","abstract":"Natural Language Processing (NLP) has traditionally been published in its core disciplinary venues like ACL. However, advances in Large Language Models (LLMs) has led to a blurring of the disciplinary lines between NLP and general Machine Learning (ML), with authors regularly publishing in venues from both fields. Here, we ask whether the disciplinary center of gravity is shifting. Using NLP research published from 2010 to 2026 and studies of both established and new authors, we find that a migration is taking place. First, comparing the pre- and post-LLM eras, established authors lost 19.2pp of share at flagship *ACL main-conference tracks while gaining 14.8pp in the newer Findings tracks, and general ML venues rose 8.6pp, even when adjusting for parallel growth in the fields. Second, am…","authors":["David Jurgens"],"primaryCategory":"cs.CL","categories":["cs.CL"],"publishedAt":"2026-07-02T16:47:14Z","updatedAt":"2026-07-02T16:47:14Z","htmlUrl":"https://arxiv.org/abs/2607.02416v1","pdfUrl":"https://arxiv.org/pdf/2607.02416v1","doi":null},{"arxivId":"2607.02413","version":"v1","title":"Q-GAIN: A Python Package for Machine Learning and Physically Informed Analysis Applications","abstract":"Here we describe the quantum gas analysis and inference (Q-GAIN) Python package, which enables rapid deployment of machine learning (ML) and physics-informed analysis techniques for cold-atom experiments. Out of the box, Q-GAIN implements classification, object detection, and physics-informed metrics for feature detection in images of atomic Bose-Einstein condensates (BECs). Q-GAIN encourages a natural, module-based workflow: starting with data loading and preprocessing, followed by ML-based feature identification, and ending with conventional analysis techniques. We demonstrate this modularity by configuring Q-GAIN for three ML tasks. First, we demonstrate the basic workflow of the Q-GAIN framework by implementing the standard task of classifying handwritten digits from the MNIST dataset…","authors":["M. Doris","S. Guo","S. M. Koh","L. Ritter","A. R. Fritsch","S. Mukherjee","I. B. Spielman","J. P. Zwolak"],"primaryCategory":"cond-mat.quant-gas","categories":["cond-mat.quant-gas","cs.LG"],"publishedAt":"2026-07-02T16:45:34Z","updatedAt":"2026-07-02T16:45:34Z","htmlUrl":"https://arxiv.org/abs/2607.02413v1","pdfUrl":"https://arxiv.org/pdf/2607.02413v1","doi":null},{"arxivId":"2607.02407","version":"v1","title":"Text-Driven 3D Indoor Scene Synthesis in Non-Manhattan Environments","abstract":"Large Language Models (LLMs) have demonstrated remarkable capabilities in 3D indoor synthesis for Manhattan environments. However, existing methods often fail to capture plausible object layout patterns in non-Manhattan settings, primarily because they struggle to model non-orthogonal spatial relationships, leading to high geometric violations and low physical fidelity. To address this challenge, we propose SPG-Layout, a novel text-driven framework designed to generate physically plausible indoor scenes within complex non-Manhattan environments. Specifically, we first utilize statistical priors of object distributions to guide the training process, enhancing environmental understanding and fidelity. Furthermore, mirroring human design workflows, we adopt a hierarchical layout strategy tha…","authors":["Xianhui Meng","Zirui Song","Yuchen Zhang","Li Zhang","Yongxuan Lv","Xiuying Chen","Kun Wang","Yan Luo"],"primaryCategory":"cs.AI","categories":["cs.AI","cs.CV"],"publishedAt":"2026-07-02T16:40:08Z","updatedAt":"2026-07-02T16:40:08Z","htmlUrl":"https://arxiv.org/abs/2607.02407v1","pdfUrl":"https://arxiv.org/pdf/2607.02407v1","doi":null},{"arxivId":"2607.02404","version":"v1","title":"Object-centric LeJEPA","abstract":"Image encoders trained with LeJEPA can deliver strong features for downstream tasks, but, like other image-level self-supervised methods, typically require large training datasets. Aligning representations at the level of objects rather than whole scenes promises greater data efficiency, but doing this in a completely self-supervised way, effectively jointly partitioning a scene and representing its objects, is unstable: the two are locked in a cyclic dependency, partitioning requires meaningful representations, while meaningful representations require consistent partitioning. We sidestep this instability by taking object masks as given during training, using cheap, off-the-shelf SAM proposals. We extend LeJEPA - whose distributional anti-collapse objective ports naturally from whole imag…","authors":["Jakob Geusen","Ender Konukoglu"],"primaryCategory":"cs.CV","categories":["cs.CV","cs.LG"],"publishedAt":"2026-07-02T16:38:21Z","updatedAt":"2026-07-02T16:38:21Z","htmlUrl":"https://arxiv.org/abs/2607.02404v1","pdfUrl":"https://arxiv.org/pdf/2607.02404v1","doi":null},{"arxivId":"2607.02403","version":"v1","title":"ACID: Action Consistency via Inverse Dynamics for Planning with World Models","abstract":"Decision-time planning with action-conditioned world models has become a popular paradigm for embodied control. However, the standard planning cost judges a candidate solely by how close its predicted terminal state lies to the goal, leaving the realizability of the intermediate transitions unchecked -- a predicted trajectory can look convincing while the environment rollout drifts away from it. In this paper, we propose ACID, a decision-time planning framework that introduces cycle action consistency: the action inferred backward from a predicted transition by an inverse dynamics model should recover the one that was conditioned on. We fold this per-step residual into the planning cost via a scale-invariant adaptive weight. Across four action-conditioned world models and six tasks spanni…","authors":["Gawon Seo","Dongwon Kim","Suha Kwak"],"primaryCategory":"cs.RO","categories":["cs.RO","cs.AI","cs.CV"],"publishedAt":"2026-07-02T16:38:10Z","updatedAt":"2026-07-02T16:38:10Z","htmlUrl":"https://arxiv.org/abs/2607.02403v1","pdfUrl":"https://arxiv.org/pdf/2607.02403v1","doi":null}],"summary":{"by_primary_category":{"cs.CV":15,"cs.AI":8,"cs.CL":7,"cs.LG":8,"cs.RO":6,"cs.SE":2,"cs.CY":1,"quant-ph":1,"eess.IV":1,"cond-mat.quant-gas":1},"top_authors":[{"author":"Yawei Luo","count":2},{"author":"Yi Yang","count":2},{"author":"Marc Pollefeys","count":2},{"author":"Pengyu Nie","count":2},{"author":"Zhuowei Chen","count":2},{"author":"Xiang Lorraine Li","count":2},{"author":"Hanlin Wang","count":1},{"author":"Hao Ouyang","count":1},{"author":"Qiuyu Wang","count":1},{"author":"Wen Wang","count":1}]}}}