News / #multimodal Tag Multimodal 500 articles archived under #multimodal · RSS Sign in to follow Hugging Face Daily Papers research 29d ago SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks Abstract SCOPE is a self-play framework that trains language models on open-ended tasks through policy co-evolution, achieving superior performance on both targeted and held-out benchmarks without external supervision. AI-generated summary Self-play can train language models… 15 NVIDIA Developer Blog official-blog 29d ago How to Post-Train Autonomous Vehicle Models in Closed-Loop with NVIDIA Alpamayo Developing autonomous vehicle (AV) policies requires bridging an important gap between training and deployment. Vision-language-action (VLA) models that can... 26 Hugging Face Daily Papers research 29d ago AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling Abstract A unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer to enable high-quality synthesis across arbitrary modality combinations. AI-generated summary Conditional human… 8 arXiv — Machine Learning research 29d ago VeriGate: Verifier-Gated Step-Level Supervision for GRPO arXiv:2605.30451v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) is an effective recipe for training reasoning models with verifier-based outcome rewards, but its supervision is sparse: when all sampled trajectories for a prompt receive the same verifier… 26 arXiv — Machine Learning research 29d ago LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation arXiv:2605.30651v1 Announce Type: new Abstract: We study trajectory selection for reasoning distillation, where teacher-generated reasoning trajectories are selectively used as supervision for a student model. Existing methods rely on heuristics such as trajectory quality or… 15 arXiv — Machine Learning research 29d ago BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies arXiv:2605.30660v1 Announce Type: new Abstract: Test-time scaling for vision-language-action (VLA) policies, methods such as RoboMonkey, SEAL, MG-Select, and V-GPS, samples K candidate action chunks at inference and executes the verifier-best. When all K candidates are unsafe,… 10 arXiv — Machine Learning research 29d ago Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models arXiv:2605.30713v1 Announce Type: new Abstract: Test-time compute (TTC) strategies have emerged as a lightweight approach to boost reasoning in large language models (LLMs). However, their application and benefits for vision-language models (VLMs) remain underexplored. We… 18 arXiv — NLP / Computation & Language research 29d ago Your Multimodal Speech Model Says I Have a Face for Radio arXiv:2605.30472v1 Announce Type: new Abstract: As large neural models have become better at language tasks, researchers are increasingly building multi- and omnimodal models that handle more modalities of data. One example is the expansion of speech recognition models to… 35 arXiv — NLP / Computation & Language research 29d ago TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation arXiv:2605.30673v1 Announce Type: new Abstract: Classroom videos contain observable teaching practices, but their pedagogical and visual signals are rarely organized in forms suitable for model evaluation. We present \textit{TeachObs}, a human-validated benchmark for multimodal… 26 arXiv — NLP / Computation & Language research 29d ago Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation arXiv:2605.30833v1 Announce Type: new Abstract: On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, \textbf{Supervision… 33 arXiv — NLP / Computation & Language research 29d ago MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft arXiv:2605.30931v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and… 18 arXiv — NLP / Computation & Language research 29d ago FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection arXiv:2605.31349v1 Announce Type: new Abstract: Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - confounding rhetorical hate mechanisms with target community features and preventing causal… 7 arXiv — NLP / Computation & Language research 29d ago Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely arXiv:2605.31387v1 Announce Type: new Abstract: Robots operating in diverse environments rely on visual input to interpret objects and spatial layouts. In human-collaborative tasks, they are expected to communicate this understanding through language. Vision-language models… 32 arXiv — NLP / Computation & Language research 29d ago "In\^{t}elegi Rom\^ane\c{s}te?'' A Recipe for Romanian Vision-Language Models arXiv:2605.31401v1 Announce Type: new Abstract: Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded… 23 arXiv — NLP / Computation & Language research 29d ago SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks arXiv:2605.31433v1 Announce Type: new Abstract: Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a… 9 arXiv — NLP / Computation & Language research 29d ago Fine-grained Verification via Diagnostic Reasoning Supervision for Aspect Sentiment Triplet Extraction arXiv:2605.31446v1 Announce Type: new Abstract: Aspect Sentiment Triplet Extraction (ASTE) aims to identify aspect terms, opinion terms, and sentiment polarities as structured triplets, providing essential inputs for downstream information system applications such as opinion… 26 Hugging Face Daily Papers research 29d ago Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer Abstract SwanSphere presents a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts using causal autoregressive diffusion transformers and multimodal learning strategies. AI-generated summary Real-time and accurate spatial… 25 Hugging Face Daily Papers research 29d ago Task-Focused Memorization for Multimodal Agents Abstract A reinforcement-learning-based framework called TaskMem is introduced to dynamically determine what information to store in long-term memory for multimodal agents, improving performance on streaming video benchmarks. AI-generated summary Long-term memory is essential… 9 Hugging Face Daily Papers research 29d ago GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration Abstract Generative multimodal foundation models are used to create high-quality training data for image restoration, improving model generalization across diverse real-world scenarios. AI-generated summary Real-world image restoration (IR) is bottlenecked by the scarcity of… 19 Hugging Face Daily Papers research 29d ago Linear Scaling Video VLMs for Long Video Understanding Abstract StateKV enables efficient long-video vision-language model inference by maintaining cross-frame context in a fixed-capacity recurrent state while using a full per-frame cache for decoding, achieving linear-time prefill with minimal accuracy loss compared to full… 7 Hugging Face Daily Papers research 29d ago Representation Forcing for Bottleneck-Free Unified Multimodal Models Abstract Representation Forcing enables unified multimodal models to perform both perception and generation tasks end-to-end without relying on external latent spaces, matching state-of-the-art performance in image generation while improving understanding capabilities.… 27 Hugging Face Daily Papers research 29d ago VLM3: Vision Language Models Are Native 3D Learners Abstract Vision Language Models can be adapted for 3D understanding tasks through simple architectural modifications and text-based training, achieving performance comparable to specialized vision models without requiring complex designs or extensive data augmentation.… 5 Hugging Face Daily Papers research 29d ago Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring Abstract Hide-and-Seek framework detects robot execution failures in vision-language-action models by localizing failure-indicative actions through contrastive learning from trajectory-level supervision without step-level annotations. AI-generated summary Vision-Language-Action… 18 r/LocalLLaMA community 29d ago MiniMax M3 - Coding & Agentic Frontier, 1M Context, Multimodal   submitted by   /u/dryadofelysium [link]   [comments] 14 r/MachineLearning community 1mo ago How would you model this "strand" clustering problem? [P] https://preview.redd.it/llqlupnwng4h1.png?width=2188&format=png&auto=webp&s=7fae5860babaffa1c8bfdcb1468b374eb38ac55d I'm currently building a computer vision application. I've managed to successfully train a YOLO model to detect the object I'm interested in for my videos. The… 33 r/LocalLLaMA community 1mo ago Stepfun 3.7 Flash is very good If you can fit Stepfun 3.7 Flash into RAM, try it! It's feeling close to GLM 5.1 quality in terms of aesthetics, and around 80% in terms of 3D world understanding. However since it's only 25% of the params of GLM 5.1, and it has built in vision, it's feeling like nothing else… 22 Vercel — AI dev-tools 1mo ago MiniMax M3 on AI Gateway MiniMax M3 is now available on Vercel AI Gateway . M3 is MiniMax's first model with a 1M-token context window and native multimodality, built around MiniMax Sparse Attention (MSA). M3 improves on software engineering, terminal-based tool use, and agentic web browsing, and is… 8 r/MachineLearning community 1mo ago Event like spiking neuron lib that fits into the CPU cache [P] I benchmarked it against PyTorch with a Wikipedia dataset. I heavily used Gemini Flash 3.5 to build out my vision https://huggingface.co/etoxin/neuronguard-wikipedia-classifier   submitted by   /u/Logical_Prompt_3543 [link]   [comments] 21 Hugging Face Daily Papers research 1mo ago Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning Abstract Training Vision-Language Models with geometric priors improves 3D spatial reasoning through deep supervision with contrastive loss and depth consistency, achieving better performance than standard fine-tuning approaches. AI-generated summary Vision-Language Models… 25 The Information — AI news-outlet 1mo ago Meta Plans an AI Pendant as Part of Ambitious Wearables Expansion Meta Platforms plans to start testing an AI pendant in the next year as part of an ambitious roadmap for wearable devices aimed at reversing the huge losses in its hardware division. An internal memo describing the roadmap, reviewed by The Information, also lays out plans to ... 12 The Information — AI news-outlet 1mo ago Meta Memo Outlines Ambitious Hardware Plans, Including New AI Pendant Meta Platforms plans to start testing an AI pendant in the next year as part of an ambitious roadmap for wearable devices aimed at reversing the huge losses in its hardware division. An internal memo describing the roadmap, reviewed by The Information, also lays out plans to… 20 Hugging Face Daily Papers research 1mo ago Reflective Prompt Tuning through Language Model Function-Calling Abstract Reflective Prompt Tuning (RPT) automates prompt optimization for large language models by simulating human iterative engineering through diagnostic feedback and memory-based revision cycles. AI-generated summary Large language models (LLMs) have become increasingly… 26 Hugging Face Daily Papers research 1mo ago Why Far Looks Up: Probing Spatial Representation in Vision-Language Models Abstract Vision-language models exhibit entangled spatial representations that correlate vertical image position with distance, impacting reasoning robustness and performance across benchmarks. AI-generated summary Vision-language models (VLMs) achieve strong performance on… 15 Hugging Face Daily Papers research 1mo ago PANDO: Efficient Multimodal AI Agents via Online Skill Distillation Abstract PANDO is a web agent framework that improves efficiency through experience accumulation by reducing redundant actions, optimizing skill discovery, and enhancing prompt caching without sacrificing performance. AI-generated summary Recent advances in multimodal web agents… 29 Hugging Face Daily Papers research 1mo ago DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation Abstract DynaFLIP is a dynamics-aware multimodal pre-training framework that enhances robot manipulation by integrating motion understanding into visual perception through image-language-3D flow triplets and geometric regularization techniques. AI-generated summary Robot… 22 Hugging Face Daily Papers research 1mo ago Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection Abstract A parameter-efficient vision-language model is developed for time-series anomaly detection using a novel benchmark with natural-language rationales, achieving superior performance and generalization across multiple datasets. AI-generated summary Recent advances in… 38 Hugging Face Daily Papers research 1mo ago Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation Abstract A novel single-shot 3D Gaussian head avatar generation method called MVCHead uses hierarchical state space models and multi-view consistency enforcement to create high-fidelity 3D assets from 2D images without requiring multi-view data or 3D supervision. AI-generated… 23 r/LocalLLaMA community 1mo ago Llama.cpp B9406 MTP mmproj fix B9406 Been waiting for this one. Building now. Report your results if you test! GGML_ASSERT(i01 >= 0 && i01 < ne01) crash in get_rows / mtmd_helper_decode_image_chunk when using MTP + MoE model + vision (Qwen3.6-35B-A3B)   submitted by   /u/Bulky-Priority6824 [link]… 32 Hugging Face Daily Papers research 1mo ago EarlyTom: Early Token Compression Completes Fast Video Understanding Abstract EarlyTom is a training-free framework that compresses visual tokens early in the vision encoder to reduce time-to-first-token and computational costs while maintaining model accuracy. AI-generated summary Video large language models (Video-LLMs) have demonstrated strong… 9 Hugging Face Daily Papers research 1mo ago Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering Abstract CorVer, a corpus-grounded reward mechanism, enhances factual accuracy in question answering by providing efficient sentence-level feedback through Wikipedia co-occurrence statistics, outperforming neural verifiers while reducing training time. AI-generated summary… 13 Hugging Face Daily Papers research 1mo ago Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation Abstract Multi-agent system for generating reliable, visually informative multimodal reports by interleaving textual and visual evidence through specialized agents and verification mechanisms. AI-generated summary Large Language Models (LLMs) have advanced autonomous agents from… 8 arXiv — Machine Learning research 1mo ago Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision arXiv:2605.28865v1 Announce Type: new Abstract: What does a world model learn from physical exploration, without any linguistic supervision? We argue the answer is organized by a single principle: the geometric structure of the physical world. Training a VAE-based world model on… 25 arXiv — Machine Learning research 1mo ago PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation arXiv:2605.28867v1 Announce Type: new Abstract: Generating high-quality time-series data is challenging because real-world signals often exhibit multimodal patterns and multiscale dynamics, including oscillations and high-frequency variations. Flow Matching (FM) offers an… 10 arXiv — Machine Learning research 1mo ago Balancing Multimodal Learning through Label Space Reshaping arXiv:2605.28869v1 Announce Type: new Abstract: Multimodal learning often suffers from modality imbalance, where modalities that converge faster dominate optimization while others remain undertrained. Existing approaches typically mitigate this issue by strengthening the weak… 4 arXiv — Machine Learning research 1mo ago TRACER: Persistent Regularization for Robust Multimodal Finetuning arXiv:2605.29380v1 Announce Type: new Abstract: Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal… 37 arXiv — Machine Learning research 1mo ago Rethinking Post-Training Recipes for Multimodal Time-Series Forecasting arXiv:2605.29401v1 Announce Type: new Abstract: Time-Series Foundation Models (TSFMs) excel at zero-shot unimodal forecasting using numerical data, but unlike LLMs they cannot consume multimodal, non-numerical context that often shape real-world trajectories. In this work, we… 16 arXiv — Machine Learning research 1mo ago AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference arXiv:2605.29535v1 Announce Type: new Abstract: Vision-Language Models (VLMs) process thousands of visual tokens per image alongside comparatively few text tokens, yet existing compression methods treat both modalities uniformly. We observe that the two modalities have… 28 arXiv — NLP / Computation & Language research 1mo ago Lightweight Multimodal LLM-Enabled Cost-Effective Defect Grading of Power Transmission Equipment arXiv:2605.28822v1 Announce Type: new Abstract: Defect grading of power transmission equipment (DGPTE) is crucial to the stability of electric energy transmission. Although existing machine learning methods exhibit strong capabilities in defect detection, they are plagued by… 30 arXiv — NLP / Computation & Language research 1mo ago Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception arXiv:2605.29064v1 Announce Type: new Abstract: We study how persona prompting shapes language generated by multimodal large language models in an urban perception setting. Using 59,808 annotations from 1,200 persona-conditioned agents and two no-persona settings, we analyze… 16 arXiv — NLP / Computation & Language research 1mo ago Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset arXiv:2605.29365v1 Announce Type: new Abstract: Formality transfer is commonly framed as a symmetric bidirectional task between informal and formal registers. We argue that this framing conceals a supervision design flaw in existing benchmarks such as GYAFC: binary human… 20 Page 10 of 10 · 500 articles ← Newer