Tag

Multimodal

500 articles archived under #multimodal · RSS

Hugging Face Daily Papers research 29d ago

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

Abstract SCOPE is a self-play framework that trains language models on open-ended tasks through policy co-evolution, achieving superior performance on both targeted and held-out benchmarks without external supervision. AI-generated summary Self-play can train language models…

15
NVIDIA Developer Blog official-blog 29d ago

How to Post-Train Autonomous Vehicle Models in Closed-Loop with NVIDIA Alpamayo

Developing autonomous vehicle (AV) policies requires bridging an important gap between training and deployment. Vision-language-action (VLA) models that can...

26
Hugging Face Daily Papers research 29d ago

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Abstract A unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer to enable high-quality synthesis across arbitrary modality combinations. AI-generated summary Conditional human…

8
arXiv — Machine Learning research 29d ago

VeriGate: Verifier-Gated Step-Level Supervision for GRPO

arXiv:2605.30451v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) is an effective recipe for training reasoning models with verifier-based outcome rewards, but its supervision is sparse: when all sampled trajectories for a prompt receive the same verifier…

26
arXiv — Machine Learning research 29d ago

LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation

arXiv:2605.30651v1 Announce Type: new Abstract: We study trajectory selection for reasoning distillation, where teacher-generated reasoning trajectories are selectively used as supervision for a student model. Existing methods rely on heuristics such as trajectory quality or…

15
arXiv — Machine Learning research 29d ago

BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies

arXiv:2605.30660v1 Announce Type: new Abstract: Test-time scaling for vision-language-action (VLA) policies, methods such as RoboMonkey, SEAL, MG-Select, and V-GPS, samples K candidate action chunks at inference and executes the verifier-best. When all K candidates are unsafe,…

10
arXiv — Machine Learning research 29d ago

Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models

arXiv:2605.30713v1 Announce Type: new Abstract: Test-time compute (TTC) strategies have emerged as a lightweight approach to boost reasoning in large language models (LLMs). However, their application and benefits for vision-language models (VLMs) remain underexplored. We…

18
arXiv — NLP / Computation & Language research 29d ago

Your Multimodal Speech Model Says I Have a Face for Radio

arXiv:2605.30472v1 Announce Type: new Abstract: As large neural models have become better at language tasks, researchers are increasingly building multi- and omnimodal models that handle more modalities of data. One example is the expansion of speech recognition models to…

35
arXiv — NLP / Computation & Language research 29d ago

TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation

arXiv:2605.30673v1 Announce Type: new Abstract: Classroom videos contain observable teaching practices, but their pedagogical and visual signals are rarely organized in forms suitable for model evaluation. We present \textit{TeachObs}, a human-validated benchmark for multimodal…

26
arXiv — NLP / Computation & Language research 29d ago

Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

arXiv:2605.30833v1 Announce Type: new Abstract: On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, \textbf{Supervision…

33
arXiv — NLP / Computation & Language research 29d ago

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

arXiv:2605.30931v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and…

18
arXiv — NLP / Computation & Language research 29d ago

FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection

arXiv:2605.31349v1 Announce Type: new Abstract: Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - confounding rhetorical hate mechanisms with target community features and preventing causal…

7
arXiv — NLP / Computation & Language research 29d ago

Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely

arXiv:2605.31387v1 Announce Type: new Abstract: Robots operating in diverse environments rely on visual input to interpret objects and spatial layouts. In human-collaborative tasks, they are expected to communicate this understanding through language. Vision-language models…

32
arXiv — NLP / Computation & Language research 29d ago

"In\^{t}elegi Rom\^ane\c{s}te?'' A Recipe for Romanian Vision-Language Models

arXiv:2605.31401v1 Announce Type: new Abstract: Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded…

23
arXiv — NLP / Computation & Language research 29d ago

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

arXiv:2605.31433v1 Announce Type: new Abstract: Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a…

9
arXiv — NLP / Computation & Language research 29d ago

Fine-grained Verification via Diagnostic Reasoning Supervision for Aspect Sentiment Triplet Extraction

arXiv:2605.31446v1 Announce Type: new Abstract: Aspect Sentiment Triplet Extraction (ASTE) aims to identify aspect terms, opinion terms, and sentiment polarities as structured triplets, providing essential inputs for downstream information system applications such as opinion…

26
Hugging Face Daily Papers research 29d ago

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Abstract SwanSphere presents a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts using causal autoregressive diffusion transformers and multimodal learning strategies. AI-generated summary Real-time and accurate spatial…

25
Hugging Face Daily Papers research 29d ago

Task-Focused Memorization for Multimodal Agents

Abstract A reinforcement-learning-based framework called TaskMem is introduced to dynamically determine what information to store in long-term memory for multimodal agents, improving performance on streaming video benchmarks. AI-generated summary Long-term memory is essential…

9
Hugging Face Daily Papers research 29d ago

GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

Abstract Generative multimodal foundation models are used to create high-quality training data for image restoration, improving model generalization across diverse real-world scenarios. AI-generated summary Real-world image restoration (IR) is bottlenecked by the scarcity of…

19
Hugging Face Daily Papers research 29d ago

Linear Scaling Video VLMs for Long Video Understanding

Abstract StateKV enables efficient long-video vision-language model inference by maintaining cross-frame context in a fixed-capacity recurrent state while using a full per-frame cache for decoding, achieving linear-time prefill with minimal accuracy loss compared to full…

7
Hugging Face Daily Papers research 29d ago

Representation Forcing for Bottleneck-Free Unified Multimodal Models

Abstract Representation Forcing enables unified multimodal models to perform both perception and generation tasks end-to-end without relying on external latent spaces, matching state-of-the-art performance in image generation while improving understanding capabilities.…

27
Hugging Face Daily Papers research 29d ago

VLM3: Vision Language Models Are Native 3D Learners

Abstract Vision Language Models can be adapted for 3D understanding tasks through simple architectural modifications and text-based training, achieving performance comparable to specialized vision models without requiring complex designs or extensive data augmentation.…

5
Hugging Face Daily Papers research 29d ago

Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

Abstract Hide-and-Seek framework detects robot execution failures in vision-language-action models by localizing failure-indicative actions through contrastive learning from trajectory-level supervision without step-level annotations. AI-generated summary Vision-Language-Action…

18
r/LocalLLaMA community 29d ago

MiniMax M3 - Coding & Agentic Frontier, 1M Context, Multimodal

  submitted by   /u/dryadofelysium [link]   [comments]

14
r/MachineLearning community 1mo ago

How would you model this "strand" clustering problem? [P]

https://preview.redd.it/llqlupnwng4h1.png?width=2188&format=png&auto=webp&s=7fae5860babaffa1c8bfdcb1468b374eb38ac55d I'm currently building a computer vision application. I've managed to successfully train a YOLO model to detect the object I'm interested in for my videos. The…

33
r/LocalLLaMA community 1mo ago

Stepfun 3.7 Flash is very good

If you can fit Stepfun 3.7 Flash into RAM, try it! It's feeling close to GLM 5.1 quality in terms of aesthetics, and around 80% in terms of 3D world understanding. However since it's only 25% of the params of GLM 5.1, and it has built in vision, it's feeling like nothing else…

22
Vercel — AI dev-tools 1mo ago

MiniMax M3 on AI Gateway

MiniMax M3 is now available on Vercel AI Gateway . M3 is MiniMax's first model with a 1M-token context window and native multimodality, built around MiniMax Sparse Attention (MSA). M3 improves on software engineering, terminal-based tool use, and agentic web browsing, and is…

8
r/MachineLearning community 1mo ago

Event like spiking neuron lib that fits into the CPU cache [P]

I benchmarked it against PyTorch with a Wikipedia dataset. I heavily used Gemini Flash 3.5 to build out my vision https://huggingface.co/etoxin/neuronguard-wikipedia-classifier   submitted by   /u/Logical_Prompt_3543 [link]   [comments]

21
Hugging Face Daily Papers research 1mo ago

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Abstract Training Vision-Language Models with geometric priors improves 3D spatial reasoning through deep supervision with contrastive loss and depth consistency, achieving better performance than standard fine-tuning approaches. AI-generated summary Vision-Language Models…

25
The Information — AI news-outlet 1mo ago

Meta Plans an AI Pendant as Part of Ambitious Wearables Expansion

Meta Platforms plans to start testing an AI pendant in the next year as part of an ambitious roadmap for wearable devices aimed at reversing the huge losses in its hardware division. An internal memo describing the roadmap, reviewed by The Information, also lays out plans to ...

12
The Information — AI news-outlet 1mo ago

Meta Memo Outlines Ambitious Hardware Plans, Including New AI Pendant

Meta Platforms plans to start testing an AI pendant in the next year as part of an ambitious roadmap for wearable devices aimed at reversing the huge losses in its hardware division. An internal memo describing the roadmap, reviewed by The Information, also lays out plans to…

20
Hugging Face Daily Papers research 1mo ago

Reflective Prompt Tuning through Language Model Function-Calling

Abstract Reflective Prompt Tuning (RPT) automates prompt optimization for large language models by simulating human iterative engineering through diagnostic feedback and memory-based revision cycles. AI-generated summary Large language models (LLMs) have become increasingly…

26
Hugging Face Daily Papers research 1mo ago

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Abstract Vision-language models exhibit entangled spatial representations that correlate vertical image position with distance, impacting reasoning robustness and performance across benchmarks. AI-generated summary Vision-language models (VLMs) achieve strong performance on…

15
Hugging Face Daily Papers research 1mo ago

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

Abstract PANDO is a web agent framework that improves efficiency through experience accumulation by reducing redundant actions, optimizing skill discovery, and enhancing prompt caching without sacrificing performance. AI-generated summary Recent advances in multimodal web agents…

29
Hugging Face Daily Papers research 1mo ago

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

Abstract DynaFLIP is a dynamics-aware multimodal pre-training framework that enhances robot manipulation by integrating motion understanding into visual perception through image-language-3D flow triplets and geometric regularization techniques. AI-generated summary Robot…

22
Hugging Face Daily Papers research 1mo ago

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

Abstract A parameter-efficient vision-language model is developed for time-series anomaly detection using a novel benchmark with natural-language rationales, achieving superior performance and generalization across multiple datasets. AI-generated summary Recent advances in…

38
Hugging Face Daily Papers research 1mo ago

Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation

Abstract A novel single-shot 3D Gaussian head avatar generation method called MVCHead uses hierarchical state space models and multi-view consistency enforcement to create high-fidelity 3D assets from 2D images without requiring multi-view data or 3D supervision. AI-generated…

23
r/LocalLLaMA community 1mo ago

Llama.cpp B9406 MTP mmproj fix

B9406 Been waiting for this one. Building now. Report your results if you test! GGML_ASSERT(i01 >= 0 && i01 < ne01) crash in get_rows / mtmd_helper_decode_image_chunk when using MTP + MoE model + vision (Qwen3.6-35B-A3B)   submitted by   /u/Bulky-Priority6824 [link]…

32
Hugging Face Daily Papers research 1mo ago

EarlyTom: Early Token Compression Completes Fast Video Understanding

Abstract EarlyTom is a training-free framework that compresses visual tokens early in the vision encoder to reduce time-to-first-token and computational costs while maintaining model accuracy. AI-generated summary Video large language models (Video-LLMs) have demonstrated strong…

9
Hugging Face Daily Papers research 1mo ago

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Abstract CorVer, a corpus-grounded reward mechanism, enhances factual accuracy in question answering by providing efficient sentence-level feedback through Wikipedia co-occurrence statistics, outperforming neural verifiers while reducing training time. AI-generated summary…

13
Hugging Face Daily Papers research 1mo ago

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Abstract Multi-agent system for generating reliable, visually informative multimodal reports by interleaving textual and visual evidence through specialized agents and verification mechanisms. AI-generated summary Large Language Models (LLMs) have advanced autonomous agents from…

8
arXiv — Machine Learning research 1mo ago

Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision

arXiv:2605.28865v1 Announce Type: new Abstract: What does a world model learn from physical exploration, without any linguistic supervision? We argue the answer is organized by a single principle: the geometric structure of the physical world. Training a VAE-based world model on…

25
arXiv — Machine Learning research 1mo ago

PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation

arXiv:2605.28867v1 Announce Type: new Abstract: Generating high-quality time-series data is challenging because real-world signals often exhibit multimodal patterns and multiscale dynamics, including oscillations and high-frequency variations. Flow Matching (FM) offers an…

10
arXiv — Machine Learning research 1mo ago

Balancing Multimodal Learning through Label Space Reshaping

arXiv:2605.28869v1 Announce Type: new Abstract: Multimodal learning often suffers from modality imbalance, where modalities that converge faster dominate optimization while others remain undertrained. Existing approaches typically mitigate this issue by strengthening the weak…

4
arXiv — Machine Learning research 1mo ago

TRACER: Persistent Regularization for Robust Multimodal Finetuning

arXiv:2605.29380v1 Announce Type: new Abstract: Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal…

37
arXiv — Machine Learning research 1mo ago

Rethinking Post-Training Recipes for Multimodal Time-Series Forecasting

arXiv:2605.29401v1 Announce Type: new Abstract: Time-Series Foundation Models (TSFMs) excel at zero-shot unimodal forecasting using numerical data, but unlike LLMs they cannot consume multimodal, non-numerical context that often shape real-world trajectories. In this work, we…

16
arXiv — Machine Learning research 1mo ago

AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference

arXiv:2605.29535v1 Announce Type: new Abstract: Vision-Language Models (VLMs) process thousands of visual tokens per image alongside comparatively few text tokens, yet existing compression methods treat both modalities uniformly. We observe that the two modalities have…

28
arXiv — NLP / Computation & Language research 1mo ago

Lightweight Multimodal LLM-Enabled Cost-Effective Defect Grading of Power Transmission Equipment

arXiv:2605.28822v1 Announce Type: new Abstract: Defect grading of power transmission equipment (DGPTE) is crucial to the stability of electric energy transmission. Although existing machine learning methods exhibit strong capabilities in defect detection, they are plagued by…

30
arXiv — NLP / Computation & Language research 1mo ago

Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception

arXiv:2605.29064v1 Announce Type: new Abstract: We study how persona prompting shapes language generated by multimodal large language models in an urban perception setting. Using 59,808 annotations from 1,200 persona-conditioned agents and two no-persona settings, we analyze…

16
arXiv — NLP / Computation & Language research 1mo ago

Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset

arXiv:2605.29365v1 Announce Type: new Abstract: Formality transfer is commonly framed as a symmetric bidirectional task between informal and formal registers. We argue that this framing conceals a supervision design flaw in existing benchmarks such as GYAFC: binary human…

20

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

How to Post-Train Autonomous Vehicle Models in Closed-Loop with NVIDIA Alpamayo

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

VeriGate: Verifier-Gated Step-Level Supervision for GRPO

LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation

BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies

Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models

Your Multimodal Speech Model Says I Have a Face for Radio

TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation

Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection

Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely

"In\^{t}elegi Rom\^ane\c{s}te?'' A Recipe for Romanian Vision-Language Models

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

Fine-grained Verification via Diagnostic Reasoning Supervision for Aspect Sentiment Triplet Extraction

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Task-Focused Memorization for Multimodal Agents

GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

Linear Scaling Video VLMs for Long Video Understanding

Representation Forcing for Bottleneck-Free Unified Multimodal Models

VLM3: Vision Language Models Are Native 3D Learners

Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

MiniMax M3 - Coding & Agentic Frontier, 1M Context, Multimodal

How would you model this "strand" clustering problem? [P]

Stepfun 3.7 Flash is very good

MiniMax M3 on AI Gateway

Event like spiking neuron lib that fits into the CPU cache [P]

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Meta Plans an AI Pendant as Part of Ambitious Wearables Expansion

Meta Memo Outlines Ambitious Hardware Plans, Including New AI Pendant

Reflective Prompt Tuning through Language Model Function-Calling

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation

Llama.cpp B9406 MTP mmproj fix

EarlyTom: Early Token Compression Completes Fast Video Understanding

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision

PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation

Balancing Multimodal Learning through Label Space Reshaping

TRACER: Persistent Regularization for Robust Multimodal Finetuning

Rethinking Post-Training Recipes for Multimodal Time-Series Forecasting

AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference

Lightweight Multimodal LLM-Enabled Cost-Effective Defect Grading of Power Transmission Equipment

Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception

Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset