Hugging Face Daily Papers

500 articles archived · Visit source ↗ · RSS

Hugging Face Daily Papers research 5d ago

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

Abstract Multimodal Chain-of-Thought reasoning shows selective effectiveness across different tasks, with limitations in maintaining visual introspection during reasoning processes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Chain-of-Thought (CoT) has become a standard method…

17
Hugging Face Daily Papers research 5d ago

DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

Abstract DomainShuttle enables open domain subject-driven text-to-video generation with high fidelity and flexibility across in-domain and cross-domain scenarios through domain-aware modeling and dual RoPE schemes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Open domain…

10
Hugging Face Daily Papers research 5d ago

RoPE-Aware Bit Allocation for KV-Cache Quantization

Abstract Block-GTQ introduces a RoPE-aware bit allocation method for key-cache quantization that improves attention accuracy and downstream performance through adaptive bit distribution and packed cache serving. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing low-bit…

22
Hugging Face Daily Papers research 5d ago

Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

Abstract This survey explores multimodal code intelligence systems that generate and reason with code based on visual inputs, categorizing approaches across GUI, scientific visualization, structured graphics, and emerging frameworks while identifying verification-centered…

25
Hugging Face Daily Papers research 5d ago

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

Abstract Implicit Visual Chain-of-Thought decomposes visual conditioning into structural and semantic cascades for improved structure-aware image generation with sketch supervision. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Unified multi-modal large language models (MLLMs)…

7
Hugging Face Daily Papers research 5d ago

Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

Abstract A large-scale synthetic dataset and specialized model architecture are introduced to address the challenges of artistic text recognition by improving data diversity and model flexibility for irregular text layouts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct WordArt…

9
Hugging Face Daily Papers research 5d ago

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Abstract Wan-Streamer is a unified, end-to-end multimodal model that enables real-time audio-visual interaction through causal attention mechanisms and integrated processing of visual, audio, and text modalities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present…

20
Hugging Face Daily Papers research 5d ago

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

Abstract Long-term memory in LLM agents should be evaluated as an auditable post-interaction artifact by reconstructing structured user state from the agent's memory, as demonstrated by MEMPROBE, a benchmark testing memory recovery against synthetic ground truth across 50…

21
Hugging Face Daily Papers research 5d ago

Critique of Agent Model

Abstract True artificial agency requires internalized structures for goals, identity, decision-making, self-regulation, and learning, distinguishing autonomous systems from task-specific ones. Generated by Qwen/Qwen2.5-Coder-32B-Instruct What is an agent? What constitutes…

24
Hugging Face Daily Papers research 5d ago

InSight: Self-Guided Skill Acquisition via Steerable VLAs

Abstract InSight enables autonomous skill acquisition for vision-language-action models through primitive-action level steerability and automated demonstration generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision-language-action (VLA) models can learn manipulation…

19
Hugging Face Daily Papers research 5d ago

Multi4D: High-Fidelity Dynamic Gaussian Splatting via Multi-Level Competitive Allocation

Abstract Multi4D addresses the trade-off between motion consistency and visual fidelity in dynamic 3D Gaussian splatting through a multi-level competitive allocation framework that enables adaptive specialization and efficient representation. Generated by…

21
Hugging Face Daily Papers research 5d ago

Semantic Browsing: Controllable Diversity for Image Generation

Abstract Text-to-image models are enhanced with controlled diversity through semantic browsing capabilities that enable structured navigation of image variations based on meaningful semantic decisions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Modern text-to-image models…

4
Hugging Face Daily Papers research 5d ago

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

Abstract Large language models face challenges in archive-grounded reasoning tasks involving evidence retrieval and synthesis across diverse document collections, with performance varying significantly across domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language…

26
Hugging Face Daily Papers research 5d ago

ChartWalker: Benchmarking the Cross-Chart RAG Task

Abstract ChartWalker presents a novel framework for cross-chart retrieval-augmented generation with hierarchical knowledge graph construction and structure-aware sampling for challenging multi-modal analytical tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Cross-Chart…

33
Hugging Face Daily Papers research 5d ago

QG-MIL: A Gated Transformer Aggregator for Domain-Agnostic Multiple Instance Learning in Medical Imaging

Abstract QG-MIL introduces a gated transformer aggregator for multiple instance learning in medical imaging that stabilizes attention distribution and improves prediction consistency across different medical domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Attention-based…

38
Hugging Face Daily Papers research 5d ago

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

Abstract EventVLA addresses long-horizon robotic manipulation challenges by introducing a sparse visual evidence memory framework with visual anchors and dynamic Keyframe Evidence Memory module for improved task performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Memory…

23
Hugging Face Daily Papers research 5d ago

OpenThoughts-Agent: Data Recipes for Agentic Models

Abstract An open-source data curation pipeline for training agentic language models is presented, demonstrating superior performance through systematic experimentation and scalable training data. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic language models dramatically…

34
Hugging Face Daily Papers research 5d ago

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

Abstract Researchers introduce NanoGen, a unified framework for training and evaluating diffusion transformers that demonstrates the need for comprehensive benchmarking beyond ImageNet class-conditional generation to assess true progress in generative modeling. Generated by…

25
Hugging Face Daily Papers research 5d ago

FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

Abstract FLUX3D addresses limitations in image-to-3D Gaussian Splatting generation by improving representation learning and cross-modal alignment through specialized architectures and attention mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Sparse voxel representation…

34
Hugging Face Daily Papers research 5d ago

World Value Models for Robotic Manipulation

Abstract World Value Model combines world models with value estimation to provide accurate task progression assessment and improve robotic policy learning from mixed-quality data. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Generalist value models play a pivotal role in scaling…

6
Hugging Face Daily Papers research 6d ago

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Abstract A large-scale multi-agent benchmark for evaluating LLMs in Chinese psychiatric diagnosis is introduced, highlighting challenges in dynamic consultation and the gap between consultation quality and diagnostic accuracy. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Mental…

36
Hugging Face Daily Papers research 6d ago

FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

Abstract Video diffusion models are adapted to decode explicit surface primitives directly from latent space, enabling high-quality 3D scene generation with improved geometric accuracy and real-time rendering capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Generating…

26
Hugging Face Daily Papers research 6d ago

Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning

Abstract EDV is a three-stage framework that uses multiple heterogeneous agents to collaboratively construct reliable experiences for LLM agents, preventing self-confirmatory errors through execute-distill-verify processes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

29
Hugging Face Daily Papers research 6d ago

FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning

Abstract FlowR2A addresses the tension in multimodal driving planning by combining dense reward supervision with dynamic proposal generation through a flow-matching decoder that learns reward-conditioned action distributions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

35
Hugging Face Daily Papers research 6d ago

An Efficient Method for the Optimal Control of Microgrids Under Uncertainties using Local Reduction

Abstract Two mathematical formulations for robust microgrid sizing and power scheduling are proposed and compared, with one using binary variables and big-M constraints and the other using continuous nonlinear programming with smooth reformulation of logical constraints.…

6
Hugging Face Daily Papers research 6d ago

Qwen-AgentWorld: Language World Models for General Agents

Abstract Language-based world models enable agentic environment simulation across multiple domains and enhance general agent performance through scalable simulation and improved downstream task performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A world model predicts…

16
Hugging Face Daily Papers research 6d ago

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Abstract NatureBench presents a cross-disciplinary benchmark of 90 scientific tasks derived from Nature publications to assess AI coding agents' ability to achieve discovery rather than just reproduction, revealing that current agents primarily rely on methodological translation…

21
Hugging Face Daily Papers research 6d ago

DREAM: Dense Retrieval Embeddings via Autoregressive Modeling

Abstract DREAM trains dense retrieval embeddings using autoregressive language model attention mechanisms to supervise document-query similarity without requiring labeled examples. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Dense retrieval embedding models are a fundamental…

22
Hugging Face Daily Papers research 6d ago

FedOT: Ownership Verification and Leakage Tracing via Watermarks for Federated LDMs

Abstract FedOT is a novel framework that enables ownership verification and leakage tracing in federated latent diffusion models by introducing chunked watermarking and latent vector transformation to prevent watermark removal attacks. Generated by…

17
Hugging Face Daily Papers research 6d ago

ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection

Abstract A comprehensive multimodal misinformation detection framework is introduced that handles complex, multilingual content with multiple images and diverse verification approaches, achieving superior performance while reducing computational costs. Generated by…

29
Hugging Face Daily Papers research 6d ago

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

Abstract A novel online data mixing framework called Holistic Data Scheduler uses reinforcement learning with a multi-objective reward function to optimize large language model pre-training efficiency and performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The composition…

38
Hugging Face Daily Papers research 6d ago

Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

Abstract Text-to-image models fail to generate counterfactual scenes because they rely on tightly coupled visual-textual patterns rather than causal reasoning, demonstrating limited understanding beyond pattern matching. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Text-to-image…

26
Hugging Face Daily Papers research 6d ago

MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

Abstract MemGUI-Agent addresses long-horizon mobile GUI task limitations through proactive context management using Context-as-Action (ConAct) to maintain critical information across extended sequences. Generated by Qwen/Qwen2.5-Coder-32B-Instruct MLLM-based mobile GUI agents…

32
Hugging Face Daily Papers research 6d ago

MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization

Abstract MobileForge enables efficient adaptation of mobile GUI agents through annotation-free learning by combining real app interaction grounding with hierarchical feedback-guided policy optimization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct MLLM-based mobile GUI agents…

18
Hugging Face Daily Papers research 6d ago

Tapered Language Models

Abstract Tapered language models allocate more parameters to earlier layers and fewer to later layers, improving performance without increasing total parameters or compute costs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Modern language models, including transformer,…

34
Hugging Face Daily Papers research 6d ago

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

Abstract A novel framework called VeriEvol is introduced that addresses the challenge of scaling reinforcement learning for visual mathematical reasoning by ensuring reliable reward labels through a two-axis approach that separates prompt difficulty from answer reliability,…

17
Hugging Face Daily Papers research 6d ago

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

Abstract Data-centric approach using curated datasets and minimal GRPO setup significantly improves long-context reasoning in large language models, outperforming prior reinforcement learning methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Long-context reasoning is an…

15
Hugging Face Daily Papers research 6d ago

TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization

Abstract A unified open-source framework for discrete text-trigger optimization that standardizes the development and execution of optimization strategies across various domains and applications. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Discrete text-trigger optimization --…

18
Hugging Face Daily Papers research 6d ago

Lift4D: Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild

Abstract Lift4D presents a test-time optimization framework that combines temporal consistency from single-view 3D reconstruction with deformable 3D Gaussian Splatting and view-conditioned diffusion priors to reconstruct dynamic non-rigid objects from monocular video. Generated…

15
Hugging Face Daily Papers research 6d ago

Comparing Linear Probes with Mahalanobis Cosine Similarity

Abstract The Mahalanobis cosine similarity provides a theoretically grounded method for comparing linear probes that correlates strongly with out-of-distribution performance metrics. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Linear probes are widely used in interpretability…

25
Hugging Face Daily Papers research 6d ago

ShotcreteDepth: A Bi-modal Dataset for Robust Robotic Depth Perception in Shotcrete Construction Environments

Abstract A bi-modal construction domain dataset combining stereo RGB and LiDAR data under challenging environmental conditions is introduced for autonomous system perception research. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce ShotcreteDepth, a bi-modal dataset…

22
Hugging Face Daily Papers research 6d ago

Self-Compacting Language Model Agents

Abstract SelfCompact is a scaffolding approach that enables models to autonomously determine optimal compaction timing and methods for managing long agent traces, achieving better performance with reduced token costs compared to fixed-interval methods. Generated by…

13
Hugging Face Daily Papers research 6d ago

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

Abstract Pre premature commitment in long-horizon LLM agents leads to silent failures where agents defend early interpretations without considering alternatives, and hidden-state convergence serves as an early diagnostic for trajectory consistency. Generated by…

24
Hugging Face Daily Papers research 6d ago

Go-with-the-Track: Video Compositing and Motion Control with Point Tracking

Abstract Go-with-the-Track unifies motion control and reference image compositing in video generation by using point-track embeddings with spatial-aware encoding and video diffusion transformers. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Filmmaking demands precise motion…

32
Hugging Face Daily Papers research 6d ago

Libretto: Giving LLM Agents a Sense of Musical Structure

Abstract Libretto provides a structured framework for symbolic music generation and revision using LLM-native grammar and statistical evaluation across musical dimensions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Generative music systems can now produce impressive audio from…

18
Hugging Face Daily Papers research 6d ago

A Verifiable Search Is Not a Learnable Chain-of-Thought

Abstract Training models on chain-of-thought demonstrations fails for tasks requiring backtracking search because the forward derivation cannot be faithfully imitated, demonstrating a fundamental limitation in learning search procedures through demonstration. Generated by…

11
Hugging Face Daily Papers research 6d ago

Vera: A Layered Diffusion Model for Content-Preserving Video Editing

Abstract Vera is a layered diffusion framework that preserves video content during editing by generating edit layers and alpha mattes through a Mixture-of-Transformers architecture. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video diffusion models have enabled remarkable…

10
Hugging Face Daily Papers research 6d ago

Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City

Abstract Research examines how self-driving car systems and humans perform on visual question answering tasks across different geographic locations, revealing that both human and AI responses diverge based on question types but show similar performance regardless of location.…

5
Hugging Face Daily Papers research 6d ago

An Exploratory Case Study of LLM-Assisted Refactoring and Gameplay Feature Generation in an Endless Runner Game

Abstract Large language models demonstrate varying effectiveness in software development tasks, successfully completing localized refactoring but showing limitations in integrating new gameplay features within existing game systems. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

24
Hugging Face Daily Papers research 6d ago

AC-ODM: Actor--Critic Online Data Mixing for Sample-Efficient LLM Pretraining

Abstract AC-ODM optimizes pretraining data composition for LLMs using reinforcement learning to improve convergence speed and downstream accuracy while maintaining computational efficiency. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Optimizing pretraining data composition is…

8

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

RoPE-Aware Bit Allocation for KV-Cache Quantization

Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

Critique of Agent Model

InSight: Self-Guided Skill Acquisition via Steerable VLAs

Multi4D: High-Fidelity Dynamic Gaussian Splatting via Multi-Level Competitive Allocation

Semantic Browsing: Controllable Diversity for Image Generation

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

ChartWalker: Benchmarking the Cross-Chart RAG Task

QG-MIL: A Gated Transformer Aggregator for Domain-Agnostic Multiple Instance Learning in Medical Imaging

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

OpenThoughts-Agent: Data Recipes for Agentic Models

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

World Value Models for Robotic Manipulation

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning

FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning

An Efficient Method for the Optimal Control of Microgrids Under Uncertainties using Local Reduction

Qwen-AgentWorld: Language World Models for General Agents

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

DREAM: Dense Retrieval Embeddings via Autoregressive Modeling

FedOT: Ownership Verification and Leakage Tracing via Watermarks for Federated LDMs

ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization

Tapered Language Models

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization

Lift4D: Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild

Comparing Linear Probes with Mahalanobis Cosine Similarity

ShotcreteDepth: A Bi-modal Dataset for Robust Robotic Depth Perception in Shotcrete Construction Environments

Self-Compacting Language Model Agents

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

Go-with-the-Track: Video Compositing and Motion Control with Point Tracking

Libretto: Giving LLM Agents a Sense of Musical Structure

A Verifiable Search Is Not a Learnable Chain-of-Thought

Vera: A Layered Diffusion Model for Content-Preserving Video Editing

Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City

An Exploratory Case Study of LLM-Assisted Refactoring and Gameplay Feature Generation in an Endless Runner Game

AC-ODM: Actor--Critic Online Data Mixing for Sample-Efficient LLM Pretraining