Hugging Face Daily Papers

500 articles archived · Visit source ↗ · RSS

Hugging Face Daily Papers research 21d ago

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Abstract OmniCap-IF is introduced as the first comprehensive benchmark for evaluating instruction-following capabilities in omni-modal captioning, revealing significant performance disparities and a format-content tradeoff in multi-modal reasoning. Generated by…

5
Hugging Face Daily Papers research 21d ago

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

Abstract SlimSearcher is a framework that improves efficiency in deep research agents by combining Pareto-efficient trajectory filtering and adaptive reward shaping to reduce computational costs while maintaining accuracy. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep…

15
Hugging Face Daily Papers research 21d ago

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

Abstract Researchers identify widespread vulnerabilities in agent benchmark verification systems and develop an automated iterative process using LLM agents to create robust verifiers that resist exploitation while maintaining legitimate task performance. Generated by…

20
Hugging Face Daily Papers research 21d ago

LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

Abstract LatentSkill enables efficient deployment of textual skills in agent systems by converting them into LoRA adapters stored in weight space, reducing context overhead while maintaining modularity and composability. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agent systems…

18
Hugging Face Daily Papers research 21d ago

Chiaroscuro Attention: Spending Compute in the Dark

Abstract CHIAR-Former uses spectral entropy-based routing to dynamically select between DCT, RBF, and self-attention operators, achieving improved efficiency on large text datasets while maintaining performance through hybrid attention mechanisms. Generated by…

27
Hugging Face Daily Papers research 21d ago

Text-to-Image Models Need Less from Text Encoders Than You Think

Abstract Text-to-image models primarily utilize basic text representation aspects like word merging and order rather than complex contextual information encoded in full text embeddings. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Text-to-image models rely on text prompts as…

36
Hugging Face Daily Papers research 21d ago

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Abstract Optical reasoning uses images as a standalone reasoning medium for language and multimodal tasks, achieving higher token efficiency than traditional text-based approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Chain-of-Thought (CoT) improves the performance of…

27
Hugging Face Daily Papers research 21d ago

Answer Presence Drives RAG Rewriting Gains

Abstract Controlled interventions reveal that gold answer presence in rewritten contexts significantly boosts QA performance, with removal causing substantial F1 drops and injection improving results, while conventional probing methods show fragility to sentinel changes.…

35
Hugging Face Daily Papers research 21d ago

SwiftVR: Real-Time One-Step Generative Video Restoration

Abstract SwiftVR enables real-time video restoration on consumer GPUs through efficient attention mechanisms and lightweight autoencoding, achieving high frame rates at 4K resolution. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Real-time video restoration (VR) for live streams…

33
Hugging Face Daily Papers research 21d ago

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Abstract Reasoning Arena improves reinforcement learning with verifiable rewards by using trace tournaments and Bradley-Terry models to generate meaningful gradients from non-diverse reward groups, resulting in faster training and better reasoning performance. Generated by…

15
Hugging Face Daily Papers research 21d ago

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

Abstract Privileged Bayesian Self-Distillation enables fine-grained credit assignment in long-horizon tasks by converting sparse outcome rewards into calibrated turn-level signals through Bayesian evidence scoring and autoregressive decomposition. Generated by…

8
Hugging Face Daily Papers research 21d ago

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

Abstract SkeMex is a self-evolving framework that enhances medical agents through structured skill memory, improving long-term clinical reasoning by distinguishing useful experiences and governing memory retention based on contextual utility. Generated by…

32
Hugging Face Daily Papers research 21d ago

EMMA: Extracting Multiple physical parameters from Multimodal Data

Abstract EMMA is a physics-informed multimodal framework that directly recovers dynamical parameters from raw video, audio, and image data using a Liquid Time-Constant network and physics-constrained loss. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce EMMA, a…

33
Hugging Face Daily Papers research 21d ago

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Abstract Research challenges the conventional wisdom in latent visual reasoning by demonstrating that cosine alignment between supervised latents and visual targets negatively correlates with model accuracy, while revealing that answers are decoded downstream from latents rather…

24
Hugging Face Daily Papers research 21d ago

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

Abstract Self-Evaluation Elicitation (SEE) method improves model calibration for quality assessment through calibration-coupled reinforcement learning and masked distillation, demonstrating transferable quality evaluation beyond specific judge preferences. Generated by…

37
Hugging Face Daily Papers research 21d ago

DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

Abstract A multi-agent framework for deep research tasks that addresses planning, evidence acquisition, and report synthesis through decoupled components and dynamic optimization mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep Research (DR) has emerged as a new…

38
Hugging Face Daily Papers research 21d ago

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Abstract Skill-RM presents a unified reward modeling framework that treats reward computation as a structured agentic task, enabling dynamic evidence aggregation and consistent evaluation across diverse applications. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reward models…

19
Hugging Face Daily Papers research 21d ago

PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems

Abstract A local benchmark-generation pipeline transforms live property graphs and seed queries into balanced NL-to-Cypher datasets for enterprise knowledge graphs, incorporating schema profiling, reverse-query grounding, and execution validation. Generated by…

22
Hugging Face Daily Papers research 21d ago

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

Abstract Agents relying on self-generated reflections can store confident but incorrect task interpretations, leading to persistent errors despite environment resets, which is identified through a new metric called Reflection Repetition Rate. Generated by…

10
Hugging Face Daily Papers research 21d ago

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

Abstract AHA-WAM is an asynchronous world-action model that uses dual Diffusion Transformers to enable efficient long-horizon planning and real-time action execution in robotic manipulation tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct World-action models have emerged as a…

4
Hugging Face Daily Papers research 21d ago

Why Muon Outperforms Adam: A Curvature Perspective

Abstract Muon outperforms Adam in large language model training by reducing curvature penalties through lower normalized directional sharpness, particularly in middle and late training stages, with advantages amplified by data imbalance and heterogeneous curvature. Generated by…

30
Hugging Face Daily Papers research 21d ago

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Abstract OmniGameArena presents a unified benchmark for evaluating vision-language model agents in diverse game settings with a reflection-based improvement protocol that tracks performance evolution and skill generalization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

18
Hugging Face Daily Papers research 21d ago

Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

Abstract Large language models can be equipped with formal verification frameworks using dependent-type languages to improve multi-step workflow reliability and performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Equipping Large Language Models (LLMs) to execute reliable…

9
Hugging Face Daily Papers research 21d ago

Trajectory-Refined Distillation

Abstract On-policy distillation suffers from prefix failure where dense token-level supervision creates fragmented gradients; trajectory-refined distillation addresses this by correcting student rollouts at the trajectory level before distillation. Generated by…

37
Hugging Face Daily Papers research 21d ago

Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses

Abstract Bayesian-Agent presents a framework that treats reusable skills and SOPs as hypotheses for model success, using Bayesian inference to guide agent behavior and improve task performance through posterior-guided harness optimization. Generated by…

10
Hugging Face Daily Papers research 21d ago

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Abstract Lookahead Sparse Attention with Neural Memory Indexer reduces GPU memory usage for long-context LLM inference while maintaining accuracy through proactive KV cache management and decoupled training. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Conventional LLMs keep the…

19
Hugging Face Daily Papers research 21d ago

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

Abstract AI evaluation results suffer from inconsistent reporting across platforms, prompting the development of EvalCards, an operational framework that standardizes benchmark metadata, evaluation data, and model information into a unified, interpretable record with four key…

20
Hugging Face Daily Papers research 21d ago

End-to-End Context Compression at Scale

Abstract Encoder-decoder compression techniques are improved through architectural search and large-scale pretraining to create Latent Context Language Models that efficiently handle long contexts with better performance and memory usage compared to traditional KV cache methods.…

25
Hugging Face Daily Papers research 21d ago

OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation

Abstract A simulation-data-driven framework for humanoid loco-manipulation that uses 3D generative models to create realistic assets and hierarchical visuomotor policies trained on simulated data achieves better zero-shot performance than real-robot training. Generated by…

24
Hugging Face Daily Papers research 21d ago

Echo-Memory: A Controlled Study of Memory in Action World Models

Abstract Controlled study of memory mechanisms in action-conditioned world models reveals that memory structure and capacity significantly impact open-domain return performance beyond simple replay fidelity measures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present…

29
Hugging Face Daily Papers research 21d ago

Latent Spatial Memory for Video World Models

Abstract Latent spatial memory for video world models stores 3D scene information directly in diffusion latent space, eliminating pixel-space reconstruction overhead and achieving faster generation with reduced memory usage. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video…

23
Hugging Face Daily Papers research 21d ago

On the Geometry of On-Policy Distillation

Abstract On-policy distillation exhibits distinct parameter space dynamics characterized by relaxed off-principal updates and subspace locking, forming a unique geometric pattern separate from supervised fine-tuning and reinforcement learning with verifiable rewards. Generated…

20
Hugging Face Daily Papers research 21d ago

Set-Based Transformer for Atmospheric Compensation in Standoff LWIR Hyperspectral Imaging

Abstract A lightweight deep learning framework is presented for atmospheric compensation in passive long-wave infrared hyperspectral imaging, enabling joint estimation of transmittance, atmospheric path radiance, and downwelling spectrum from multi-range radiance measurements.…

36
Hugging Face Daily Papers research 21d ago

Human Psychometric Questionnaires Mischaracterize LLM Behavior

Abstract Human psychometric questionnaires fail to reliably predict LLM behavior in real-world interactions, while generation-based profiling offers superior accuracy for understanding model responses to everyday user queries. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We…

38
Hugging Face Daily Papers research 21d ago

CoVEBench: Can Video Editing Models Handle Complex Instructions?

Abstract A new benchmark called CoVEBench is introduced to evaluate compositional video editing capabilities, addressing limitations of existing models in handling complex, multi-step editing tasks while preserving spatiotemporal content. Generated by…

19
Hugging Face Daily Papers research 21d ago

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

Abstract SWE-Explore introduces a benchmark for evaluating coding agents' repository exploration capabilities by requiring ranked lists of relevant code regions within line budgets, demonstrating that agentic exploration outperforms traditional retrieval methods. Generated by…

11
Hugging Face Daily Papers research 21d ago

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Abstract SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spatial reasoning is a…

7
Hugging Face Daily Papers research 21d ago

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Abstract Imaginative Perception Tokens (IPT) enhance vision-language models' spatial reasoning by providing intermediate perceptual representations that externalize what the model would perceive from alternative viewpoints, outperforming traditional text-based reasoning methods.…

22
Hugging Face Daily Papers research 21d ago

A Cookbook of 3D Vision: Data, Learning Paradigms, and Application

Abstract 3D vision research is organized through a taxonomy connecting geometric representations, datasets, learning frameworks, and applications across reconstruction, generation, and video modeling tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct 3D vision has rapidly…

32
Hugging Face Daily Papers research 21d ago

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

Abstract UnpredictaBench evaluates large language models' capacity to sample from target distributions, revealing significant gaps in their ability to simulate unpredictable systems despite recent advances in output diversity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We…

7
Hugging Face Daily Papers research 21d ago

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Abstract Contrastive Reflection (CORE) improves language model reasoning by analyzing differences between successful and unsuccessful attempts to generate concise, interpretable insights that enable faster and more efficient self-improvement compared to traditional parametric…

21
Hugging Face Daily Papers research 21d ago

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Abstract Autoregressive language models are transformed into diffusion language models through on-policy distillation that eliminates train-inference mismatch and reduces training token requirements. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We study the transformation of…

18
Hugging Face Daily Papers research 21d ago

Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

Abstract A novel attack-agnostic robustness metric based on Fisher Information Matrix spectral norm is proposed, providing theoretical bounds and scalable evaluation methods for deep neural network robustness assessment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The…

12
Hugging Face Daily Papers research 21d ago

Reinforcement Learning from Rich Feedback with Distributional DAgger

Abstract Forward cross-entropy objective with distributional imitation learning enables monotonic policy improvement and better performance in reasoning tasks compared to traditional reinforcement learning methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reasoning models…

15
Hugging Face Daily Papers research 21d ago

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Abstract Vision-language models struggle to genuinely understand spatial numerical concepts, relying instead on shallow visual cues rather than developing robust coordinate-aware representations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision-Language Models (VLMs) are…

19
Hugging Face Daily Papers research 21d ago

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

Abstract RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Efficient inference is critical for long-context language models, where…

28
Hugging Face Daily Papers research 21d ago

Towards Retrieving Interaction Spaces for Agentic Search

Abstract RISE framework constructs bounded interaction spaces for agentic search by combining BM25 retrieval with preprocessed document indexing to enable efficient corpus exploration while maintaining high accuracy at scale. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

18
Hugging Face Daily Papers research 21d ago

LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models

Abstract LayerRoute is a lightweight adapter that selectively skips transformer blocks during inference based on input type, achieving compute savings while maintaining or improving model quality through gated routing and LoRA adaptation. Generated by…

19
Hugging Face Daily Papers research 22d ago

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Abstract Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system. Generated by…

35
Hugging Face Daily Papers research 22d ago

GENEB: Why Genomic Models Are Hard to Compare

Abstract GENEB presents a comprehensive benchmark for evaluating genomic foundation models across diverse tasks and architectures under a unified protocol. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Progress in genomic foundation models is difficult to assess due to fragmented…

25

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

Chiaroscuro Attention: Spending Compute in the Dark

Text-to-Image Models Need Less from Text Encoders Than You Think

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Answer Presence Drives RAG Rewriting Gains

SwiftVR: Real-Time One-Step Generative Video Restoration

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

EMMA: Extracting Multiple physical parameters from Multimodal Data

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

Why Muon Outperforms Adam: A Curvature Perspective

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

Trajectory-Refined Distillation

Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

End-to-End Context Compression at Scale

OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation

Echo-Memory: A Controlled Study of Memory in Action World Models

Latent Spatial Memory for Video World Models

On the Geometry of On-Policy Distillation

Set-Based Transformer for Atmospheric Compensation in Standoff LWIR Hyperspectral Imaging

Human Psychometric Questionnaires Mischaracterize LLM Behavior

CoVEBench: Can Video Editing Models Handle Complex Instructions?

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

A Cookbook of 3D Vision: Data, Learning Paradigms, and Application

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

Reinforcement Learning from Rich Feedback with Distributional DAgger

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

Towards Retrieving Interaction Spaces for Agentic Search

LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

GENEB: Why Genomic Models Are Hard to Compare