News / #reasoning Tag Reasoning 500 articles archived under #reasoning · RSS Sign in to follow Hugging Face Daily Papers research 21d ago Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory Abstract SkeMex is a self-evolving framework that enhances medical agents through structured skill memory, improving long-term clinical reasoning by distinguishing useful experiences and governing memory retention based on contextual utility. Generated by… 32 Hugging Face Daily Papers research 21d ago Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents Abstract Research challenges the conventional wisdom in latent visual reasoning by demonstrating that cosine alignment between supervised latents and visual targets negatively correlates with model accuracy, while revealing that answers are decoded downstream from latents rather… 24 Hugging Face Daily Papers research 21d ago DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning Abstract A multi-agent framework for deep research tasks that addresses planning, evidence acquisition, and report synthesis through decoupled components and dynamic optimization mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep Research (DR) has emerged as a new… 38 arXiv — Machine Learning research 21d ago Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning arXiv:2606.07602v1 Announce Type: new Abstract: LLM-based LEGO assembly generation requires both semantic grounding and physical feasibility. We identify a data-induced failure mode, PhysHack, in which the assemblies satisfy physical-validity constraints while producing… 12 arXiv — Machine Learning research 21d ago MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution arXiv:2606.07603v1 Announce Type: new Abstract: Large language models (LLMs) exhibit strong reasoning capabilities, yet most LLM-based agents are statically deployed and unable to improve through task interactions. Existing experience-driven methods often rely on memory or… 32 arXiv — Machine Learning research 21d ago Adversarial Robustness of Activation Steering in Large Language Models arXiv:2606.07696v1 Announce Type: new Abstract: Activation steering has become a popular training-free method to control LLM behavior by injecting precomputed direction vectors into the model's residual stream at inference time. Yet its robustness to realistic input variation… 24 arXiv — Machine Learning research 21d ago Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories arXiv:2606.07889v1 Announce Type: new Abstract: LLM-based coding agents sometimes acknowledge a problem in their own reasoning and then proceed anyway. We call this pattern strained coherence: a safety-relevant failure mode in which an agent has information that should change… 31 arXiv — Machine Learning research 21d ago The Easy, the Hard, and the Learnable: Confidence and Difficulty-Adaptive Policy Optimization for LLM Reasoning arXiv:2606.07950v1 Announce Type: new Abstract: RL with verifiable rewards can substantially improve LLM reasoning, yet standard GRPO-style training often treats easy, hard, and learnable questions alike through uniform sampling and weighting, leading to inefficient compute… 31 arXiv — Machine Learning research 21d ago Enhancing AI Interpretability and Safety through Localised Architectures arXiv:2606.07998v1 Announce Type: new Abstract: Recent advances in generative AI, especially powerful Large Language Models (LLMs) and Large Reasoning Models (LRMs), raise concerns over the interpretability, safety and sustainability of these large and opaque AI models. The… 8 arXiv — Machine Learning research 21d ago ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning arXiv:2606.08088v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has recently become a key paradigm for improving the reasoning abilities of Large Language Models (LLMs), yet it remains limited by sparse binary rewards and its ignorance of… 28 Hugging Face Daily Papers research 21d ago SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks Abstract SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spatial reasoning is a… 7 Hugging Face Daily Papers research 21d ago Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models Abstract Imaginative Perception Tokens (IPT) enhance vision-language models' spatial reasoning by providing intermediate perceptual representations that externalize what the model would perceive from alternative viewpoints, outperforming traditional text-based reasoning methods.… 22 Hugging Face Daily Papers research 21d ago CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning Abstract Contrastive Reflection (CORE) improves language model reasoning by analyzing differences between successful and unsuccessful attempts to generate concise, interpretable insights that enable faster and more efficient self-improvement compared to traditional parametric… 21 r/LocalLLaMA community 21d ago Nex N2 has a funny "few words do trick" reasoning I've been playing with Nex N2 Pro (Qwen 3.5 397B finetune) locally today. I noticed straight away that it has a pattern of reasoning that is distinct and uses simple words like "need" and "maybe" a lot. Here's a sample of reasoning. We need answer user asks "what is the theory… 16 Hugging Face Daily Papers research 21d ago Reinforcement Learning from Rich Feedback with Distributional DAgger Abstract Forward cross-entropy objective with distributional imitation learning enables monotonic policy improvement and better performance in reasoning tasks compared to traditional reinforcement learning methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reasoning models… 15 Hugging Face Daily Papers research 21d ago Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation Abstract Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system. Generated by… 35 Hugging Face Daily Papers research 22d ago Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation Abstract Post-hoc compression of reasoning traces reduces computational costs and inference lengths while maintaining high accuracy, offering an accuracy-efficiency trade-off in knowledge distillation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reasoning models produce long… 24 Hugging Face Daily Papers research 22d ago Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback Abstract Critic-R framework enhances agentic search by closing the feedback loop between reasoning agents and retrieval models through critic evaluation and dual optimization mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic search systems iteratively interact… 34 arXiv — Machine Learning research 22d ago TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models arXiv:2606.06902v1 Announce Type: new Abstract: Targeted post-training aims to improve reasoning, math, and code without degrading strengths. Low-rank adapters are efficient but task-global; activation interventions are input-aware but often require separate probes, vectors, or… 21 arXiv — Machine Learning research 22d ago The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning arXiv:2606.06920v1 Announce Type: new Abstract: Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1B models (135M-1B)… 17 arXiv — NLP / Computation & Language research 22d ago RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning arXiv:2606.07006v1 Announce Type: cross Abstract: Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However,… 15 arXiv — Machine Learning research 22d ago On the Geometry of On-Policy Distillation arXiv:2606.07082v1 Announce Type: new Abstract: On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with… 10 arXiv — Machine Learning research 22d ago A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning arXiv:2606.07410v1 Announce Type: new Abstract: The emergence of "Aha moments" in large language models, particularly DeepSeek-R1-0120, has raised the question of whether these systems genuinely reason or merely imitate the appearance of reasoning. We conduct a comprehensive… 18 arXiv — NLP / Computation & Language research 22d ago How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures arXiv:2606.06635v1 Announce Type: new Abstract: Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two… 24 arXiv — NLP / Computation & Language research 22d ago CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures arXiv:2606.06646v1 Announce Type: new Abstract: Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to understand not just keywords but also the context and complex reasoning embedded in a text.… 10 arXiv — NLP / Computation & Language research 22d ago Signal-Driven Observation for Long-Horizon Web Agents arXiv:2606.06708v1 Announce Type: new Abstract: Web agents operating over long horizons ingest raw DOM and accessibility trees -- routinely tens of thousands of tokens -- at every action step, causing progressive context degradation that erodes reasoning well before tasks… 7 arXiv — NLP / Computation & Language research 22d ago When to Think Deeply: Inhibitory Deliberation for LLM Reasoning arXiv:2606.06745v1 Announce Type: new Abstract: Reasoning Large Language Models can improve problem-solving performance through deliberative inference, but invoking slow reasoning for every input is computationally expensive and often unnecessary. We propose IDPR, a framework… 25 arXiv — NLP / Computation & Language research 22d ago Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces arXiv:2606.06840v1 Announce Type: new Abstract: Modern reasoning models offer surprisingly strong zero-shot performance on challenging multi-label tasks that require selecting a small set of relevant options from hundreds of thousands to millions of candidate labels. We… 30 arXiv — NLP / Computation & Language research 22d ago CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification arXiv:2606.06842v1 Announce Type: new Abstract: Table reasoning remains challenging for large language models (LLMs), particularly in tasks that require multi-step inference over long and structured tables. Existing approaches predominantly rely on single-direction reasoning,… 34 arXiv — NLP / Computation & Language research 22d ago Are Large Language Models Suitable for Graph Computation? Progress and Prospects arXiv:2606.06865v1 Announce Type: new Abstract: Large language models (LLMs) have been increasingly explored for graph computation, where tasks require reasoning over structured relationships and algorithmic operations. Yet, it remains unclear when LLMs can reliably support such… 28 arXiv — NLP / Computation & Language research 22d ago ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning arXiv:2606.06915v1 Announce Type: new Abstract: Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based… 34 arXiv — NLP / Computation & Language research 22d ago TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents arXiv:2606.07054v1 Announce Type: new Abstract: Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory-level monitoring. Existing approaches either evaluate… 22 arXiv — NLP / Computation & Language research 22d ago mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages? arXiv:2606.07069v1 Announce Type: new Abstract: We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require… 19 arXiv — NLP / Computation & Language research 22d ago From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning arXiv:2606.07190v1 Announce Type: new Abstract: Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect… 21 arXiv — NLP / Computation & Language research 22d ago M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions arXiv:2606.07402v1 Announce Type: new Abstract: Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic… 19 arXiv — NLP / Computation & Language research 22d ago How reliable are LLMs when it comes to playing dice? arXiv:2606.07515v1 Announce Type: new Abstract: We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a… 33 arXiv — NLP / Computation & Language research 22d ago MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring arXiv:2606.06754v1 Announce Type: cross Abstract: We present MADRAG, a training-free framework for analytic essay scoring that combines multi-agent reasoning with retrieval-augmented grounding. Unlike standard LLM-as-judge approaches, which are prone to bias and unstable… 10 arXiv — NLP / Computation & Language research 22d ago Textual Supervision Enhances Geospatial Representations in Vision-Language Models arXiv:2606.07172v1 Announce Type: cross Abstract: Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations… 18 arXiv — NLP / Computation & Language research 22d ago MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism arXiv:2606.07512v1 Announce Type: cross Abstract: Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple… 14 arXiv — NLP / Computation & Language research 22d ago AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning arXiv:2512.13278v2 Announce Type: replace Abstract: Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, which… 10 arXiv — NLP / Computation & Language research 22d ago SEEK: Steering LLM Reasoning for RAG via Internal Reasoning Sketches arXiv:2601.09402v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge into the generation process. Benefiting from the reasoning capabilities of LLMs, existing methods have leveraged… 8 arXiv — NLP / Computation & Language research 22d ago Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning arXiv:2602.11201v2 Announce Type: replace Abstract: Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or… 22 Hugging Face Daily Papers research 22d ago Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators Abstract Astra is an agentic spatial reasoning framework that enhances Vision-Language Models with action-conditioned visual imagination by coupling a reinforcement learning-trained policy with a world simulator for generating novel-view observations. Generated by… 22 Hugging Face Daily Papers research 22d ago Watch, Remember, Reason: Human-View Video Understanding with MLLMs Abstract Multimodal large language models for video understanding are structured around three core capabilities—watching, remembering, and reasoning—with applications spanning multiple video domains and addressing challenges in perception, memory, and reasoning. Generated by… 8 Hugging Face Daily Papers research 22d ago WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark Abstract WorldBench is introduced as a visually diverse reasoning benchmark for evaluating multimodal large language models, revealing significant limitations in current models' visual understanding capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In real-world… 11 llama.cpp releases dev-tools 23d ago b9544 common/chat : fix LFM2/LFM2.5 reasoning round-trip and leak ( #24234 ) common/chat : fix LFM2 reasoning round-trip and stray leak Gate by reasoning format and whether the template supports macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled)… 30 r/LocalLLaMA community 23d ago Z.ai, we need Air! GLM GGUF wen? First we never saw an upgraded Air model after 4.5. Then GLM 4.7 Turbo was great, but quickly surpassed for coding. Now GLM 5.1 is a coding beast, but too huge for most to run locally, and even slow on API. Will we ever get another Air model with frontier reasoning and… 23 r/LocalLLaMA community 24d ago I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising! Saw this post here yesterday: KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) Cheap KV cache with good precision? Sign me up! Oh, vLLM… 12 Hugging Face Daily Papers research 25d ago Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning Abstract Discrete-WAM introduces a unified discrete latent vision-action world policy that enables compositional causal reasoning and counterfactual reasoning in autonomous driving through aligned discrete tokens and a shared discrete diffusion framework. Generated by… 29 Hugging Face Daily Papers research 25d ago World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis Abstract World-language-action models combine textual instruction processing with robot state prediction through an autoregressive transformer backbone, enabling efficient long-horizon task execution and cross-embodiment learning. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We… 7 Page 8 of 10 · 500 articles ← Newer Older →