Tag

Reasoning

500 articles archived under #reasoning · RSS

Hugging Face Daily Papers research 21d ago

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

Abstract SkeMex is a self-evolving framework that enhances medical agents through structured skill memory, improving long-term clinical reasoning by distinguishing useful experiences and governing memory retention based on contextual utility. Generated by…

32
Hugging Face Daily Papers research 21d ago

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Abstract Research challenges the conventional wisdom in latent visual reasoning by demonstrating that cosine alignment between supervised latents and visual targets negatively correlates with model accuracy, while revealing that answers are decoded downstream from latents rather…

24
Hugging Face Daily Papers research 21d ago

DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

Abstract A multi-agent framework for deep research tasks that addresses planning, evidence acquisition, and report synthesis through decoupled components and dynamic optimization mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep Research (DR) has emerged as a new…

38
arXiv — Machine Learning research 21d ago

Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning

arXiv:2606.07602v1 Announce Type: new Abstract: LLM-based LEGO assembly generation requires both semantic grounding and physical feasibility. We identify a data-induced failure mode, PhysHack, in which the assemblies satisfy physical-validity constraints while producing…

12
arXiv — Machine Learning research 21d ago

MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution

arXiv:2606.07603v1 Announce Type: new Abstract: Large language models (LLMs) exhibit strong reasoning capabilities, yet most LLM-based agents are statically deployed and unable to improve through task interactions. Existing experience-driven methods often rely on memory or…

32
arXiv — Machine Learning research 21d ago

Adversarial Robustness of Activation Steering in Large Language Models

arXiv:2606.07696v1 Announce Type: new Abstract: Activation steering has become a popular training-free method to control LLM behavior by injecting precomputed direction vectors into the model's residual stream at inference time. Yet its robustness to realistic input variation…

24
arXiv — Machine Learning research 21d ago

Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories

arXiv:2606.07889v1 Announce Type: new Abstract: LLM-based coding agents sometimes acknowledge a problem in their own reasoning and then proceed anyway. We call this pattern strained coherence: a safety-relevant failure mode in which an agent has information that should change…

31
arXiv — Machine Learning research 21d ago

The Easy, the Hard, and the Learnable: Confidence and Difficulty-Adaptive Policy Optimization for LLM Reasoning

arXiv:2606.07950v1 Announce Type: new Abstract: RL with verifiable rewards can substantially improve LLM reasoning, yet standard GRPO-style training often treats easy, hard, and learnable questions alike through uniform sampling and weighting, leading to inefficient compute…

31
arXiv — Machine Learning research 21d ago

Enhancing AI Interpretability and Safety through Localised Architectures

arXiv:2606.07998v1 Announce Type: new Abstract: Recent advances in generative AI, especially powerful Large Language Models (LLMs) and Large Reasoning Models (LRMs), raise concerns over the interpretability, safety and sustainability of these large and opaque AI models. The…

8
arXiv — Machine Learning research 21d ago

ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning

arXiv:2606.08088v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has recently become a key paradigm for improving the reasoning abilities of Large Language Models (LLMs), yet it remains limited by sparse binary rewards and its ignorance of…

28
Hugging Face Daily Papers research 21d ago

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Abstract SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spatial reasoning is a…

7
Hugging Face Daily Papers research 21d ago

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Abstract Imaginative Perception Tokens (IPT) enhance vision-language models' spatial reasoning by providing intermediate perceptual representations that externalize what the model would perceive from alternative viewpoints, outperforming traditional text-based reasoning methods.…

22
Hugging Face Daily Papers research 21d ago

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Abstract Contrastive Reflection (CORE) improves language model reasoning by analyzing differences between successful and unsuccessful attempts to generate concise, interpretable insights that enable faster and more efficient self-improvement compared to traditional parametric…

21
r/LocalLLaMA community 21d ago

Nex N2 has a funny "few words do trick" reasoning

I've been playing with Nex N2 Pro (Qwen 3.5 397B finetune) locally today. I noticed straight away that it has a pattern of reasoning that is distinct and uses simple words like "need" and "maybe" a lot. Here's a sample of reasoning. We need answer user asks "what is the theory…

16
Hugging Face Daily Papers research 21d ago

Reinforcement Learning from Rich Feedback with Distributional DAgger

Abstract Forward cross-entropy objective with distributional imitation learning enables monotonic policy improvement and better performance in reasoning tasks compared to traditional reinforcement learning methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reasoning models…

15
Hugging Face Daily Papers research 21d ago

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Abstract Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system. Generated by…

35
Hugging Face Daily Papers research 22d ago

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

Abstract Post-hoc compression of reasoning traces reduces computational costs and inference lengths while maintaining high accuracy, offering an accuracy-efficiency trade-off in knowledge distillation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reasoning models produce long…

24
Hugging Face Daily Papers research 22d ago

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

Abstract Critic-R framework enhances agentic search by closing the feedback loop between reasoning agents and retrieval models through critic evaluation and dual optimization mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic search systems iteratively interact…

34
arXiv — Machine Learning research 22d ago

TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models

arXiv:2606.06902v1 Announce Type: new Abstract: Targeted post-training aims to improve reasoning, math, and code without degrading strengths. Low-rank adapters are efficient but task-global; activation interventions are input-aware but often require separate probes, vectors, or…

21
arXiv — Machine Learning research 22d ago

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

arXiv:2606.06920v1 Announce Type: new Abstract: Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1B models (135M-1B)…

17
arXiv — NLP / Computation & Language research 22d ago

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

arXiv:2606.07006v1 Announce Type: cross Abstract: Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However,…

15
arXiv — Machine Learning research 22d ago

On the Geometry of On-Policy Distillation

arXiv:2606.07082v1 Announce Type: new Abstract: On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with…

10
arXiv — Machine Learning research 22d ago

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

arXiv:2606.07410v1 Announce Type: new Abstract: The emergence of "Aha moments" in large language models, particularly DeepSeek-R1-0120, has raised the question of whether these systems genuinely reason or merely imitate the appearance of reasoning. We conduct a comprehensive…

18
arXiv — NLP / Computation & Language research 22d ago

How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

arXiv:2606.06635v1 Announce Type: new Abstract: Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two…

24
arXiv — NLP / Computation & Language research 22d ago

CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures

arXiv:2606.06646v1 Announce Type: new Abstract: Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to understand not just keywords but also the context and complex reasoning embedded in a text.…

10
arXiv — NLP / Computation & Language research 22d ago

Signal-Driven Observation for Long-Horizon Web Agents

arXiv:2606.06708v1 Announce Type: new Abstract: Web agents operating over long horizons ingest raw DOM and accessibility trees -- routinely tens of thousands of tokens -- at every action step, causing progressive context degradation that erodes reasoning well before tasks…

7
arXiv — NLP / Computation & Language research 22d ago

When to Think Deeply: Inhibitory Deliberation for LLM Reasoning

arXiv:2606.06745v1 Announce Type: new Abstract: Reasoning Large Language Models can improve problem-solving performance through deliberative inference, but invoking slow reasoning for every input is computationally expensive and often unnecessary. We propose IDPR, a framework…

25
arXiv — NLP / Computation & Language research 22d ago

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

arXiv:2606.06840v1 Announce Type: new Abstract: Modern reasoning models offer surprisingly strong zero-shot performance on challenging multi-label tasks that require selecting a small set of relevant options from hundreds of thousands to millions of candidate labels. We…

30
arXiv — NLP / Computation & Language research 22d ago

CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

arXiv:2606.06842v1 Announce Type: new Abstract: Table reasoning remains challenging for large language models (LLMs), particularly in tasks that require multi-step inference over long and structured tables. Existing approaches predominantly rely on single-direction reasoning,…

34
arXiv — NLP / Computation & Language research 22d ago

Are Large Language Models Suitable for Graph Computation? Progress and Prospects

arXiv:2606.06865v1 Announce Type: new Abstract: Large language models (LLMs) have been increasingly explored for graph computation, where tasks require reasoning over structured relationships and algorithmic operations. Yet, it remains unclear when LLMs can reliably support such…

28
arXiv — NLP / Computation & Language research 22d ago

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

arXiv:2606.06915v1 Announce Type: new Abstract: Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based…

34
arXiv — NLP / Computation & Language research 22d ago

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

arXiv:2606.07054v1 Announce Type: new Abstract: Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory-level monitoring. Existing approaches either evaluate…

22
arXiv — NLP / Computation & Language research 22d ago

mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

arXiv:2606.07069v1 Announce Type: new Abstract: We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require…

19
arXiv — NLP / Computation & Language research 22d ago

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

arXiv:2606.07190v1 Announce Type: new Abstract: Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect…

21
arXiv — NLP / Computation & Language research 22d ago

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

arXiv:2606.07402v1 Announce Type: new Abstract: Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic…

19
arXiv — NLP / Computation & Language research 22d ago

How reliable are LLMs when it comes to playing dice?

arXiv:2606.07515v1 Announce Type: new Abstract: We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a…

33
arXiv — NLP / Computation & Language research 22d ago

MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring

arXiv:2606.06754v1 Announce Type: cross Abstract: We present MADRAG, a training-free framework for analytic essay scoring that combines multi-agent reasoning with retrieval-augmented grounding. Unlike standard LLM-as-judge approaches, which are prone to bias and unstable…

10
arXiv — NLP / Computation & Language research 22d ago

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

arXiv:2606.07172v1 Announce Type: cross Abstract: Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations…

18
arXiv — NLP / Computation & Language research 22d ago

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

arXiv:2606.07512v1 Announce Type: cross Abstract: Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple…

14
arXiv — NLP / Computation & Language research 22d ago

AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

arXiv:2512.13278v2 Announce Type: replace Abstract: Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, which…

10
arXiv — NLP / Computation & Language research 22d ago

SEEK: Steering LLM Reasoning for RAG via Internal Reasoning Sketches

arXiv:2601.09402v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge into the generation process. Benefiting from the reasoning capabilities of LLMs, existing methods have leveraged…

8
arXiv — NLP / Computation & Language research 22d ago

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

arXiv:2602.11201v2 Announce Type: replace Abstract: Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or…

22
Hugging Face Daily Papers research 22d ago

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

Abstract Astra is an agentic spatial reasoning framework that enhances Vision-Language Models with action-conditioned visual imagination by coupling a reinforcement learning-trained policy with a world simulator for generating novel-view observations. Generated by…

22
Hugging Face Daily Papers research 22d ago

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Abstract Multimodal large language models for video understanding are structured around three core capabilities—watching, remembering, and reasoning—with applications spanning multiple video domains and addressing challenges in perception, memory, and reasoning. Generated by…

8
Hugging Face Daily Papers research 22d ago

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Abstract WorldBench is introduced as a visually diverse reasoning benchmark for evaluating multimodal large language models, revealing significant limitations in current models' visual understanding capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In real-world…

11
llama.cpp releases dev-tools 23d ago

b9544

common/chat : fix LFM2/LFM2.5 reasoning round-trip and leak ( #24234 ) common/chat : fix LFM2 reasoning round-trip and stray leak Gate by reasoning format and whether the template supports macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled)…

30
r/LocalLLaMA community 23d ago

Z.ai, we need Air! GLM GGUF wen?

First we never saw an upgraded Air model after 4.5. Then GLM 4.7 Turbo was great, but quickly surpassed for coding. Now GLM 5.1 is a coding beast, but too huge for most to run locally, and even slow on API. Will we ever get another Air model with frontier reasoning and…

23
r/LocalLLaMA community 24d ago

I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising!

Saw this post here yesterday: KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) Cheap KV cache with good precision? Sign me up! Oh, vLLM…

12
Hugging Face Daily Papers research 25d ago

Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

Abstract Discrete-WAM introduces a unified discrete latent vision-action world policy that enables compositional causal reasoning and counterfactual reasoning in autonomous driving through aligned discrete tokens and a shared discrete diffusion framework. Generated by…

29
Hugging Face Daily Papers research 25d ago

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Abstract World-language-action models combine textual instruction processing with robot state prediction through an autoregressive transformer backbone, enabling efficient long-horizon task execution and cross-embodiment learning. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We…

7

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning

MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution

Adversarial Robustness of Activation Steering in Large Language Models

Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories

The Easy, the Hard, and the Learnable: Confidence and Difficulty-Adaptive Policy Optimization for LLM Reasoning

Enhancing AI Interpretability and Safety through Localised Architectures

ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Nex N2 has a funny "few words do trick" reasoning

Reinforcement Learning from Rich Feedback with Distributional DAgger

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

On the Geometry of On-Policy Distillation

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures

Signal-Driven Observation for Long-Horizon Web Agents

When to Think Deeply: Inhibitory Deliberation for LLM Reasoning

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

Are Large Language Models Suitable for Graph Computation? Progress and Prospects

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

How reliable are LLMs when it comes to playing dice?

MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

SEEK: Steering LLM Reasoning for RAG via Internal Reasoning Sketches

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

b9544

Z.ai, we need Air! GLM GGUF wen?

I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising!

Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis