Tag

Reasoning

500 articles archived under #reasoning · RSS

arXiv — NLP / Computation & Language research 19d ago

When is Your LLM Steerable?

arXiv:2606.11599v1 Announce Type: new Abstract: Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the…

35
arXiv — NLP / Computation & Language research 19d ago

Multi-Agent Reasoning with Adaptive Worker Allocation for Stance Detection

arXiv:2606.11609v1 Announce Type: new Abstract: Stance detection requires identifying an author's position toward a target, often from short-form texts where stance is implicit, indirect, or rhetorically framed. Although large language models (LLMs) achieve strong performance on…

30
arXiv — NLP / Computation & Language research 19d ago

Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

arXiv:2606.11762v1 Announce Type: new Abstract: Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable…

14
arXiv — NLP / Computation & Language research 19d ago

WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning

arXiv:2606.11816v1 Announce Type: new Abstract: Forecasting real-world events requires language-model agents to reason under uncertainty from incomplete, time-bounded information. Yet evaluating whether agents genuinely forecast requires more than final-answer accuracy: a model…

6
arXiv — NLP / Computation & Language research 19d ago

Agreement in Representation Space for Open-Ended Self-Consistency

arXiv:2606.12003v1 Announce Type: new Abstract: Self-consistency improves LLM reasoning by sampling multiple outputs and selecting the most consistent answer, but existing formulations largely rely on exact matching and therefore remain limited to tasks with categorical outputs.…

27
arXiv — NLP / Computation & Language research 19d ago

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

arXiv:2606.12373v1 Announce Type: new Abstract: Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment…

37
arXiv — NLP / Computation & Language research 19d ago

Doc-to-Atom: Learning to Compile and Compose Memory Atoms

arXiv:2606.12400v1 Announce Type: new Abstract: Long input sequences are central to document understanding and multi-step reasoning in Large Language Models, yet the quadratic cost of attention makes inference both memory-intensive and slow. Context distillation mitigates this…

26
Hugging Face Daily Papers research 19d ago

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

Abstract A new benchmark called ComBench is introduced to evaluate large language models' combinatorial reasoning abilities through Olympiad-level problems that test both proof construction and explicit mathematical constructions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

37
Hugging Face Daily Papers research 19d ago

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

Abstract Recursive automated composition framework enables scalable reinforcement learning for language models by automatically combining verifiable environments through compositional operators. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reinforcement Learning (RL) with…

11
Hugging Face Daily Papers research 19d ago

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

Abstract A teacher-student framework decouples complex reasoning from efficient reward deployment in text-to-image training, achieving superior preference accuracy and optimization performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reward models are central to…

22
Hugging Face Daily Papers research 19d ago

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Abstract Embodied-R1.5 is a unified embodied foundation model that integrates embodied reasoning capabilities and achieves state-of-the-art performance on embodied vision-language benchmarks through a multi-task balanced reinforcement learning approach. Generated by…

35
Hugging Face Daily Papers research 19d ago

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Abstract InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms, demonstrating strong performance on video understanding benchmarks and video agent capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

18
Hugging Face Daily Papers research 19d ago

Decentralized Multi-Agent Systems with Shared Context

Abstract Decentralized Language Models (DeLM) framework enables scalable large language model reasoning through parallel agents that asynchronously coordinate via a shared verified context, improving performance and efficiency over centralized approaches. Generated by…

25
Hugging Face Daily Papers research 19d ago

The Role of Feedback Alignment in Self-Distillation

Abstract Self-distillation effectiveness depends on structural alignment between feedback and solver reasoning, with step-aligned critique outperforming binary rewards and reference solutions by targeting specific reasoning failures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

32
Hugging Face Daily Papers research 20d ago

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Abstract Role-Agent framework enables LLM agents to function as both agent and environment through bootstrapped co-evolution, improving performance via environment-aware reasoning and targeted practice. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Although Large Language Model…

33
Hugging Face Daily Papers research 20d ago

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Abstract MemDreamer addresses long-video understanding challenges by decoupling perception and reasoning through hierarchical graph memory and agentic exploration, achieving state-of-the-art performance with reduced computational overhead. Generated by…

33
Hugging Face Daily Papers research 20d ago

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Abstract Chain-of-thought supervised fine-tuning degrades long-context recall in hybrid linear-attention models by biasing attention gradients toward short-range patterns, but a training-free method called QK-Restore can restore long-context capabilities by reverting query-key…

8
arXiv — Machine Learning research 20d ago

Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning

arXiv:2606.09873v1 Announce Type: new Abstract: Reasoning models achieve strong performance on challenging tasks by generating explicit intermediate reasoning traces before producing a final answer. Yet the internal structure of representation space when reasoning remains poorly…

29
arXiv — Machine Learning research 20d ago

TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition

arXiv:2606.09883v1 Announce Type: new Abstract: Large language models (LLMs) have made remarkable progress in reasoning tasks, largely driven by post-training paradigms, especially reinforcement learning with verifiable rewards (RLVR). However, a critical bottleneck persists:…

25
arXiv — NLP / Computation & Language research 20d ago

SocraticPO: Policy Optimization via Interactive Guidance

arXiv:2606.09887v1 Announce Type: cross Abstract: Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should…

5
arXiv — Machine Learning research 20d ago

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

arXiv:2606.09916v1 Announce Type: new Abstract: Multi-turn LLM agents fan short queries into long trajectories of tool calls, search results, and intermediate reasoning. Both KV memory and KV read bandwidth grow by orders of magnitude across a single trajectory, making the…

24
arXiv — Machine Learning research 20d ago

Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling

arXiv:2606.09926v1 Announce Type: new Abstract: Sampling from the sequence-level power distribution $p^\alpha$ elicits RL-level reasoning from base language models without any parameter updates, but the standard Metropolis--Hastings (MH), a Markov Chain Monte Carlo (MCMC)…

20
arXiv — NLP / Computation & Language research 20d ago

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

arXiv:2606.09937v1 Announce Type: cross Abstract: We introduce RKSC (Reasoning-Aware KV Cache Sharing), a training-free inference framework that eliminates two structural redundancies in multi-branch LLM reasoning pipelines. ASKS (Attention-Similarity KV Sharing) computes the…

32
arXiv — Machine Learning research 20d ago

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

arXiv:2606.10184v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) relies on the diversity of $K$ rollouts within each group; otherwise, the group-mean advantage $A^{(k)} = r^{(k)} - \mu_r$ collapses to zero. This presents a structural challenge for…

9
arXiv — Machine Learning research 20d ago

Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation

arXiv:2606.10385v1 Announce Type: new Abstract: On-policy distillation (OPD) has demonstrated strong empirical gains in enhancing complex reasoning in LLMs by aligning a student model with a teacher's predictive distribution over the student's own trajectories. An emerging…

36
arXiv — NLP / Computation & Language research 20d ago

Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

arXiv:2606.09856v1 Announce Type: new Abstract: Post-training Large Language Models (LLMs) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctness is verifiable. Yet, many real-world reasoning problems are inductive: agents must infer…

36
arXiv — NLP / Computation & Language research 20d ago

The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge

arXiv:2606.10296v1 Announce Type: new Abstract: Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between…

16
arXiv — NLP / Computation & Language research 20d ago

Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

arXiv:2606.10307v1 Announce Type: new Abstract: Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding,…

14
arXiv — NLP / Computation & Language research 20d ago

TabClaw: An Interactive and Self-Evolving Agent for Spreadsheet Manipulation and Table Reasoning

arXiv:2606.10316v1 Announce Type: new Abstract: Spreadsheets and tables are widely used representations for structured data analysis, but effective analysis still requires substantial manual effort and domain expertise. Recent large language model (LLM) agents can automate parts…

31
arXiv — NLP / Computation & Language research 20d ago

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

arXiv:2606.10403v1 Announce Type: new Abstract: Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung)…

34
arXiv — NLP / Computation & Language research 20d ago

WebChallenger: A Reliable and Efficient Generalist Web Agent

arXiv:2606.10423v1 Announce Type: new Abstract: Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most…

31
arXiv — NLP / Computation & Language research 20d ago

REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs

arXiv:2606.10694v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly expected to interact with users over long time horizons. However, due to their finite context window, LLMs cannot retain all past interactions, making long-term memory management…

22
arXiv — NLP / Computation & Language research 20d ago

Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

arXiv:2606.10796v1 Announce Type: new Abstract: Automatic Depression Detection (ADD) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging due to two critical obstacles: 1) difficulty in modeling complex but sparsely distributed…

5
arXiv — NLP / Computation & Language research 20d ago

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

arXiv:2606.11046v1 Announce Type: new Abstract: Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the…

18
arXiv — NLP / Computation & Language research 20d ago

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

arXiv:2606.11052v1 Announce Type: new Abstract: Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including…

26
arXiv — NLP / Computation & Language research 20d ago

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

arXiv:2606.11070v1 Announce Type: new Abstract: Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain…

15
arXiv — NLP / Computation & Language research 20d ago

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

arXiv:2606.10254v1 Announce Type: cross Abstract: While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined.…

19
arXiv — NLP / Computation & Language research 20d ago

Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

arXiv:2606.10475v1 Announce Type: cross Abstract: Multi-agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optimized in a way that heavily favors final output accuracy rather than stability of the…

18
arXiv — NLP / Computation & Language research 20d ago

How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs

arXiv:2606.10646v1 Announce Type: cross Abstract: Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from…

5
Hugging Face Daily Papers research 20d ago

How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs

Abstract FlowTracer is an RL framework that uses attention-induced graphs to trace reasoning flows and assign token-level credit based on global information propagation structures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Token-level credit assignment remains a key obstacle…

26
Hugging Face Daily Papers research 20d ago

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Abstract Multi-turn reasoning models exhibit hidden alignment failures that are masked by traditional evaluation methods, revealing vulnerabilities through a trace-level diagnostic framework that identifies distinct failure modes including context-injection failures. Generated…

12
r/LocalLLaMA community 20d ago

Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers

Built a decision-reasoning engine (Orlog) and wanted to fine-tune a local model for it instead of paying per-call forever. The method (DV-DPO): Run a 3-voice council on each question, produce a synthesis Cross-examine: losing voices challenge the synthesis If synthesis gets…

35
Hugging Face Daily Papers research 20d ago

Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

Abstract SCOUT framework dynamically allocates prompt-injection detection by predicting detector reliability and latency, improving safety and efficiency over fixed single-detector approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Prompt-injection detectors are…

30
Hugging Face Daily Papers research 20d ago

SDR: Set-Distance Rewards for Radiology Report Generation

Abstract Set-based rewards using embedding distances improve chest X-ray report generation by enabling effective post-training and test-time selection without requiring causal reasoning structures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reinforcement learning with…

14
Google DeepMind official-blog 20d ago

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Introducing Gemma 4 12B: a unified, encoder-free multimodal model Jun 03, 2026 · Share x.com Facebook LinkedIn Mail Gemma 4 12B is designed to bring high-performance multimodal intelligence directly to your laptop, combining mobile-first efficiency with advanced reasoning.…

17
Hugging Face Daily Papers research 20d ago

Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

Abstract Skill-3D framework enables agents to learn scene-aware skills through self-evolving memory and skill libraries, improving tool utilization in 3D spatial reasoning tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct This paper explores agentic 3D spatial understanding,…

22
Hugging Face Daily Papers research 20d ago

Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?

Abstract Large language models can improve translation for low-resource languages through structured linguistic reasoning traces, with the most significant benefits occurring during inference rather than training. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language…

30
Hugging Face Daily Papers research 21d ago

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Abstract OmniCap-IF is introduced as the first comprehensive benchmark for evaluating instruction-following capabilities in omni-modal captioning, revealing significant performance disparities and a format-content tradeoff in multi-modal reasoning. Generated by…

5
Hugging Face Daily Papers research 21d ago

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Abstract Optical reasoning uses images as a standalone reasoning medium for language and multimodal tasks, achieving higher token efficiency than traditional text-based approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Chain-of-Thought (CoT) improves the performance of…

27
Hugging Face Daily Papers research 21d ago

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Abstract Reasoning Arena improves reinforcement learning with verifiable rewards by using trace tournaments and Bradley-Terry models to generate meaningful gradients from non-diverse reward groups, resulting in faster training and better reasoning performance. Generated by…

15

When is Your LLM Steerable?

Multi-Agent Reasoning with Adaptive Worker Allocation for Stance Detection

Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning

Agreement in Representation Space for Open-Ended Self-Consistency

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

Doc-to-Atom: Learning to Compile and Compose Memory Atoms

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Decentralized Multi-Agent Systems with Shared Context

The Role of Feedback Alignment in Self-Distillation

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning

TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition

SocraticPO: Policy Optimization via Interactive Guidance

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation

Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge

Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

TabClaw: An Interactive and Self-Evolving Agent for Spreadsheet Manipulation and Table Reasoning

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

WebChallenger: A Reliable and Efficient Generalist Web Agent

REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs

Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs

How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers

Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

SDR: Set-Distance Rewards for Radiology Report Generation

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short