News / #reasoning Tag Reasoning 500 articles archived under #reasoning · RSS Sign in to follow arXiv — NLP / Computation & Language research 19d ago When is Your LLM Steerable? arXiv:2606.11599v1 Announce Type: new Abstract: Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the… 35 arXiv — NLP / Computation & Language research 19d ago Multi-Agent Reasoning with Adaptive Worker Allocation for Stance Detection arXiv:2606.11609v1 Announce Type: new Abstract: Stance detection requires identifying an author's position toward a target, often from short-form texts where stance is implicit, indirect, or rhetorically framed. Although large language models (LLMs) achieve strong performance on… 30 arXiv — NLP / Computation & Language research 19d ago Automated Creativity Evaluation of Language Models Across Open-Ended Tasks arXiv:2606.11762v1 Announce Type: new Abstract: Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable… 14 arXiv — NLP / Computation & Language research 19d ago WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning arXiv:2606.11816v1 Announce Type: new Abstract: Forecasting real-world events requires language-model agents to reason under uncertainty from incomplete, time-bounded information. Yet evaluating whether agents genuinely forecast requires more than final-answer accuracy: a model… 6 arXiv — NLP / Computation & Language research 19d ago Agreement in Representation Space for Open-Ended Self-Consistency arXiv:2606.12003v1 Announce Type: new Abstract: Self-consistency improves LLM reasoning by sampling multiple outputs and selecting the most consistent answer, but existing formulations largely rely on exact matching and therefore remain limited to tasks with categorical outputs.… 27 arXiv — NLP / Computation & Language research 19d ago Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization arXiv:2606.12373v1 Announce Type: new Abstract: Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment… 37 arXiv — NLP / Computation & Language research 19d ago Doc-to-Atom: Learning to Compile and Compose Memory Atoms arXiv:2606.12400v1 Announce Type: new Abstract: Long input sequences are central to document understanding and multi-step reasoning in Large Language Models, yet the quadratic cost of attention makes inference both memory-intensive and slow. Context distillation mitigates this… 26 Hugging Face Daily Papers research 19d ago ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics Abstract A new benchmark called ComBench is introduced to evaluate large language models' combinatorial reasoning abilities through Olympiad-level problems that test both proof construction and explicit mathematical constructions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 37 Hugging Face Daily Papers research 19d ago Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization Abstract Recursive automated composition framework enables scalable reinforcement learning for language models by automatically combining verifiable environments through compositional operators. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reinforcement Learning (RL) with… 11 Hugging Face Daily Papers research 19d ago Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions Abstract A teacher-student framework decouples complex reasoning from efficient reward deployment in text-to-image training, achieving superior preference accuracy and optimization performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reward models are central to… 22 Hugging Face Daily Papers research 19d ago Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models Abstract Embodied-R1.5 is a unified embodied foundation model that integrates embodied reasoning capabilities and achieves state-of-the-art performance on embodied vision-language benchmarks through a multi-task balanced reinforcement learning approach. Generated by… 35 Hugging Face Daily Papers research 19d ago InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning Abstract InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms, demonstrating strong performance on video understanding benchmarks and video agent capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 18 Hugging Face Daily Papers research 19d ago Decentralized Multi-Agent Systems with Shared Context Abstract Decentralized Language Models (DeLM) framework enables scalable large language model reasoning through parallel agents that asynchronously coordinate via a shared verified context, improving performance and efficiency over centralized approaches. Generated by… 25 Hugging Face Daily Papers research 19d ago The Role of Feedback Alignment in Self-Distillation Abstract Self-distillation effectiveness depends on structural alignment between feedback and solver reasoning, with step-aligned critique outperforming binary rewards and reference solutions by targeting specific reasoning failures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 32 Hugging Face Daily Papers research 20d ago Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution Abstract Role-Agent framework enables LLM agents to function as both agent and environment through bootstrapped co-evolution, improving performance via environment-aware reasoning and targeted practice. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Although Large Language Model… 33 Hugging Face Daily Papers research 20d ago MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism Abstract MemDreamer addresses long-video understanding challenges by decoupling perception and reasoning through hierarchical graph memory and agentic exploration, achieving state-of-the-art performance with reduced computational overhead. Generated by… 33 Hugging Face Daily Papers research 20d ago Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It Abstract Chain-of-thought supervised fine-tuning degrades long-context recall in hybrid linear-attention models by biasing attention gradients toward short-range patterns, but a training-free method called QK-Restore can restore long-context capabilities by reverting query-key… 8 arXiv — Machine Learning research 20d ago Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning arXiv:2606.09873v1 Announce Type: new Abstract: Reasoning models achieve strong performance on challenging tasks by generating explicit intermediate reasoning traces before producing a final answer. Yet the internal structure of representation space when reasoning remains poorly… 29 arXiv — Machine Learning research 20d ago TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition arXiv:2606.09883v1 Announce Type: new Abstract: Large language models (LLMs) have made remarkable progress in reasoning tasks, largely driven by post-training paradigms, especially reinforcement learning with verifiable rewards (RLVR). However, a critical bottleneck persists:… 25 arXiv — NLP / Computation & Language research 20d ago SocraticPO: Policy Optimization via Interactive Guidance arXiv:2606.09887v1 Announce Type: cross Abstract: Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should… 5 arXiv — Machine Learning research 20d ago IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference arXiv:2606.09916v1 Announce Type: new Abstract: Multi-turn LLM agents fan short queries into long trajectories of tool calls, search results, and intermediate reasoning. Both KV memory and KV read bandwidth grow by orders of magnitude across a single trajectory, making the… 24 arXiv — Machine Learning research 20d ago Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling arXiv:2606.09926v1 Announce Type: new Abstract: Sampling from the sequence-level power distribution $p^\alpha$ elicits RL-level reasoning from base language models without any parameter updates, but the standard Metropolis--Hastings (MH), a Markov Chain Monte Carlo (MCMC)… 20 arXiv — NLP / Computation & Language research 20d ago RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference arXiv:2606.09937v1 Announce Type: cross Abstract: We introduce RKSC (Reasoning-Aware KV Cache Sharing), a training-free inference framework that eliminates two structural redundancies in multi-branch LLM reasoning pipelines. ASKS (Attention-Similarity KV Sharing) computes the… 32 arXiv — Machine Learning research 20d ago Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning arXiv:2606.10184v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) relies on the diversity of $K$ rollouts within each group; otherwise, the group-mean advantage $A^{(k)} = r^{(k)} - \mu_r$ collapses to zero. This presents a structural challenge for… 9 arXiv — Machine Learning research 20d ago Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation arXiv:2606.10385v1 Announce Type: new Abstract: On-policy distillation (OPD) has demonstrated strong empirical gains in enhancing complex reasoning in LLMs by aligning a student model with a teacher's predictive distribution over the student's own trajectories. An emerging… 36 arXiv — NLP / Computation & Language research 20d ago Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models arXiv:2606.09856v1 Announce Type: new Abstract: Post-training Large Language Models (LLMs) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctness is verifiable. Yet, many real-world reasoning problems are inductive: agents must infer… 36 arXiv — NLP / Computation & Language research 20d ago The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge arXiv:2606.10296v1 Announce Type: new Abstract: Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between… 16 arXiv — NLP / Computation & Language research 20d ago Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate arXiv:2606.10307v1 Announce Type: new Abstract: Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding,… 14 arXiv — NLP / Computation & Language research 20d ago TabClaw: An Interactive and Self-Evolving Agent for Spreadsheet Manipulation and Table Reasoning arXiv:2606.10316v1 Announce Type: new Abstract: Spreadsheets and tables are widely used representations for structured data analysis, but effective analysis still requires substantial manual effort and domain expertise. Recent large language model (LLM) agents can automate parts… 31 arXiv — NLP / Computation & Language research 20d ago KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty arXiv:2606.10403v1 Announce Type: new Abstract: Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung)… 34 arXiv — NLP / Computation & Language research 20d ago WebChallenger: A Reliable and Efficient Generalist Web Agent arXiv:2606.10423v1 Announce Type: new Abstract: Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most… 31 arXiv — NLP / Computation & Language research 20d ago REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs arXiv:2606.10694v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly expected to interact with users over long time horizons. However, due to their finite context window, LLMs cannot retain all past interactions, making long-term memory management… 22 arXiv — NLP / Computation & Language research 20d ago Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning arXiv:2606.10796v1 Announce Type: new Abstract: Automatic Depression Detection (ADD) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging due to two critical obstacles: 1) difficulty in modeling complex but sparsely distributed… 5 arXiv — NLP / Computation & Language research 20d ago Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models arXiv:2606.11046v1 Announce Type: new Abstract: Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the… 18 arXiv — NLP / Computation & Language research 20d ago Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It arXiv:2606.11052v1 Announce Type: new Abstract: Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including… 26 arXiv — NLP / Computation & Language research 20d ago T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains arXiv:2606.11070v1 Announce Type: new Abstract: Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain… 15 arXiv — NLP / Computation & Language research 20d ago RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning arXiv:2606.10254v1 Announce Type: cross Abstract: While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined.… 19 arXiv — NLP / Computation & Language research 20d ago Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation arXiv:2606.10475v1 Announce Type: cross Abstract: Multi-agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optimized in a way that heavily favors final output accuracy rather than stability of the… 18 arXiv — NLP / Computation & Language research 20d ago How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs arXiv:2606.10646v1 Announce Type: cross Abstract: Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from… 5 Hugging Face Daily Papers research 20d ago How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs Abstract FlowTracer is an RL framework that uses attention-induced graphs to trace reasoning flows and assign token-level credit based on global information propagation structures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Token-level credit assignment remains a key obstacle… 26 Hugging Face Daily Papers research 20d ago When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models Abstract Multi-turn reasoning models exhibit hidden alignment failures that are masked by traditional evaluation methods, revealing vulnerabilities through a trace-level diagnostic framework that identifies distinct failure modes including context-injection failures. Generated… 12 r/LocalLLaMA community 20d ago Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers Built a decision-reasoning engine (Orlog) and wanted to fine-tune a local model for it instead of paying per-call forever. The method (DV-DPO): Run a 3-voice council on each question, produce a synthesis Cross-examine: losing voices challenge the synthesis If synthesis gets… 35 Hugging Face Daily Papers research 20d ago Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense Abstract SCOUT framework dynamically allocates prompt-injection detection by predicting detector reliability and latency, improving safety and efficiency over fixed single-detector approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Prompt-injection detectors are… 30 Hugging Face Daily Papers research 20d ago SDR: Set-Distance Rewards for Radiology Report Generation Abstract Set-based rewards using embedding distances improve chest X-ray report generation by enabling effective post-training and test-time selection without requiring causal reasoning structures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reinforcement learning with… 14 Google DeepMind official-blog 20d ago Introducing Gemma 4 12B: a unified, encoder-free multimodal model Introducing Gemma 4 12B: a unified, encoder-free multimodal model Jun 03, 2026 · Share x.com Facebook LinkedIn Mail Gemma 4 12B is designed to bring high-performance multimodal intelligence directly to your laptop, combining mobile-first efficiency with advanced reasoning.… 17 Hugging Face Daily Papers research 20d ago Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning Abstract Skill-3D framework enables agents to learn scene-aware skills through self-evolving memory and skill libraries, improving tool utilization in 3D spatial reasoning tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct This paper explores agentic 3D spatial understanding,… 22 Hugging Face Daily Papers research 20d ago Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation? Abstract Large language models can improve translation for low-resource languages through structured linguistic reasoning traces, with the most significant benefits occurring during inference rather than training. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language… 30 Hugging Face Daily Papers research 21d ago OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning Abstract OmniCap-IF is introduced as the first comprehensive benchmark for evaluating instruction-following capabilities in omni-modal captioning, revealing significant performance disparities and a format-content tradeoff in multi-modal reasoning. Generated by… 5 Hugging Face Daily Papers research 21d ago Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text Abstract Optical reasoning uses images as a standalone reasoning medium for language and multimodal tasks, achieving higher token efficiency than traditional text-based approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Chain-of-Thought (CoT) improves the performance of… 27 Hugging Face Daily Papers research 21d ago Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short Abstract Reasoning Arena improves reinforcement learning with verifiable rewards by using trace tournaments and Bradley-Terry models to generate meaningful gradients from non-diverse reward groups, resulting in faster training and better reasoning performance. Generated by… 15 Page 7 of 10 · 500 articles ← Newer Older →