Tag

Reasoning

500 articles archived under #reasoning · RSS

arXiv — NLP / Computation & Language research 26d ago

Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair

arXiv:2606.05030v1 Announce Type: new Abstract: Autoregressive chain-of-thought (CoT) reasoning in large language models (LLMs) is fundamentally forward-directed: each step conditions only on prior tokens. This unidirectional inductive bias renders even capable models…

31
arXiv — NLP / Computation & Language research 26d ago

Boosting Self-Consistency with Ranking

arXiv:2606.05054v1 Announce Type: new Abstract: Self-consistency improves large language models by sampling multiple reasoning paths and selecting the most frequent answer, but majority voting often fails to recover correct answers that are already present among the samples. We…

33
arXiv — NLP / Computation & Language research 26d ago

Arithmetic Pedagogy for Language Models

arXiv:2606.05106v1 Announce Type: new Abstract: We investigate whether methods of human mathematics pedagogy can guide the training of language models toward arithmetic reasoning. Building on the GASING method -- an Indonesian pedagogy that solves basic arithmetic through a…

32
arXiv — NLP / Computation & Language research 26d ago

Streaming Communication in Multi-Agent Reasoning

arXiv:2606.05158v1 Announce Type: new Abstract: Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to…

8
arXiv — NLP / Computation & Language research 26d ago

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

arXiv:2606.04244v1 Announce Type: cross Abstract: Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when…

7
arXiv — NLP / Computation & Language research 26d ago

StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

arXiv:2606.04246v1 Announce Type: cross Abstract: Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. We present StepPRM-RTL, a novel…

8
arXiv — NLP / Computation & Language research 26d ago

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

arXiv:2606.04435v1 Announce Type: cross Abstract: Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms…

25
r/MachineLearning community 26d ago

Best Visual Reasoning Model in 2026 (Including APIs) [D]

For example, suppose I have a one-hour video and I provide it to ChatGPT or another AI model. If I ask complex reasoning questions about the video, which models are best suited for long-horizon video understanding and reasoning? Which models can produce the most reliable answers…

38
Hugging Face Daily Papers research 26d ago

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

Abstract ThoughtFold addresses over-thinking in large reasoning models by using fine-grained preference learning to identify and eliminate redundant explorations in chain-of-thought reasoning processes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large Reasoning Models (LRMs)…

13
Hugging Face Daily Papers research 26d ago

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

Abstract MapAgent is an industrial-grade agentic architecture that combines vision-language processing with constraint-aware reasoning to produce specification-compliant lane maps, achieving high automation rates in large-scale urban mapping. Generated by…

21
Hugging Face Daily Papers research 26d ago

Streaming Communication in Multi-Agent Reasoning

Abstract StreamMA enables efficient multi-agent reasoning by streaming intermediate results and leveraging reliable early steps to improve both latency and effectiveness across various reasoning tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multi-agent reasoning systems…

12
Hugging Face Daily Papers research 26d ago

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Abstract Deep-research agents can be audited using a claim-centric framework that identifies error spans in their reasoning trajectories, improving reliability assessment beyond just final answer evaluation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep-research agents solve…

20
Hugging Face Daily Papers research 26d ago

MemTrain: Self-Supervised Context Memory Training

Abstract A self-supervised training framework called MemTrain enhances long-horizon language model agents' memory capabilities through proxy tasks optimized via GRPO, improving downstream reasoning performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Memory is an…

4
Hugging Face Daily Papers research 26d ago

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Abstract Wide-baseline matching presents a challenging spatial reasoning testbed for multimodal large language models, requiring systematic evaluation and training frameworks that current models lack, prompting the introduction of ReasonMatch-Bench and Dynamic Correspondence…

28
Hugging Face Daily Papers research 26d ago

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Abstract Agentic Chain-of-Thought Steering (ACTS) formulates reasoning steering as a Markov decision process to enable efficient, controllable chain-of-thought reasoning with token savings. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models improve final-answer…

19
Hugging Face Daily Papers research 26d ago

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

Abstract KVarN is a calibration-free KV-cache quantizer that uses Hadamard rotation and dual-scaling variance normalization to reduce error accumulation during autoregressive decoding in large language models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Test-time scaling is a…

28
OpenAI official-blog 26d ago

Introducing new capabilities to GPT-Rosalind

GPT-Rosalind advances life sciences research with enhanced biological reasoning, medicinal chemistry expertise, genomics analysis, and experimental workflow capabilities.

38
Hugging Face Daily Papers research 27d ago

OCC-RAG: Optimal Cognitive Core for Faithful Question Answering

Abstract Compact task-specialized language models demonstrate superior performance in multi-hop reasoning and faithfulness compared to larger general-purpose models through a novel training pipeline and structured reasoning traces. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

32
Hugging Face Daily Papers research 27d ago

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

Abstract TRON enables scalable and controllable reinforcement learning for visual reasoning through an online environment substrate that generates unlimited diverse training instances with verifiable answers. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reinforcement learning…

22
Hugging Face Daily Papers research 27d ago

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

Abstract Answer-correct long chain-of-thought traces can lead to different fine-tuning outcomes, with post-conclusion continuations identified as harmful to training, characterized by uncertainty-geometry mismatches and addressed through a lightweight boundary proxy method.…

26
Hugging Face Daily Papers research 27d ago

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

Abstract Controlled concrete reasoning combines visual simulation with abstract reasoning through a training method that uses privileged future information to improve prediction accuracy and robustness. Generated by Qwen/Qwen2.5-Coder-32B-Instruct World models and multimodal…

19
Hugging Face Daily Papers research 27d ago

Value-Aware Stochastic KV Cache Eviction for Reasoning Models

Abstract Value-aware stochastic KV cache eviction method improves reasoning model accuracy under compression by protecting large-magnitude states and promoting diverse eviction decisions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reasoning models improve accuracy through…

9
arXiv — Machine Learning research 27d ago

Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning

arXiv:2606.02842v1 Announce Type: new Abstract: Multimodal spatial reasoning often relies on long chains of intermediate textual and visual thoughts, where accumulating visual tokens and dense cross-modal attention incur substantial computation and memory overhead. To address…

6
arXiv — Machine Learning research 27d ago

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

arXiv:2606.02884v1 Announce Type: new Abstract: Reward guidance algorithms steer a learned generative process toward the reward-tilted measure at inference time. While empirically powerful, these methods are prone to reward hacking: the guided model over-optimizes the reward at…

11
arXiv — Machine Learning research 27d ago

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

arXiv:2606.02963v1 Announce Type: new Abstract: Production inference increasingly targets a heterogeneous mix of accelerators. Agentic pipelines interleave reasoning, tool calls, and multi-agent coordination, each with distinct compute and memory profiles. For optimal…

19
arXiv — Machine Learning research 27d ago

MOSAIC: Efficient Mixture-of-Agent Scheduling via Adaptive Aggregation and Inference Concurrency

arXiv:2606.03014v1 Announce Type: new Abstract: Mixture-of-Agents (MoA) systems improve reasoning accuracy by routing each query to multiple expert LLMs and aggregating their outputs. Efficiently executing this workload on limited GPU resources has bottlenecks. Skill-based…

22
arXiv — Machine Learning research 27d ago

Libra: Efficient Resource Management for Agentic RL Post-Training

arXiv:2606.03077v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a standard post-training paradigm for large language models (LLMs), extending beyond preference alignment to complex reasoning and multi-turn agentic behaviors. In agentic RL, the rollout…

23
arXiv — Machine Learning research 27d ago

FGRPO: Federated GRPO with Adaptive Aggregation on Non-IID Data

arXiv:2606.03094v1 Announce Type: new Abstract: Recent advances in language models have established reinforcement learning as the primary paradigm for eliciting self-correction and long-chain reasoning. While group relative policy optimization (GRPO) offers superior scalability…

4
arXiv — Machine Learning research 27d ago

Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning

arXiv:2606.03234v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has become the dominant approach for improving mathematical reasoning in large language models, yet current methods reduce each correct rollout to a single reward bit, ignoring…

21
arXiv — Machine Learning research 27d ago

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

arXiv:2606.03458v1 Announce Type: new Abstract: Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but…

27
arXiv — NLP / Computation & Language research 27d ago

Adaptive Latent Agentic Reasoning

arXiv:2606.02871v1 Announce Type: new Abstract: Large reasoning models improve performance by generating extended chain-of-thought (CoT) reasoning, but this behavior becomes inefficient when applied to LLM agents. Current LLM agents often generate verbose textual reasoning at…

19
arXiv — NLP / Computation & Language research 27d ago

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

arXiv:2606.02907v1 Announce Type: new Abstract: Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the…

21
arXiv — NLP / Computation & Language research 27d ago

Hint-Guided Diversified Policy Optimization for LLM Reasoning

arXiv:2606.03021v1 Announce Type: new Abstract: Recent developments in Large Language Models (LLMs) have showcased impressive reasoning capabilities, with Reinforcement Learning with Verifiable Rewards (RLVR) being a promising enhancement strategy. However, existing reward…

6
arXiv — NLP / Computation & Language research 27d ago

PhotoCraft: Agentic Reasoning with Hierarchical Self-Evolving Memory for Deep Image Search

arXiv:2606.03099v1 Announce Type: new Abstract: Deep Image Search requires multi-step reasoning over rich contextual cues, such as time, location, and event relations. However, most existing LLM-based agents are stateless and reactive, lacking persistent memory to maintain…

11
arXiv — NLP / Computation & Language research 27d ago

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

arXiv:2606.03102v1 Announce Type: new Abstract: Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically…

21
arXiv — NLP / Computation & Language research 27d ago

SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

arXiv:2606.03301v1 Announce Type: new Abstract: We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by…

33
arXiv — NLP / Computation & Language research 27d ago

Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

arXiv:2606.03331v1 Announce Type: new Abstract: Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and…

38
arXiv — NLP / Computation & Language research 27d ago

The Unsampled Truth: Psychometrics in SLMs Measure Prompt Artifacts, Not Psychological Constructs

arXiv:2606.03357v1 Announce Type: new Abstract: When prompting SLMs for psychometric assessments, researchers assume the outputs reflect semantic reasoning. We evaluate this premise across 13 open-weights models (0.6B to 14B parameters) using a prompt variation framework that…

18
arXiv — NLP / Computation & Language research 27d ago

Framing Migration News with LLMs: Structured CoT as a Support for Human Interpretation

arXiv:2606.03761v1 Announce Type: new Abstract: Frame analysis of migration news is a socially consequential task: media scholars and researchers who study how migration is narrated need tools that are not only accurate, but transparent, auditable, and accessible within the…

25
arXiv — NLP / Computation & Language research 27d ago

HybridThinker: Efficient Chain-of-Thought Reasoning via Compressed Memory and Transient Thought Steps

arXiv:2606.03768v1 Announce Type: new Abstract: Extended chain-of-thought (CoT) traces improve LLM reasoning but incur substantial computational and memory costs. While existing CoT compression methods mitigate this by condensing thought steps into compact representations via…

26
arXiv — NLP / Computation & Language research 27d ago

Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?

arXiv:2606.03782v1 Announce Type: new Abstract: Large language models (LLMs) offer a promising approach to machine translation (MT) for extremely low-resource languages by incorporating linguistic resources through in-context learning. However, LLMs often struggle to apply…

14
arXiv — NLP / Computation & Language research 27d ago

Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

arXiv:2606.03793v1 Announce Type: new Abstract: Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attacks. Prior work on MLLM robustness has focused largely on English-centric…

19
arXiv — NLP / Computation & Language research 27d ago

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

arXiv:2606.03965v1 Announce Type: new Abstract: Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking…

6
arXiv — NLP / Computation & Language research 27d ago

Quantifying Faithful Confidence Expression in Large Reasoning Models

arXiv:2606.03969v1 Announce Type: new Abstract: Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This…

35
arXiv — NLP / Computation & Language research 27d ago

Attention Calibration for Position-Fair Dense Information Retrieval

arXiv:2606.02737v1 Announce Type: cross Abstract: Dense retrieval models exhibit positional bias: retrieval effectiveness degrades when relevant information appears later in a passage (Zeng et al., 2025). We ask whether this bias can be reduced at inference time, without…

34
arXiv — NLP / Computation & Language research 27d ago

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

arXiv:2606.02812v1 Announce Type: cross Abstract: Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but…

38
Hugging Face Daily Papers research 27d ago

MindZero: Learning Online Mental Reasoning With Zero Annotations

Abstract MindZero presents a self-supervised reinforcement learning framework that enables multimodal large language models to perform efficient and robust online mental reasoning without requiring explicit mental state annotations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

35
Simon Willison community 27d ago

Microsoft's new MAI models

Microsoft announced two new text LLMs this morning - MAI-Thinking-1 (reasoning, 35B parameters, available to "select early partners") and MAI-Code-1-Flash (5B parameters, "purpose-built for GitHub Copilot and VS Code to deliver high performance and lower cost [...] rolling out…

17
r/LocalLLaMA community 27d ago

Weird issue with OpenCode and Qwen3.6

I’m using Qwen3.6-27B running on my server with llama-server for AI coding with OpenCode. Sometimes for some reason, the response stops when its reasoning like if it has finished outputting the full response. I have to type “continue” and it continues working like if nothing…

30
Hugging Face Daily Papers research 27d ago

SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

Abstract Strategic Video Intelligence requires understanding, causal reasoning, and planning capabilities that current benchmarks fail to evaluate adequately, leading to significant performance gaps in complex cognitive tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct True…

10

Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair

Boosting Self-Consistency with Ranking

Arithmetic Pedagogy for Language Models

Streaming Communication in Multi-Agent Reasoning

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

Best Visual Reasoning Model in 2026 (Including APIs) [D]

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

Streaming Communication in Multi-Agent Reasoning

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

MemTrain: Self-Supervised Context Memory Training

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

Introducing new capabilities to GPT-Rosalind

OCC-RAG: Optimal Cognitive Core for Faithful Question Answering

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

Value-Aware Stochastic KV Cache Eviction for Reasoning Models

Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

MOSAIC: Efficient Mixture-of-Agent Scheduling via Adaptive Aggregation and Inference Concurrency

Libra: Efficient Resource Management for Agentic RL Post-Training

FGRPO: Federated GRPO with Adaptive Aggregation on Non-IID Data

Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

Adaptive Latent Agentic Reasoning

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

Hint-Guided Diversified Policy Optimization for LLM Reasoning

PhotoCraft: Agentic Reasoning with Hierarchical Self-Evolving Memory for Deep Image Search

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

The Unsampled Truth: Psychometrics in SLMs Measure Prompt Artifacts, Not Psychological Constructs

Framing Migration News with LLMs: Structured CoT as a Support for Human Interpretation

HybridThinker: Efficient Chain-of-Thought Reasoning via Compressed Memory and Transient Thought Steps

Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?

Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Quantifying Faithful Confidence Expression in Large Reasoning Models

Attention Calibration for Position-Fair Dense Information Retrieval

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

MindZero: Learning Online Mental Reasoning With Zero Annotations

Microsoft's new MAI models

Weird issue with OpenCode and Qwen3.6

SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence