Tag

Reasoning

500 articles archived under #reasoning · RSS

arXiv — NLP / Computation & Language research 4d ago

Zero-shot Tweet-Level Stance Detection Enhanced by External Knowledge and Reflective Chain-of-Thought Reasoning

arXiv:2606.26571v1 Announce Type: new Abstract: Zero-shot tweet-level stance detection confronts two primary challenges: (1) mitigating the context sparsity inherent in short texts, and (2) establishing the relevance between implicit targets and textual content. While existing…

35
arXiv — NLP / Computation & Language research 4d ago

Beyond Logical Forms: LLM-Extracted Patterns for Fallacy Classification

arXiv:2606.26698v1 Announce Type: new Abstract: In today's fast-paced information era, logical fallacies, defined as defective patterns of reasoning, inevitably contribute to the growth of information disorder. However, often fallacies appear in nuanced forms that complicate…

37
arXiv — NLP / Computation & Language research 4d ago

Information-Aware KV Cache Compression for Long Reasoning

arXiv:2606.26875v1 Announce Type: new Abstract: Reasoning capability has advanced rapidly in large language models (LLMs), leading to an increasing size of key-value (KV) cache in both prefilling and decoding stages. Existing KV cache compression methods mainly rely on attention…

15
arXiv — NLP / Computation & Language research 4d ago

ReaORE: Reasoning-Guided Progressive Open Relation Extraction Empowered by Large Reasoning Models

arXiv:2606.26986v1 Announce Type: new Abstract: Open Relation Extraction (OpenRE) requires a model to extract unseen relations between head and tail entities from unstructured text for real-world applications. The core challenge of OpenRE lies in achieving reliable…

13
arXiv — NLP / Computation & Language research 4d ago

Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization

arXiv:2606.27025v1 Announce Type: new Abstract: Building general-purpose role-playing agents that faithfully portray any character from a natural-language profile remains challenging. The dominant paradigm -- supervised fine-tuning -- encourages behavioral mimicry without deep,…

16
arXiv — NLP / Computation & Language research 4d ago

The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans

arXiv:2606.27103v1 Announce Type: new Abstract: Humans flexibly adapt their reasoning strategies to the requirements of a given problem. Large language models (LLMs) have performed well on many cognitive tasks, however, it is unclear whether this accuracy is a result of pattern…

9
arXiv — NLP / Computation & Language research 4d ago

Multilingual Reasoning Cascades Need More Context

arXiv:2606.27306v1 Announce Type: new Abstract: Translation cascades for reasoning translate the query from another language to English, reason in English, and translate the answer back to the original language. This is a competitive approach to multilingual reasoning, but…

7
arXiv — NLP / Computation & Language research 4d ago

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

arXiv:2606.26300v1 Announce Type: cross Abstract: A classical intuition holds that verifying a solution is easier than producing one. For today's coding agents, this intuition is being inverted: as foundation models develop stronger reasoning capabilities and engineering…

24
arXiv — NLP / Computation & Language research 4d ago

Narration-of-Thought: Inference-Time Scaffolding for Defeasible Ethical Reasoning in Large Language Models

arXiv:2606.26366v1 Announce Type: cross Abstract: Standard chain-of-thought on moral dilemmas exhibits two failure modes: stakeholder collapse (the trace names at most one party with a stake in the outcome) and uncertainty suppression (no explicit unknowns or hedges before…

29
arXiv — NLP / Computation & Language research 4d ago

Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

arXiv:2606.26387v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to…

19
arXiv — NLP / Computation & Language research 4d ago

Humans Disengage, Reasoning Models Persist: Separating Difficulty Registration from Deliberation Allocation

arXiv:2606.26502v1 Announce Type: cross Abstract: Large reasoning models (LRMs) take longer on harder problems, just as humans do. This surface similarity hides an opposite pattern within items. When an LRM gets a problem wrong, it spends more tokens than when it gets the same…

29
arXiv — NLP / Computation & Language research 4d ago

Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation

arXiv:2606.26686v1 Announce Type: cross Abstract: In order to screen a prompt or a response, the recent guardrail methods generate a chain-of-thought (CoT) before they issue a verdict. This design follows a common belief that step-by-step reasoning improves a decision. However,…

17
arXiv — NLP / Computation & Language research 4d ago

Vis-CoT: A Human-in-the-Loop Framework for Interactive Visualization and Intervention in LLM Chain-of-Thought Reasoning

arXiv:2509.01412v3 Announce Type: replace Abstract: Large language models (LLMs) show strong reasoning via chain-of-thought (CoT) prompting, but the process is opaque, which makes verification, debugging, and control difficult in high-stakes settings. We present Vis-CoT, a…

37
Hugging Face Daily Papers research 4d ago

OpenBioRQ: Unsolved Biomedical Research Questions for Agents

Abstract A new biomedical benchmark evaluates agentic models' ability to verify sources and avoid false citations by testing unsolved research questions with no answer keys, revealing significant failures in retrieval-grounded reasoning and tool usage. Generated by…

9
Hugging Face Daily Papers research 4d ago

Confidence-Aware Tool Orchestration for Robust Video Understanding

Abstract Robust-TO addresses the Blind Trust Problem in video reasoning by integrating per-frame trustworthiness into an agentic framework that improves accuracy under realistic perturbations through calibrated evidence weighting and reliability-aware reasoning. Generated by…

17
r/LocalLLaMA community 4d ago

Qwen 3.6 27b GLM 5.2 fine-tune?

Hi everyone, Since both models are open weights and GLM seems to find that secret to frontier model reasoning, why don't we see any Qwen GLM finetune yet? Is it because GLM 5.2 is recent and finetune and datasets take time or the community is just not interested in the finetune?…

28
Hugging Face Daily Papers research 4d ago

Do Thinking Tokens Help with Safety?

Abstract Research reveals that reasoning models' safety outcomes are predictable from early hidden representations, with deliberation appearing but not substantially influencing final responses, and current safety interventions inadvertently suppress genuine deliberation…

25
Hugging Face Daily Papers research 4d ago

Forecasting Future Behavior as a Learning Task

Abstract Behavior Forecasters are trained to predict large reasoning model outputs from single trajectories, outperforming large language models while requiring significantly less computational cost. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Trust in an AI system is often…

24
Hugging Face Daily Papers research 4d ago

ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

Abstract ReNIO enhances on-policy distillation for language models by reweighting negative trajectories based on token-level probability ratios, improving reasoning performance in mathematical and code generation tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct On-policy…

25
Smol AI News news-outlet 4d ago

not much happened today

**Z.ai's GLM-5.2** leads in coding and agent benchmarks with top scores like **1595** on Code Arena: Frontend and **34.29%** reasoning accuracy with zero failures. Databricks improved GLM-5.2 speed to **392 tok/s** using hardware and optimizations. **Ornith-1.0**, a new…

13
Hugging Face Daily Papers research 4d ago

RL-Index: Reinforcement Learning for Retrieval Index Reasoning

Abstract RL-Index introduces an agentic indexing framework that shifts reasoning from query time to indexing stage by using LLM-generated rationales and reinforcement learning to improve retrieval effectiveness and reduce latency. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

25
Hugging Face Daily Papers research 5d ago

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

Abstract A novel label-free framework for visual reasoning called V-Zero is presented, which uses contrastive evidence gating to improve fine-grained visual reasoning without requiring annotated answer labels, achieving faster training than traditional methods. Generated by…

12
arXiv — Machine Learning research 5d ago

Holographic Memory for Zero-Shot Compositional Reasoning in Knowledge Graphs: A Mechanistic Study of Where and Why It Fails

arXiv:2606.24948v1 Announce Type: new Abstract: Knowledge graph embedding (KGE) models predict single-hop links well but have no mechanism for zero-shot compositional queries: multi-hop questions whose relation chains never appeared during training. Holographic Reduced…

31
arXiv — Machine Learning research 5d ago

MacroLens: A Multi-Task Benchmark for Contextual Financial Reasoning under Macroeconomic Scenarios

arXiv:2606.24950v1 Announce Type: new Abstract: Financial decision-making is contextual: forecasting prices, valuing companies, and assessing event exposure weigh price history, accounting fundamentals, macroeconomic regime, and contemporaneous text. A benchmark over these four…

25
arXiv — Machine Learning research 5d ago

ExTra: Exploratory Trajectory Optimization for Language Model Reinforcement Learning

arXiv:2606.24994v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) for language-model reasoning can fail at both extremes of task difficulty: easy prompts often produce all-correct, low-diversity rollout groups with little gradient signal,…

25
arXiv — Machine Learning research 5d ago

Geo-Strat-RL: Learning Geological Event Reasoning from Verifiable Tasks

arXiv:2606.25000v1 Announce Type: new Abstract: To evaluate whether vision-language models can reason about geological histories, it is necessary to construct observations for which the underlying process history is known. Furthermore, reasoning over geological histories is not…

6
arXiv — Machine Learning research 5d ago

Multi-Stream Temporal Fusion for Financial Fraud Detection

arXiv:2606.25007v1 Announce Type: new Abstract: Financial fraud detection in digital banking requires reasoning over multiple heterogeneous event streams -- transactions, login sessions, risk signals -- that individually appear benign but collectively reveal fraudulent patterns.…

14
arXiv — NLP / Computation & Language research 5d ago

Do Thinking Tokens Help with Safety?

arXiv:2606.25013v1 Announce Type: cross Abstract: Today's reasoning models use thinking tokens to attain stronger performance on benchmarks than their instruction-tuned counterparts. It is also generally believed that this more "deliberative" mode should improve alignment and…

37
arXiv — NLP / Computation & Language research 5d ago

Hybrid-IR: Dual-Path Hybrid Retrieval with Iterative Reasoning for Complex Medical Question Answering

arXiv:2606.25338v1 Announce Type: new Abstract: Large language models (LLMs) have shown promising performance across a wide range of biomedical applications, including medical question answering (QA), yet they remain prone to hallucinations and outdated knowledge. Although…

9
arXiv — NLP / Computation & Language research 5d ago

Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

arXiv:2606.25354v1 Announce Type: new Abstract: Test-time scaling improves language-model reasoning, but existing approaches often face a difficult trade-off: long chain-of-thought sampling remains single-threaded, while sentence- or solution-level search can be computationally…

22
arXiv — NLP / Computation & Language research 5d ago

Riazi-8B: An Urdu Large Language Model for Mathematical Reasoning

arXiv:2606.25568v1 Announce Type: new Abstract: Recent LLMs demonstrate strong mathematical reasoning capabilities, but existing gains rely heavily on English-centric training resources and benchmarks. As a result, reasoning performance degrades substantially in low-resource…

27
arXiv — NLP / Computation & Language research 5d ago

OPERA: Aligning Open-Ended Reasoning via Objective Perplexity-based Reinforcement Learning

arXiv:2606.25757v1 Announce Type: new Abstract: Reinforcement Learning (RL) has enabled LLMs to excel in objective reasoning tasks such as mathematics and code generation. However, applying RL to open-ended tasks, such as creative writing, remains challenging because…

22
arXiv — NLP / Computation & Language research 5d ago

RAVEN: Long-Horizon Reasoning & Navigation with a Visuo-Spatio-Temporal Memory

arXiv:2606.25206v1 Announce Type: cross Abstract: Long-term robot deployment requires a compact and scalable memory that preserves fine-grained visual semantics, grounds observations in space and time, and enables efficient storage and retrieval. In this paper, we propose RAVEN,…

21
arXiv — NLP / Computation & Language research 5d ago

Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

arXiv:2606.25524v1 Announce Type: cross Abstract: Large language models (LLMs) reach high accuracy in mathematical reasoning, but individual traces on the same problem diverge; some arrive at the correct answer while others fail. Prior work analyzes failure at the step, chunk,…

32
arXiv — NLP / Computation & Language research 5d ago

How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

arXiv:2606.26041v1 Announce Type: cross Abstract: Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustness under controlled visual degradation remains insufficiently…

29
arXiv — NLP / Computation & Language research 5d ago

Scale or Reason? A Compute-Equivalent Analysis of Reasoning Distillation

arXiv:2509.22193v2 Announce Type: replace Abstract: Distilling reasoning traces from strong teacher models has become the standard recipe for building capable small language models. Yet reasoning traces are 5-20$\times$ longer than standard instruction fine-tuning (IFT) outputs,…

19
arXiv — NLP / Computation & Language research 5d ago

Reinforcement Learning Improves Traversal of Parametric Knowledge in LLMs

arXiv:2511.05933v2 Announce Type: replace Abstract: Reinforcement learning (RL) is often credited with improving language model reasoning at the expense of knowledge. We challenge this narrative by showing that reasoning models consistently outperform their instruction-tuned…

11
arXiv — NLP / Computation & Language research 5d ago

ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure

arXiv:2602.01472v2 Announce Type: replace Abstract: Large reasoning models (LRMs) typically solve reasoning-intensive tasks by generating long chain-of-thought (CoT) traces, leading to substantial inference overhead. We identify a reproducible inference-time phenomenon, termed…

5
Hugging Face Daily Papers research 5d ago

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

Abstract Multimodal Chain-of-Thought reasoning shows selective effectiveness across different tasks, with limitations in maintaining visual introspection during reasoning processes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Chain-of-Thought (CoT) has become a standard method…

17
Hugging Face Daily Papers research 5d ago

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

Abstract Implicit Visual Chain-of-Thought decomposes visual conditioning into structural and semantic cascades for improved structure-aware image generation with sketch supervision. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Unified multi-modal large language models (MLLMs)…

7
Hugging Face Daily Papers research 5d ago

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

Abstract Large language models face challenges in archive-grounded reasoning tasks involving evidence retrieval and synthesis across diverse document collections, with performance varying significantly across domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language…

26
arXiv — Machine Learning research 6d ago

Weight-Space Geometry of Offline Reasoning Training

arXiv:2606.23740v1 Announce Type: new Abstract: Offline reinforcement-learning losses (RFT, RIFT, DFT, Offline GRPO, DPO) are widely used to distill reasoning from large teachers into smaller students, and are typically compared on downstream accuracy alone. We ask whether they…

6
arXiv — Machine Learning research 6d ago

A Survey on Federated Causal Discovery and Inference

arXiv:2606.23741v1 Announce Type: new Abstract: Causal reasoning, which encompasses the discovery of causal structures and the inference of causal effects, is fundamental to data-driven decision making. In practice, data for reliable causal analysis are often distributed across…

7
arXiv — NLP / Computation & Language research 6d ago

Blockwise Policy-Drift Gating for On-Policy Distillation

arXiv:2606.24084v1 Announce Type: cross Abstract: On-policy distillation (OPD) trains a student policy using teacher signals computed on trajectories sampled by the student itself. Recent work shows that sampled-token OPD can be fragile on long-horizon reasoning tasks and that…

30
arXiv — Machine Learning research 6d ago

Reasoning as Attractor Dynamics: Latent Memory Retrieval via Gibbs-Weighted Energy Minimization

arXiv:2606.24543v1 Announce Type: new Abstract: Large Language Models (LLMs) are traditionally viewed as autoregressive generators. However, from the perspective of collective computation, they function as high-dimensional Dense Associative Memories that store complex reasoning…

24
arXiv — NLP / Computation & Language research 6d ago

CALIBER: Calibrating Confidence Before and After Reasoning in Language Models

arXiv:2606.24281v1 Announce Type: new Abstract: Reasoning language models are increasingly asked not only to answer difficult questions, but also to estimate their likelihood of success. Existing methods typically elicit confidence only once: either before thinking or after…

4
arXiv — NLP / Computation & Language research 6d ago

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

arXiv:2606.24526v1 Announce Type: new Abstract: Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of…

36
arXiv — NLP / Computation & Language research 6d ago

Qwen-AgentWorld: Language World Models for General Agents

arXiv:2606.24597v1 Announce Type: new Abstract: A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can…

8
arXiv — NLP / Computation & Language research 6d ago

Neuro-Symbolic Drive: Rule-Grounded Faithful Reasoning for Driving VLAs

arXiv:2606.23938v1 Announce Type: cross Abstract: Driving VLA models incorporating Chain-of-Thought (CoT) reasoning are attractive because they leverage pretrained VLM representations and expose intermediate decisions in natural language, yet current rationales often lack the…

4
arXiv — NLP / Computation & Language research 6d ago

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

arXiv:2606.24391v1 Announce Type: cross Abstract: We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy the enemy base. Three stressors are deliberate: fog of war, full diplomacy (messages, ceasefires, ultimatums; uranium kept…

29

Zero-shot Tweet-Level Stance Detection Enhanced by External Knowledge and Reflective Chain-of-Thought Reasoning

Beyond Logical Forms: LLM-Extracted Patterns for Fallacy Classification

Information-Aware KV Cache Compression for Long Reasoning

ReaORE: Reasoning-Guided Progressive Open Relation Extraction Empowered by Large Reasoning Models

Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization

The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans

Multilingual Reasoning Cascades Need More Context

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

Narration-of-Thought: Inference-Time Scaffolding for Defeasible Ethical Reasoning in Large Language Models

Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

Humans Disengage, Reasoning Models Persist: Separating Difficulty Registration from Deliberation Allocation

Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation

Vis-CoT: A Human-in-the-Loop Framework for Interactive Visualization and Intervention in LLM Chain-of-Thought Reasoning

OpenBioRQ: Unsolved Biomedical Research Questions for Agents

Confidence-Aware Tool Orchestration for Robust Video Understanding

Qwen 3.6 27b GLM 5.2 fine-tune?

Do Thinking Tokens Help with Safety?

Forecasting Future Behavior as a Learning Task

ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

not much happened today

RL-Index: Reinforcement Learning for Retrieval Index Reasoning

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

Holographic Memory for Zero-Shot Compositional Reasoning in Knowledge Graphs: A Mechanistic Study of Where and Why It Fails

MacroLens: A Multi-Task Benchmark for Contextual Financial Reasoning under Macroeconomic Scenarios

ExTra: Exploratory Trajectory Optimization for Language Model Reinforcement Learning

Geo-Strat-RL: Learning Geological Event Reasoning from Verifiable Tasks

Multi-Stream Temporal Fusion for Financial Fraud Detection

Do Thinking Tokens Help with Safety?

Hybrid-IR: Dual-Path Hybrid Retrieval with Iterative Reasoning for Complex Medical Question Answering

Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

Riazi-8B: An Urdu Large Language Model for Mathematical Reasoning

OPERA: Aligning Open-Ended Reasoning via Objective Perplexity-based Reinforcement Learning

RAVEN: Long-Horizon Reasoning & Navigation with a Visuo-Spatio-Temporal Memory

Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning

How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

Scale or Reason? A Compute-Equivalent Analysis of Reasoning Distillation

Reinforcement Learning Improves Traversal of Parametric Knowledge in LLMs

ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

Weight-Space Geometry of Offline Reasoning Training

A Survey on Federated Causal Discovery and Inference

Blockwise Policy-Drift Gating for On-Policy Distillation

Reasoning as Attractor Dynamics: Latent Memory Retrieval via Gibbs-Weighted Energy Minimization

CALIBER: Calibrating Confidence Before and After Reasoning in Language Models

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

Qwen-AgentWorld: Language World Models for General Agents

Neuro-Symbolic Drive: Rule-Grounded Faithful Reasoning for Driving VLAs

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War