Tag

Reasoning

500 articles archived under #reasoning · RSS

Hugging Face Daily Papers research 13d ago

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Abstract Prompt-Level Distillation extracts reasoning patterns from teacher models to enhance student model performance while maintaining interpretability and reducing latency. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Advanced reasoning typically requires Chain-of-Thought…

18
Smol AI News news-outlet 14d ago

GLM 5.2: the top Frontend Coding model in the world, IndexShare reduces costs

**Z.ai released GLM-5.2**, an MIT-licensed open-weight frontier model targeting **coding and long-horizon agentic tasks** with a **1M-token context window** and **two reasoning-effort modes**. It features a **744B-parameter mixture-of-experts architecture** with **40B active…

14
Hugging Face Daily Papers research 14d ago

Implicit Reasoning for Large Language Model-based Generative Recommendation

Abstract Large Language Models for generative recommendation face challenges with semantic IDs disrupting natural-language reasoning, prompting a lightweight implicit reasoning approach that outperforms explicit methods while reducing computational costs. Generated by…

16
arXiv — Machine Learning research 14d ago

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

arXiv:2606.15127v1 Announce Type: new Abstract: Reasoning models are increasingly used in settings where the final answer is not the only object of review: educational tools may show students intermediate steps, decision-support systems may require human oversight, and audit…

11
arXiv — Machine Learning research 14d ago

Semantic Reasoning in Medicine: The Role of Knowledge Graphs Across Five Key Domains

arXiv:2606.15155v1 Announce Type: new Abstract: Knowledge graphs (KGs) have emerged as a promising solution for integrating and reasoning over complex biomedical and clinical data in healthcare. By representing structured relationships among entities such as diseases, drugs,…

17
arXiv — Machine Learning research 14d ago

Understanding Diversity Collapse in RLVR via the Lens of Overtraining

arXiv:2606.15455v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a key approach for enhancing the reasoning abilities of large language models. However, RLVR often suffers from \emph{diversity collapse}: Pass@$1$ improves while…

6
arXiv — Machine Learning research 14d ago

Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning

arXiv:2606.15576v1 Announce Type: new Abstract: Reinforcement learning from verifiable rewards assigns a single scalar to each rollout, leaving token-level credit assignment underspecified in long reasoning traces. On-policy self-distillation addresses this by letting the same…

6
arXiv — Machine Learning research 14d ago

Is Code Better Than Language for Algorithmic Reasoning

arXiv:2606.15589v1 Announce Type: new Abstract: For tool-augmented language models, comparing natural-language reasoning with code-execution pipelines is difficult because the comparison changes both the intermediate representation and the execution mechanism. We separate these…

29
arXiv — Machine Learning research 14d ago

Formalizing and Mitigating Structural Distortion in LLM Attention for Zero-Shot Graph Reasoning

arXiv:2606.15633v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown promise for reasoning over Text-Attributed Graphs (TAGs). However, applying LLMs to graphs requires linearizing their structure into sequences, introducing distortion rooted in the graph…

9
arXiv — Machine Learning research 14d ago

ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training

arXiv:2606.15682v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) achieve strong problem-solving through long chain-of-thought, but their deployment is constrained by the high cost of full-precision inference and growing KV cache footprints. Microscaled FP4 formats…

35
arXiv — NLP / Computation & Language research 14d ago

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

arXiv:2606.14961v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale…

21
arXiv — NLP / Computation & Language research 14d ago

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

arXiv:2606.15007v1 Announce Type: new Abstract: We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context…

19
arXiv — NLP / Computation & Language research 14d ago

Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models

arXiv:2606.15070v1 Announce Type: new Abstract: By incorporating test-time compute scaling, large reasoning models (LRMs) can solve complex problems through explicit chain-of-thought (CoT) reasoning processes. However, they often suffer from overthinking, resulting in redundant…

23
arXiv — NLP / Computation & Language research 14d ago

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

arXiv:2606.15079v1 Announce Type: new Abstract: Efficient and scalable agentic intelligence requires models that can deliver both low-latency responses and strong reasoning capabilities while remaining practical to train, serve, and deploy. In this report, we present Ling-2.6…

16
arXiv — NLP / Computation & Language research 14d ago

AdaMame: A Training Recipe for Adaptive Multilingual Reasoning

arXiv:2606.15080v1 Announce Type: new Abstract: While Large Reasoning Models (LRMs) show strong performance in English, they often fail to reason in the language of the query, a phenomenon known as language collapse. Existing RL-based fixes typically add a binary language…

31
arXiv — NLP / Computation & Language research 14d ago

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

arXiv:2606.15307v1 Announce Type: new Abstract: Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced…

21
arXiv — NLP / Computation & Language research 14d ago

Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

arXiv:2606.15419v1 Announce Type: new Abstract: Objective: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). Method: We designed a multi-agent peer-reviewed reasoning method in which multiple LLM…

22
arXiv — NLP / Computation & Language research 14d ago

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

arXiv:2606.15733v1 Announce Type: new Abstract: Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are…

21
arXiv — NLP / Computation & Language research 14d ago

ttda704 at SemEval-2026 Task 6: Structured Chain-of-Thought Prompting for Political Evasion Detection

arXiv:2606.15770v1 Announce Type: new Abstract: This paper describes our system for SemEval-2026 Task 6, which addresses the classification of political evasion strategies in English question-answer pairs extracted from U.S. presidential interviews. We systematically compare two…

31
arXiv — NLP / Computation & Language research 14d ago

When Correct Edges Cannot Be Verified: A Provenance Gap in Incomplete KGQA and a Provenance-Favoring Completion Policy

arXiv:2606.15833v1 Announce Type: new Abstract: Incomplete Knowledge Graph Question Answering (IKGQA) requires completing missing edges to continue reasoning. A growing line of work verifies completed edges against retrieved text, treating textual support as a proxy for edge…

10
arXiv — NLP / Computation & Language research 14d ago

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

arXiv:2606.15872v1 Announce Type: new Abstract: Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial…

27
arXiv — NLP / Computation & Language research 14d ago

Free Energy Heuristics: Fast-And-Frugal Cognition as Active Inference Under Uncertain Precision

arXiv:2606.15877v1 Announce Type: new Abstract: Chain-of-thought (CoT) improves large language models' performance in math and symbolic reasoning. But on planning, contested ethics, and tasks where the model cannot check itself, more reasoning makes things worse. Both effects…

8
arXiv — NLP / Computation & Language research 14d ago

Neuron Level Analysis of Large Language Model in Legal Domain Reasoning

arXiv:2606.15884v1 Announce Type: new Abstract: We presented a neuron-level analysis of legal-domain reasoning in LLMs, comparing it with other applied domain tasks across seven open-weight models. Using neuron attribution scores to rank and suppress influential neurons, we…

4
arXiv — NLP / Computation & Language research 14d ago

Formalize Once, Edit the Rest: Efficient Lean-Based Answer Selection for Math Reasoning

arXiv:2606.15972v1 Announce Type: new Abstract: With large language models (LLMs) increasingly applied to mathematical reasoning, formal proof assistants such as Lean can be leveraged to verify reasoning outputs with machine-checkable rigor, enabling use cases such as answer…

30
arXiv — NLP / Computation & Language research 14d ago

A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

arXiv:2606.15974v1 Announce Type: new Abstract: Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning…

30
arXiv — NLP / Computation & Language research 14d ago

From Argument Components to Graphs: A Multi-Agent Debate with Confidence Gating for Argument Relations

arXiv:2606.16047v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly assessed and utilized in the field of Argument Mining (AM), thanks to their strong general reasoning capabilities. However, standard training-free models often miss sophisticated…

7
arXiv — NLP / Computation & Language research 14d ago

Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization

arXiv:2606.16111v1 Announce Type: new Abstract: Recent advances in tool-integrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking…

10
arXiv — NLP / Computation & Language research 14d ago

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

arXiv:2606.16151v1 Announce Type: new Abstract: Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can…

15
arXiv — NLP / Computation & Language research 14d ago

Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework

arXiv:2606.16211v1 Announce Type: new Abstract: Biomedical question answering (QA) increasingly requires reasoning over interacting entities, where supporting evidence is scattered across biomedical knowledge graphs, literature documents, and web-accessible resources. However,…

36
arXiv — NLP / Computation & Language research 14d ago

Creative Collision: Directorial Persona Steering and Competition in Large Language Models

arXiv:2606.16240v1 Announce Type: new Abstract: Activation steering has emerged as a powerful tool for shaping the behaviour of large language models at inference time, yet most prior work injects a \emph{single} semantic direction into the residual stream. We study the richer…

38
arXiv — NLP / Computation & Language research 14d ago

Tyler: Typed Latent Reasoning for Language Models -- When to Think, What to Compute, and How Much to Allocate

arXiv:2606.16360v1 Announce Type: new Abstract: Chain-of-thought (CoT) prompting improves reasoning in large language models (LLMs) by externalizing intermediate computation as discrete text tokens, but this textual interface also introduces redundancy and inference overhead.…

16
arXiv — NLP / Computation & Language research 14d ago

A Mechanistic Understanding of Pronoun Fidelity in LLMs

arXiv:2606.16407v1 Announce Type: new Abstract: Faithful and robust pronoun use is important for fair and coherent generations, yet large language models largely fail when multiple referents use different pronouns. To study the interplay of reasoning, repetition, and bias in…

35
Hugging Face Daily Papers research 14d ago

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Abstract Nemotron 3 Ultra is a large-scale language model featuring hybrid Mamba-Attention architecture with 550 billion parameters, achieving high inference throughput and extended context length through specialized training techniques. Generated by…

5
Hugging Face Daily Papers research 14d ago

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Abstract VibeThinker-3B demonstrates that compact models can achieve state-of-the-art performance on verifiable reasoning tasks through specialized training techniques, challenging conventional scaling assumptions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct This technical…

16
arXiv — NLP / Computation & Language research 15d ago

SuperThoughts: Reasoning Tokens in Superposition

arXiv:2606.13862v1 Announce Type: cross Abstract: Long Chain-of-Thought (CoT) reasoning improves LLM problem-solving but is computationally expensive due to sequential token generation. While recent works explore reasoning in continuous latent spaces to bypass discrete token…

12
arXiv — Machine Learning research 15d ago

EM-NeSy: Expectation Maximization for Neurosymbolic Learning

arXiv:2606.14463v1 Announce Type: new Abstract: Neurosymbolic (NeSy) models integrate neural networks and symbolic reasoning for robust and interpretable AI. State-of-the-art NeSy models require that the symbolic component is expressed in a differentiable way, often complicating…

38
arXiv — Machine Learning research 15d ago

When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing

arXiv:2606.14668v1 Announce Type: new Abstract: Knowledge editing systems must update selected facts while preserving nearby but irrelevant behavior. This paper studies this problem in a memory-assisted setting where an edit memory is retrieved at inference time and a…

36
arXiv — NLP / Computation & Language research 15d ago

Which Models Perform Better in Inheritance Reasoning?

arXiv:2606.13751v1 Announce Type: new Abstract: This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning. The task evaluates the ability of large language models to solve inheritance cases that require legal…

6
arXiv — NLP / Computation & Language research 15d ago

QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning

arXiv:2606.13756v1 Announce Type: new Abstract: This paper presents a comprehensive overview of the QIAS 2026 shared task, organized as part of the OSACT7 Workshop and co-located with LREC 2026. The shared task was designed to evaluate the ability of large language models to…

35
arXiv — NLP / Computation & Language research 15d ago

Implicit Reasoning for Large Language Model-based Generative Recommendation

arXiv:2606.14142v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly adopted as backbones for Generative Recommendation (GR), promising access to pretrained world knowledge. Yet reliably invoking this knowledge for GR remains poorly understood. A key…

10
arXiv — NLP / Computation & Language research 15d ago

AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

arXiv:2606.14674v1 Announce Type: new Abstract: LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often…

13
arXiv — NLP / Computation & Language research 15d ago

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

arXiv:2606.14691v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving…

34
arXiv — NLP / Computation & Language research 15d ago

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

arXiv:2606.14694v1 Announce Type: new Abstract: Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and…

4
arXiv — NLP / Computation & Language research 15d ago

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

arXiv:2606.13815v1 Announce Type: cross Abstract: Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the…

37
arXiv — NLP / Computation & Language research 15d ago

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

arXiv:2606.14470v1 Announce Type: cross Abstract: Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software…

37
arXiv — NLP / Computation & Language research 15d ago

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

arXiv:2606.14697v1 Announce Type: cross Abstract: Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where…

4
arXiv — NLP / Computation & Language research 15d ago

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

arXiv:2502.10886v3 Announce Type: replace Abstract: Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We…

23
arXiv — NLP / Computation & Language research 15d ago

Reward-SQL: Boosting Text-to-SQL via Stepwise Execution-Aware Reasoning and Process-Supervised Rewards

arXiv:2505.04671v3 Announce Type: replace Abstract: Recent advances in large language models (LLMs) trained with reinforcement learning (RL) have improved Text-to-SQL performance. However, RL-based approaches still struggle with complex queries due to two key limitations:…

18
arXiv — NLP / Computation & Language research 15d ago

Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Metapragmatic Links

arXiv:2509.24102v5 Announce Type: replace Abstract: While moral reasoning has emerged as a promising research direction for large language models (LLMs), achieving robust generalization remains a critical challenge. This challenge arises from the gap between what is said and…

27
arXiv — NLP / Computation & Language research 15d ago

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

arXiv:2603.05167v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We introduce…

20

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

GLM 5.2: the top Frontend Coding model in the world, IndexShare reduces costs

Implicit Reasoning for Large Language Model-based Generative Recommendation

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

Semantic Reasoning in Medicine: The Role of Knowledge Graphs Across Five Key Domains

Understanding Diversity Collapse in RLVR via the Lens of Overtraining

Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning

Is Code Better Than Language for Algorithmic Reasoning

Formalizing and Mitigating Structural Distortion in LLM Attention for Zero-Shot Graph Reasoning

ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

AdaMame: A Training Recipe for Adaptive Multilingual Reasoning

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

ttda704 at SemEval-2026 Task 6: Structured Chain-of-Thought Prompting for Political Evasion Detection

When Correct Edges Cannot Be Verified: A Provenance Gap in Incomplete KGQA and a Provenance-Favoring Completion Policy

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

Free Energy Heuristics: Fast-And-Frugal Cognition as Active Inference Under Uncertain Precision

Neuron Level Analysis of Large Language Model in Legal Domain Reasoning

Formalize Once, Edit the Rest: Efficient Lean-Based Answer Selection for Math Reasoning

A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

From Argument Components to Graphs: A Multi-Agent Debate with Confidence Gating for Argument Relations

Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework

Creative Collision: Directorial Persona Steering and Competition in Large Language Models

Tyler: Typed Latent Reasoning for Language Models -- When to Think, What to Compute, and How Much to Allocate

A Mechanistic Understanding of Pronoun Fidelity in LLMs

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

SuperThoughts: Reasoning Tokens in Superposition

EM-NeSy: Expectation Maximization for Neurosymbolic Learning

When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing

Which Models Perform Better in Inheritance Reasoning?

QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning

Implicit Reasoning for Large Language Model-based Generative Recommendation

AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

Reward-SQL: Boosting Text-to-SQL via Stepwise Execution-Aware Reasoning and Process-Supervised Rewards

Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Metapragmatic Links

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning