Tag

Reasoning

500 articles archived under #reasoning · RSS

arXiv — NLP / Computation & Language research 6d ago

A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial

arXiv:2606.24510v1 Announce Type: cross Abstract: Rare diseases affect millions of individuals worldwide, yet timely diagnosis remains a major public health challenge due to scarcity of specialized clinical expertise. While large language models (LLMs) show promise to support…

28
arXiv — NLP / Computation & Language research 6d ago

Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering

arXiv:2403.04890v4 Announce Type: replace Abstract: In this paper, we propose a modified version of the MedQA-USMLE dataset, named MEDQA-OPEN, which contains open-ended medical questions without options to mimic clinical scenarios, along with clinician-approved reasoned answers.…

7
arXiv — NLP / Computation & Language research 6d ago

Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

arXiv:2501.11790v5 Announce Type: replace Abstract: Recent studies have raised significant concerns regarding the reliability of current mathematics benchmarks, highlighting issues such as simplistic design and potential data contamination. Consequently, developing a reliable…

29
Hugging Face Daily Papers research 6d ago

Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

Abstract Text-to-image models fail to generate counterfactual scenes because they rely on tightly coupled visual-textual patterns rather than causal reasoning, demonstrating limited understanding beyond pattern matching. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Text-to-image…

26
Hugging Face Daily Papers research 6d ago

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

Abstract A novel framework called VeriEvol is introduced that addresses the challenge of scaling reinforcement learning for visual mathematical reasoning by ensuring reliable reward labels through a two-axis approach that separates prompt difficulty from answer reliability,…

17
Hugging Face Daily Papers research 6d ago

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

Abstract Data-centric approach using curated datasets and minimal GRPO setup significantly improves long-context reasoning in large language models, outperforming prior reinforcement learning methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Long-context reasoning is an…

15
Hugging Face Daily Papers research 6d ago

A Verifiable Search Is Not a Learnable Chain-of-Thought

Abstract Training models on chain-of-thought demonstrations fails for tasks requiring backtracking search because the forward derivation cannot be faithfully imitated, demonstrating a fundamental limitation in learning search procedures through demonstration. Generated by…

11
Hugging Face Daily Papers research 6d ago

Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

Abstract Autoregressive generation in large language models traditionally uses the final layer for token prediction, but a new decoding strategy dynamically selects more reliable intermediate layers based on entropy-guided search, improving reasoning performance with minimal…

34
r/LocalLLaMA community 7d ago

Training a Qwen 3.5 4B/9B agent for multi-tool use: SFT first or go directly to RL?

To train Qwen 3.5 4B or 9B for a custom multi-tool agent workflow and would appreciate guidance from people who have done this successfully. A few questions: SFT → RL or RL-only? - Is it still recommended to first do supervised fine-tuning (tool-calling traces, reasoning…

15
Hugging Face Daily Papers research 7d ago

Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Abstract DR-MV3D presents a map-grounded learning framework with dense rewards to improve multi-view 3D visual question answering through global map construction, view-trajectory planning, and egocentric grounding. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multi-view 3D…

15
Hugging Face Daily Papers research 7d ago

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

Abstract Trajectory-Augmented Policy Optimization (TAPO) enhances large language model reasoning by creating explicit corrective trajectories that preserve erroneous reasoning while incorporating natural-language diagnoses and corrections, outperforming traditional…

31
Hugging Face Daily Papers research 7d ago

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

Abstract Reinforcement learning approaches for improving LLM reasoning capabilities are enhanced by a Bayesian Manifold Curriculum framework that structures problem sampling based on task manifold relationships and endogenous non-stationarity. Generated by…

20
Hacker News — AI on Front Page community 7d ago

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

Article URL: https://arxiv.org/abs/2606.16140 Comments URL: https://news.ycombinator.com/item?id=48639240 Points: 211 # Comments: 85

26
r/LocalLLaMA community 7d ago

NEX-N2-mini: "There is no Pareto frontier. I am Pareto". This Qwen3.5-MoE fine tune fixed 3.5 and 3.6 overthinking apparently on my tests.

I have been testing all popular MoE for my Mac and it seems I just found gold: 3.5/3.6 level of reasoning (if not slightly superior) at a fraction of the reasoning tokens used (wasted). Dynamic plot with other benchmarks here: https://benchmark-yourself.streamlit.app/…

4
Hugging Face Daily Papers research 7d ago

Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models

Abstract Reflective Masking enables iterative local refinement in Mask Diffusion Models through lightweight post-training, supporting multi-turn reasoning without architectural changes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While reasoning on autoregressive (AR) models is…

26
r/LocalLLaMA community 8d ago

8-16 MI50s Minimax M3 @19 tps TG (peak)

TL;DR Speeds are not too ugly for this old 2018 hardware but imo, not very usable for agentic coding (if you compare with qwen3.6 27B on 8 MI50 @ 50 tps TG 800 tps PP). More concerning is that the reasoning output is very very long and still didn’t check about the quality of…

27
r/LocalLLaMA community 9d ago

GLM 5.2: 98% of max level intelligence with less than half of tokens usage

According to this number of reasoning tokens from GLM 5.1 to GLM 5.2 more than doubled from 16.7k to 36.7k and for me as a local user with old junk Xeon setup this makes GLM 5.2 unusable to the extent where I had to shut down model after 12h of waiting it to respond to my math…

37
r/LocalLLaMA community 10d ago

How do I set the right llama.cpp parameters?

--n-gpu-layers all --ctx-size 0 --reasoning-budget 0 --presence-penalty 1.1 --repeat-penalty 1.1 How do I figure out the optimal llama.cpp parameters for my setup? llama.cpp + Open WebUI in Docker with an AMD GPU (16GB VRAM) running gemma 4 12b and 26b models. Is it all about…

13
Hugging Face Daily Papers research 10d ago

Context-Aware RL for Agentic and Multimodal LLMs

Abstract ContextRL enhances long-horizon reasoning and multimodal performance through reinforcement learning that rewards context selection for supporting query-answer pairs, achieving improvements over standard methods on diverse benchmarks. Generated by…

21
r/LocalLLaMA community 10d ago

Watching a local AI voice assistant get dumber (A 9B to 0.8B agent experiment on my RTX 5060 Ti)

I wanted to find the exact floor for running an intelligent, local voice assistant agent on consumer hardware. I kept the environment, tools, and prompts identical, I stepped the model sizes down through Qwen 3.5 9B, 4B, 2B, and 0.8B to see how agentic reasoning degrades. The…

12
r/LocalLLaMA community 10d ago

Has anyone here used VibeThinker-3B outside benchmarks?

Just curious, given the hype and benchmark numbers. Curious about real-world behavior: debugging, coding assistance, reasoning over messy prompts, local latency, failure modes, and whether it actually feels useful versus just optimized for verifiable evals.…

23
arXiv — NLP / Computation & Language research 11d ago

Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models

arXiv:2606.19404v1 Announce Type: cross Abstract: Hallucination detection in large language models (LLMs) is deployment-critical, and recent work shows that the spectrum of attention-derived graph Laplacians carries strong signal about reasoning quality. Prior spectral…

15
arXiv — Machine Learning research 11d ago

Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks

arXiv:2606.19489v1 Announce Type: new Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by projecting learned features into a human-understandable concept space. Recent approaches leverage vision-language models to generate concept embeddings, reducing the need…

8
arXiv — Machine Learning research 11d ago

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

arXiv:2606.19636v1 Announce Type: new Abstract: Math and science reasoning benchmarks rely on pass@k, the fraction of sampled chains that reach gold, as the canonical per-example difficulty signal. The same signal drives RL with verifiable rewards, math data curation, synthetic…

20
arXiv — NLP / Computation & Language research 11d ago

Efficiently Representing Algorithms With Chain-of-Thought Transformers

arXiv:2606.19697v1 Announce Type: cross Abstract: The increasing popularity of \emph{reasoning} models -- language models that output a series of reasoning or thought tokens before producing an answer -- is justified, in part, by theoretical results showing that chain-of-thought…

9
arXiv — NLP / Computation & Language research 11d ago

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

arXiv:2606.19750v1 Announce Type: cross Abstract: Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing…

15
arXiv — Machine Learning research 11d ago

ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models

arXiv:2606.19919v1 Announce Type: new Abstract: Large reasoning models rely on long chain-of-thought to achieve strong performance, but applying such reasoning uniformly incurs high computational cost. Existing efficiency-oriented methods attempt to shorten or mix reasoning…

11
arXiv — Machine Learning research 11d ago

VIMPO: Value-Implicit Policy Optimization for LLMs

arXiv:2606.20008v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards has become a central tool for improving the reasoning ability of large language models, but current methods face a trade-off between simplicity and credit assignment. Group-relative…

6
arXiv — NLP / Computation & Language research 11d ago

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

arXiv:2606.20075v1 Announce Type: cross Abstract: Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome…

36
arXiv — NLP / Computation & Language research 11d ago

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

arXiv:2606.19350v1 Announce Type: new Abstract: Large language models (LLMs) excel at multi-step reasoning but incur substantial inference cost. We introduce Causal Attribution Pruning (CAP), a training-free method that identifies critical attention heads by measuring their…

34
arXiv — NLP / Computation & Language research 11d ago

Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning

arXiv:2606.19351v1 Announce Type: new Abstract: Knowledge graph (KG) reasoning infers new knowledge from existing facts and is widely applied in question answering, recommendation, and decision support. With the rapid development of large language models (LLMs), LLM-based KG…

25
arXiv — NLP / Computation & Language research 11d ago

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

arXiv:2606.19354v1 Announce Type: new Abstract: Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the…

5
arXiv — NLP / Computation & Language research 11d ago

Where Does Social Reasoning Come From? Capability Provenance in Language Models

arXiv:2606.19625v1 Announce Type: new Abstract: We use training-data attribution as an interpretable tool for capability discovery, mapping which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in OLMo3-7B. Training-data attribution measures how…

9
arXiv — NLP / Computation & Language research 11d ago

Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability

arXiv:2606.19815v1 Announce Type: new Abstract: Pre-trained language models such as BERT achieve strong text classification performance but lack transparency, limiting their use in high-stakes settings. The Tsetlin Machine (TM) offers fully interpretable, clause-based reasoning…

25
arXiv — NLP / Computation & Language research 11d ago

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

arXiv:2606.19847v1 Announce Type: new Abstract: Large language models (LLMs) demonstrate strong reasoning and generation abilities, but their fixed context windows limit long-term information accumulation and reuse across multi-session interactions. Existing memory-augmented…

32
arXiv — NLP / Computation & Language research 11d ago

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

arXiv:2606.19946v1 Announce Type: new Abstract: Activation steering controls model behavior by modifying intermediate hidden states at inference time without retraining. Existing methods handle only single-direction injection; when multiple semantic directions are superposed…

16
arXiv — NLP / Computation & Language research 11d ago

MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

arXiv:2606.20164v1 Announce Type: new Abstract: Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and…

29
arXiv — NLP / Computation & Language research 11d ago

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

arXiv:2606.19808v1 Announce Type: cross Abstract: Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes.…

25
arXiv — NLP / Computation & Language research 11d ago

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

arXiv:2504.02885v2 Announce Type: replace Abstract: Automated medical report generation (MRG) is increasingly used to reduce the burden of manual reporting and for decision support. Large vision-language models (LVLMs) hold great promise for automated MRG due to their…

29
Hugging Face Daily Papers research 11d ago

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

Abstract S-Agent is a spatial reasoning framework that enhances visual language models with temporal memory and hierarchical spatial tools to enable continuous 3D world understanding from multi-view imagery. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Real-world spatial…

28
Hugging Face Daily Papers research 11d ago

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

Abstract A lightweight image inpainting framework achieves high-fidelity results with significantly reduced parameters and inference time through novel local-global interaction blocks and adaptive distillation strategies. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While…

35
Hugging Face Daily Papers research 11d ago

Thinking with Visual Grounding

Abstract Visually grounded thinking integrates natural-language reasoning with explicit visual evidence grounding in vision-language models, improving reasoning accuracy through scalable synthesis and reinforcement learning techniques. Generated by…

34
Hugging Face Daily Papers research 11d ago

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

Abstract A two-stage iterative framework alternates between data augmentation and policy optimization to improve LLM reasoning by leveraging intermediate correction steps, achieving superior performance on coding benchmarks and constraint satisfaction problems. Generated by…

23
TechCrunch — AI news-outlet 11d ago

General Intuition in talks to raise $300M at around $2B valuation

General Intuition is in talks to raise around $300 million at a roughly $2 billion valuation from backers including Jeff Bezos. The startup trains AI agents on spatial-temporal reasoning.

14
Hugging Face Daily Papers research 11d ago

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Abstract A framework automates environment redesign in reinforcement learning for large language models by having the policy analyze failures and suggest configuration changes, achieving superior performance over larger proprietary models and fixed-environment baselines.…

6
OpenAI official-blog 11d ago

Improving health intelligence in ChatGPT

Learn how GPT-5.5 Instant improves ChatGPT’s health and wellness responses with stronger reasoning, better context, clearer communication, and physician-informed evaluations.

7
OpenAI official-blog 11d ago

Using AI to help physicians diagnose rare genetic diseases affecting children

Researchers used an OpenAI reasoning model to help diagnose rare diseases, identifying 18 new diagnoses in previously unsolved cases.

17
Hugging Face Daily Papers research 11d ago

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

Abstract SciOrch is a framework that uses a lightweight orchestrator model to coordinate multiple frontier LLMs for scientific reasoning, achieving superior performance through MCTS-based training and GRPO-style optimization while reducing API costs. Generated by…

31
Hugging Face Daily Papers research 12d ago

Native Active Perception as Reasoning for Omni-Modal Understanding

Abstract OmniAgent is a novel omni-modal agent that addresses long video understanding by using an iterative observation-thought-action cycle with active perception, achieving superior performance compared to larger models through efficient selective processing. Generated by…

24
Hugging Face Daily Papers research 12d ago

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Abstract A unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning through reinforcement learning, enabling robust spatial reasoning across diverse tasks and domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spatial…

9

A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial

Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering

Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

A Verifiable Search Is Not a Learnable Chain-of-Thought

Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

Training a Qwen 3.5 4B/9B agent for multi-tool use: SFT first or go directly to RL?

Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

NEX-N2-mini: "There is no Pareto frontier. I am Pareto". This Qwen3.5-MoE fine tune fixed 3.5 and 3.6 overthinking apparently on my tests.

Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models

8-16 MI50s Minimax M3 @19 tps TG (peak)

GLM 5.2: 98% of max level intelligence with less than half of tokens usage

How do I set the right llama.cpp parameters?

Context-Aware RL for Agentic and Multimodal LLMs

Watching a local AI voice assistant get dumber (A 9B to 0.8B agent experiment on my RTX 5060 Ti)

Has anyone here used VibeThinker-3B outside benchmarks?

Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models

Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

Efficiently Representing Algorithms With Chain-of-Thought Transformers

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models

VIMPO: Value-Implicit Policy Optimization for LLMs

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

Where Does Social Reasoning Come From? Capability Provenance in Language Models

Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

Thinking with Visual Grounding

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

General Intuition in talks to raise $300M at around $2B valuation

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Improving health intelligence in ChatGPT

Using AI to help physicians diagnose rare genetic diseases affecting children

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

Native Active Perception as Reasoning for Omni-Modal Understanding

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models