Tag

Inference

340 articles archived under #inference · RSS

arXiv — Machine Learning research 1mo ago

Architecture-driven Shift: towards a lightweight selector for capturing the trends of logit shift

arXiv:2605.27469v1 Announce Type: new Abstract: Continual Learning (CL) is a practical paradigm to utilize power of deep pre-trained neural networks, but which pre-trained model has a better ability to balance ``Plasticity-Stability", deserving to be chosen? The logit shift…

35
arXiv — Machine Learning research 1mo ago

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

arXiv:2605.27763v1 Announce Type: new Abstract: Safety evaluations of language models often treat serving configuration as fixed background infrastructure, but batch condition is an untested treatment variable whenever the same prompt may be evaluated alone, in a synchronized…

17
arXiv — Machine Learning research 1mo ago

SPAR: Support-Preserving Action Rectification

arXiv:2605.27877v1 Announce Type: new Abstract: Offline policy improvement faces an inherent conflict between maximizing value and fitting the data distribution. While in-sample weighted regression is stable, it suffers from over-conservatism that suppresses high-value actions…

5
arXiv — Machine Learning research 1mo ago

RW-TTT: Batched Serving for Request-Owned Test-Time Training State

arXiv:2605.28053v1 Announce Type: new Abstract: Test-time training (TTT) adapts an LLM during generation by reading and updating request-owned state, such as fast weights, low-rank deltas, or streaming learner state. This breaks batched LLM serving, which assumes shared static…

8
arXiv — Machine Learning research 1mo ago

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

arXiv:2605.28302v1 Announce Type: new Abstract: Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D)…

9
arXiv — NLP / Computation & Language research 1mo ago

StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

arXiv:2605.28073v1 Announce Type: new Abstract: Story rewriting aims to adapt existing narratives to diverse reader preferences while preserving plot consistency and narrative coherence. Unlike conventional work on style transfer, we argue that effective story rewriting demands…

15
arXiv — NLP / Computation & Language research 1mo ago

Why We Need Speech to Evaluate Speech Translation

arXiv:2605.28227v1 Announce Type: new Abstract: Speech translation models are increasingly capable of preserving speech-specific information (e.g., speaker gender, prosody, and emphasis), yet evaluation metrics remain blind to such phenomena. We meta-evaluate both text- and…

35
r/LocalLLaMA community 1mo ago

Vulnerability found in framework used by VLLM, many MCP servers, and other LLM tools

Worth taking a look to see if this affects any of you. Surprised nobody has posted it yet.   submitted by   /u/Hrethric [link]   [comments]

4
The Information — AI news-outlet 1mo ago

Crypto-Friendly United Texas Bank Switches Regulator to OCC

United Texas Bank, a crypto-friendly bank, said it successfully switched its regulator through a charter conversion, despite being under a consent order. The move will help it grow its business serving digital asset firms and foreign banks. The Dallas-based bank, which has about…

26
r/LocalLLaMA community 1mo ago

Is there any use case for large models with very slow token output for batch processing?

Maybe I'm influenced by the sci-fi story "The Last Question" by Issac Assimov but I've always got a tickle imagining a huge model like Kimi running on, say, disk. Even if it is 0.001 tok/sec to ask complex questions and get an answer in a week Is there any use or community…

17
Hugging Face Daily Papers research 1mo ago

ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

Abstract ZeroUnlearn addresses privacy concerns in large language models by reformulating machine unlearning as precise knowledge re-mapping through model editing, enabling efficient and targeted removal of sensitive information while preserving general model utility.…

38
Hugging Face Daily Papers research 1mo ago

Rethinking VLM Representation for VLA Initialization

Abstract Effective vision-language-action model initialization requires balancing pretrained vision-language model representations with embodied task-specific adaptations and robot-data pretraining while preserving core action-relevant features. AI-generated summary…

22
Hugging Face Daily Papers research 1mo ago

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

Abstract Long-horizon agentic reasoning is enhanced through a state-adaptive memory framework that dynamically manages interaction histories by creating compact memory cues while preserving detailed trajectories for targeted retrieval. AI-generated summary Long-horizon agentic…

20
arXiv — Machine Learning research 1mo ago

Provably Communication-Efficient and Privacy-Preserving Federated Graph Neural Networks

arXiv:2605.26243v1 Announce Type: new Abstract: Graph neural networks (GNNs) achieve strong performance on relational data, but real-world graphs are often distributed across organizations that cannot share raw data due to privacy and policy constraints. Existing federated GNN…

30
Hugging Face Daily Papers research 1mo ago

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Abstract Parallel Box Decoding enables efficient and accurate unified visual grounding and detection by decoding geometric elements as atomic units, improving both throughput and localization quality. AI-generated summary Vision-language models (VLMs) commonly formulate visual…

8
r/LocalLLaMA community 1mo ago

Fast little local memory retriever for Hermes

As title says. Looking for suggestions of a good memory retriever (for use with hindsight/hermes) ideally that can run on a strix halo NPU. GPT OSS 20B would be good based on their outdated rankings but it’s slow on the NPU for this type of task — needs very high throughput to…

4
r/LocalLLaMA community 1mo ago

Looking for Suggestions — Single 5090 & 64gb DDR5

Hi Reddit, I am planning on running Qwen 3.6 27b NVFP4 via vLLM on my 5090 but was wondering if something like 35b a3b at Q8 on Llama would produce better results for agentic coding and utilize the system memory. My research says no but if that’s the case what would yall do to…

10
r/LocalLLaMA community 1mo ago

Harbor v0.4.19 - vllm/sglang/llama.cpp launch codex/claude/pi/opencode

I'm usually not posting about Harbor releases out of the respect for the community here, but I think v0.4.19 might save a lot of people some time. Harbor can now launch your local agentic coding tools with local inference backends. For example, to run pi + vllm: # model…

26
Smol AI News news-outlet 1mo ago

not much happened today

**Inference optimization** is increasingly architectural, with **EAGLE 3.1** improving speculative decoding and long-context handling, collaborating with **vLLM** and **TorchSpec**. **Perplexity** open-sourced a rebuilt **Unigram tokenizer** cutting CPU use by **5–6×** and…

15
arXiv — Machine Learning research 1mo ago

PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets

arXiv:2605.24249v1 Announce Type: new Abstract: The growing availability of clinical data has increased the use of machine learning, yet centralized data aggregation is often infeasible for sensitive health information. Federated Learning (FL) offers a distributed alternative,…

19
arXiv — Machine Learning research 1mo ago

Hardware-Aware Federated Learning for Speech Emotion Recognition

arXiv:2605.24712v1 Announce Type: new Abstract: Federated learning (FL) enables privacy-preserving collaborative training across distributed edge devices, but real deployments involve heterogeneous clients with different processing power, memory capacity, and communication…

16
arXiv — NLP / Computation & Language research 1mo ago

DTO: a Differentiable Training Objective for Effective Counterfactual Story Rewriting

arXiv:2605.24885v1 Announce Type: new Abstract: Counterfactual story rewriting is a natural language processing task that requires updating an existing story to reflect a chosen alternative event, yet preserving all the unaffected storyline elements and overall coherence. While…

30
r/LocalLLaMA community 1mo ago

Qwen 3.6 benchmarks on 2x RTX PRO 6000

Got a chance to play around with 2x RTX PRO 6000 setup so sharing some number for Qwen 3.6. All these were run using latest stable VLLM backend. This was for a personal project. Qwen 3.6 27B BF16 (Original without any quantization) ------ MTP - Off | 64 concurrency | 1600 tps…

8
arXiv — Machine Learning research 1mo ago

FIRMA: FIbonacci Ring Model Aggregation for Privacy-preserving Federated Learning

arXiv:2605.22898v1 Announce Type: new Abstract: Federated learning protocols face a structural trilemma: canonical server-based aggregation~\cite{mcmahan2017} creates a single point of failure and gradient inversion risk; decentralised ring-gossip…

23
arXiv — Machine Learning research 1mo ago

Building a privacy-preserving Federated Recommender system for mobile devices

arXiv:2605.22924v1 Announce Type: new Abstract: Serving personalized content on mobile devices has traditionally required pooling sensitive user data on centralized servers, a practice increasingly at odds with modern privacy expectations and geographical regulations. We present…

25
arXiv — Machine Learning research 1mo ago

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

arXiv:2605.23057v1 Announce Type: new Abstract: ModeSwitch-LLM is a lightweight request-boundary controller for improving single-GPU large language model inference efficiency by routing each request to an appropriate fixed inference mode. Instead of relying on one static serving…

15
arXiv — NLP / Computation & Language research 1mo ago

DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

arXiv:2605.23605v1 Announce Type: cross Abstract: Diffusion language models intrinsically fail to capture correlations between decoded tokens, which leads to a harsh trade-off between sampling quality and throughput. To solve this issue, we propose DiLaDiff, a variant of masked…

18
r/LocalLLaMA community 1mo ago

Could someone please help explain these results?

I'm running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf on 12 GB VRAM and 32 GB RAM via the TurboQuant variant of llama.cpp. I increased the --n-cpu-moe value from 8 to 30, and my inference rate doubled! (17 to 34 tok/s). Shouldn't it have slowed down from the CPU having to do so much more…

22
r/LocalLLaMA community 1mo ago

How are you all handling agents and sub agents?

Currently got it setup in Librechat to use DeepSeek v4 pro via OpenRouter to be the master planner, then have my PC running Qwen 35B @ 160ish tok/sec locally, and my mini PC running Gemma E2B locally for smaller tasks. Im wondering if there are setups out there to effectively…

10
r/LocalLLaMA community 1mo ago

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

Hello everyone! I want to share the result of my experiment to make Qwen3.6 27B Q4_K_M fits in to my RTX 5060 Ti 16 GB. Inspired by u/Due-Project-7507 's work on Ununnilium/Qwen3.6-27B-IQ4_XS-pure-GGUF . Using the same pure quantization method, I was able to create a Q4_K_M…

19
llama.cpp releases dev-tools 1mo ago

b9291

SYCL: improve MoE prefill throughput ( #23142 ) change k_copy_src1_to_contiguous so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends switch the O(n_as * n_routed_rows) contraption to a counting sort-based…

27
r/LocalLLaMA community 1mo ago

Cannot get NCCL test to run in docker with 2 x 6000 Pro connected x8 to AM4 CPU

nvidia-smi topo -m is showing the both GPU as PHB (i.e. via CPU) connected as expected but I cannot get NCCL all_reduce_perf to run at all, it always hangs after starting up. It seems that vllm won't work with TP=2 until I can fix this. Is there any reason why this setup would…

5
arXiv — NLP / Computation & Language research 1mo ago

HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering

arXiv:2605.22035v1 Announce Type: cross Abstract: Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This…

33
Hugging Face Daily Papers research 1mo ago

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

Abstract KVServe is a service-aware and adaptive framework for optimizing key-value communication compression in disaggregated large language model serving, achieving significant improvements in job completion time and time-to-first-token reduction through dynamic optimization.…

30
Hugging Face Daily Papers research 1mo ago

WorldKV: Efficient World Memory with World Retrieval and Compression

Abstract WorldKV enables persistent world generation in video diffusion models by retrieving and compressing key-value cache chunks to maintain consistency while improving throughput. AI-generated summary Autoregressive video diffusion models have enabled real-time,…

22
Hugging Face Daily Papers research 1mo ago

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

Abstract Research investigates subword tokenization's impact on LLM training efficiency and performance through controlled byte-level pretraining experiments, revealing key factors in training throughput and linguistic priors. AI-generated summary Subword tokenization is an…

23
r/LocalLLaMA community 1mo ago

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB, until they actually merged the MTP PR. Then, performance tanked and was barely above non-MTP. So, I decided to try out ik_llama.cpp since it also supports MTP and is apparently better optimized for…

35
r/LocalLLaMA community 1mo ago

'Am I OpenAI compatible' - a tool and documentation for unified api signatures in open source AI.

This has turned out to be useful to many of my friends so I thought I'd share here as well. I created a tool and documentation page for most major open-souce project's adherence to 'OpenAI compatibility' after seeing inconsistencies between engines like vLLM and llama.cpp. Now…

18
r/MachineLearning community 1mo ago

High E2E latency on fine-tuned Gemma 4 26B despite low TTFT [R]

Recently fine-tuned a Gemma 4 26B model, and I’m seeing surprisingly high end-to-end latency despite the effective inference footprint being much smaller (~4B-ish behavior during serving). Current setup: Model: Gemma 4 26B (fine-tuned) Engine: vLLM Quantization: FP8 Hardware:…

27
Hugging Face Daily Papers research 1mo ago

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Abstract Mix-Quant is a phase-aware quantization framework that accelerates long-context, multi-turn LLM inference by applying high-throughput NVFP4 quantization to the prefilling phase while maintaining BF16 precision for decoding. AI-generated summary LLM agents have recently…

30
Hugging Face Daily Papers research 1mo ago

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Abstract Orthogonal Gradient Projection for Safety Alignment (OGPSA) addresses the safety-utility trade-off in LLM alignment by preserving general capabilities during sequential safety training through low-rank gradient projection. AI-generated summary Safety post-training can…

32
arXiv — Machine Learning research 1mo ago

CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

arXiv:2605.20247v1 Announce Type: new Abstract: Catastrophic forgetting remains a major obstacle to continual learning in large language models (LLMs) and vision--language models (VLMs). Although Mixture-of-Experts (MoE) architectures offer an efficient path to scaling, existing…

38
arXiv — Machine Learning research 1mo ago

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

arXiv:2605.20262v1 Announce Type: new Abstract: We study selective refusal editing as a three-way control problem: induce non-refusal on designated edit prompts while preserving benign behavior and harmful refusals outside the edit set. We introduce Residual Paving, a routed…

7
arXiv — Machine Learning research 1mo ago

Consistently Informative Soft-Label Temperature for Knowledge Distillation

arXiv:2605.20357v1 Announce Type: new Abstract: Knowledge distillation (KD) transfers knowledge from a high-capacity teacher to a compact student by matching their predictive distributions, with temperature scaling serving as a central mechanism for smoothing teacher predictions…

28
arXiv — NLP / Computation & Language research 1mo ago

Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models

arXiv:2605.20915v1 Announce Type: new Abstract: Machine unlearning aims to remove the influence of specific training data from a model while preserving reliable behavior on the remaining data, making reliable prediction and uncertainty estimation essential for evaluation.…

36
arXiv — NLP / Computation & Language research 1mo ago

DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

arXiv:2605.20936v1 Announce Type: cross Abstract: Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often…

27
r/LocalLLaMA community 1mo ago

Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark?

I’m running Hermes Agent on a single NVIDIA DGX Spark using vLLM with: docker run --gpus all \ --name qwen36-aggressive \ --restart unless-stopped \ -p 8000:8000 \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --shm-size=32g \ -v…

13
r/LocalLLaMA community 1mo ago

Try ik_llama.cpp with MTP if you have limited VRAM. You will be pleasantly surprised!

Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB (75-80 tok/s), until they actually merged the MTP PR. Then, performance tanked (65-70 tok/s) and was barely above non-MTP. I then decided to try out ik_llama.cpp since it also supports MTP. I did not…

14
Hugging Face Daily Papers research 1mo ago

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Abstract POW3R is a policy-aware framework for reinforcement learning with rubric-based rewards that adapts criterion weights during training to improve policy optimization while preserving human-defined criteria importance. AI-generated summary Reinforcement learning with…

12
Hugging Face Daily Papers research 1mo ago

Base Models Look Human To AI Detectors

Abstract Instruction-tuned language models produce text that commercial detectors identify as non-human, prompting the development of a paraphrasing pipeline that improves human-likeness while preserving semantics across different model sizes. AI-generated summary As…

37

Architecture-driven Shift: towards a lightweight selector for capturing the trends of logit shift

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

SPAR: Support-Preserving Action Rectification

RW-TTT: Batched Serving for Request-Owned Test-Time Training State

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

Why We Need Speech to Evaluate Speech Translation

Vulnerability found in framework used by VLLM, many MCP servers, and other LLM tools

Crypto-Friendly United Texas Bank Switches Regulator to OCC

Is there any use case for large models with very slow token output for batch processing?

ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

Rethinking VLM Representation for VLA Initialization

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

Provably Communication-Efficient and Privacy-Preserving Federated Graph Neural Networks

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Fast little local memory retriever for Hermes

Looking for Suggestions — Single 5090 & 64gb DDR5

Harbor v0.4.19 - vllm/sglang/llama.cpp launch codex/claude/pi/opencode

not much happened today

PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets

Hardware-Aware Federated Learning for Speech Emotion Recognition

DTO: a Differentiable Training Objective for Effective Counterfactual Story Rewriting

Qwen 3.6 benchmarks on 2x RTX PRO 6000

FIRMA: FIbonacci Ring Model Aggregation for Privacy-preserving Federated Learning

Building a privacy-preserving Federated Recommender system for mobile devices

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

Could someone please help explain these results?

How are you all handling agents and sub agents?

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

b9291

Cannot get NCCL test to run in docker with 2 x 6000 Pro connected x8 to AM4 CPU

HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

WorldKV: Efficient World Memory with World Retrieval and Compression

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

'Am I OpenAI compatible' - a tool and documentation for unified api signatures in open source AI.

High E2E latency on fine-tuned Gemma 4 26B despite low TTFT [R]

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

Consistently Informative Soft-Label Temperature for Knowledge Distillation

Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models

DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark?

Try ik_llama.cpp with MTP if you have limited VRAM. You will be pleasantly surprised!

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Base Models Look Human To AI Detectors