News / #gpu Tag Gpu 500 articles archived under #gpu · RSS Sign in to follow r/LocalLLaMA community 3d ago What are people using for multi-model backends? What about swapping configs? I am trying to plan and deploy a machine that serves models for coding, Hermes, and whatever else. It's got multiple GPUs in it, and I want the flexibility to run different configurations (i.e. I might want to run two smaller models when I'm using Hermes and doing some… 23 Ollama releases dev-tools 3d ago v0.30.11 What's Changed launch: add thinking capability detection to opencode by @hoyyeva in #15434 launch: auto-install Claude Code by @hoyyeva in #16802 launch: auto-install opencode when missing by @hoyyeva in #16806 discover: fix inverted iGPU/dGPU Vulkan classification on Windows… 28 NVIDIA Developer Blog official-blog 3d ago Deploy a Production-Ready NVIDIA AI-Q Blueprint on Oracle Cloud Infrastructure AI agents have changed a lot in the last two years. The first could only answer one question at a time. Then came multi-turn chat, where the model could keep... 7 llama.cpp releases dev-tools 3d ago b9820 sched : reintroduce less synchronizations during split compute ( #20793 ) CUDA: Improve performance via less synchronizations between token ( #17795 ) Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async() Adds function to relax sync requirements between input… 26 TechCrunch — AI news-outlet 3d ago Why everyone from OpenAI to SpaceX is building their own chips (and turning up the heat on Nvidia) Nvidia has dominated the AI chip market for years, but the era of total dependence might be ending.   OpenAI just shared its plans to spice things up with Jalapeño, its custom inference chip built with Broadcom, joining Google, Apple, and SpaceX in a growing list… 35 r/LocalLLaMA community 3d ago Why do people keep investing in Intel for AI? If you get a good deal on some Xeons with a lot of memory bandwidth, or a cheap GPU for home inference, that's cool, no disrespect. But how in the hell are Wall Street types considering Intel part of the "AI picks and shovels" play? Who's buying Intel for their AI data centers?… 17 NVIDIA Developer Blog official-blog 3d ago Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer As context windows grow longer, moving large model weights efficiently becomes critical to performance. A common way to address this is quantization, an... 37 TechCrunch — AI news-outlet 3d ago OpenAI’s Jalapeño chip is Big Tech’s spiciest move away from Nvidia Nvidia has dominated the AI chip market for years, but the era of total dependence might be ending.   OpenAI just shared its plans to spice things up with Jalapeño, its custom inference chip built with Broadcom, joining Google, Apple, and SpaceX in a growing list… 25 llama.cpp releases dev-tools 3d ago b9810 CUDA: add cublasSgemmBatched mapping for HIP/MUSA vendor headers ( #25033 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64… 31 arXiv — Machine Learning research 4d ago \chisao{}: A GPU-Native Parallel Optimizer for Multimodal Black-Box Functions via Convergence-Anticonvergence Oscillation arXiv:2606.26164v1 Announce Type: new Abstract: Finding all modes of a multimodal black-box function is a fundamental challenge in optimization, Bayesian inference, and scientific computing. Existing approaches -- basin-hopping, CMA-ES, multistart gradient descent -- operate… 26 arXiv — Machine Learning research 4d ago Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization arXiv:2606.26453v1 Announce Type: new Abstract: We present KernelPro, a closed-loop multi-agent system that automatically generates, profiles, and iteratively optimizes GPU kernel code by integrating large language model (LLM) code generation with hardware profiler feedback and… 21 arXiv — Machine Learning research 4d ago PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs arXiv:2606.26666v1 Announce Type: new Abstract: Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce KV-cache fragmentation and mature kernels… 20 arXiv — NLP / Computation & Language research 4d ago Structure Before Collapse: Transient semantic geometry in next-token prediction arXiv:2606.26749v1 Announce Type: cross Abstract: Neural Collapse predicts that balanced one-hot classification pushes model representations to be equally far from each other; a symmetric configuration that depends only on the output label and ignores any semantic similarity in… 29 arXiv — NLP / Computation & Language research 4d ago Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA arXiv:2606.27023v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) applied to Medical Visual Question Answering (VQA) tend to produce overconfident outputs regardless of actual correctness, and existing verbalized confidence calibration methods, developed… 15 arXiv — Machine Learning research 4d ago Finding Stationary Points by Comparisons arXiv:2606.27082v1 Announce Type: new Abstract: We study the problem of finding stationary points of non-convex functions when access to the objective is provided only through a comparison oracle that, given two points, outputs which has the larger function value. For a twice… 17 arXiv — NLP / Computation & Language research 4d ago ProvenAI: Provenance-Native Traces of Evidence in Generated Answers arXiv:2606.26449v1 Announce Type: new Abstract: Retrieval-augmented systems routinely present citations alongside generated answers, yet a citation does not confirm that the corresponding source meaningfully shaped the output. This paper introduces ProvenAI, a framework that… 17 arXiv — NLP / Computation & Language research 4d ago \textsc{DiARC}: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models arXiv:2606.26530v1 Announce Type: new Abstract: The Abstraction and Reasoning Corpus (ARC;~\citealp{chollet2019measure}) contains tasks that require summarizing patterns from limited grid samples and predicting output grids. Recently, many large language model based approaches… 22 arXiv — NLP / Computation & Language research 4d ago GAVEL: Grounded Caption Error Verification and Localization arXiv:2606.26923v1 Announce Type: new Abstract: Vision-language models (VLMs) often produce hallucinated or inconsistent outputs, where text and images are not properly aligned. Addressing this issue requires not only detecting misalignment but also explaining the discrepancy… 24 arXiv — NLP / Computation & Language research 4d ago KARLA: Knowledge-base Augmented Retrieval for Language Models arXiv:2606.26807v1 Announce Type: cross Abstract: We propose a new method that allows an LLM to automatically pull in factual knowledge from a knowledge base during token generation. This means that (1)~factual knowledge in the LLM output can be updated without retraining the… 12 arXiv — NLP / Computation & Language research 4d ago Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement arXiv:2606.27226v1 Announce Type: cross Abstract: Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores… 14 r/LocalLLaMA community 4d ago For dual GPUs, will there be any big impact to inference speeds when running in PCIe 5.0 x8/x4 vs x8/x8? I bought the Biostar Z890 Valkyrie because it was on sale and had three PCIe 5.0 slots connected to the CPU (x16 or x8/x8 or x8/x4/x4), which I thought would be great for running dual GPUs for LLM inference. The problem is that now I want to add a SATA expansion card to the… 25 r/LocalLLaMA community 4d ago When you don't have a data center GPU Please don't tell me someone is going to (yet again) reply with the longest finetune-merge name in eternity...   submitted by   /u/Iwaku_Real [link]   [comments] 4 Latent.Space news-outlet 4d ago [AINews] OpenAI reports median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025. It's happening. 37 r/LocalLLaMA community 4d ago Built an open source local first Kanban workflow for running AI coding agents without babysitting every step I’ve been building BatonBot, a local first app for running AI coding workflows with less babysitting. The problem I kept running into, especially with local models, is that coding agents can be useful but the workflow gets slow: start task → wait → check output → fix next issue… 10 r/LocalLLaMA community 4d ago audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA I’ve been working on audio.cpp , a native C++ inference framework for audio models built on top of ggml. The framework currently has 25 model families, but I want to be precise about its state: 12 are released in the repo now and ready for normal use. I’m not counting anything… 24 NVIDIA Developer Blog official-blog 4d ago Streamlining Resource Binding with End-to-End Support for Vulkan Descriptor Heaps Shaders are GPU programs that process visual data—such as rays, pixels, geometry, and textures—to produce specific rendering effects. Shaders find necessary... 32 r/MachineLearning community 4d ago Kuma: compiling PyTorch models into self-contained WebGPU executables [P] I've been experimenting with a compiler/runtime project that I'm not entirely sure is a good idea, so I'd love some feedback from people who've worked on deployment systems. The idea is to compile an exported PyTorch model into a self-contained package that contains: graph… 33 r/LocalLLaMA community 4d ago DGX Spark OS lifetime? I think of purchasing 2 DGX Sparks for my office (because a 700+W workstation would be intolerable) for LLM-centric work (inference only, no fine-tuning). I know the OS is based on Ubuntu 24.04. Has Nvidia ever disclosed what is the lifetime of the OS? Meaning, is there a chance… 17 r/LocalLLaMA community 4d ago LFM2.5 230M running in-browser at 1,400 tok/s using custom WebGPU kernels Everything runs locally in your browser using custom WebGPU kernels written by Fable 5 (before it was shut down) and Opus 4.8. The video was recorded on my M4 Max. Model: LiquidAI/LFM2.5-230M ( GGUF ) Demo: https://huggingface.co/spaces/webml-community/lfm2-webgpu-kernels  … 37 Hugging Face Daily Papers research 4d ago Forecasting Future Behavior as a Learning Task Abstract Behavior Forecasters are trained to predict large reasoning model outputs from single trajectories, outperforming large language models while requiring significantly less computational cost. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Trust in an AI system is often… 24 NVIDIA Developer Blog official-blog 4d ago Scaling AI Inference Across Multiple GPUs Using NVIDIA TensorRT with Multi-Device Inference Support Generative AI workloads are rapidly outgrowing the memory and compute budget of single GPUs. For inference developers building media generation pipelines, the... 38 NVIDIA Developer Blog official-blog 4d ago How KRAFTON Built PUBG Ally, a Co-Playable Character Powered by NVIDIA ACE AI companions in games have long been constrained by scripted behavior trees and fixed dialogue. PUBG Ally is a different kind of system. Built by KRAFTON for... 26 r/LocalLLaMA community 4d ago Tensor Split Fix for intel GPU's llama.cpp release b9788 sycl : support --split-mode tensor #24152 I'd like to see some numbers if anyone has 2xintel gpus and tries this out   submitted by   /u/Bulky-Priority6824 [link]   [comments] 10 r/LocalLLaMA community 4d ago siq1 on kebab bench tested my model on kebab bench and it performs very well: https://huggingface.co/spaces/AlexWortega/hermes-agent-zerogpu   submitted by   /u/Mysterious_Hearing14 [link]   [comments] 29 Hugging Face Daily Papers research 4d ago What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics Abstract Jailbreak attacks expose vulnerabilities in aligned large language models, revealing that harmful intent is encoded in structured intermediate uncertainty dynamics rather than output representations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Jailbreak attacks reveal… 23 Hugging Face Daily Papers research 4d ago Constraint Tax in Open-Weight LLMs: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints Abstract Tool Suppression occurs when JSON Schema constraints and tool calling are jointly enabled, preventing open-weight models from invoking tools despite maintaining schema compliance, with the issue stemming from grammar-based token masking that makes tool-call tokens… 5 llama.cpp releases dev-tools 4d ago b9788 sycl : support --split-mode tensor ( #24152 ) Sycl tp stage1 ( #1 ) SYCL: tensor parallelism (--split-mode tensor) for dual-GPU Adds the comm_init/comm_free/comm_allreduce_tensor trio that the meta-backend queries via get_proc_address to enable backend-specific all-reduce,… 33 r/LocalLLaMA community 4d ago NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone. NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone. Instead of generating strictly one token at a time, it uses a frozen autoregressive context tower plus a diffusion denoiser tower… 38 r/LocalLLaMA community 4d ago Worse quality with MTP - Qwen 3.6, Gemma 4 Hi. I am self-hosting Qwen 3.6 27B Q8_K_XL with Llama.cpp on 4x5070ti. (All 4 cards are on single x16 slot bifurcated to 4x4 with risers). I've been testing it on several work repos with Opencode CLI and in like 8/10 situations the output of non-MTP model is far superior to the… 8 Hugging Face Daily Papers research 4d ago CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression Abstract Two-channel evaluation shows output compression reduces costs while input compression increases costs and degrades accuracy across models and datasets. Generated by Qwen/Qwen2.5-Coder-32B-Instruct "Talk short. Drop grammar. Save token." This caveman style is widely… 28 r/LocalLLaMA community 5d ago If LLMs are so good at coding… How come things like ROCm and the intel stack aren’t able to rapidly improve their software ecosystems to be a match for CUDA? Until the software from other vendors catches up with NVIDIA, they’re always going to get away with charging a massive premium on their “it just works”… 38 arXiv — Machine Learning research 5d ago Quantifying Explainable AI-introduced signal noise on ECG data with Spectral Entropy arXiv:2606.24974v1 Announce Type: new Abstract: Explainability techniques are used to assess the output of various deep learning models. This is especially true in healthcare, where models need to be trusted and decisions justified. Explainability (XAI) tools use heuristics… 22 arXiv — Machine Learning research 5d ago Erased, but Not Gone: Output Forgetting Is Not True Forgetting arXiv:2606.25001v1 Announce Type: new Abstract: Machine unlearning (MU) is commonly judged by output forgetting, such as low forget-set accuracy or reduced logit-level membership inference. But if output-level success can coexist with retraining-inconsistent residuals in… 26 arXiv — Machine Learning research 5d ago Speculative Decoding at Temperature Zero: A Scoped Safety-Invariance Screen with a 48,072-Sample Expansion arXiv:2606.25097v1 Announce Type: new Abstract: Speculative decoding accelerates inference by letting a draft model propose tokens for a target model to verify, raising a concrete safety question: at temperature zero, can draft-side behavior leak into safety-scored outputs? We… 7 arXiv — Machine Learning research 5d ago Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation arXiv:2606.25432v1 Announce Type: new Abstract: Inference efficiency is typically pursued by shrinking the model: distillation, pruning, quantization, and sparse routing each lower per-token cost while treating token count as fixed. But output length has been inflating, and it… 28 arXiv — NLP / Computation & Language research 5d ago What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics arXiv:2606.25182v1 Announce Type: new Abstract: Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it… 5 arXiv — NLP / Computation & Language research 5d ago Constraint Tax in Open-Weight LLMs: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints arXiv:2606.25605v1 Announce Type: new Abstract: Tool Calling and Structured Output are two core capabilities of modern Agent systems, yet their interaction under joint deployment conditions remains insufficiently understood. This paper reports a reproducible phenomenon observed… 10 arXiv — NLP / Computation & Language research 5d ago Weave of Formal Thought arXiv:2606.25987v1 Announce Type: new Abstract: Large language models (LLMs) attain remarkable surface fluency on code, yet they neither formally guarantee the syntactic validity of their output nor leverage the hierarchical structure defining the target language. While existing… 18 arXiv — NLP / Computation & Language research 5d ago Tracing Target Answers in Poisoned Retrieval Corpora via Token Influence Attribution arXiv:2606.25721v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) systems are vulnerable to corpus poisoning attacks that manipulate model outputs through malicious retrieved documents. Existing detection methods typically rely on auxiliary classifiers or… 30 arXiv — NLP / Computation & Language research 5d ago RAS: Measuring LLM Safety Through Refusal Alignment arXiv:2606.25750v1 Announce Type: cross Abstract: Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is… 27 Page 2 of 10 · 500 articles ← Newer Older →