News / #gpu Tag Gpu 500 articles archived under #gpu · RSS Sign in to follow r/LocalLLaMA community 18d ago Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo What it is, in plain words. Your GPU keeps two float vectors for every token of your conversation. That’s the KV cache, and it’s why long contexts eat VRAM: Llama-3.1-8B needs about 0.12 MB per token, so 100k tokens costs 12 GB and a million tokens costs 122 GB. No consumer card… 33 Hugging Face Daily Papers research 18d ago MiniMax Sparse Attention Abstract MiniMax Sparse Attention enables efficient processing of ultra-long contexts in large language models through blockwise sparsity and optimized GPU execution, achieving significant speedups while maintaining performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 20 arXiv — NLP / Computation & Language research 18d ago Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU arXiv:2606.12765v1 Announce Type: new Abstract: Apple's Metal 4.1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose interface is documented but whose hardware behavior is deliberately hidden. The… 18 arXiv — NLP / Computation & Language research 18d ago GENIE: A Fine-Grained Measure for Novelty arXiv:2606.12790v1 Announce Type: new Abstract: Large Language Models have consistently demonstrated a lack of creativity and diversity across tasks. Prior work has focused on addressing whether models are capable of generating creative outputs. Here, we aim to consider novelty… 38 arXiv — NLP / Computation & Language research 18d ago PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation arXiv:2606.12911v1 Announce Type: new Abstract: Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST,… 8 arXiv — NLP / Computation & Language research 18d ago S-GBT: Smooth Growth Bound Tensor for Certified Robustness Against Word Substitution Attacks in NLP arXiv:2606.13439v1 Announce Type: new Abstract: Despite recent progress in Natural Language Processing (NLP), models remain vulnerable to word substitution attacks. Most existing defenses focus on first order sensitivity and measure how much the output changes when the input is… 9 arXiv — NLP / Computation & Language research 18d ago Detecting Functional Memorization in Code Language Models arXiv:2606.12764v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used to generate code at scale. Meanwhile, prior work has investigated whether training data may be recoverable from model outputs, by auditing the textual overlap between training… 7 Hugging Face Daily Papers research 18d ago MuJoCo-Drones-Gym: A GPU-Accelerated Multi-Drone Simulator for Control and Reinforcement Learning Abstract A Gymnasium-compatible multi-drone simulation environment built on MuJoCo physics engine that supports flexible physics models, action interfaces, and observation spaces for reinforcement learning applications. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Robotic… 35 NVIDIA Developer Blog official-blog 18d ago One-Click Multi-Tenant Security with NVIDIA Quantum InfiniBand NVIDIA Quantum InfiniBand now offers intent-based security profiles in Unified Fabric Manager (UFM) that enable multi-tenant fabric security in a single... 33 r/LocalLLaMA community 18d ago xdna-top: unified NPU+iGPU terminal monitor for Strix Halo (Ryzen AI Max) — finally see the NPU work If you're running local models on a Ryzen AI Max / Strix Halo box, you've probably noticed it's hard to see what the NPU is actuallydoing. amd-smi is still broken on gfx1151 (ROCm #6035 ( https://github.com/ROCm/ROCm/issues/6035 )), and while GNOME Resources has a GUI view, I… 21 The Information — AI news-outlet 18d ago KKR, Nvidia, Others Launch $10 Billion Data Center Company Private equity firm KKR, the Kuwait Investment Authority, Nvidia and power generation company Vistra launched a new company on Thursday to finance and help build AI data centers. Nvidia’s role as an anchor investor in Helix signifies another extension of the AI giant’s growing… 29 r/LocalLLaMA community 18d ago Reviewing speed optimizations on llamacpp for large MoE models on multiGPU rigs? (fitparams vs -ngl/-ncmoe vs other flags, P2P, overclocking) In anticipation of MiniMax reported upcoming open-weight release of M3, wanted to do comprehensive review of what I’m aware of regarding speed optimizations. Hopefully it can be helpful reference for some people too. I outlined my understanding of currently available speed… 24 Hugging Face Daily Papers research 18d ago DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models Abstract DRIFT is a framework that adapts pretrained vision-language models for continuous decoding tasks by combining coarse prediction with iterative refinement through flow matching, improving performance across perception and planning tasks. Generated by… 12 r/LocalLLaMA community 18d ago DifussionGemma 4 on 4x7900xtx Just got 100 tps on generation, but in total time it around 45-60 t/s in case of prompt processing waiting. Available memory show: GPU KV cache size: 152,671 tokens Maximum concurrency for 131,072 tokens per request: 1.16x amd-smi monitor for this gpu: GPU XCP POWER GPU_T MEM_T… 6 r/LocalLLaMA community 18d ago Any chances for a 12B diffusion Gemma? Currently recompiling my llama.cpp with support for diffusion Gemma, but I know on my hardware it won't likely be all that viable. I feel like if the goal was to take better advantage of consume GPUs for fast, intelligent generation, building a diffusion model off the biggest… 16 r/LocalLLaMA community 18d ago DiffusionGemma 26B A4B results on my 5090 # DiffusionGemma 26B A4B — Tuning Results (note: these are my tuning results but Deepseek assisted in generation of testing scripts and reports) https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF System - **GPU**: RTX 5090 (32 GB VRAM), CUDA 13.3 - **Build**:… 17 r/LocalLLaMA community 18d ago DiffusionGemma under real workloads feels very different from benchmark demos okay after testing DiffusionGemma a bit more internally we genuinely can’t tell if this is the start of something big or if everyone’s just getting distracted by crazy TPS numbers again lol but one thing that stood out REALLY fast for us was how different the H100 vs A100… 29 r/LocalLLaMA community 18d ago Are older Titan cards still viable? Looking at older Nvidia cards under £200 for Gemma/Qwen MOE coding. Is there any reason to avoid older Titan 12GB cards other than being power hungry? They have more memory bandwidth than the newer consumer cards Titan X 12GB 480GB/s Titan XP 12GB 547GB/s Titan V 12GB 652GB/s… 38 r/LocalLLaMA community 18d ago "How NVIDIA Built Nemotron 3 Open Model" by "Caleb Writes Code" x "Joey Conway"   submitted by   /u/Jeidoz [link]   [comments] 28 r/LocalLLaMA community 19d ago AMD R9700 vs GB10 I have a budget of 5K, and want to buy some gpus my requirement is 48gb+ vram, because I finetune small language model, perform DPO, in general tinkering/ development is my usecase. if you where in my shoe which among these would you get, on one hand amd is better bang for buck,… 4 arXiv — Machine Learning research 19d ago Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs arXiv:2606.12280v1 Announce Type: new Abstract: Post-training quantization lets large text-to-image diffusion transformers run on consumer GPUs, yet the hardware-specific trade-offs are seldom measured directly. We quantize Ideogram 4.0 - a 9.3B flow-matching diffusion… 17 arXiv — NLP / Computation & Language research 19d ago PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference arXiv:2606.11196v1 Announce Type: new Abstract: Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge, a framework that trains dedicated judge models to score query-output pairs without… 20 arXiv — NLP / Computation & Language research 19d ago The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content arXiv:2606.11198v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) systems inject external knowledge to improve LLM outputs, yet the format of injected content -- distinct from its semantic relevance -- can independently distort the model's attention… 6 arXiv — NLP / Computation & Language research 19d ago LatticeBridge: Rare-Event Sequential Inference for Faithful Structured Sequence Synthesis arXiv:2606.11203v1 Announce Type: new Abstract: Structured sequence generation often requires a model to satisfy several input-derived constraints in a single output. Standard decoding methods may assign high probability to fluent continuations while placing low mass on… 22 arXiv — NLP / Computation & Language research 19d ago Substrate Asymmetry in User-Side Memory: A Diagnostic Framework arXiv:2606.11712v1 Announce Type: new Abstract: User-side memory in LLMs is typically scored as a single "personalization" capability: given a user's history, is the output more user-aware? We show this aggregate metric hides opposite-direction failures. Memory factorises into… 22 arXiv — NLP / Computation & Language research 19d ago Agreement in Representation Space for Open-Ended Self-Consistency arXiv:2606.12003v1 Announce Type: new Abstract: Self-consistency improves LLM reasoning by sampling multiple outputs and selecting the most consistent answer, but existing formulations largely rely on exact matching and therefore remain limited to tasks with categorical outputs.… 27 arXiv — NLP / Computation & Language research 19d ago On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study arXiv:2606.12234v1 Announce Type: new Abstract: Controlling the output of Large Language Models (LLMs) is a central challenge for their reliable deployment, yet a clear understanding of the involved trade-offs remains elusive. Current approaches to conditioning are often… 26 arXiv — NLP / Computation & Language research 19d ago Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs arXiv:2606.12385v1 Announce Type: new Abstract: Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose… 7 r/LocalLLaMA community 19d ago nvidia/diffusiongemma-26B-A4B-it-NVFP4 · Hugging Face Model Overview Description: DiffusionGemma 26B A4B IT is an open-weights multimodal generative model developed by Google DeepMind that processes text, image, and video inputs to produce text output via discrete diffusion. Built on the Gemma 4 26B A4B Mixture-of-Experts (MoE)… 12 LangChain releases dev-tools 19d ago langchain-core==1.4.5 Changes since langchain-core==1.4.4 release(core): 1.4.5 ( #38056 ) feat(standard-tests): validate tool call chunks during streaming ( #34707 ) fix(core): async tracer on_chat_model_start fallback in sync context ( #35233 ) fix(langchain): tighten structured output model… 13 r/LocalLLaMA community 19d ago I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3) I kept wanting to talk to my local models instead of typing, but every voice setup wanted a GPU, shipped my audio to the cloud, or was macOS-only. So I built one that's none of those — and I benchmarked it, so these are real measured numbers, not vibes. One command installs the… 12 Ars Technica — AI news-outlet 19d ago Google DeepMind releases DiffusionGemma, a model that runs local AI 4x faster Diffusion AI is most common in image generation, but it can make text outputs much faster. 29 r/LocalLLaMA community 19d ago Best Open-Source AI coding model for my specs? hello everyone! im looking for the most powerful open-source coding ai while still fitting my system my specs: CPU: AMD ryzen 7 7700 GPU: RTX 5070 RAM: 32 gb DDR5 OS: windows 11 use case: Writing, Coding, debugging. any recommendations would be great. thanks in advance  … 4 Hugging Face Daily Papers research 19d ago When Behavioral Safety Evaluation Fails: A Representation-Level Perspective Abstract Behavioral safety evaluations of large language models provide incomplete insights into internal robustness, as demonstrated by the audit gap between observable outputs and latent space vulnerabilities revealed through intervention-based testing. Generated by… 38 r/LocalLLaMA community 19d ago FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the… 25 NVIDIA Developer Blog official-blog 19d ago Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation Developers building real-time AI—such as chat assistants, copilots, and agentic workflows—are often constrained by token-by-token generation speed. This... 6 r/LocalLLaMA community 19d ago SenseNova U1 dropped an infographic-specific finetune it's the same U1-8B-MoT base with an extended MT (multi-task) training phase focused on structured visual output. the benchmark jumps are significant: IGenBench I-ACC (infographic accuracy) : 4.2👉17.0 (4x) Chart Understanding: 51.3👉69.5Text Rendering: 39.8👉46.6Overall… 32 llama.cpp releases dev-tools 19d ago b9589 CUDA: Fix ssm_scan_f32 data-races ( #24360 ) Add missing syncthreads before resuing cub_temp_storage __syncthreads() is required before being allowed to resue TempStorage smem:… 32 r/LocalLLaMA community 19d ago 1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM Bonsai LM (1-bit and 1.58-bitLLMs) benchmark on Jetson Orin Nano Super Just released a deep benchmark of 5 Bonsai LM models (1.7B → ~8B) on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA - across all 4 power modes: 7W, 15W, 25W, and MAXN A thread! So, Bonsai LM models… 29 r/LocalLLaMA community 19d ago How long do you think it will take for the stock market to notice that Apple and Microsoft announced at the same time that they're all-in for local AI? Microsoft's Surface with the crappy old Nvidia chip won't keep up with anything from Apple, but Microsoft wouldn't be on board if Nvidia didn't have a roadmap for more and better laptop chips. And Apple can crash the market on a whim by just announcing a line of products that… 12 r/LocalLLaMA community 19d ago Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss? Hey everyone, I'm running Qwen3.6-MTP-27B-MTP (Q4_K_M) with llama.cpp server on a Tesla V100 , and I'm currently getting around 55 tokens/sec . I'm trying to find out whether there are any configuration changes that could increase throughput further without reducing output… 31 arXiv — NLP / Computation & Language research 20d ago Where You Inject Diversity Matters: A Unified Framework for Diverse Generation arXiv:2606.10302v1 Announce Type: new Abstract: Open-ended generation tasks often require a set of meaningfully different outputs, yet large language models often produce similar generations. Existing test-time diversity methods operate at different stages of generation with… 23 arXiv — NLP / Computation & Language research 20d ago MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents arXiv:2606.10304v1 Announce Type: new Abstract: When LLM agents are coerced into covertly encoding sensitive data (Base64, ROT13, acrostic, synonym chains, and beyond), the resulting outputs evade output-side detection but the underlying computation does not. Across nine… 37 arXiv — NLP / Computation & Language research 20d ago Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation arXiv:2606.10475v1 Announce Type: cross Abstract: Multi-agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optimized in a way that heavily favors final output accuracy rather than stability of the… 18 arXiv — NLP / Computation & Language research 20d ago Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output arXiv:2606.10528v1 Announce Type: cross Abstract: Current reinforcement learning from human feedback (RLHF) methods primarily rely on scalar rewards from a trained reward model (RM). While effective, scalar rewards are often noisy and fail to capture fine-grained preference… 32 The Information — AI news-outlet 20d ago OpenAI in Talks to Lease 10 Gigawatt Ohio Data Center with Backing From Nvidia OpenAI is in advanced negotiations to lease a proposed 10 gigawatt data center campus on federal land in Ohio as part of a deal that could include financial backing from Nvidia , according to two people with direct knowledge of the discussions. The campus under discussion would… 12 r/LocalLLaMA community 20d ago Furiosa AI selling inference chip to consumer market will be a game changer to local llm ​ This is south Korean start up all-in on inference chip: https://furiosa.ai/renegade-spec Tsmc 5nm node Hynix HBM3 1.5TB/s 48GB VRAM TDP 180W Already tested on LG LLM. If they opened their programming interface the way NVIDIA opens PTX and Intel opens SPIR-V, and team up… 12 r/LocalLLaMA community 20d ago Since when the RTX 6000 PRO is priced at 13250USD on the official NVIDIA Page? https://marketplace.nvidia.com/en-us/enterprise/laptops-workstations/nvidia-rtx-pro-6000-blackwell-workstation-edition/   submitted by   /u/panchovix [link]   [comments] 7 Hugging Face Daily Papers research 20d ago Phase Marginalization for Patch-Grid Instability in Vision Transformers Abstract Phase Marginalization is a post-hoc method that addresses phase-dependent instability in Vision Transformers by evaluating structured patch-grid phases and aggregating outputs in the original image coordinate system. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision… 32 NVIDIA Developer Blog official-blog 20d ago Delivering Lifecycle Control for AI Infrastructure at Scale with NVIDIA DGX Spark Enterprise Manageability As AI infrastructure scales, enterprise expectations for operational maturity are increasing. Organizations expect these systems to be provisionable,... 38 Page 6 of 10 · 500 articles ← Newer Older →