Tag

Gpu

500 articles archived under #gpu · RSS

r/LocalLLaMA community 18d ago

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

What it is, in plain words. Your GPU keeps two float vectors for every token of your conversation. That’s the KV cache, and it’s why long contexts eat VRAM: Llama-3.1-8B needs about 0.12 MB per token, so 100k tokens costs 12 GB and a million tokens costs 122 GB. No consumer card…

33
Hugging Face Daily Papers research 18d ago

MiniMax Sparse Attention

Abstract MiniMax Sparse Attention enables efficient processing of ultra-long contexts in large language models through blockwise sparsity and optimized GPU execution, achieving significant speedups while maintaining performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

20
arXiv — NLP / Computation & Language research 18d ago

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

arXiv:2606.12765v1 Announce Type: new Abstract: Apple's Metal 4.1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose interface is documented but whose hardware behavior is deliberately hidden. The…

18
arXiv — NLP / Computation & Language research 18d ago

GENIE: A Fine-Grained Measure for Novelty

arXiv:2606.12790v1 Announce Type: new Abstract: Large Language Models have consistently demonstrated a lack of creativity and diversity across tasks. Prior work has focused on addressing whether models are capable of generating creative outputs. Here, we aim to consider novelty…

38
arXiv — NLP / Computation & Language research 18d ago

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

arXiv:2606.12911v1 Announce Type: new Abstract: Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST,…

8
arXiv — NLP / Computation & Language research 18d ago

S-GBT: Smooth Growth Bound Tensor for Certified Robustness Against Word Substitution Attacks in NLP

arXiv:2606.13439v1 Announce Type: new Abstract: Despite recent progress in Natural Language Processing (NLP), models remain vulnerable to word substitution attacks. Most existing defenses focus on first order sensitivity and measure how much the output changes when the input is…

9
arXiv — NLP / Computation & Language research 18d ago

Detecting Functional Memorization in Code Language Models

arXiv:2606.12764v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used to generate code at scale. Meanwhile, prior work has investigated whether training data may be recoverable from model outputs, by auditing the textual overlap between training…

7
Hugging Face Daily Papers research 18d ago

MuJoCo-Drones-Gym: A GPU-Accelerated Multi-Drone Simulator for Control and Reinforcement Learning

Abstract A Gymnasium-compatible multi-drone simulation environment built on MuJoCo physics engine that supports flexible physics models, action interfaces, and observation spaces for reinforcement learning applications. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Robotic…

35
NVIDIA Developer Blog official-blog 18d ago

One-Click Multi-Tenant Security with NVIDIA Quantum InfiniBand

NVIDIA Quantum InfiniBand now offers intent-based security profiles in Unified Fabric Manager (UFM) that enable multi-tenant fabric security in a single...

33
r/LocalLLaMA community 18d ago

xdna-top: unified NPU+iGPU terminal monitor for Strix Halo (Ryzen AI Max) — finally see the NPU work

If you're running local models on a Ryzen AI Max / Strix Halo box, you've probably noticed it's hard to see what the NPU is actuallydoing. amd-smi is still broken on gfx1151 (ROCm #6035 ( https://github.com/ROCm/ROCm/issues/6035 )), and while GNOME Resources has a GUI view, I…

21
The Information — AI news-outlet 18d ago

KKR, Nvidia, Others Launch $10 Billion Data Center Company

Private equity firm KKR, the Kuwait Investment Authority, Nvidia and power generation company Vistra launched a new company on Thursday to finance and help build AI data centers. Nvidia’s role as an anchor investor in Helix signifies another extension of the AI giant’s growing…

29
r/LocalLLaMA community 18d ago

Reviewing speed optimizations on llamacpp for large MoE models on multiGPU rigs? (fitparams vs -ngl/-ncmoe vs other flags, P2P, overclocking)

In anticipation of MiniMax reported upcoming open-weight release of M3, wanted to do comprehensive review of what I’m aware of regarding speed optimizations. Hopefully it can be helpful reference for some people too. I outlined my understanding of currently available speed…

24
Hugging Face Daily Papers research 18d ago

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

Abstract DRIFT is a framework that adapts pretrained vision-language models for continuous decoding tasks by combining coarse prediction with iterative refinement through flow matching, improving performance across perception and planning tasks. Generated by…

12
r/LocalLLaMA community 18d ago

DifussionGemma 4 on 4x7900xtx

Just got 100 tps on generation, but in total time it around 45-60 t/s in case of prompt processing waiting. Available memory show: GPU KV cache size: 152,671 tokens Maximum concurrency for 131,072 tokens per request: 1.16x amd-smi monitor for this gpu: GPU XCP POWER GPU_T MEM_T…

6
r/LocalLLaMA community 18d ago

Any chances for a 12B diffusion Gemma?

Currently recompiling my llama.cpp with support for diffusion Gemma, but I know on my hardware it won't likely be all that viable. I feel like if the goal was to take better advantage of consume GPUs for fast, intelligent generation, building a diffusion model off the biggest…

16
r/LocalLLaMA community 18d ago

DiffusionGemma 26B A4B results on my 5090

# DiffusionGemma 26B A4B — Tuning Results (note: these are my tuning results but Deepseek assisted in generation of testing scripts and reports) https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF System - **GPU**: RTX 5090 (32 GB VRAM), CUDA 13.3 - **Build**:…

17
r/LocalLLaMA community 18d ago

DiffusionGemma under real workloads feels very different from benchmark demos

okay after testing DiffusionGemma a bit more internally we genuinely can’t tell if this is the start of something big or if everyone’s just getting distracted by crazy TPS numbers again lol but one thing that stood out REALLY fast for us was how different the H100 vs A100…

29
r/LocalLLaMA community 18d ago

Are older Titan cards still viable?

Looking at older Nvidia cards under £200 for Gemma/Qwen MOE coding. Is there any reason to avoid older Titan 12GB cards other than being power hungry? They have more memory bandwidth than the newer consumer cards Titan X 12GB 480GB/s Titan XP 12GB 547GB/s Titan V 12GB 652GB/s…

38
r/LocalLLaMA community 18d ago

"How NVIDIA Built Nemotron 3 Open Model" by "Caleb Writes Code" x "Joey Conway"

  submitted by   /u/Jeidoz [link]   [comments]

28
r/LocalLLaMA community 19d ago

AMD R9700 vs GB10

I have a budget of 5K, and want to buy some gpus my requirement is 48gb+ vram, because I finetune small language model, perform DPO, in general tinkering/ development is my usecase. if you where in my shoe which among these would you get, on one hand amd is better bang for buck,…

4
arXiv — Machine Learning research 19d ago

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

arXiv:2606.12280v1 Announce Type: new Abstract: Post-training quantization lets large text-to-image diffusion transformers run on consumer GPUs, yet the hardware-specific trade-offs are seldom measured directly. We quantize Ideogram 4.0 - a 9.3B flow-matching diffusion…

17
arXiv — NLP / Computation & Language research 19d ago

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

arXiv:2606.11196v1 Announce Type: new Abstract: Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge, a framework that trains dedicated judge models to score query-output pairs without…

20
arXiv — NLP / Computation & Language research 19d ago

The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content

arXiv:2606.11198v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) systems inject external knowledge to improve LLM outputs, yet the format of injected content -- distinct from its semantic relevance -- can independently distort the model's attention…

6
arXiv — NLP / Computation & Language research 19d ago

LatticeBridge: Rare-Event Sequential Inference for Faithful Structured Sequence Synthesis

arXiv:2606.11203v1 Announce Type: new Abstract: Structured sequence generation often requires a model to satisfy several input-derived constraints in a single output. Standard decoding methods may assign high probability to fluent continuations while placing low mass on…

22
arXiv — NLP / Computation & Language research 19d ago

Substrate Asymmetry in User-Side Memory: A Diagnostic Framework

arXiv:2606.11712v1 Announce Type: new Abstract: User-side memory in LLMs is typically scored as a single "personalization" capability: given a user's history, is the output more user-aware? We show this aggregate metric hides opposite-direction failures. Memory factorises into…

22
arXiv — NLP / Computation & Language research 19d ago

Agreement in Representation Space for Open-Ended Self-Consistency

arXiv:2606.12003v1 Announce Type: new Abstract: Self-consistency improves LLM reasoning by sampling multiple outputs and selecting the most consistent answer, but existing formulations largely rely on exact matching and therefore remain limited to tasks with categorical outputs.…

27
arXiv — NLP / Computation & Language research 19d ago

On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study

arXiv:2606.12234v1 Announce Type: new Abstract: Controlling the output of Large Language Models (LLMs) is a central challenge for their reliable deployment, yet a clear understanding of the involved trade-offs remains elusive. Current approaches to conditioning are often…

26
arXiv — NLP / Computation & Language research 19d ago

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

arXiv:2606.12385v1 Announce Type: new Abstract: Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose…

7
r/LocalLLaMA community 19d ago

nvidia/diffusiongemma-26B-A4B-it-NVFP4 · Hugging Face

Model Overview Description: DiffusionGemma 26B A4B IT is an open-weights multimodal generative model developed by Google DeepMind that processes text, image, and video inputs to produce text output via discrete diffusion. Built on the Gemma 4 26B A4B Mixture-of-Experts (MoE)…

12
LangChain releases dev-tools 19d ago

langchain-core==1.4.5

Changes since langchain-core==1.4.4 release(core): 1.4.5 ( #38056 ) feat(standard-tests): validate tool call chunks during streaming ( #34707 ) fix(core): async tracer on_chat_model_start fallback in sync context ( #35233 ) fix(langchain): tighten structured output model…

13
r/LocalLLaMA community 19d ago

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

I kept wanting to talk to my local models instead of typing, but every voice setup wanted a GPU, shipped my audio to the cloud, or was macOS-only. So I built one that's none of those — and I benchmarked it, so these are real measured numbers, not vibes. One command installs the…

12
Ars Technica — AI news-outlet 19d ago

Google DeepMind releases DiffusionGemma, a model that runs local AI 4x faster

Diffusion AI is most common in image generation, but it can make text outputs much faster.

29
r/LocalLLaMA community 19d ago

Best Open-Source AI coding model for my specs?

hello everyone! im looking for the most powerful open-source coding ai while still fitting my system my specs: CPU: AMD ryzen 7 7700 GPU: RTX 5070 RAM: 32 gb DDR5 OS: windows 11 use case: Writing, Coding, debugging. any recommendations would be great. thanks in advance  …

4
Hugging Face Daily Papers research 19d ago

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Abstract Behavioral safety evaluations of large language models provide incomplete insights into internal robustness, as demonstrated by the audit gap between observable outputs and latent space vulnerabilities revealed through intervention-based testing. Generated by…

38
r/LocalLLaMA community 19d ago

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the…

25
NVIDIA Developer Blog official-blog 19d ago

Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation

Developers building real-time AI—such as chat assistants, copilots, and agentic workflows—are often constrained by token-by-token generation speed. This...

6
r/LocalLLaMA community 19d ago

SenseNova U1 dropped an infographic-specific finetune

it's the same U1-8B-MoT base with an extended MT (multi-task) training phase focused on structured visual output. the benchmark jumps are significant: IGenBench I-ACC (infographic accuracy) : 4.2👉17.0 (4x) Chart Understanding: 51.3👉69.5Text Rendering: 39.8👉46.6Overall…

32
llama.cpp releases dev-tools 19d ago

b9589

CUDA: Fix ssm_scan_f32 data-races ( #24360 ) Add missing syncthreads before resuing cub_temp_storage __syncthreads() is required before being allowed to resue TempStorage smem:…

32
r/LocalLLaMA community 19d ago

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

Bonsai LM (1-bit and 1.58-bitLLMs) benchmark on Jetson Orin Nano Super Just released a deep benchmark of 5 Bonsai LM models (1.7B → ~8B) on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA - across all 4 power modes: 7W, 15W, 25W, and MAXN A thread! So, Bonsai LM models…

29
r/LocalLLaMA community 19d ago

How long do you think it will take for the stock market to notice that Apple and Microsoft announced at the same time that they're all-in for local AI?

Microsoft's Surface with the crappy old Nvidia chip won't keep up with anything from Apple, but Microsoft wouldn't be on board if Nvidia didn't have a roadmap for more and better laptop chips. And Apple can crash the market on a whim by just announcing a line of products that…

12
r/LocalLLaMA community 19d ago

Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss?

Hey everyone, I'm running Qwen3.6-MTP-27B-MTP (Q4_K_M) with llama.cpp server on a Tesla V100 , and I'm currently getting around 55 tokens/sec . I'm trying to find out whether there are any configuration changes that could increase throughput further without reducing output…

31
arXiv — NLP / Computation & Language research 20d ago

Where You Inject Diversity Matters: A Unified Framework for Diverse Generation

arXiv:2606.10302v1 Announce Type: new Abstract: Open-ended generation tasks often require a set of meaningfully different outputs, yet large language models often produce similar generations. Existing test-time diversity methods operate at different stages of generation with…

23
arXiv — NLP / Computation & Language research 20d ago

MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

arXiv:2606.10304v1 Announce Type: new Abstract: When LLM agents are coerced into covertly encoding sensitive data (Base64, ROT13, acrostic, synonym chains, and beyond), the resulting outputs evade output-side detection but the underlying computation does not. Across nine…

37
arXiv — NLP / Computation & Language research 20d ago

Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

arXiv:2606.10475v1 Announce Type: cross Abstract: Multi-agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optimized in a way that heavily favors final output accuracy rather than stability of the…

18
arXiv — NLP / Computation & Language research 20d ago

Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output

arXiv:2606.10528v1 Announce Type: cross Abstract: Current reinforcement learning from human feedback (RLHF) methods primarily rely on scalar rewards from a trained reward model (RM). While effective, scalar rewards are often noisy and fail to capture fine-grained preference…

32
The Information — AI news-outlet 20d ago

OpenAI in Talks to Lease 10 Gigawatt Ohio Data Center with Backing From Nvidia

OpenAI is in advanced negotiations to lease a proposed 10 gigawatt data center campus on federal land in Ohio as part of a deal that could include financial backing from Nvidia , according to two people with direct knowledge of the discussions. The campus under discussion would…

12
r/LocalLLaMA community 20d ago

Furiosa AI selling inference chip to consumer market will be a game changer to local llm

 This is south Korean start up all-in on inference chip: https://furiosa.ai/renegade-spec Tsmc 5nm node Hynix HBM3 1.5TB/s 48GB VRAM TDP 180W Already tested on LG LLM. If they opened their programming interface the way NVIDIA opens PTX and Intel opens SPIR-V, and team up…

12
r/LocalLLaMA community 20d ago

Since when the RTX 6000 PRO is priced at 13250USD on the official NVIDIA Page?

https://marketplace.nvidia.com/en-us/enterprise/laptops-workstations/nvidia-rtx-pro-6000-blackwell-workstation-edition/   submitted by   /u/panchovix [link]   [comments]

7
Hugging Face Daily Papers research 20d ago

Phase Marginalization for Patch-Grid Instability in Vision Transformers

Abstract Phase Marginalization is a post-hoc method that addresses phase-dependent instability in Vision Transformers by evaluating structured patch-grid phases and aggregating outputs in the original image coordinate system. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision…

32
NVIDIA Developer Blog official-blog 20d ago

Delivering Lifecycle Control for AI Infrastructure at Scale with NVIDIA DGX Spark Enterprise Manageability

As AI infrastructure scales, enterprise expectations for operational maturity are increasing. Organizations expect these systems to be provisionable,...

38

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

MiniMax Sparse Attention

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

GENIE: A Fine-Grained Measure for Novelty

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

S-GBT: Smooth Growth Bound Tensor for Certified Robustness Against Word Substitution Attacks in NLP

Detecting Functional Memorization in Code Language Models

MuJoCo-Drones-Gym: A GPU-Accelerated Multi-Drone Simulator for Control and Reinforcement Learning

One-Click Multi-Tenant Security with NVIDIA Quantum InfiniBand

xdna-top: unified NPU+iGPU terminal monitor for Strix Halo (Ryzen AI Max) — finally see the NPU work

KKR, Nvidia, Others Launch $10 Billion Data Center Company

Reviewing speed optimizations on llamacpp for large MoE models on multiGPU rigs? (fitparams vs -ngl/-ncmoe vs other flags, P2P, overclocking)

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

DifussionGemma 4 on 4x7900xtx

Any chances for a 12B diffusion Gemma?

DiffusionGemma 26B A4B results on my 5090

DiffusionGemma under real workloads feels very different from benchmark demos

Are older Titan cards still viable?

"How NVIDIA Built Nemotron 3 Open Model" by "Caleb Writes Code" x "Joey Conway"

AMD R9700 vs GB10

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content

LatticeBridge: Rare-Event Sequential Inference for Faithful Structured Sequence Synthesis

Substrate Asymmetry in User-Side Memory: A Diagnostic Framework

Agreement in Representation Space for Open-Ended Self-Consistency

On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

nvidia/diffusiongemma-26B-A4B-it-NVFP4 · Hugging Face

langchain-core==1.4.5

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

Google DeepMind releases DiffusionGemma, a model that runs local AI 4x faster

Best Open-Source AI coding model for my specs?

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation

SenseNova U1 dropped an infographic-specific finetune

b9589

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

How long do you think it will take for the stock market to notice that Apple and Microsoft announced at the same time that they're all-in for local AI?

Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss?

Where You Inject Diversity Matters: A Unified Framework for Diverse Generation

MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output

OpenAI in Talks to Lease 10 Gigawatt Ohio Data Center with Backing From Nvidia

Furiosa AI selling inference chip to consumer market will be a game changer to local llm

Since when the RTX 6000 PRO is priced at 13250USD on the official NVIDIA Page?

Phase Marginalization for Patch-Grid Instability in Vision Transformers

Delivering Lifecycle Control for AI Infrastructure at Scale with NVIDIA DGX Spark Enterprise Manageability