Tag

Gpu

500 articles archived under #gpu · RSS

NVIDIA Developer Blog official-blog 20d ago

Delivering Lifecycle Control for AI Infrastructure at Scale with NVIDIA DGX Spark Enterprise Manageability

As AI infrastructure scales, enterprise expectations for operational maturity are increasing. Organizations expect these systems to be provisionable,...

38
NVIDIA Developer Blog official-blog 20d ago

Model Quantization: Turn FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRT

Converting a quantized checkpoint into an NVIDIA TensorRT engine bridges the gap between model optimization and production deployment, enabling faster...

6
NVIDIA Developer Blog official-blog 20d ago

Accelerating Federated Learning Research with AI Agents and NVIDIA FLARE Auto-FL

Federated learning (FL) research often begins with a deceptively simple question: What should we try next? A new aggregation rule, a FedProx coefficient, a...

16
NVIDIA Developer Blog official-blog 20d ago

Evaluate Clinical ASR Models Faster with Agent Skills and NVIDIA Nemotron Speech

Training a speech AI model to correctly recognize or synthesize clinical terminology is surprisingly difficult. Drug names like Acetaminophen, Amlodipine,...

9
r/LocalLLaMA community 20d ago

PSA: Throttle GPU power limits, with minor performance deficits

I just feel i need to post this here again so more people see: Test around with throttling the power limits of your GPUs, you will often find that you can save tons of power with only minor performance deficits. On my dual Radeon VII setup, i went from 250 to 100 watts per card,…

11
Hugging Face Daily Papers research 20d ago

Liberating LLM Capabilities in Full-Duplex Speech Models

Abstract A text-first tri-channel speech interface enables real-time interaction with visible text output alongside spoken responses, demonstrating superior performance in full-duplex conversational tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Speech-based large language…

21
llama.cpp releases dev-tools 21d ago

b9572

ggml-cpu : fix rms_norm_back wrong output under in-place aliasing ( #24305 ) ggml-cpu : fix rms_norm_back wrong output under in-place aliasing cont : clean-up comment Co-authored-by: Georgi Gerganov [email protected] macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon…

27
Hugging Face Daily Papers research 21d ago

SwiftVR: Real-Time One-Step Generative Video Restoration

Abstract SwiftVR enables real-time video restoration on consumer GPUs through efficient attention mechanisms and lightweight autoencoding, achieving high frame rates at 4K resolution. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Real-time video restoration (VR) for live streams…

33
llama.cpp releases dev-tools 21d ago

b9570

ggml-webgpu: Add clang-format job ( #24308 ) Add clang-format job try local formatting macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU)…

34
arXiv — Machine Learning research 21d ago

Teacher-Free Self-Training Amplifies but Does Not Compound: A Pass@$K$ Crossover on a Free-Verifier Domain

arXiv:2606.07856v1 Announce Type: new Abstract: When a language model trains on its own verified outputs, does it acquire capability beyond its base, or merely get better at expressing capability the base already had? We make the question decidable with a teacher-free…

15
arXiv — Machine Learning research 21d ago

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

arXiv:2606.08044v1 Announce Type: new Abstract: Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under…

33
Hugging Face Daily Papers research 21d ago

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Abstract Lookahead Sparse Attention with Neural Memory Indexer reduces GPU memory usage for long-context LLM inference while maintaining accuracy through proactive KV cache management and decoupled training. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Conventional LLMs keep the…

19
r/LocalLLaMA community 21d ago

ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp

This PR improves matmul performance for k-quants. The following table shows the improvement on the pp512 test in M2 pro. quant model master (t/s) PR (t/s) speedup Q2_K qwen3 0.6B Q2_K - Medium 817.86 ± 6.14 1991.81 ± 6.87 2.44x Q3_K qwen35 4B Q3_K - Medium 92.54 ± 0.13 302.24 ±…

38
r/LocalLLaMA community 21d ago

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

I fine-tuned NVIDIA's Parakeet TDT 0.6B v2 for clinical speech and am releasing the weights as Omi Med STT v1 (CC-BY-4.0). Disclosure: I'm the founder of Omi Health and built this. Happy to dig into the training mix, benchmark, failure cases, quantization, or anything else. The…

14
NVIDIA Developer Blog official-blog 21d ago

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

Pre-training frontier LLMs comes down to throughput. When training spans trillions of tokens across thousands of accelerators, every percentage point of step...

34
llama.cpp releases dev-tools 21d ago

b9565

[ggml-webgpu] Handle buffer overlap / buffer aliasing for concat operator ( #24000 ) Only run webgpu CI on my fork Add webgpu only workflow handle buffer overlap case for concat operator restore build-webgpu.yml Co-Authored-By: Claude Sonnet 4.6 [email protected] Run…

14
Hugging Face Daily Papers research 21d ago

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

Abstract UnpredictaBench evaluates large language models' capacity to sample from target distributions, revealing significant gaps in their ability to simulate unpredictable systems despite recent advances in output diversity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We…

7
r/LocalLLaMA community 21d ago

Friends from the localllama community, if you love local llm, don't participate in the IPO (spaceX, OpenAI, Anthropic)

I'm not going to. And you shouldn't either. The frontier labs are the ones who are harming our community. They are jacking the hardware prices up. First it was nvidia GPUs. And then it was RAM. And then SSD. And now HDDs prices are x3 compared to last year. Even NAS prices are…

35
llama.cpp releases dev-tools 21d ago

b9564

[ggml-webgpu] Implement 2D workgroups for scale, binary, and unary ops ( #24044 ) Only run webgpu CI on my fork Add webgpu only workflow Implement 2d workgroups for more operations fix Fix type Move back to global_invocation_id macOS/iOS: macOS Apple Silicon (arm64) macOS Apple…

24
r/LocalLLaMA community 21d ago

Xiaomi just claimed 1,000+ tps on a 1T model using a standard 8-GPU server

Just saw Xiaomi MiMo announce MiMo-V2.5-Pro UltraSpeed , claiming they broke the 1,000 tokens/sec output barrier on a 1 trillion parameter MoE model . According to them, they’re doing it on a single standard 8-GPU node , not custom wafer-scale hardware like Cerebras and not…

34
r/LocalLLaMA community 21d ago

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

Hey fellow Llamas, your time is precious, so I'll keep it short (while trying to explain everything lol). TL;DR: 33-35B MoE on a 16 GB GPU. Qwen3.6 35B-A3B: 13.3 GiB (was ~20.5). Laguna XS.2 33B-A3B: 14.6 GiB (was 18.8). Both measured on an RTX 3090, both under 16 GiB. Only the…

33
llama.cpp releases dev-tools 21d ago

b9557

cuda: reset cuda context after reading memory size ( #23935 ) cuda: reset device in get_memory function if no backend is active also count device and host buffers exclude hip and musa from counting and device reset use device mutex instead of atomic undo backend_free function…

34
r/LocalLLaMA community 21d ago

[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]

These last few weeks have been godsend for 24GB (and below) gpu poor peeps. Killer models released (Gemma 4 / Qwen 3.6) Free intelligence via QAT Bonus speed via MTP We're at the tipping point where GPU poor (24gb and below) people are actually NOT poor any more. I was already…

20
The Information — AI news-outlet 21d ago

Google and Nvidia Consider Intel as Backup Chip Manufacturer

TSMC’s capacity struggles are turning into a boon for Intel. As the Taiwanese chip making giant struggles to meet overwhelming demand for its chip manufacturing capacity, several major AI chip design companies, including Google and Nvidia, are quietly turning to Intel as a…

36
The Information — AI news-outlet 21d ago

Nvidia, SK Hynix Sign Multi-Year Deal for Next-gen AI Memory

Nvidia and SK Hynix have signed a multiyear deal to work together on advanced memory chips, as AI demand strains global memory supply. The agreement covers chip design and manufacturing, and includes memory for Nvidia’s Vera Rubin platform, its next major AI system. The deal,…

23
Stratechery (Ben Thompson) community 21d ago

Google Buys Compute From SpaceX, Broadcom’s Outlook, Apple’s AI Politics

Google's deal with SpaceX, and Broadcom's earnings, both seem bullish for Nvidia. Then, what I'm looking for at WWDC.

36
Hugging Face Daily Papers research 22d ago

SIA: Self Improving AI with Harness & Weight Updates

Abstract A self-improving AI framework simultaneously updates both model weights and task-specific agent architecture through a language-model feedback agent across legal classification, GPU optimization, and biological data denoising tasks. Generated by…

20
arXiv — Machine Learning research 22d ago

Gaussian Process Latent Factor Regression for Low-Data, High-Dimensional Output Problems

arXiv:2606.06576v1 Announce Type: new Abstract: In the sciences, regression tasks often require predicting high-dimensional outputs from few training examples. Multi-output Gaussian processes excel in low-data regimes but typically struggle with high-dimensional outputs.…

14
arXiv — Machine Learning research 22d ago

TorchKM: A GPU-Oriented Library for Kernel Learning and Model Selection

arXiv:2606.06742v1 Announce Type: new Abstract: TorchKM is an open-source library for kernel machines, including support vector machines, kernel logistic regression, and kernel quantile regression, with GPU acceleration. The library features a scikit-learn-style API and is…

36
arXiv — Machine Learning research 22d ago

Trio: Learning Time-Series Forecasting with Temporal-Spatial-Sample Attention and Structural Causal Priors

arXiv:2606.07291v1 Announce Type: new Abstract: Multivariate time-series forecasting requires models to reason over temporal dynamics, cross-variable dependencies, and historical input-output correspondences. Recent Prior-Data Fitted Networks (PFNs) suggest that synthetic tasks…

8
arXiv — Machine Learning research 22d ago

Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

arXiv:2606.07404v1 Announce Type: new Abstract: This paper reports on training a hundred-billion-parameter sparse mixture of experts on a single eight-GPU node, end to end. LightningLM 0.1V is a recurrence-backbone language model family grown in four stages from a small dense…

28
arXiv — NLP / Computation & Language research 22d ago

TA-RAG: Tone-Aware Retrieval-Augmented Generation for Peer-Support Health Communication

arXiv:2606.06794v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) successfully grounds large language model (LLM) outputs in trusted documents, but factual grounding alone is insufficient for sensitive peer-support health communication. In domains such as HIV…

25
arXiv — NLP / Computation & Language research 22d ago

Korean Culture into LLM Alignment: Toward Cultural Coherence

arXiv:2606.06797v1 Announce Type: new Abstract: Cultural-aspect work on large language models is dominated by a negative target: which outputs to suppress. We argue that a constructive counterpart is also needed, a working definition of what a culturally coherent response is…

15
arXiv — NLP / Computation & Language research 22d ago

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

arXiv:2606.06840v1 Announce Type: new Abstract: Modern reasoning models offer surprisingly strong zero-shot performance on challenging multi-label tasks that require selecting a small set of relevant options from hundreds of thousands to millions of candidate labels. We…

30
llama.cpp releases dev-tools 22d ago

b9554: [SYCL] Update compute runtime version to 26.x in docker (#24070)

update compute runtime from 25 to 26 in docker add comment with old driver for multiple GPUs

12
r/LocalLLaMA community 22d ago

llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is?

Running into something annoying with llama-server in router mode (`--models-preset`) and I can't tell if I'm missing a flag or if this is just how it works. My rig is 2x 3090, 2x 4060 Ti (one's unplugged at the moment, riser got repurposed) and a 5060 Ti. I run a single…

28
r/LocalLLaMA community 22d ago

Qwen 3.6 27B on DeepSWE

Overview: It scored 2% (1.79% rounded up) It is 18/20th place scoring above Haiku 4.5 and Minimax M2.7 Full benchmark took 70 hours Average time per task 32m Average output tokens per task: 44k Perspectives: It scored suspiciously similar to 3.6 Plus and it really gets me…

21
r/LocalLLaMA community 22d ago

Clustering 3x Jetson Nano Orin Supers

Hey everyone! Recently, I released a blog on how to setup a cluster out of your Raspberry Pi 4bs and Mac minis for distributed training and inference Now its time to do the same with Jetson Nano Orin Super! Why ? - 1024 CUDA Cores (Ampere) - 8GB unified memory LPDDR5 - 6x ARM…

26
r/LocalLLaMA community 23d ago

Gemma 4 31B QAT GGUF loads with MTP branch, but outputs repeated <unused49> - any working recipe?

Update: you were right to suggest checking the hash. My cached GGUF blob was corrupt. HF expected SHA256: 9188a71055550f1e60b875d02b7abb63625ac11b4a6f148d6b22b3b28ba3d335 My old local blob hashed to: 20e9ffda0c1a0fb5b6ed9cc445834e5c3e98a1f9ffe4a64edf319cbd0aa85fba I moved the…

9
r/LocalLLaMA community 23d ago

You don't need a GPU to run gemma-4-26B-A4B

I've been running LLMs on my old potato i5-8500 with 32GB of RAM and *no GPU* for awhile now, running up to 12B dense models which run slow but perfectly useable. But this Gemma-4-26B-A4B simply flies on this CPU - only machine using Koboldcpp on Linux. That's right, an old used…

15
r/LocalLLaMA community 23d ago

Cool stuff to do with NVIDIA RTX 6000 PRO 96GB VRAM

I have been a C++ dev for 3 years as long as have done PyTorch in my free time (not that good in the latter). Now, I was lucky enough to get a brand new GPU from a colleague. What are some cool side projects I can build to learn tons about ML and inference/infra? Please don't…

33
r/LocalLLaMA community 23d ago

dvlt.cu: inference engine written from scratch in CUDA/C++ for NVIDIA's DVLT 3D transformer model

Im into both HPC and 3D reconstruction, so I built this as a side project. dvlt.cu is a single 5MB binary: - No python, torch, TF, ONNX, llama.cpp, vLLM, or huggingface runtime - Nearly no dependencies: only cuBLASLt (shipped with libcuda ) + cuTLASS ( header only lib ) - mmap'd…

21
r/LocalLLaMA community 23d ago

Best Coding Harness for Qwen3.6 35B?

I've been happily using GitHub Copilot for 7-8 months, primarily in Visual Studio and VS Code, mostly with the built-in flagship models and have felt like the output is worth the cost. Lately I've been playing with a lot of different local LLM models and decided to try using…

32
r/LocalLLaMA community 23d ago

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised with the result! By using llama.cpp patched with the…

17
Hacker News — AI on Front Page community 23d ago

Nvidia is proposing a beast of a CPU system for Windows PCs

Article URL: https://twitter.com/lemire/status/2062880075117113739 Comments URL: https://news.ycombinator.com/item?id=48424605 Points: 218 # Comments: 405

22
r/LocalLLaMA community 23d ago

Has there been any recent new development on which quant is considered optimal?

I recall in earlier days, q4 was said to be optimal. That is to say, if you have a: small q8 model medium q4 model large q2 Assuming they use the same amount of GPU VRAM, medium q4 would be the best-performing model. I also know that Apple (crazy that I am citing Apple here,…

23
r/LocalLLaMA community 23d ago

Qwen 3.6 27B MTP - Adding spec-type and spec-draft-n-max is dropping tps and reducing GPU utilization

I have a 5090 power limited to 475W. When I run the following command, it barely hits 300W and I get something like 30 t/s: bash ./llama-server \ -m ~/myp/models/unsloth_mtp_Qwen3.6-27B-UD-Q5_K_XL.gguf \ --host 0.0.0.0 \ --port 8080 \ --chat-template-kwargs…

13
llama.cpp releases dev-tools 24d ago

b9537

context : fix off-by-one comparisons to n_gpu_layers ( #24208 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan)…

37
The Information — AI news-outlet 24d ago

Google Agrees to Pay SpaceX $920 Million Monthly for Compute Access

Google has agreed to pay $920 million a month to purchase compute capacity from SpaceX, SpaceX said in a filing. The deal will run from October 2026 through June 2029, and Google will be able to access approximately 110,000 Nvidia GPUs, CPUs and related components, according to…

25
r/LocalLLaMA community 24d ago

Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result

TL;DR: I spent a long session tuning a 35B MoE on a tiny 8GB laptop GPU. Three things mattered a lot (--no-mmap, VRAM headroom, closing CPU-hungry apps). Several "obvious" optimizations did nothing because of this model's hybrid architecture (TurboQuant, Flash Attention, even…

19

Delivering Lifecycle Control for AI Infrastructure at Scale with NVIDIA DGX Spark Enterprise Manageability

Model Quantization: Turn FP8 Checkpoints into High-Performance Inference Engines with NVIDIA TensorRT

Accelerating Federated Learning Research with AI Agents and NVIDIA FLARE Auto-FL

Evaluate Clinical ASR Models Faster with Agent Skills and NVIDIA Nemotron Speech

PSA: Throttle GPU power limits, with minor performance deficits

Liberating LLM Capabilities in Full-Duplex Speech Models

b9572

SwiftVR: Real-Time One-Step Generative Video Restoration

b9570

Teacher-Free Self-Training Amplifies but Does Not Compound: A Pass@$K$ Crossover on a Free-Verifier Domain

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

b9565

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

Friends from the localllama community, if you love local llm, don't participate in the IPO (spaceX, OpenAI, Anthropic)

b9564

Xiaomi just claimed 1,000+ tps on a 1T model using a standard 8-GPU server

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

b9557

[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]

Google and Nvidia Consider Intel as Backup Chip Manufacturer

Nvidia, SK Hynix Sign Multi-Year Deal for Next-gen AI Memory

Google Buys Compute From SpaceX, Broadcom&#8217;s Outlook, Apple&#8217;s AI Politics

SIA: Self Improving AI with Harness & Weight Updates

Gaussian Process Latent Factor Regression for Low-Data, High-Dimensional Output Problems

TorchKM: A GPU-Oriented Library for Kernel Learning and Model Selection

Trio: Learning Time-Series Forecasting with Temporal-Spatial-Sample Attention and Structural Causal Priors

Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

TA-RAG: Tone-Aware Retrieval-Augmented Generation for Peer-Support Health Communication

Korean Culture into LLM Alignment: Toward Cultural Coherence

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

b9554: [SYCL] Update compute runtime version to 26.x in docker (#24070)

llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is?

Qwen 3.6 27B on DeepSWE

Clustering 3x Jetson Nano Orin Supers

Gemma 4 31B QAT GGUF loads with MTP branch, but outputs repeated <unused49> - any working recipe?

You don't need a GPU to run gemma-4-26B-A4B

Cool stuff to do with NVIDIA RTX 6000 PRO 96GB VRAM

dvlt.cu: inference engine written from scratch in CUDA/C++ for NVIDIA's DVLT 3D transformer model

Best Coding Harness for Qwen3.6 35B?

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Nvidia is proposing a beast of a CPU system for Windows PCs

Has there been any recent new development on which quant is considered optimal?

Qwen 3.6 27B MTP - Adding spec-type and spec-draft-n-max is dropping tps and reducing GPU utilization

b9537

Google Agrees to Pay SpaceX $920 Million Monthly for Compute Access

Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result

Google Buys Compute From SpaceX, Broadcom’s Outlook, Apple’s AI Politics