News / #gpu Tag Gpu 500 articles archived under #gpu · RSS Sign in to follow NVIDIA Developer Blog official-blog 13d ago Building AI Agents for AR Glasses and XR Devices with NVIDIA XR AI Developers building for AR glasses and wearable devices face an infrastructure gap. The hardware is ready, but creating AI experiences requires integrating live... 33 r/LocalLLaMA community 13d ago I didn't know it was possible to compile llamacpp to run cuda + vulkan at the same time.. cmake -B build -G "Visual Studio 17 2022" -A x64 -DCUDAToolkit_ROOT="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" -DGGML_CUDA=ON -DGGML_VULKAN=ON -DGGML_FLASH_ATTN=ON -DGGML_BLAS=OFF -DGGML_NATIVE=OFF -DGGML_RPC=ON -DGGML_BACKEND_DL=ON… 31 NVIDIA Developer Blog official-blog 13d ago Build On-Device AI Companions with the NVIDIA ACE Game Agent SDK and Unreal Engine 5 Plugins NVIDIA RTX technologies are deeply integrated into Unreal Engine 5 through the NVIDIA RTX Branch of Unreal Engine and the NVIDIA DLSS Unreal Engine plugin. This... 23 NVIDIA Developer Blog official-blog 13d ago How to Optimize Transformer-Based Models for Low-Precision Training Transformer architectures are the backbone of many modern large language and generative AI models. As these models grow in size, training runs consume more GPU... 5 NVIDIA Developer Blog official-blog 13d ago NVIDIA Blackwell Tops MLPerf Training 6.0 with Industry-Leading Scale and Performance NVIDIA delivered a clean sweep in MLPerf Training v6.0, the latest edition of industry-standard AI training benchmarks developed by the MLCommons consortium.... 17 r/LocalLLaMA community 13d ago Joing all GPUs to train a community model This sub controls an insane amount of collective VRAM. Why aren't we pooling our GPUs to train a massive community model? Are there any active distributed volunteer computing projects actually doing this right now? I know the bottlenecks (latency, weight poisoning, nodes… 37 r/LocalLLaMA community 14d ago How are you running DeepSeekV4 flash or pro locally for non Mac users? Seems all the mac users are having fun with ds4. For those of us on non metal platforms who are running this locally, how are you running it, CPU, CUDA, ROCm, others?   submitted by   /u/segmond [link]   [comments] 33 arXiv — Machine Learning research 14d ago TriAdReview: Triangular Adversarial Review Architecture for Multi-Model Technical Document Generation arXiv:2606.15074v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for technical document generation, yet single-model outputs often suffer from over-engineering, security blind spots, and incomplete coverage. We propose TriAdReview, a triangular… 24 arXiv — Machine Learning research 14d ago Discovering Lattice Reduction Strategies via Self-Play arXiv:2606.15301v1 Announce Type: new Abstract: The Lenstra-Lenstra-Lov\'asz (LLL) algorithm is a seminal contribution to computer science used for lattice basis reduction, yet its polynomial-time outputs produce bases that are far from optimal as the dimension grows. We show… 17 arXiv — Machine Learning research 14d ago Constitutional Value Potentials: reading and steering internal priority margins in language models arXiv:2606.15420v1 Announce Type: new Abstract: A constitution tells a language model what to value, but little tells us whether it does. Adherence is judged from outputs, and output evidence is most fragile on value conflicts, where what matters is not which value a model… 20 arXiv — NLP / Computation & Language research 14d ago Contaminated Collaboration: Measuring Gender Bias Transfer in LLM-Assisted Student Writing arXiv:2606.15914v1 Announce Type: new Abstract: Gender bias in LLMs has been studied extensively in model outputs, with biased prompts shown to amplify stereotyped generations. Whether such bias propagates into text produced by humans who use these systems, however, remains… 16 arXiv — NLP / Computation & Language research 14d ago Formalize Once, Edit the Rest: Efficient Lean-Based Answer Selection for Math Reasoning arXiv:2606.15972v1 Announce Type: new Abstract: With large language models (LLMs) increasingly applied to mathematical reasoning, formal proof assistants such as Lean can be leveraged to verify reasoning outputs with machine-checkable rigor, enabling use cases such as answer… 30 r/LocalLLaMA community 14d ago Finally - 4xRTX 5060TI nvtop showing clocks and PCIe speed while running gpu_burn I wrote a while ago about my plans to put together a quad 5060ti 16gb based system after finding them nicely discounted. Everything got delayed due to issues with CPU seating (damn re-used stock cooler with plastic push… 32 Ars Technica — AI news-outlet 14d ago Chipmaker Nvidia seeks to raise over $25B in first bond deal since 2021 Debt sale set to test investor appetite for further exposure to AI sector amid a deluge of borrowing. 27 NVIDIA Developer Blog official-blog 14d ago Fine-Tuning Biological Foundation Models with LoRA Using NVIDIA BioNeMo Recipes Foundation models are reshaping computational biology. Pretrained on massive corpora of protein or genomic sequences, models such as ESM2 (a protein language... 8 The Information — AI news-outlet 14d ago Nvidia Plans To Raise At Least $20 Billion In Bonds Nvidia said Monday it plans to raise new debt even as the AI chip leader keeps generating tens of billions of dollars in cash every quarter. It will be the company’s first corporate bond sale since 2021, when it raised $5 billion. Bloomberg earlier reported that Nvidia would… 29 The Information — AI news-outlet 14d ago Nvidia’s Share of AI Inference Chip Market Appears to Be Rising As AI developers and cloud providers have launched server chips to lessen their dependence on Nvidia’s, some analysts and executives at these firms expected the chips to eat into Nvidia’s market share. That doesn’t seem to be happening. Nvidia has actually increased its share of… 4 r/LocalLLaMA community 14d ago Buying AI accelerators/GPUs in China... Bit of a long-shot this, but happens I'll be in China next week. Just wondering if there are any Chinese graphics cards/AI accelerators I should be trying to buy when I'm there? :-). I would be looking for something that let me run inference big models (so, lots of (V?)RAM), but… 10 r/LocalLLaMA community 14d ago React Native ExecuTorch now runs Gemma 4 (Vulkan and MLX accelerated) We've integrated Gemma 4 into react-native-executorch . You can now run it fully offline in your React Native app, with GPU acceleration via the Vulkan delegate on Android and the MLX delegate on Apple Silicon. Link to the attached demo app here .   submitted by  … 32 r/MachineLearning community 14d ago Recent CS graduate looking for GPU compute collaborators for LLM/VLM research [D] Hi everyone, I’m a recent CS graduate working mainly on NLP/LLMs and VLMs failures. I’m currently in a phase where I can dedicate a lot of focused time to research, but the main bottleneck holding me back is compute. I know “asking for GPUs” can sound vague or unserious, so I… 34 The Information — AI news-outlet 14d ago Exclusive: Nvidia Server Marketplace Startup Raises $100 Million at $800 Million Valuation Data center software startup and AI-server broker Hydra Host has raised $100 million at a valuation of close to $800 million, led by Kindred Ventures. Nvidia, Cathie Wood’s ARK Invest, early CoreWeave backer Magnetar, and existing investors Founders Fund and Flume Ventures also… 26 llama.cpp releases dev-tools 14d ago b9642 CUDA: only support F32/F16 for GGML_OP_REPEAT ( #24533 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64… 33 arXiv — Machine Learning research 15d ago A fully GPU-based workflow for building physics emulators of hypersonic flows arXiv:2606.13742v1 Announce Type: new Abstract: The ability to resolve complex physical phenomena with high fidelity and at low computational cost is central to addressing key challenges in modern engineering. A prime example lies in hypersonic flows, where the precise… 32 arXiv — Machine Learning research 15d ago Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0 arXiv:2606.14598v1 Announce Type: new Abstract: Post-training INT8 (W8A8) quantization of diffusion transformers is widely deployed as a speed optimization, yet on consumer Ampere GPUs it is frequently slower than the FP8 and NF4 alternatives it is meant to beat. We trace this… 10 arXiv — NLP / Computation & Language research 15d ago The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation arXiv:2606.13685v1 Announce Type: new Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks… 29 arXiv — NLP / Computation & Language research 15d ago Harsher on Male? Evaluating LLMs on Gender-Asymmetric Moral Framing Across Diverse Conflict Scenarios arXiv:2606.14068v1 Announce Type: new Abstract: Existing studies on gender bias in LLMs have largely focused on stereotypes, occupational associations, or explicit harmful outputs. In this work, we ask whether LLMs apply consistent response standards to the same negative… 28 llama.cpp releases dev-tools 15d ago b9641 ggml-webgpu: improve i-quants mul_mat performance and speed up prefil… 23 r/MachineLearning community 15d ago Coherent Context Can Silently Shift LLMs Into a Different Internal Regime — And Current Safety Systems Are Blind To It [D] I’m an independent researcher currently exploring what I believe is an important phenomenon for both mechanistic interpretability and AI safety. Core idea: A strong, coherent target text can move the model into a different internal regime — before the final output is produced.… 10 r/LocalLLaMA community 15d ago Qwen 3.6 35B-A3B @ Q4 or Gemma 4 12B @ Q8? Wondering how much model quantization matters here. Daily driver on my 32gb unified memory setup is the qwen model outputting ~15 tokens a second. Heard good things about the 12B Gemma 4 model so interested in trying it against my codebase. Given its size I can very comfortably… 28 r/LocalLLaMA community 15d ago Gemma 4 models benchmarked on with Triple GPU Hearing good things about Gemma 4. Ran a few models across my llama box. Kubuntu 26.04 OS. AMD Ryzen 5 3600 6-core CPU. 48 GiB of DDR4 3600 Mhz RAM. Nvidia GTX-1070 at 8GiB VRAM ( X 3 ) with 24GiB total VRAM. GPUs have power limit set to 120, 121, 122 watts using: sudo… 29 r/LocalLLaMA community 15d ago Strange numbers of pp and tg rx7900xtx on ROCm and Vulcan with Qwen3.6-27b nonMTP and MTP So I'm getting very unsatisfactory results of running this model locally. Item Current OS Ubuntu 24.04.4 LTS Linux kernel 6.8.0-124-generic GPU RX 7900 XTX / gfx1100 llama.cpp b9630 / 8ed274ef4 ROCm 7.2.4 AMD driver 6.16.13 Vulkan API 1.4.330 , Mesa 26.0.0-devel Raw Backend… 33 r/LocalLLaMA community 15d ago Qwen 27B Q6/Q8 KV + MTP at 256K on DGX Spark / GB10, tok/s? Has anyone tested Qwen3.6-27B on NVIDIA DGX Spark / GB10 or similar systems at 256K context? I know it's a dense model, but I'm curious how it performs with MTP enabled. Looking for real numbers with: Q6/Q8 quant Q8 KV cache MTP/speculative decoding 256K context Mainly… 31 r/LocalLLaMA community 15d ago Dual DGX Sparks- 40tk/s single 1M ; 350 tk/s agg. - Deepseek V4 Flash (vs RTX Pro 6000 vs Mac M2 Ultra 192) First of all shout out to Aiden/Antirez & geniuses at the Nvidia community threads. I'm merely claude-vibing off of their works. That a said, i thought i'd share recipes & learnings & benchmarks so far on running big MOE models on two dgx sparks at a reasonable speed for agent… 14 r/LocalLLaMA community 15d ago Build for local LLM with 2 separate GPUs I want to build a headless compute machine to run a RTX Ada 4000 (20GB) with a RTX Pro 5000 (48GB) or RTX PRO 4500 (32GB) in parallel for inference. The goal is not running one large model using 2x GPUs, but rather running separate models on each GPU. Why these GPU config?… 19 r/LocalLLaMA community 16d ago Dual r9700 ai pro for training llms? I am a developer and need high vram machine to finetune llms, how has your experience been with finetuning/training on multi gpu on 2x r700 amd ai pro gpus?   submitted by   /u/AppropriatePush6262 [link]   [comments] 13 r/LocalLLaMA community 16d ago Yay got Gemma 12B QAT working on old 1080ti (maybe with speculative decoding?) Pretty happy with 50 tok/sec on this 9 year old GPU. Suggestions to improve anything (speed or quality) very welcome! I'm not 100% sure how to tell if the speculative decoding "model-draft" is helping or not. But hey, it is fast and seems coherent, I'm happy bash… 24 r/LocalLLaMA community 17d ago 3090 died, good night sweet prince Feelsbadman.jpeg Once you've tasted 4x GPUs and almost BF16 models with BF16 KV cache you can't go back 😞. AND IT'S THE WEEKEND OH MAN.   submitted by   /u/fragment_me [link]   [comments] 32 r/LocalLLaMA community 17d ago Diffusion Gemma is 4x faster, but makes 6x more mistakes! Benchmarked the new Gemma diffusion model against its autoregressive twin on a single H100 (FP8). We gave each the same three tasks: write a Steve Jobs biography, the history of Tetris, and the story of BeOS - every next topic less popular than the previous one. Then we… 14 NVIDIA Developer Blog official-blog 17d ago NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark AI agents have fundamentally changed the complexity of inference workloads. Until now, the industry has struggled to define a standard for measuring how... 8 r/LocalLLaMA community 17d ago Local LLMs aren't democratic anymore... the hardware barrier has gotten out of hand. When we first started experimenting with local LLMs, it was a completely different story! We were using gaming GPUs to tinker around. 8GB or 16GB of VRAM (which wasn't even a given for everyone) was the norm, and so many people could actually get their hands dirty and… 25 r/LocalLLaMA community 17d ago Do you ever see a post so bad that you ask yourself what was the prompt if this is the output and what model wrote this. Im talking about posts that sound schizophrenic absolute nonsense but its clear that it was pasted from ai with the structure of it and em dashes   submitted by   /u/George__Roid [link]   [comments] 32 r/LocalLLaMA community 17d ago Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split Setup: +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 610.43.02 KMD Version: 610.43.02 CUDA UMD Version: 13.3 | +-----------------------------------------+------------------------+----------------------+ | GPU Name… 9 TechCrunch — AI news-outlet 17d ago SpaceX, Anthropic, and OpenAI’s hot IPO summer The IPO market is back, and it’s not the same companies leading the charge. FAANG had a good run, but a new acronym is taking over: MANGOS — Meta (or Microsoft, depending on who you ask), Anthropic, Nvidia, Google, OpenAI, and SpaceX.… 22 TechCrunch — AI news-outlet 17d ago It’s hot IPO summer, and the MANGOS are ripe The IPO market is back, and it’s not the same companies leading the charge. FAANG had a good run, but a new acronym is taking over: MANGOS — Meta (or Microsoft, depending on who you ask), Anthropic, Nvidia, Google, OpenAI, and SpaceX.… 17 NVIDIA Developer Blog official-blog 17d ago Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and... 25 The Information — AI news-outlet 17d ago Meta Bought Rivos to Accelerate Its AI Chip Push. It Isn’t Working. Meta Platforms bought semiconductor startup Rivos last year to accelerate development of in-house chips and reduce its reliance on Nvidia as it pours cash into data centers for its AI ambitions. Now six months since the acquisition closed, Meta is struggling to make it work,… 10 Hugging Face Daily Papers research 17d ago Surflo: Consistent 3D Surface Flow Model with Global State Abstract Surflo compresses unposed RGB views into latent tokens and decodes 3D surface points through flow matching, enabling flexible resolution output and efficient processing compared to existing methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Geometry is invariant to… 35 The Information — AI news-outlet 17d ago Nvidia Pitches Vera CPU to Chinese Customers Nvidia is pitching Chinese customers on its new Vera central processing units for AI data centers, telling them the chips could be available as soon as August and that orders can begin now, Reuters reported, citing three people familiar with the matter. The push gives Nvidia… 7 Hugging Face Daily Papers research 17d ago Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering Abstract Flash-GMM introduces an efficient fused Triton kernel for Gaussian Mixture Models that achieves significant speedup and enables processing much larger datasets on a single GPU. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present Flash-GMM, a fused Triton kernel for… 18 llama.cpp releases dev-tools 17d ago b9605 ggml: support concat for scalar types at cuda backend ( #24011 ) cuda: support concat for scalar types Update concat.cu fix metal ci issue macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux:… 20 Page 5 of 10 · 500 articles ← Newer Older →