Tag

Gpu

500 articles archived under #gpu · RSS

The Information — AI news-outlet 28d ago

OpenAI Could Release Internal Tool That Would Weaken Nvidia’s Software Advantage

Morning! Anissa here. OpenAI is open to the idea of publicly sharing software it’s been developing to make its AI run on chips from different providers, a senior executive said, a move that could weaken one of Nvidia’s biggest advantages. The revealing comments came from Sachin…

24
r/LocalLLaMA community 28d ago

llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp

Overview continue #23764 , this PR only reserves logits space for n_seqs when possible. With -ub 2048 and MTP, this saves another 1.2GB of VRAM for me. I've tested llama-perplexity also and it seems to work fine. But maybe there is a better API, putting up as a draft for now…

22
llama.cpp releases dev-tools 28d ago

b9452

vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints ( #23056 ) Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even though they're only 2-byte aligned, and Q3_K still wins on NVIDIA as well. mesa isn't all that great at coalescing back-to-back loads from…

4
r/LocalLLaMA community 28d ago

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

Hey all! I’ve been working on CUDA performance in mistral.rs, and v0.8.2 is focused on CUDA throughput. The result: on Gemma 4 (dense & MoE), mistral.rs is faster than llama.cpp at every point in my release sweep on GB10/H100/B200. See some results below on GB10 and B200:…

24
r/MachineLearning community 28d ago

5060 Ti 16GB or Cloud: Which makes more sense for DL, RL, and LLM studies/research? [D]

Hi everyone, If you have purchased (at least one) GPU(s) for ML/DL studies and research: How is your experience and is it worth it? What do you use it for and how is the ROI? I have a MacBook Pro with M4 from some years ago, while MPS is useful in many occasions, it's no…

29
Hugging Face Daily Papers research 28d ago

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

Abstract Batch-1 autoregressive decoding in physical AI systems shows that memory bandwidth alone doesn't fully explain latency, with GPU speedup limited by launch overheads and quantization efficiency varying significantly across hardware platforms. AI-generated summary…

16
Ars Technica — AI news-outlet 28d ago

Intel: Our upcoming AI chip will be cheaper, run cooler than Nvidia, AMD options

Crescent Island is an air-cooled chip that uses LPDDR5 memory.

30
r/LocalLLaMA community 28d ago

MTP is nice and all, but what about PP speeds?

I don't know for the rest of you, but with my setup, as soon as i enable MTP, the PP performance and GPU usage drops significantly for some reason. It's not as much a memory issue for me as it is declining performance. My setup is: 2x Radeon VII 16gb on ROCm, 1x Rtx3080 8gb Max…

28
Hugging Face Daily Papers research 28d ago

Mellum2 Technical Report

Abstract Mellum 2 is an open-weight 12B-parameter Mixture-of-Experts language model with 2.5B active parameters per token, specialized in software engineering tasks and optimized for inference efficiency on commodity GPUs. AI-generated summary We present Mellum 2, an open-weight…

33
r/LocalLLaMA community 28d ago

Cheap V100 32gb

Mod remove if violation. But I thought some GPU poor folks would be interested V100 32GB $526 - $60 (SSUS60)*- $35 (PayPal) + $71 shipping YMMV *I might have used another for $75, not showing up on my list of coupons after use. Update: I think it was USAFF75 : $499-$75 Ordered…

11
Hacker News — AI on Front Page community 28d ago

Microsoft builds MacBook Pro rival with NVIDIA-powered Surface Laptop Ultra

https://www.microsoft.com/en-us/surface/devices/surface-lapt... https://blogs.windows.com/devices/2026/05/31/introducing-sur... Comments URL: https://news.ycombinator.com/item?id=48355720 Points: 206 # Comments: 430

6
r/LocalLLaMA community 29d ago

Entire world: We need more GPUs. Meanwhile, Jensen Huang:

  submitted by   /u/Nunki08 [link]   [comments]

16
The Information — AI news-outlet 29d ago

Nvidia Unveils New Chip for PCs

Nvidia unveiled a new chip for personal computers alongside Microsoft on Monday, a major step into the PC chip market long led by Intel, Advanced Micro Devices and Apple. The new chip, called N1X, will power a new line of Windows computers starting this fall, Nvidia CEO Jensen…

33
r/LocalLLaMA community 29d ago

NVIDIA RTX Spark — Slim Laptops & Small Desktops

  submitted by   /u/zxyzyxz [link]   [comments]

9
The Information — AI news-outlet 29d ago

Intel to Ship New AI Chip This Year to Challenge Nvidia

Intel plans to ship a new AI chip by the end of this year, betting that a cheaper, simpler processor can give it a new foothold in a market dominated by Nvidia, according to a Financial Times interview with Kevork Kechichian, who leads Intel’s data centre group. The chip,…

28
Smol AI News news-outlet 29d ago

not much happened today

**NVIDIA** led open-source AI model releases with **Cosmos 3**, a comprehensive omnimodal world model unifying language, image, video, audio, and action using a Mixture-of-Transformers design, and **Nemotron 3 Ultra**, a **550B** parameter open-weight model noted for high…

33
Hacker News — AI on Front Page community 29d ago

Nvidia RTX Spark

Article URL: https://www.nvidia.com/en-us/products/rtx-spark/ Comments URL: https://news.ycombinator.com/item?id=48352939 Points: 204 # Comments: 164

20
NVIDIA Developer Blog official-blog 29d ago

How to Post-Train Autonomous Vehicle Models in Closed-Loop with NVIDIA Alpamayo

Developing autonomous vehicle (AV) policies requires bridging an important gap between training and deployment. Vision-language-action (VLA) models that can...

26
Hugging Face official-blog 29d ago

Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action

Back to Articles Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action Enterprise + Article Published June 1, 2026 Upvote - Asawaree asawareeb nvidia Atharva Joshi atharvajoshi10 nvidia NVIDIA Cosmos 3 is here - and it's available on Hugging…

23
NVIDIA Developer Blog official-blog 29d ago

Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3

Physical AI systems must understand the real world before they can act within it. Robots, autonomous vehicles, and smart spaces need to understand what's...

21
r/LocalLLaMA community 29d ago

NVIDIA announces Nemotron 3 Ultra

  submitted by   /u/themixtergames [link]   [comments]

22
NVIDIA Developer Blog official-blog 29d ago

Advancing AI Infrastructure for Agentic AI with NVIDIA DOCA In-Silicon Security

The AI era is driving a new class of infrastructure: AI factories that transform data into intelligence for autonomous AI agents operating at unprecedented...

4
arXiv — Machine Learning research 29d ago

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

arXiv:2605.30381v1 Announce Type: new Abstract: Deceptive alignment, in which models maintain accurate internal representations while deliberately producing false outputs, remains a central challenge in AI safety. While strategic deception is the primary long-term concern,…

24
arXiv — Machine Learning research 29d ago

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

arXiv:2605.30448v1 Announce Type: new Abstract: Black-box LLM distillation is usually evaluated as an output-matching problem: a student is considered successful when its responses are semantically similar to, or task-consistent with, those of a teacher. However, output…

19
arXiv — Machine Learning research 29d ago

Discovering a Zeta Map Algorithm on Dyck Paths via Mechanistic Interpretability

arXiv:2605.30482v1 Announce Type: new Abstract: Machine learning is increasingly used in mathematical discovery, but in mathematics the desired output is often not a prediction itself, but an explicit construction that can be checked independently. We study this setting through…

24
arXiv — Machine Learning research 29d ago

Reducing the GPU Memory Bottleneck with Lossless Compression for ML -- Extended

arXiv:2605.30728v1 Announce Type: new Abstract: Machine learning (ML) training and inference often process data sets far exceeding GPU memory capacity, forcing them to rely on PCIe for on-demand tensor transfers, causing critical transfer bottlenecks. Lossy compression has been…

32
arXiv — NLP / Computation & Language research 29d ago

Fine-Tuning Improves Information Conveyance in Language Models

arXiv:2605.30844v1 Announce Type: new Abstract: Fine-tuning is often believed to reduce uncertainty and diversity in large language models, but existing analyses overlook output length, a key confounder, and therefore fail to capture how uncertainty is distributed across an…

27
arXiv — NLP / Computation & Language research 29d ago

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

arXiv:2605.31183v1 Announce Type: new Abstract: Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steering model output generation. When AxBench - a model steering benchmark - was introduced in Wu…

10
NVIDIA Developer Blog official-blog 29d ago

NVIDIA Vera CPU Sets a New Standard for Agentic Workloads in AI Factories

Each wave of AI has created a new scaling law. Pretraining scaled intelligence through larger datasets, more parameters, and massively parallel GPU systems....

4
NVIDIA Developer Blog official-blog 29d ago

NVIDIA DSX OS Delivers Open, Modular Software for Operating AI Factories at Scale

AI is now essential infrastructure, powered by AI factories that generate intelligence in the form of tokens. As demand grows, these factories must scale...

13
llama.cpp releases dev-tools 29d ago

b9445: ci: remove redundant or duplicate jobs (#23927)

remove redundant apple job openvino gpu and cpu test can share the same build and machine Update build-rpc.yml Update build-openvino.yml cpu any doesnt make sense as we have an arm job already, so do high perf on both x86 and arm remove duplicate x86 vulkan combine backend…

31
Hugging Face Daily Papers research 29d ago

FRAPPE: Full Input, Residual Output Autoencoding with Projection Pursuit Encoder

Abstract A novel autoencoding framework called FRAPPE uses a projection pursuit encoder to predict residuals from full input, enabling efficient variable-rate image compression with fast CPU-based encoding. AI-generated summary Media compression standards have reached a plateau…

32
Hugging Face Daily Papers research 29d ago

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Abstract SANA-Streaming enables real-time high-resolution video-to-video editing through a hybrid diffusion transformer architecture, cycle-reverse regularization, and efficient system co-design optimized for consumer GPUs. AI-generated summary Real-time streaming video-to-video…

27
r/LocalLLaMA community 29d ago

Use HTML as the primary chat language of your LLM's so they can make interactive content

A day or two back I posted about how you can use HTML directly as the output for your agent's chat . Many people mentioned that there was no point as with mermaid or graphviz agents can already draw diagrams, or that markdown was technically a superset of HTML (not that I've…

25
r/LocalLLaMA community 29d ago

Get you some GPUs, it's not worth the hacks around lack of RAM

https://preview.redd.it/w356ddr8ak4h1.png?width=550&format=png&auto=webp&s=f04238bf0d44f6defe58698c75f08d6c2581d4c2 https://preview.redd.it/nalt9p8mak4h1.png?width=550&format=png&auto=webp&s=b8fb2f366f176eab0003a5cc53e4736664d25659 If you can, get you some GPUs, all the hacks…

12
r/LocalLLaMA community 29d ago

GPU Prices. Buy now, or buy later?

If the Community could sound off on this, I'd be grateful. Do you think GPU prices are going to stop skyrocketing? Is this FOMO and hype driving the adoption of local inference? I wonder if this mass-market adoption will last for years? Is it a long-term trend? If I wait 6…

31
r/LocalLLaMA community 29d ago

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python

I ported NVIDIA's Parakeet speech-to-text models to pure C++/ggml (the engine behind llama.cpp and whisper.cpp). It runs the FastConformer TDT / CTC / RNNT / hybrid models with no Python and no PyTorch, on CPU and GPU (CUDA, HIP, Vulkan, Metal). The goal was to match NeMo…

30
r/LocalLLaMA community 29d ago

Whats actually happening when a model spills out of VRAM into system memory?

So as far as I understand it, llama.cpp can run models across multiple different sources of compute (multiple GPU, multi-core cpu, cpu+gpu, etc). However, what I'm not understanding is how that split occurs so that I can better optimize my settings and flags and whatnot. For…

31
r/LocalLLaMA community 29d ago

Qwen3.6-35B vs Gemma4-26B on 7900 XTX

Ran a fair comparison between Qwen3.6-35B-A3B and Gemma4-26B-A4B on my Radeon 7900 XTX. Both reasoning-enabled at matching 32K budgets, no output caps, six generic real-world prompts (meeting notes, incident postmortem, log triage to JSON, code review, a build-vs-buy decision, a…

9
r/LocalLLaMA community 29d ago

We might have a winner with the upcoming N1X

https://www.notebookcheck.net/Nvidia-s-N1X-and-N1-processors-leak-in-full-ahead-of-launch.1311497.0.html 16 channel ddr5 memory is going to give us best of both world,light the memory bandwidth is going to be great than 500GB/S Edit: didn’t realize lpddr5 is 16-bit wide per…

23
Hacker News — AI on Front Page community 29d ago

I put a datacenter GPU in my gaming PC

Article URL: https://blog.tymscar.com/posts/v100localllm/ Comments URL: https://news.ycombinator.com/item?id=48345694 Points: 241 # Comments: 154

5
r/LocalLLaMA community 29d ago

13 abliterated Gemma 4 E2B variants, 44 GPU hours, Benchmark and Comparison - Abliterlitics

I compared 13 abliterated variants of Gemma 4 E2B across weight analysis, KL divergence, HarmBench safety, and 8 benchmark tasks. 44 GPU hours on a single RTX 5090. Here is what actually works and what destroys capabilities. coder3101's variant achieves 96% ASR with capability…

17
r/LocalLLaMA community 1mo ago

Flash Attention for llama.cpp on RDNA3: 47% less KV VRAM than Vulkan f16 K, KLD almost losselss on F16 K / q4_0 V. Part 1.

The normal tradeoff in llama.cpp attention is: quantize your KV cache and lose quality, or keep fp16 and burn VRAM. On RDNA3 there's a third option(from now on)!Pack four 8-bit K values into a single 32-bit and feed them directly to the GPU's native `sudot4` dot-product…

13
llama.cpp releases dev-tools 1mo ago

b9439

llama: only use one iGPU device by default ( #23897 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64…

4
r/LocalLLaMA community 1mo ago

mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF just released !

Description of the module: I host 30+ free APEX MoE quantizations as independent research. My only local hardware is an NVIDIA DGX Spark (122 GB unified memory) — enough for ~30-50B-class MoEs, but bigger ones (200B+) require rented compute on H100/H200/Blackwell, typically…

26
r/LocalLLaMA community 1mo ago

Dell confirms XPS laptop with NVIDIA N1X at Computex ( basically a DGX Spark GB10 for consumers with Windows )

  submitted by   /u/fallingdowndizzyvr [link]   [comments]

24
r/LocalLLaMA community 1mo ago

All DGX Station GB300 OEM systems side-by-side in one image (roughly actual size)

Except for HP which I had to guesstimate from some Chinese guy's pic at a showcase because the ZGX Fury AI Station G1N's official page is locked down. Same reason why I didn't include Nvidia's Most underrated LLM system of 2026 and nothing comes close Assuming budget is…

23
r/LocalLLaMA community 1mo ago

nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face

The NVIDIA Qwen3.6-35B-A3B-NVFP4 model is the quantized version of Alibaba's Qwen3.6-35B-A3B model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check here . The NVIDIA Qwen3.6-35B-A3B-NVFP4 model is…

27
r/LocalLLaMA community 1mo ago

Fine tuning on DGX spark vs 4x 3090?

hey, my research direction don’t focus on inference or eval benchmarks. specifically, it’s mech interp research direction, analyzing how models do computation etc i dont have GPU, mostly using cloud GPUs loaned by third parties. i saved up some scholarship money by spending less…

32
r/LocalLLaMA community 1mo ago

Why does Thinking Output More Tokens Than a Response?

I was too lazy to use a vector DB + Embedding + Clustering for this list of 1000 items I wanted to categorize. I was hoping to use a local LLM to do it, but it would only respond with a list of about 100 items or so and their categories. It confused me because when I saw the…

22

OpenAI Could Release Internal Tool That Would Weaken Nvidia’s Software Advantage

llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp

b9452

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

5060 Ti 16GB or Cloud: Which makes more sense for DL, RL, and LLM studies/research? [D]

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

Intel: Our upcoming AI chip will be cheaper, run cooler than Nvidia, AMD options

MTP is nice and all, but what about PP speeds?

Mellum2 Technical Report

Cheap V100 32gb

Microsoft builds MacBook Pro rival with NVIDIA-powered Surface Laptop Ultra

Entire world: We need more GPUs. Meanwhile, Jensen Huang:

Nvidia Unveils New Chip for PCs

NVIDIA RTX Spark — Slim Laptops & Small Desktops

Intel to Ship New AI Chip This Year to Challenge Nvidia

not much happened today

Nvidia RTX Spark

How to Post-Train Autonomous Vehicle Models in Closed-Loop with NVIDIA Alpamayo

Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action

Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3

NVIDIA announces Nemotron 3 Ultra

Advancing AI Infrastructure for Agentic AI with NVIDIA DOCA In-Silicon Security

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

Discovering a Zeta Map Algorithm on Dyck Paths via Mechanistic Interpretability

Reducing the GPU Memory Bottleneck with Lossless Compression for ML -- Extended

Fine-Tuning Improves Information Conveyance in Language Models

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

NVIDIA Vera CPU Sets a New Standard for Agentic Workloads in AI Factories

NVIDIA DSX OS Delivers Open, Modular Software for Operating AI Factories at Scale

b9445: ci: remove redundant or duplicate jobs (#23927)

FRAPPE: Full Input, Residual Output Autoencoding with Projection Pursuit Encoder

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Use HTML as the primary chat language of your LLM's so they can make interactive content

Get you some GPUs, it's not worth the hacks around lack of RAM

GPU Prices. Buy now, or buy later?

I ported NVIDIA Parakeet (speech-to-text) to ggml: same output as NeMo, faster, GGUF-quantized, no Python

Whats actually happening when a model spills out of VRAM into system memory?

Qwen3.6-35B vs Gemma4-26B on 7900 XTX

We might have a winner with the upcoming N1X

I put a datacenter GPU in my gaming PC

13 abliterated Gemma 4 E2B variants, 44 GPU hours, Benchmark and Comparison - Abliterlitics

Flash Attention for llama.cpp on RDNA3: 47% less KV VRAM than Vulkan f16 K, KLD almost losselss on F16 K / q4_0 V. Part 1.

b9439

mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF just released !

Dell confirms XPS laptop with NVIDIA N1X at Computex ( basically a DGX Spark GB10 for consumers with Windows )

All DGX Station GB300 OEM systems side-by-side in one image (roughly actual size)

nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face

Fine tuning on DGX spark vs 4x 3090?

Why does Thinking Output More Tokens Than a Response?