Tag

Gpu

500 articles archived under #gpu · RSS

arXiv — NLP / Computation & Language research 5d ago

Scale or Reason? A Compute-Equivalent Analysis of Reasoning Distillation

arXiv:2509.22193v2 Announce Type: replace Abstract: Distilling reasoning traces from strong teacher models has become the standard recipe for building capable small language models. Yet reasoning traces are 5-20$\times$ longer than standard instruction fine-tuning (IFT) outputs,…

19
r/LocalLLaMA community 5d ago

Any chance I could cluster my DGX Spark (128GB unified memory) and my AMD Ryzen AI Max 395 (128GM unified memory) together to run 1 model?

Hey all, So I have a Nvidia DGX Spark and an AMD Strix 395, both have 128GB of unified memory. The Spark has 200Gbit network and the AMD Strix has 5Gbit ethernet (but it has a pcie gen 4x4 slot). Is there any chance I can cluster the 2 together to run a larger model that can fit…

30
r/LocalLLaMA community 5d ago

SDXL running locally in the browser on WebGPU, open-source

I needed simple local image generation without the usual setup. No virtual environments, no ComfyUI with a complex graph and installation as an exe. So i tried to push the whole thing into the browser and run it on WebGPU. It's a browser extension. You install it, then it loads…

13
r/MachineLearning community 5d ago

MuJoCo derived Simulator for High Fidelity Vision RL training natively on GPU [D]

Hi everyone, For the past couple of weeks I have been working on a simulator project considering the shortcomings of MuJoCo. There are things that people like and also don't like about MuJoCo, like the CPU dependency on MuJoCo which makes the simulation not parallelizable beyond…

31
r/LocalLLaMA community 5d ago

MINISFORUM DEG1 Oculink eGPU Dock Refurbished - $59

I got one of these refurbished units last year. I have nothing but good things to say about it. It works great. It has heft to it to keep the GPU secure. And unlike some cheaper Oculink docks, it has redrivers for signal integrity.   submitted by   /u/fallingdowndizzyvr…

20
r/LocalLLaMA community 5d ago

Has anyone else found vLLM outputs noticeably worse than llama.cpp for the same model?

I'm wondering if anyone else has come across this. I've tested the same model on llama.cpp and vLLM with similar settings and quantizations. The performance and concurrency in vLLM are much noticeably better, but sometimes the model feels less reliable. Some things I've noticed:…

27
NVIDIA Developer Blog official-blog 5d ago

Accelerating BEV Pooling on NVIDIA GPUs for Physical AI Applications

An increasingly common design pattern for autonomous vehicles (AVs), robotics, and spatial AI systems is bird's-eye-view (BEV) perception. BEV models project...

31
Hugging Face official-blog 5d ago

Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel

Back to Articles a]:hidden"> Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel Enterprise + Article Published June 24, 2026 Upvote - Adil Asif adil-asif nvidia Alexandros Koumparoulis akoumpa nvidia Wenwen Gao wgao2021 nvidia Sylendran Arunagiri Sylendran95 nvidia…

29
Hacker News — AI on Front Page community 5d ago

45°C cooling design cuts data center water use to near zero

Article URL: https://blogs.nvidia.com/blog/liquid-cooling-ai-factories/ Comments URL: https://news.ycombinator.com/item?id=48660178 Points: 206 # Comments: 157

22
r/LocalLLaMA community 5d ago

I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system.

G'day. This is part 3 on my Local LLM adventures. I have a crazy system hacked server-to-desktop system : Component Spec GPUs 2x Hopper H100, 96 GB HBM3 each CPUs 2x Grace, 72 cores each Host memory 480 GB LPDDR5X per Grace, 960 GB total So I can run technically run GLM5.2.…

34
r/MachineLearning community 5d ago

I compiled LLM inference pricing across 7 providers — the caching numbers are surprising(spreadsheet included) [R]

I've been comparing GPU/LLM providers for a side project and ended up with way too many browser tabs and spreadsheets. So I decided to pull the public pricing data into one sheet and compare it side by side. A quick disclaimer: this is not benchmark data . I didn't run latency…

32
r/LocalLLaMA community 6d ago

Unlimited-OCR is now on ModelScope! A 3.3B multilingual OCR model for one-shot parsing across single images, multi-page documents, and PDFs. License: MIT

Full-document parsing instead of cropped-region OCR 32K output length for long OCR sequences Base and gundam image modes for different document layouts Transformers inference + SGLang serving with OpenAI-compatible streaming requests Built to push DeepSeek-OCR-style document…

22
arXiv — NLP / Computation & Language research 6d ago

CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression

arXiv:2606.24083v1 Announce Type: new Abstract: "Talk short. Drop grammar. Save token." This caveman style is widely promoted as a way to cut inference cost, but whether it actually saves anything depends on which channel (the user's prompt or the model's response) is being…

25
arXiv — Machine Learning research 6d ago

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

arXiv:2606.24506v1 Announce Type: cross Abstract: Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is…

8
arXiv — NLP / Computation & Language research 6d ago

The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking

arXiv:2606.24627v1 Announce Type: new Abstract: Fact-checking systems built on LLMs achieve high verdict accuracy on standard benchmarks, yet routinely output Supports labels whose cited evidence does not license the claim. Structured decomposition is the natural way to inspect…

4
arXiv — NLP / Computation & Language research 6d ago

EvidenceLens: A Claim-Evidence Matrix for Auditing Financial Question Answering

arXiv:2606.23724v1 Announce Type: cross Abstract: Large language models are increasingly used to answer questions over annual reports, earnings decks, and analyst notes, yet their outputs remain difficult to verify in high-stakes financial workflows. A fluent answer can blend…

32
arXiv — NLP / Computation & Language research 6d ago

CORE-BREW: LLR-Based Soft Decoding for Robust Multi-Bit LLM Watermarking

arXiv:2606.24163v1 Announce Type: cross Abstract: Reliable provenance for LLM outputs requires multi-bit watermarks that remain robust under editing while maintaining strict false-positive control. Existing ECC-based LLM watermarks rely largely on hard-decision decoding,…

8
arXiv — NLP / Computation & Language research 6d ago

Ensemble Learning for Large Language Models in Text and Code Generation: A Survey

arXiv:2503.13505v3 Announce Type: replace Abstract: Generative Pretrained Transformers (GPTs) are foundational Large Language Models (LLMs) for text generation. However, individual LLMs often produce inconsistent outputs and exhibit biases, limiting their representation of…

10
r/LocalLLaMA community 6d ago

What are the top Chinese GPU rental platforms?

This post has me intrigued ... but not to buy, I want to rent/lease one of these FRANKNVIDIA GPUs. I'll learn Chinese. I'll VPN in through the great firewall on the backs of carrier pigeons if I have to. I don't care. Where's the vast.ai of China at?   submitted by  …

10
r/LocalLLaMA community 6d ago

I want to add a second 7900XTX, question about pcie2/3/4

I've got a 7900XTX in my old gaming PC, now I want more vram and if I stay on one GPU I can only reasonably get 32GB and that just doesn't sound good enough. Using two slots, 48GB sounds way better and is much cheaper. I think 48GB is the minimum I want to have after going to…

26
r/LocalLLaMA community 6d ago

Tmax-27b - a Qwen3.6-27b terminal agent for small GPUs trained with DPPO (RL)

Hey everyone, wanted to share some work on making the new Tmax-27B terminal agent actually runnable on consumer hardware. What is Tmax-27B? Ai2 just released Tmax, a family of terminal-agent LLMs trained with DPPO (RL) on top of Qwen3.6. The 27B model hits ~43% on Terminal Bench…

32
r/LocalLLaMA community 6d ago

I'm eager for a 15x speedup on my strix halo

Nvidia says 15x speed up possible with diffusion model. Entire block of text generated at once. https://x.com/NVIDIAAI/status/2069465510790545761   submitted by   /u/Terminator857 [link]   [comments]

22
r/LocalLLaMA community 6d ago

650+ Apache-2.0 biomedical NER/de-id models that run on-device in MLX. Same fp32 weights, identical outputs: the clinical NER models run 30-40x faster than PyTorch-CPU on a 3-year-old M3 Max. Repro inside.

Disclosure first: I maintain OpenMed, so read this with that bias. I'm posting the numbers with the full methodology and a runnable script so you can reproduce or tear it apart. I'm here for the next couple of hours to answer methodology questions. What it is: an open-source…

25
r/LocalLLaMA community 6d ago

UPDATE: Qwen-27B-IQ4_KS and Qwen-27B-IQ_KS_KT for ik_llama.cpp, especially for NVIDIA with 16GB VRAM

Continuing 16GB VRAM Optimizations: New Qwen3.6-27B GGUF Quants (Experimental Trellis/iq4_kt & MTP) Hi everyone, I'm continuing my optimization efforts for 16GB VRAM and Nvidia GPUs from this post:…

7
r/LocalLLaMA community 6d ago

Is it possible to run a giant model like GLM5.2 on this cluster (4x servers with 512GB RAM + dual AMD Epyc)? 16 channel memory should hit 409GB/s per node.

Hey all, I have a piece of hardware laying around which is pretty fast from a traditional (non-GPU) server viewpoint. The hardware is the following: Dell C6525 Server with Quad Node (4x server blades) with the following: 2x AMD EPYC 7702 64-Core Processors 8 memory channels per…

30
r/LocalLLaMA community 6d ago

7 Chinese companies are already shipping H100/H200-class AI chips, most IPO'd in the last 6 months. I mapped all of them.

https://preview.redd.it/8e85oakbz19h1.png?width=3000&format=png&auto=webp&s=258039cb277daa2572d1ba4eec8cc488357c62d0 I run Chinese open models on a 4×3090 rig every day. The more I watched these models get tuned for domestic hardware, the more I wanted to know what that hardware…

17
NVIDIA Developer Blog official-blog 6d ago

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding

As AI systems move from single-turn interactions to coordinated multiagent workflows, low-latency inference becomes increasingly important. Autoregressive LLMs...

33
NVIDIA Developer Blog official-blog 6d ago

Build an AI Scientist for Life Science Discovery with NVIDIA BioNeMo Agent Toolkit

AI scientists are emerging as a new interface for scientific computing. These agents can read papers, write code, generate hypotheses, call APIs, inspect files,...

12
r/LocalLLaMA community 6d ago

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample

Ran three open-weight TTS models head to head on CPU. Intel Xeon, 4 cores, 15.6GB RAM, no GPU. Five configs, six text lengths from 12 to 1712 chars, 5 timed reps per cell after warmup, 150 timed runs total. Every audio output scored with UTMOS (utmos22_strong) so quality isn't…

19
r/MachineLearning community 6d ago

What's your biggest pain point when choosing between cloud GPU providers for LLM inference?[R]

Trying to understand how other people make this decision. Do you compare $/hr, $/token, throughput, reliability? Is there a tool or resource you rely on, or are you just doing the math manually? Asking because I'm an ML engineer who's been doing this in spreadsheets and…

14
llama.cpp releases dev-tools 6d ago

b9767

ggml-webgpu: improve MTP inference by using mat-vec path for small batches ( #24811 ) ggml-webgpu: improve small batches decoding Add barrier to the NUM_COLS loop in mul-mat-vec macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS…

21
Hugging Face Daily Papers research 7d ago

Safe Few-Step Generation via Velocity Editing

Abstract VESFlow is a training-free safety method for flow matching-based text-to-image generation that edits velocity fields to ensure safe output while maintaining prompt integrity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Flow matching has recently emerged as a strong…

16
r/LocalLLaMA community 7d ago

100+ t/s on Qwen3.6-27B Q8 across a 5090 + 3090 Ti — switching to tensor split-mode got me from 70 to 100+

Wanted to share a setup that's been working great for me. Running Qwen3.6-27B at Q8_0 across two GPUs (RTX 5090 + RTX 3090 Ti) and getting ~100 t/s. The big jump came from switching --split-mode to tensor . I was sitting at 70+ t/s on layer split before that. Tensor split keeps…

22
TechCrunch — AI news-outlet 7d ago

AI chipmaker Groq confirms $650M raise, re-staffs after Nvidia’s $20B not-acqui-hire deal

What does an AI company do after one of those not-acqui-hire deals? Groq raised money, is leaning into its neocloud business, and is hiring new execs.

5
TechCrunch — AI news-outlet 7d ago

Nvidia wants to cut data center water use, but that’s not the same as fixing AI’s water problem

Nvidia announced a new cooling system that cuts water use inside the data center. But it does nothing to address AI's biggest water use — fossil fuel power plants.

5
TechCrunch — AI news-outlet 7d ago

SpaceX inks compute deal with Reflection AI, an open-source AI lab

Reflection AI will pay $150 million a month beginning July 1, 2026 through 2029 for immediate access to Nvidia's latest GB300 AI chips and supporting hardware across SpaceX's Colossus 2 data center near Memphis, Tennessee.

33
NVIDIA Developer Blog official-blog 7d ago

CCCL Runtime: A Modern C++ Runtime for CUDA

The NVIDIA CUDA Core Compute Libraries (CCCL) provides delightful and efficient abstractions for CUDA developers in C++ and Python. It features: Parallel...

8
r/LocalLLaMA community 7d ago

Chinese Hackers Latest Masterpiece with NVIDIA

They reverse-engineered the Tesla v100's pinout definition, soldered it onto a half height PCB, then naming it Tesla v100 v4. Price (with 3 years warranty): 16G version: 1499 rmb (220 usd) 32G version: 3999 rmb (590 usd) The hacker's op:…

10
Hacker News — AI on Front Page community 7d ago

The text in Claude Code’s “Extended Thinking” output

Article URL: https://patrickmccanna.net/the-text-in-claude-codes-extended-thinking-output-is-not-authentic/ Comments URL: https://news.ycombinator.com/item?id=48630535 Points: 210 # Comments: 151

19
r/LocalLLaMA community 7d ago

GLM-5.2 UD-IQ1_M on llama.cpp — 5090 + 3090 Ti speed test (~ 579 t/s prefill @ 8k ctx, ~324 t/s prefill @ 57k ctx, ~10.6 t/s decode)

Just sharing some speed test numbers for GLM-5.2 running on llama.cpp. Setup: Model: unsloth/GLM-5.2-GGUF, UD-IQ1_M quant GPUs: RTX 5090 + RTX 3090 Ti 186 GB DDR5 used Debian 13 CUDA 13.3 128k context, q8_0 KV cache Prefill (prompt processing): n_tokens tokens/s 8,201 579.75…

4
NVIDIA Developer Blog official-blog 7d ago

Inside NVIDIA Halos for Robotics: A Full-Stack Functional Safety System for Physical AI

Physical AI—robots working autonomously alongside people in factories, warehouses, hospitals, and homes—is arriving faster than most expected. Traditional...

12
Vercel — AI dev-tools 8d ago

Workflow SDK now compresses run and step payloads

The Workflow SDK 5 beta now compresses all run, hook, and step inputs and outputs with zstd . Compression kicks in automatically, but only when it helps. Small payloads stay as-is, larger ones get compressed before they're persisted. Compressed payloads use less storage and are…

16
r/LocalLLaMA community 8d ago

2× Radeon R9700 — Qwen 3.6 27B Q8 MTP on llama.cpp

There isn't much information around about multi-GPU setups with the R9700, so I'm writing this up in case it helps anyone in the same situation. Here's my setup, the tests I ran, and the numbers from the server logs. Setup ThinkStation P7, Xeon w7-3455, 128 GB RDIMM 2× Gigabyte…

11
r/LocalLLaMA community 8d ago

Can I realistically get close to Claude/Codex capabilities locally?

For context, I have a modest 32Gb rig running Nvidia GPUs (5070 Ti + 5060 Ti, the latter over an adapted x4 NVME slot so not as fast as if I had a motherboard with multiple proper CPU connected PCIe lanes). I can run the 27B models on it nicely enough, but the bottleneck is…

31
r/LocalLLaMA community 8d ago

8-16 MI50s Minimax M3 @19 tps TG (peak)

TL;DR Speeds are not too ugly for this old 2018 hardware but imo, not very usable for agentic coding (if you compare with qwen3.6 27B on 8 MI50 @ 50 tps TG 800 tps PP). More concerning is that the reasoning output is very very long and still didn’t check about the quality of…

27
r/LocalLLaMA community 8d ago

R9700 abysmal performance, getting desparate

I've been trying to get my 2x R9700 setup to work for the past two weeks. This has been such a time sink I wish I had just gone with nvidia. At this point I'm close to selling the cards. I need vLLM. This is a dedicated setup for multi-user serving. I've tried the…

17
r/LocalLLaMA community 9d ago

Bought 2x r9700, 5090 is now 7k and 6000 pro is at 13.5k, best option for 64 gb vram under 4k

after being frustrated with nvidia proces, I went with asrock r9700, not even dgx spark even they are at 7k now, did I make a mistake?   submitted by   /u/AppropriatePush6262 [link]   [comments]

20
r/LocalLLaMA community 9d ago

You can now convert EXL3 quants on Apple Silicon Mac

Hi, I'm here with an update. But this time it's quite a bigger news on local llm. Normally accessing the high fidelity quant like EXL3 is CUDA gated, and imagine you need 96GB-128GB with RTX cards, they are very specialized and expensive. But now on a more general basis, MacOS…

38
r/MachineLearning community 9d ago

An open handbook on LLM inference at scale (GPU internals, KV cache, batching, vLLM/SGLang/TensorRT-LLM) [P]

I've been working through the internals of LLM inference and writing up what I learn as an open, in-progress handbook. Just wrapped another chapter on GPU execution and memory internals: why a GPU sits mostly idle during inference, how the memory hierarchy gates throughput, and…

13
r/LocalLLaMA community 9d ago

7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context

OS: CatchyOS Instructions: Connect monitor to iGPU directly so when you boot Linux your dGPU vram is 100% free since by default when you use your dGPU it consumes about 700mb~1.2gb of lost context space, yes you can still game normally using this approach. Setup kvcache at…

30

Scale or Reason? A Compute-Equivalent Analysis of Reasoning Distillation

Any chance I could cluster my DGX Spark (128GB unified memory) and my AMD Ryzen AI Max 395 (128GM unified memory) together to run 1 model?

SDXL running locally in the browser on WebGPU, open-source

MuJoCo derived Simulator for High Fidelity Vision RL training natively on GPU [D]

MINISFORUM DEG1 Oculink eGPU Dock Refurbished - $59

Has anyone else found vLLM outputs noticeably worse than llama.cpp for the same model?

Accelerating BEV Pooling on NVIDIA GPUs for Physical AI Applications

Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel

45°C cooling design cuts data center water use to near zero

I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system.

I compiled LLM inference pricing across 7 providers — the caching numbers are surprising(spreadsheet included) [R]

Unlimited-OCR is now on ModelScope! A 3.3B multilingual OCR model for one-shot parsing across single images, multi-page documents, and PDFs. License: MIT

CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking

EvidenceLens: A Claim-Evidence Matrix for Auditing Financial Question Answering

CORE-BREW: LLR-Based Soft Decoding for Robust Multi-Bit LLM Watermarking

Ensemble Learning for Large Language Models in Text and Code Generation: A Survey

What are the top Chinese GPU rental platforms?

I want to add a second 7900XTX, question about pcie2/3/4

Tmax-27b - a Qwen3.6-27b terminal agent for small GPUs trained with DPPO (RL)

I'm eager for a 15x speedup on my strix halo

650+ Apache-2.0 biomedical NER/de-id models that run on-device in MLX. Same fp32 weights, identical outputs: the clinical NER models run 30-40x faster than PyTorch-CPU on a 3-year-old M3 Max. Repro inside.

UPDATE: Qwen-27B-IQ4_KS and Qwen-27B-IQ_KS_KT for ik_llama.cpp, especially for NVIDIA with 16GB VRAM

Is it possible to run a giant model like GLM5.2 on this cluster (4x servers with 512GB RAM + dual AMD Epyc)? 16 channel memory should hit 409GB/s per node.

7 Chinese companies are already shipping H100/H200-class AI chips, most IPO'd in the last 6 months. I mapped all of them.

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding

Build an AI Scientist for Life Science Discovery with NVIDIA BioNeMo Agent Toolkit

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample

What's your biggest pain point when choosing between cloud GPU providers for LLM inference?[R]

b9767

Safe Few-Step Generation via Velocity Editing

100+ t/s on Qwen3.6-27B Q8 across a 5090 + 3090 Ti — switching to tensor split-mode got me from 70 to 100+

AI chipmaker Groq confirms $650M raise, re-staffs after Nvidia&#8217;s $20B not-acqui-hire deal

Nvidia wants to cut data center water use, but that&#8217;s not the same as fixing AI&#8217;s water problem

SpaceX inks compute deal with Reflection AI, an open-source AI lab

CCCL Runtime: A Modern C++ Runtime for CUDA

Chinese Hackers Latest Masterpiece with NVIDIA

The text in Claude Code’s “Extended Thinking” output

GLM-5.2 UD-IQ1_M on llama.cpp — 5090 + 3090 Ti speed test (~ 579 t/s prefill @ 8k ctx, ~324 t/s prefill @ 57k ctx, ~10.6 t/s decode)

Inside NVIDIA Halos for Robotics: A Full-Stack Functional Safety System for Physical AI

Workflow SDK now compresses run and step payloads

2× Radeon R9700 — Qwen 3.6 27B Q8 MTP on llama.cpp

Can I realistically get close to Claude/Codex capabilities locally?

8-16 MI50s Minimax M3 @19 tps TG (peak)

R9700 abysmal performance, getting desparate

Bought 2x r9700, 5090 is now 7k and 6000 pro is at 13.5k, best option for 64 gb vram under 4k

You can now convert EXL3 quants on Apple Silicon Mac

An open handbook on LLM inference at scale (GPU internals, KV cache, batching, vLLM/SGLang/TensorRT-LLM) [P]

7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context

AI chipmaker Groq confirms $650M raise, re-staffs after Nvidia’s $20B not-acqui-hire deal

Nvidia wants to cut data center water use, but that’s not the same as fixing AI’s water problem