News / #gpu Tag Gpu 500 articles archived under #gpu · RSS Sign in to follow r/MachineLearning community 24d ago TinyTPU: SystemVerilog systolic array compiled to WASM, running live in browser - RTL golden-verified against numpy [P] Most explanations of TPUs and systolic arrays are either hand-wavy diagrams or papers. I wanted to see the thing actually run, so I built it. TinyTPU is a 4×4 weight-stationary systolic array in real SystemVerilog, compiled to WebAssembly, with a step-by-step browser… 32 r/LocalLLaMA community 24d ago What exactly is quantization aware training? First time hearing it. I also heard about the gemma 4 qat quants and if any one of them is good for 4gb vram and 16gb ram. I can run gemma 4 26b moe iq2 nl at 8.5 to 9 tps(kv cache unquantized on gpu) with 9 layers offloaded to gpu   submitted by   /u/JournalistLucky5124… 31 r/LocalLLaMA community 24d ago sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) by masonmilby · Pull Request #21845 · ggml-org/llama.cpp Saw this on other sub so posting here. For Intel ARC card holders. Big boost so update llama.cpp version( b9519 onwards)   submitted by   /u/pmttyji [link]   [comments] 15 llama.cpp releases dev-tools 24d ago b9529 model : fix llama_model::n_gpu_layers() ( #24188 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64… 36 r/MachineLearning community 24d ago Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D] Sharing a small CPU inference benchmark for nvidia/parakeet-tdt-0.6b-v3 that turned up a result I didn't expect going in. Setup: 2 x86-64 vCPUs (AVX2/FMA), 7.7GB RAM, no GPU. Test audio: 16.78s Harvard sentences at 16kHz mono. Results: Inference path RTF Peak Memory CPU… 26 llama.cpp releases dev-tools 25d ago b9521 CUDA: enroll mul_mat_vec_q_moe into pdl ( #24087 ) Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW Data collected on a B4500: Before (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8… 10 r/MachineLearning community 25d ago Is it allowed to use OpenAI API outputs to create a silver code dataset or benchmark for a specific Python library? [d] Hello everyone, Is it allowed to use OpenAI API outputs to create a silver code dataset or benchmark for a specific Python library? I am working on a project idea related to library-specific code generation. The concrete case is a specific Python library used in a… 18 llama.cpp releases dev-tools 25d ago b9519 sycl : port multi-column MMVQ from CUDA backend ( #21845 ) mmvq: Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL. Read weights once per dispatch instead of once per column. Covers all standard quant types + reorder paths for Q4_0, Q8_0, Q3_K, Q4_K, Q5_K, Q6_K. IQ… 4 The Information — AI news-outlet 25d ago Nvidia CEO Returns to South Korea as AI Memory Runs Short Nvidia CEO Jensen Huang is making his second trip to South Korea in seven months, underscoring the country’s growing importance to the AI chip giant, according to Reuters. Demand for AI systems has strained supply of the chips and memory needed to build them, making South Korea… 37 arXiv — Machine Learning research 25d ago DiffSlack: Learning under Nonlinear Inequality Constraints via Learnable Slack Variables arXiv:2606.05247v1 Announce Type: new Abstract: Enforcing nonlinear inequality constraints in neural networks remains challenging, especially when the output is subject to many coupled constraints. Existing hard constraint methods often impose structural restrictions on the… 8 arXiv — Machine Learning research 25d ago AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents arXiv:2606.05597v1 Announce Type: new Abstract: Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present… 16 arXiv — Machine Learning research 25d ago Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation arXiv:2606.05988v1 Announce Type: new Abstract: Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and… 30 arXiv — Machine Learning research 25d ago OPRD: On-Policy Representation Distillation arXiv:2606.06021v1 Announce Type: new Abstract: On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies… 34 arXiv — Machine Learning research 25d ago On the training of physics-informed neural operators for solving parametric partial differential equations arXiv:2606.06164v1 Announce Type: new Abstract: Physics-informed neural operators (PINOs) aim to learn solution operators for partial differential equations by using the governing physics as supervision, rather than relying solely on paired input-output simulation data. By… 6 arXiv — NLP / Computation & Language research 25d ago Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution arXiv:2606.05486v1 Announce Type: new Abstract: Prompt ambiguity is a common source of failure in large language models, but is difficult to localize because it is a latent property of the prompt, while existing attribution methods are designed to explain observable outputs such… 22 arXiv — NLP / Computation & Language research 25d ago Epistemic Injustice in Language Models: An Audit of Pretraining Filters and Guardrails arXiv:2606.05936v1 Announce Type: new Abstract: Modern language models rely on pretraining filters to remove undesirable content from training corpora and inference-time guardrails to suppress undesirable outputs during deployment. In this paper, we examine how these filtering… 25 arXiv — NLP / Computation & Language research 25d ago Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries arXiv:2606.05970v1 Announce Type: new Abstract: Large language models are increasingly used for structured extraction from clinical free-text notes, but the sensitivity of their output to upstream configuration choices is less understood than their accuracy on fixed benchmarks.… 23 arXiv — NLP / Computation & Language research 25d ago Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition arXiv:2606.06065v1 Announce Type: new Abstract: Second-language (L2) speech recognition often requires transcriptions of pronunciations and intended meanings. Multi-task learning (MTL) is a natural approach because it assumes that shared representations benefit both outputs.… 9 Hugging Face Daily Papers research 25d ago Video2LoRA: Parametric Video Internalization for Vision-Language Models Abstract Video2LoRA enables efficient video processing in vision-language models by predicting Low-Rank Adaptation weights from video representations, reducing computational costs while maintaining video-faithful outputs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Processing… 7 Hugging Face Daily Papers research 25d ago OPRD: On-Policy Representation Distillation Abstract On-Policy Representation Distillation (OPRD) improves upon traditional on-policy distillation by aligning student and teacher representations in hidden-state space rather than just output space, resulting in reduced variance and improved training efficiency. Generated… 16 ThursdAI news-outlet 25d ago 📅 ThursdAI - Jun 4 - NVIDIA drops Nemotron 3 Ultra (550B open), Microsoft becomes a frontier lab, Ideogram 4 goes open, Agent Arena & more From CoreWeave: This week was kind of nuts, tons of new OpenSource goodness, 3 guests on the show (Arena, Nous Research and NVIDIA) and image gen SOTA models racing to the top. 10 r/MachineLearning community 25d ago Scrap the LLMs. Scoring 4.76% on the brand new ARC-3 using pure code, a 2012 AMD CPU, and zero AI tokens.[P] Hey everyone, The ARC Prize 2026 just launched the interactive ARC-AGI-3 track, and the collective AI world is panic-renting massive H100 clusters trying to get multi-billion parameter LLMs to navigate these dynamic environments. Predictably, out-of-the-box LLMs are faceplanting… 31 r/LocalLLaMA community 25d ago BeeLlama v0.3.1 – latest llama.cpp with extras! DFlash, MTP, q6_0 cache, TurboQuant. Single RTX 3090: Qwen 3.6 27B & Gemma 4 31B up to 177.8 tps (4.93x over baseline) BeeLlama v0.3.0 and v0.3.1 are here! Big architectural update to align the fork with upstream llama.cpp and integrate all its additions like MTP and Gemma 4 12B support, while also updating DFlash to handle complex configurations like multi-slot and multi-GPU. Now also… 5 Hugging Face official-blog 25d ago Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI Back to Articles Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI Enterprise + Article Published June 4, 2026 Upvote - Varun Singh varunsingh nvidia Isabel Hulseman ihulseman0220 nvidia Anuj Doshi andoshi nvidia Shyamala Prayaga sprayaga25… 6 r/LocalLLaMA community 25d ago I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance I’m posting this as a warning for anyone building multi-GPU local LLM rigs with older workstation/HEDT boards. My setup (Node #04) Gigabyte X399 Designare EX Threadripper 1950X 128GB DDR4 4x RTX 3090 10GbE TP-Link/Aquantia NIC llama.cpp NCCL build vLLM for safetensors models I… 15 r/LocalLLaMA community 25d ago NVIDIA Nemotron 3 Ultra is out. Not sure how much this is in the "local" world but interesting what they are putting out. https://developer.nvidia.com/blog/nvidia-nemotron-3-ultra-powers-faster-more-efficient-reasoning-for-long-running-agents/   submitted by   /u/justdoitanddont [link]   [comments] 33 r/LocalLLaMA community 25d ago Nvidia's been paying shills on LinkedIn 3 different accounts, some even with LinkedIn Gold, made the above posts all on the same day. And clearly all of them followed the marketing team's pointers without even understanding how locally hosted AI works, no way a $249 8GB machine can replace frontier models.  … 6 r/LocalLLaMA community 25d ago AMD & Intel, now onwards it's your turn to release your own models What are you doing AMD & Intel? NVIDIA just released a 550B model after so many tiny/small/medium/big models. Models are becoming(or already?) the commodity for NVIDIA. https://huggingface.co/nvidia/models?sort=created https://huggingface.co/amd/models?sort=created… 38 NVIDIA Developer Blog official-blog 25d ago NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents Single-turn chatbots are evolving into long-running agents that can reason, maintain context, use tools, and run efficiently across many turns to complete... 33 Hugging Face official-blog 25d ago How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent Back to Articles How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent Enterprise + Article Published June 4, 2026 Upvote - Maryam Motamedi maryameee nvidia Adi- margolin Amargolin nvidia Francesco fciannella nvidia Myungjong Kim Myungjong nvidia Enas Albasiri… 4 r/LocalLLaMA community 25d ago nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face Model Summary Total Parameters 550B (55B active) Architecture LatentMoE - Mamba-2 + MoE + Attention hybrid with Multi-Token Prediction (MTP) Context Length Up to 1M tokens Minimum GPU Requirement 8x GB200/B200/GB300/B300, 16x H100, 8x H200 Supported Languages English, French,… 21 Hugging Face official-blog 25d ago Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining Back to Articles Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining Enterprise + Article Published June 4, 2026 Upvote - Markus Kliegl mkliegl-nv nvidia Dan Su sudandandansu1 nvidia Author: Dan Su In large-scale LLM development, the question is no longer simply how… 29 Hugging Face Daily Papers research 26d ago Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs Abstract Research reveals significant disparities between text and image generation capabilities in multimodal models, with effective textual knowledge editing not transferring reliably to visual output, necessitating modality-aware editing approaches. Generated by… 9 Vercel — AI dev-tools 26d ago Nemotron 3 Ultra now available on AI Gateway Nemotron 3 Ultra from Nvidia is now available on Vercel AI Gateway . Nemotron 3 Ultra is an open Mixture-of-Experts reasoning model built for orchestrating long-running agent workflows, with a 1M token context window. The model targets multi-turn agent workflows: planning, tool… 37 llama.cpp releases dev-tools 26d ago b9499 ggml-webgpu: FlashAttention refactor + standardize quantization support ( #23834 ) Start work on flash_attn refactor Refactor Split k/v quantization Refactor and abstract quantization logic for flash_attn and mul_mat Add quantization support to tile path formatting Move to… 23 Smol AI News news-outlet 26d ago not much happened today **NVIDIA** released **Nemotron 3 Ultra**, a fully open **550B MoE** model with **55B active parameters** and **1M context**, optimized for long-running agent tasks with up to **5x speedup** and **30% cost reduction**. It features hybrid Mamba/attention, LatentMoE, native MTP,… 7 arXiv — Machine Learning research 26d ago When Autoregressive Consistency Hurts Safety Alignment arXiv:2606.04168v1 Announce Type: new Abstract: Safety alignment in large language models (LLMs) is fragile in part because it is often shallow: fine-tuning mainly reshapes the model's behavior near the first few output tokens. We argue that this phenomenon can be understood… 21 arXiv — Machine Learning research 26d ago Exact Unlearning in Reinforcement Learning arXiv:2606.04182v1 Announce Type: new Abstract: We formulate the problem of \emph{exact unlearning} in reinforcement learning, where the goal is to design an efficient framework that enables the removal of any user's data upon deletion request, i.e., the online learner's output… 26 arXiv — NLP / Computation & Language research 26d ago ACAT: A Collaborative Platform for Efficient Aspect-Based Sentiment Dataset Annotation arXiv:2606.04189v1 Announce Type: new Abstract: Aspect-Based Sentiment Analysis (ABSA) requires high-quality datasets to train reliable models. However, existing annotation tools treat output as flat files, leaving researchers to manually consolidate multi-annotator data,… 7 arXiv — NLP / Computation & Language research 26d ago VCIFBench: Evaluating Complex Instruction Following for Video Understanding arXiv:2606.04588v1 Announce Type: new Abstract: Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We… 27 arXiv — NLP / Computation & Language research 26d ago Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game arXiv:2606.04978v1 Announce Type: new Abstract: LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a… 26 arXiv — NLP / Computation & Language research 26d ago Depth-Attention: Cross-Layer Value Mixing for Language Models arXiv:2606.05014v1 Announce Type: new Abstract: Self-attention selects information freely across the sequence, but across depth, Transformers merely add each layer's output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent… 20 arXiv — NLP / Computation & Language research 26d ago Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data arXiv:2606.05122v1 Announce Type: new Abstract: Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training:… 22 arXiv — NLP / Computation & Language research 26d ago Covert Influence Between Language Models arXiv:2606.04071v1 Announce Type: cross Abstract: As language models increasingly consume one another's outputs, covert influence -- a phenomenon where a sender's payload (the behavioral disposition it is conditioned to propagate) transfers to a receiver through carriers… 30 arXiv — NLP / Computation & Language research 26d ago VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark arXiv:2606.04244v1 Announce Type: cross Abstract: Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when… 7 arXiv — NLP / Computation & Language research 26d ago Token Rankings are Unforgeable Language Model Signatures arXiv:2606.04459v1 Announce Type: cross Abstract: Language model parameters are known to impose unique (to each model) geometric constraints on their logit outputs, which serves as a signature that identifies the model, but also leaks the model's final layer parameters when an… 27 r/LocalLLaMA community 26d ago New Google Gemma 4 12B Claims Near-26B Performance - We Tested Both! We ran both models locally on one RTX 4090 and gave each the same task: write a self-contained HTML5 canvas animation with real physics in one file without libraries. Three scenes - a Galton board, two blocks colliding off a wall, and a chaotic triple pendulum Outputs: Gemma 4… 12 The Information — AI news-outlet 26d ago Apple to Launch New Siri in September With Help of Google, Nvidia Apple is currently on track to launch its overhauled Siri in September, to run in part on Google’s cloud computing servers using Nvidia chips, according to people familiar with the matter. While Apple will try to run as much as possible of the new Siri on devices such as… 31 The Information — AI news-outlet 26d ago Nvidia Buys Enterprise Model-Maker Kumo AI for at Least $400 Million Nvidia has bought Kumo AI, a five-year-old startup that sells predictive AI software to enterprises, for more than $400 million, said a person with knowledge of the deal. The acquisition, first revealed by an Nvidia executive in a LinkedIn post on Tuesday, should expand Nvidia’s… 26 r/LocalLLaMA community 26d ago google/gemma-4-12B · Hugging Face Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned… 29 Page 8 of 10 · 500 articles ← Newer Older →