Tag

Gpu

500 articles archived under #gpu · RSS

arXiv — Machine Learning research 1h ago

What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs

arXiv:2606.28615v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, where free-text explanations such as chain-of-thought and post-hoc rationales are used to justify model outputs. Yet it remains unclear whether these…

31
arXiv — Machine Learning research 1h ago

The Contagion Tensor: A Framework for Measuring Output-Distribution Coupling in Multi-Agent LLM Systems -- and Auditing the Claims It Enables

arXiv:2606.28839v1 Announce Type: new Abstract: We introduce the Contagion Tensor, a measurement framework for quantifying how large language model (LLM) output distributions couple across modalities, agents, and time steps. From the tensor we derive the Coupling Amplification…

38
arXiv — Machine Learning research 1h ago

When Can Conformal Risk Control Certify LLM Outputs? Bounds, Impossibility, and Adaptation for Structured Generation

arXiv:2606.29054v1 Announce Type: new Abstract: Large language models (LLMs) deployed for structured generation (NER, JSON extraction, QA, and classification) lack formal reliability guarantees, and standard heuristic abstention policies miss user-specified risk targets by…

4
arXiv — Machine Learning research 1h ago

Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps

arXiv:2606.29110v1 Announce Type: new Abstract: Recent progress in flow-based generative modeling has led to models that output high-quality samples while using only a small number of function evaluations. However, at present, there is a lack of similar advances in estimating…

32
arXiv — NLP / Computation & Language research 1h ago

Beyond the Mean: Three-Axis Fidelity for Aligning LLM-Based Survey Simulators from Small Pilot Data

arXiv:2606.28963v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to simulate social survey responses, yet their outputs exhibit systematic biases: marginal distributions are skewed, response variance is poorly calibrated, and predictor-outcome…

20
arXiv — NLP / Computation & Language research 1h ago

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

arXiv:2606.29082v1 Announce Type: new Abstract: Would experience designing faster GPU kernels also help close in on a long-standing open mathematical conjecture? Large Language Models (LLMs) integrated into evolutionary search have recently produced state-of-the-art solutions on…

4
arXiv — NLP / Computation & Language research 1h ago

AURORA: Asymmetry and Update-Induced Rotation for Robust Hallucination Detection in Large Language Models

arXiv:2606.29545v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. However, their tendency to generate hallucinations, namely factually incorrect or unfaithful outputs,…

27
arXiv — NLP / Computation & Language research 1h ago

Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression

arXiv:2606.29712v1 Announce Type: new Abstract: Large language models achieve high reasoning performance via explicit chain-of-thought and reinforcement learning, but require long output sequences and extended inference time. Latent reasoning reduces this cost by shifting…

22
arXiv — NLP / Computation & Language research 1h ago

How Far Can You Get Without a GPU? A Systematic Benchmark of Lightweight Hallucination Detection Across Question Answering, Dialogue, and Summarisation

arXiv:2606.29809v1 Announce Type: new Abstract: Hallucination detection has become a pressing requirement for trustworthy AI deployment at scale. The most accurate detection methods depend on GPU-intensive inference, proprietary API calls, or white-box access to the generating…

27
r/LocalLLaMA community 5h ago

Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought

Been running Qwen3.6-27B (8-bit) through my coding harness for a few days, alongside GLM5.2. The harness uses 3 critics — code review, test review, Playwright e2e — each with fresh context before accepting output. Qwen3.6 is legit for a 27B dense model. Benchmarks weren't lying.…

19
r/LocalLLaMA community 11h ago

Qwen3-tts.cpp + Compose Desktop GUI

I improved my qwen3-tts.cpp implementation to be about 5x realtime on my RTX 5080. It is GGML based, so it should compile and run anywhere - however I only tested it with CPU & CUDA under Windows & Linux: https://github.com/Danmoreng/qwen3-tts.cpp Additionally I made a Desktop…

13
r/LocalLLaMA community 14h ago

Going from single GPU to dual GPU is nice but not in the way I expected

I was expecting what when doubling my VRAM from 24gb to 2x24gb I'd use higher quants with more context, and thus get smarter LLMs, but that's not what it ended up happening. At least for coding, I found that the difference in quality from, say, qwen 27B UD-Q4-XL to a Q6 or Q8 is…

21
Anthropic SDK (Python) releases dev-tools 14h ago

v0.113.0

0.113.0 (2026-06-29) Full Changelog: v0.112.0...v0.113.0 Features api: add support for 20260318 web fetch and support tools ( 88dbfb1 ) Bug Fixes async count_tokens missing output_format/output_config merge block ( #162 ) ( 122c958 ) Chores api: accept user profile ID's when…

17
r/LocalLLaMA community 14h ago

Bolt Graphics GPU will have 2 DDR5 laptop DIMM slots

They have a few working prototypes, & are aiming for pre-production examples made by end of this year, & full production by Christmas 2027. Interesting specs: 5nm GPU "High performance CPU in GPU" on-card LPDDR5X as primary memory pool 2 DDR5 SODIMM slots for 'spill over'…

38
r/LocalLLaMA community 15h ago

Mellum2 local deployments

Hey local community, I work at JetBrains with the team that trained Mellum2 models — 12B-2.5A LLMs. Those models are trained completely from scratch, targeting fast inference: our primary goal were H100/H200s prod deployments, but local deployments are good as well. We…

37
Hacker News — AI on Front Page community 16h ago

What happens when you run a CUDA kernel?

Article URL: https://fergusfinn.com/blog/what-happens-when-you-run-a-gpu-kernel/ Comments URL: https://news.ycombinator.com/item?id=48718863 Points: 215 # Comments: 28

37
Import AI (Jack Clark) community 16h ago

Import AI 463: Self-improving robots; a 10k Chinese GPU cluster; and an elegiac essay for the human era

What eras bookend our interregnum?

36
r/LocalLLaMA community 17h ago

Any good uses for a 192 GB DDR3 Server in the LLM world?

I've been gifted this old IBM System X V4 with a dual Xeon E5-2640 [6c12t @ 2.7 GHz] and a whooping 192 GB of DDR3 1666 ECC RAM There's a gen 2 x16 PCi-E port in there as well so it can take a single GPU... Does anyone have some fun ideas on what to do with this system? It's…

15
r/LocalLLaMA community 22h ago

AMD MI210 64GB vs DCU K100 64GB

On the Chinese eBay there is a many DCU K100 64 GB GPU available for a very attractive price, between 6000 RMB and 19 000 (air or water cooled versions, new or second hands), and 15 000 to 20 000 for the AMD MI210 (4000-6000 RMB for the PCIE bridge). There is very little…

25
arXiv — Machine Learning research 1d ago

OperatorSHAP: Fast and Accurate Shapley Value Estimation for Neural Operators

arXiv:2606.28065v1 Announce Type: new Abstract: Understanding model predictions is essential for physical applications, where outputs often inform safety-critical decisions, such as structural load assessment, weather warnings, and clinical diagnosis. Shapley values satisfy many…

20
arXiv — Machine Learning research 1d ago

VGB for Masked Diffusion Model: Efficient Test-time Scaling for Reward Satisfaction and Sample Editing

arXiv:2606.28301v1 Announce Type: new Abstract: Inference-time scaling is a promising paradigm to improve generative models, especially when outputs must satisfy structural constraints or optimize downstream rewards. We consider Masked Diffusion Model (MDM) and introduce…

37
arXiv — Machine Learning research 1d ago

Test-Input Generation for Tensor Programs: What Actually Finds Kernel Bugs

arXiv:2606.27396v1 Announce Type: cross Abstract: Test-input generation for tensor kernels is folkloric. Most projects pick a representative shape and dtype, run a fixed-shape allclose-style check, and ship. We make the choices explicit and measure them. Using the gpuemu…

18
arXiv — Machine Learning research 1d ago

Directed Graph Topology Inference via Graph Filter Identification

arXiv:2606.27455v1 Announce Type: cross Abstract: We address the problem of inferring a directed network from nodal measurements generated by linear diffusion dynamics on the sought graph. Observations are modeled as the outputs of a graph convolutional filter, i.e., a…

5
arXiv — NLP / Computation & Language research 1d ago

Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety

arXiv:2606.27632v1 Announce Type: new Abstract: As large language models are increasingly deployed in real-world systems, safety failures can still lead to harmful outputs and dangerous misuse. We argue that the essence of safety is adversarial: many failures arise not from…

29
arXiv — NLP / Computation & Language research 1d ago

Mitigating LLM-based p-Hacking by Preregistering for the Next LLM

arXiv:2606.27687v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to generate, classify, and annotate data whose outputs feed downstream hypothesis tests. However, LLM-based research is easy to p-hack: a researcher can tune the prompts, decoding…

32
arXiv — NLP / Computation & Language research 1d ago

Enhancing Numerical Prediction in LLMs via Smooth MMD Alignment

arXiv:2606.27731v1 Announce Type: new Abstract: Despite their strong general capabilities, large language models (LLMs) often remain unreliable when outputs must be numerically precise. A key reason is the training objective: standard cross-entropy treats numeric tokens as…

31
arXiv — NLP / Computation & Language research 1d ago

Output-Space Allocation Costs for Calibration-Guided LLM Compression: An Empirical Study

arXiv:2606.27785v1 Announce Type: new Abstract: Training-free compression methods for large language models (LLMs) often use calibration data to guide compression decisions. ROCKET, a recent method combining sparse-dictionary factorization with multi-choice knapsack problem…

30
arXiv — NLP / Computation & Language research 1d ago

Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents

arXiv:2606.16682v3 Announce Type: replace-cross Abstract: When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using…

4
r/LocalLLaMA community 1d ago

Locally running mode turns an Image into a Cute Controllable Character you can Play as

This is a sequel to my last post here !! It meant a lot to have such positive feedback last time. This is the 800M version of the previous model. It still has a LOT of issues but the promise is the same. Working comfortably on consumer GPUs The context is increased to 12 latent…

32
r/LocalLLaMA community 1d ago

NPC Engine Using Local Models

I’ve been working on a game-agnostic NPC engine/backend based pretty heavily on SillyTavern-style architecture, and with smaller local models getting better and better, I honestly think this kind of thing could be the future of RPGs. Right now I’m using NVIDIA Parakeet 0.6 for…

22
r/LocalLLaMA community 1d ago

Tensor split performance on low-bandwidth (TB3) eGPUs, and a question

Hey everyone! I've got a pair of Morefine G1 4090M 16gb eGPUs connected at 40Gbps via TB3 (daisy-chained). I normally run them in layer split mode as it doesn't seem to need much bandwidth; I'm seeing around 1300t/s PP and 26t/s TG (35-40 with MTP), qwen3.6-27B @ Q4. Which is…

20
r/LocalLLaMA community 1d ago

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

Follow-up to my previous Ornith-1.0-35B Q3_K_M post. I grafted a native MTP draft head onto the IQ4_XS body (head at Q6) for self-speculative decode, single GPU, llama.cpp: 1.3-1.35x single-stream decode (172.6 -> 233.8 tok/s). Next-token distribution is byte-identical to…

11
TechCrunch — AI news-outlet 1d ago

Why Wall Street thinks US memory maker Micron is the next Nvidia

Eager to find more public AI-related companies that may do as well as Nvidia, Wall Street investors think they've found a winner with Micron.

5
r/LocalLLaMA community 1d ago

Best case for dual RTX 3090 (250W each) on Crosshair VIII Hero?

I'm building a local LLM workstation and would appreciate some advice from people already running 2×3090s. Current hardware: ASUS Crosshair VIII Hero (X570) One Gainward Phoenix RTX 3090 Looking for a second used 3090 (not necessarily the same model) Both GPUs will be…

9
r/LocalLLaMA community 1d ago

How many of you do use Q1 or Q2 of Big models(100-250B)? How's it?

Sharing popular(also recent) models for reference: 151-250B : DeepSeek-V4-Flash Step-3.X-Flash Command-a-plus-05-2026 Laguna-M.1 MiniMax-M2.X Qwen3-235B-A22B 100-150B : GLM-4.5-Air Qwen3.5-122B-A10B NVIDIA-Nemotron-3-Super-120B-A12B Mistral-Small-4-119B-2603…

34
r/LocalLLaMA community 1d ago

Is Qwen3-VL-2B the only viable VLM for JSON extraction on a "potato"?

After spending countless hours testing on 3 "potato" laptops (Intel i3, 8GB RAM, Win11, integrated GPU), that's my conclusion. For reliably extracting data from images to JSON on low-end hardware, nothing else even comes close. Yet, it’s completely missing from major benchmarks…

23
r/LocalLLaMA community 2d ago

Finally.. my rig is maxed out

Got all the parts before the crazy price increase except for the rtx pro 5k! Was saving up to order rtx pro 6000 in US and i did, but wanted to join nvidia inception program for the discount. It was around $8.5k during that time, less 1k if I succeeded. It took around 3 months…

7
r/MachineLearning community 2d ago

Benchmarking Self-Hosted Gemma 2 9B vs. Frontier APIs: The FP8 Quantization Prefill Tax and VRAM Realities on an NVIDIA L4 [P]

When evaluating migrating production LLM workloads off commercial cloud APIs, the conversation usually gets oversimplified into a trade-off between quality and infrastructure cost. To look past clean, isolated averages, I built a repeatable evaluation matrix using a real-world…

29
r/MachineLearning community 2d ago

Built an LLM training framework that actually runs on older GPUs without crashing [P]

Hey guys, I was playing around with Nanotron recently and got super frustrated by how many heavy, hardware-specific dependencies it imports at the module level ( flash-attn , triton, functorch , etc.). If you try to run it on older or budget GPUs like a T4 or V100, it just…

30
TechCrunch — AI news-outlet 2d ago

The fittest founder in the room got cancer. Here’s how he used AI to fight back.

When confronted with cancer, Connor Christou fed everything tied tied to his regime — blood results, scan data, wearable output, journal entries — into Claude.

30
llama.cpp releases dev-tools 2d ago

b9827

[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy ( #25057 ) [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies. When tensors are not fully contiguous…

14
r/LocalLLaMA community 2d ago

[NEW MODEL] - SupraSafety-18M · Tiny Content-Moderation Model

Hey r/LocalLLaMA ! SupraLabs is back with a new model: SupraSafety-18M . It's a BERT-style 18M params model trained from scratch on 2 T4 GPUs in Kaggle on the nvidia/Nemotron-3.5-Content-Safety-Dataset dataset for 7 epochs. It's built to run on edge devices , mobile phones , or…

13
r/LocalLLaMA community 2d ago

96gb+ 4090's and 5090 are literally a scam. I mods these cards myself

I run a small gpu lab in the USA and work closely with two factories in china designing/producing 48gb 4090 PCB's. The only recent card weve gotten was the 32gb 4080 super. PSA: 96gb 4090's and 5090's are a SCAM (as of Jun 2026) - you will not get the card, they do not exist.…

32
r/LocalLLaMA community 2d ago

Dear poor people of this subreddit

I see people with multi-gpu setups but I'm sure there's a potato LLM runner out there somewhere. I have an old macbook pro (i5 8th gen, 8GB RAM) that I want to turn into a homelab. I want to run a small local model for experimenting and if possible, agentic tasks (like say…

22
r/MachineLearning community 2d ago

Kicking off GPU Mode [D]

Hey ! I’m starting a series to document my work on GPU infrastructure, LLMs, and CV. Stop #1 is up: A brief look at why GPUs are the center of the industry, the CPU/GPU divide, and why nvidia-smi is the first place you check when things break. We’ll move past the basics quickly…

27
r/LocalLLaMA community 3d ago

Another big tensor fix b9820

sched : reintroduce less synchronizations during split compute ( #20793 ) CUDA: Improve performance via less synchronizations between token ( #17795 ) Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async() Adds function to relax sync requirements between input…

15
r/LocalLLaMA community 3d ago

Ornith-1.0-35B Q3_K_M: ~17 GB VRAM, KLD-checked against BF16

I quantized deepreinforce-ai/Ornith-1.0-35B down to Q3_K_M so it fits comfortably on a single GPU. Produced locally with llama-quantize from the upstream BF16 GGUF — the quantizer took it from 16.01 BPW down to 3.87 BPW , landing at 16.8 GB on disk / ~17 GiB loaded VRAM , about…

15
r/LocalLLaMA community 3d ago

Upgraded my budget build to multi-GPU for inference

I added: 1x RTX 3090 - 610 USD 1x Arc A770 - 222 USD 1x PCIe x1 to 4x USB 3.0 PCIe riser New cpu cooler Specs: Modified Zalman Z9 Plus Case 2x Zotac RTX 3090 24 GB 1x Intel Arc A770 16 GB 48 GB DDR4 RAM AMD Ryzen 5 1600X MSI X370 SLI Plus All parts were purchased second hand…

37
r/LocalLLaMA community 3d ago

Nemotron-3-Super-120B-A12B (hybrid Mamba+MoE) holds perfect needle retrieval to 504K tokens on 4×3090

TLDR: The Mamba/SSM layers keep a constant-size recurrent state instead of a growing KV cache, so context is nearly free. Full needle retrieval at half a million tokens, fully on-GPU, ~71GB. The new imatrix gguf here…

29
r/LocalLLaMA community 3d ago

Hello there! (again) i ported my kokoro enhancements so you can use them in your projects.

i made a web based and python based version of the enhancements i made to kokoro's controls. both are, of course, fully client side. if you have hardware acceleration turned on in your browser, kokoro runs on webgpu at about 40ms per generation. it's really fast. note: the…

36

What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs

The Contagion Tensor: A Framework for Measuring Output-Distribution Coupling in Multi-Agent LLM Systems -- and Auditing the Claims It Enables

When Can Conformal Risk Control Certify LLM Outputs? Bounds, Impossibility, and Adaptation for Structured Generation

Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps

Beyond the Mean: Three-Axis Fidelity for Aligning LLM-Based Survey Simulators from Small Pilot Data

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

AURORA: Asymmetry and Update-Induced Rotation for Robust Hallucination Detection in Large Language Models

Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression

How Far Can You Get Without a GPU? A Systematic Benchmark of Lightweight Hallucination Detection Across Question Answering, Dialogue, and Summarisation

Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought

Qwen3-tts.cpp + Compose Desktop GUI

Going from single GPU to dual GPU is nice but not in the way I expected

v0.113.0

Bolt Graphics GPU will have 2 DDR5 laptop DIMM slots

Mellum2 local deployments

What happens when you run a CUDA kernel?

Import AI 463: Self-improving robots; a 10k Chinese GPU cluster; and an elegiac essay for the human era

Any good uses for a 192 GB DDR3 Server in the LLM world?

AMD MI210 64GB vs DCU K100 64GB

OperatorSHAP: Fast and Accurate Shapley Value Estimation for Neural Operators

VGB for Masked Diffusion Model: Efficient Test-time Scaling for Reward Satisfaction and Sample Editing

Test-Input Generation for Tensor Programs: What Actually Finds Kernel Bugs

Directed Graph Topology Inference via Graph Filter Identification

Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety

Mitigating LLM-based p-Hacking by Preregistering for the Next LLM

Enhancing Numerical Prediction in LLMs via Smooth MMD Alignment

Output-Space Allocation Costs for Calibration-Guided LLM Compression: An Empirical Study

Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents

Locally running mode turns an Image into a Cute Controllable Character you can Play as

NPC Engine Using Local Models

Tensor split performance on low-bandwidth (TB3) eGPUs, and a question

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

Why Wall Street thinks US memory maker Micron is the next Nvidia

Best case for dual RTX 3090 (250W each) on Crosshair VIII Hero?

How many of you do use Q1 or Q2 of Big models(100-250B)? How's it?

Is Qwen3-VL-2B the only viable VLM for JSON extraction on a "potato"?

Finally.. my rig is maxed out

Benchmarking Self-Hosted Gemma 2 9B vs. Frontier APIs: The FP8 Quantization Prefill Tax and VRAM Realities on an NVIDIA L4 [P]

Built an LLM training framework that actually runs on older GPUs without crashing [P]

The fittest founder in the room got cancer. Here&#8217;s how he used AI to fight back.

b9827

[NEW MODEL] - SupraSafety-18M · Tiny Content-Moderation Model

96gb+ 4090's and 5090 are literally a scam. I mods these cards myself

Dear poor people of this subreddit

Kicking off GPU Mode [D]

Another big tensor fix b9820

Ornith-1.0-35B Q3_K_M: ~17 GB VRAM, KLD-checked against BF16

Upgraded my budget build to multi-GPU for inference

Nemotron-3-Super-120B-A12B (hybrid Mamba+MoE) holds perfect needle retrieval to 504K tokens on 4×3090

Hello there! (again) i ported my kokoro enhancements so you can use them in your projects.

The fittest founder in the room got cancer. Here’s how he used AI to fight back.