[Benchmark] 5090RTX: Promt Parsing, Token Generation and Power Level
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Inspired by https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop_wasting_electricity/ I've decided to put my 5090 to test and see how do the curves look like for the device and whether there were any obvious sweet spots (apart from setting it to minimum 400w). Graphs and outcomes: Inputs: Backend: llama.cpp in a docker container, FA on, batch 2048, max context 122k. Model: https://huggingface.co/HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Balanced Quant: Q6_K_P Hardware: Threadripper 6970, 2 channel RAM 64GB, 5090RTX Prompt: 30k prompt composed of 3 x 10k copies of the same benchmark for heavy reasoning, math and computations, can present upon request - was generated by QWEN 3.6 specifically for benchmarking. Methodology: Generation stopped after 2 minutes for the brevity of the sessions and due to the asymptotic nature of the further TG metric. Measurements were performed on a warm card as cold measurements would've taken too much time between sessions. Between measurements the server was restarted completely to reset KV cache and result in proper PP measurements of the same input. Power Level Range: 400w - 600w, 25w step Notes: Max power consumption registered was at 592w with the PL set to 600w, sustained load never reached 600w, stabilizing at 580w even when uncapped. In all of other launches a trend was visible of max values going beyond the set PL by 10-12w, reflecting sharp spikes 5090RTX is already famous for. A cold card is faster than a warm card by 2-3%, making sustained load tasks naturally slower than man-driven ones. Prompt Processing is much more sensitive to power limit, while Token Generation is almost linear at these numbers. Not exactly apples to apples when compared to the setup used in the https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop_wasting_electricity/ post, but the difference between 4090rtx and 5090rtx seems to go beyond more power, yet are not equally applied to PP and to TG:
[link] [comments] |
More from r/LocalLLaMA
-
NVFP4 Kimi2.6 and Kimi 2.5 released by Nvidia
May 14
-
Scenema Audio: Zero-shot expressive voice cloning and speech generation
May 14
-
Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline
May 14
-
Anyone actually using a local LLM as their daily knowledge base? Not for coding, for life stuff. What's your setup?
May 14
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.