r/LocalLLaMA · · 2 min read

[Benchmark] 5090RTX: Promt Parsing, Token Generation and Power Level

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

[Benchmark] 5090RTX: Promt Parsing, Token Generation and Power Level

Inspired by https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop_wasting_electricity/ I've decided to put my 5090 to test and see how do the curves look like for the device and whether there were any obvious sweet spots (apart from setting it to minimum 400w).

Graphs and outcomes:

https://preview.redd.it/t0icb8j7831h1.png?width=1700&format=png&auto=webp&s=f787b987c14ff1670d26171304dbdfc6e9fc3a69

https://preview.redd.it/6pe7k7j7831h1.png?width=1700&format=png&auto=webp&s=62b08ebab967f7af6dc8a7a865b2d22856d54a0c

https://preview.redd.it/vya398j7831h1.png?width=1700&format=png&auto=webp&s=d7f4330159964e5373266c717a1cde7c491df3f3

https://preview.redd.it/o7inv8j7831h1.png?width=1700&format=png&auto=webp&s=0baced5e3ffd1b33558bf9085d7ffea0622ce3f2

Inputs:

Backend: llama.cpp in a docker container, FA on, batch 2048, max context 122k.

Model: https://huggingface.co/HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Balanced

Quant: Q6_K_P

Hardware: Threadripper 6970, 2 channel RAM 64GB, 5090RTX

Prompt: 30k prompt composed of 3 x 10k copies of the same benchmark for heavy reasoning, math and computations, can present upon request - was generated by QWEN 3.6 specifically for benchmarking.

Methodology:

Generation stopped after 2 minutes for the brevity of the sessions and due to the asymptotic nature of the further TG metric. Measurements were performed on a warm card as cold measurements would've taken too much time between sessions. Between measurements the server was restarted completely to reset KV cache and result in proper PP measurements of the same input.

Power Level Range: 400w - 600w, 25w step

Notes:

Max power consumption registered was at 592w with the PL set to 600w, sustained load never reached 600w, stabilizing at 580w even when uncapped.

In all of other launches a trend was visible of max values going beyond the set PL by 10-12w, reflecting sharp spikes 5090RTX is already famous for.

A cold card is faster than a warm card by 2-3%, making sustained load tasks naturally slower than man-driven ones.

Prompt Processing is much more sensitive to power limit, while Token Generation is almost linear at these numbers.

Not exactly apples to apples when compared to the setup used in the https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop_wasting_electricity/ post, but the difference between 4090rtx and 5090rtx seems to go beyond more power, yet are not equally applied to PP and to TG:

PL PP 5090 PP 4090 % TG 5090 TG 4090 %
450w 2273 2113 1.075721723 49.3 41 1.202439024
425w 2248 2093 1.074056378 48.9 41.6 1.175480769
400w 2135 2061 1.035904901 48.7 42.5 1.145882353
submitted by /u/Opening-Broccoli9190
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA