Oh Hey Folks,
I took the Mellum 2 model for a spin, so I wanted to share my impressions here.
Disclaimer: the tests presented here are not cientific nor have those nice names like perplexity,etc. These tests are somewhat more akin to what Im working in a daily basis or how useful a model is helping me on a given task. Just saying.
First of all, being a 12b moe model with 2.5b params activated is somewhat uncommon but look at the speed:
| Model | JetBrains/Mellum2-12B-A2.5B-Thinking |
| Prompt eval | 492.7 t/s |
| Generation | 111.2 t/s |
| ms / token | 9.0 ms |
| Context | 131 072 tokens |
| KV cache | bf16 |
| Backend | llama.cpp Vulkan b9544 |
| GPU | AMD Radeon RX 7900 XT 20 GB |
An even at ~130k context it never dropped bellow 100t/s.
Tool calls by session:
Tools call made by Mellum 2 Model
Like I said, I used some tasks to do the test, so here more information about it:
- tool_test: this one is simple in theory, but gemma4 -12b and gpt-oss-20b that are bigger models fails at least in the write/part V. The prompt is here: https://gist.github.com/gcavalcante8808/e5b4173dab2d66fd8c9c18d2e04d4742
- test_report: this one scores the model on those tasks that are part of
tool_test, so this one has somewhat tricky stuff like checking the prometheus metrics, reconstruct the TransactionLog, etc. The prompt is here: https://gist.github.com/gcavalcante8808/969c071b872d8677211f836febcbfdcf - Sometimes I also need to call the
session-debugger to pinpoint where the model had some difficulties, this one is not so simple for a model of this weight on my opinion: https://gist.github.com/gcavalcante8808/7be2c5e9220fd6ecb7106100b8a4cb93
For a quick comparison, the legendary qwen3.5-9b which also oneshots the same tasks, gets roughly 30t/s token generation in the same hardware!
TLDR: Jetbrains rocked! I'm really impressed!
Setup
I have an AMD XT7900 (20GB Card) and 128GB of DD4 RAM and I tested using vulkan.
PS: I tried to test with ROCM, but my gpu was having hard locks, so I postponed rocm tests.
lscpu:
❯ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 3900X 12-Core Processor CPU family: 23 Model: 113 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU(s) scaling MHz: 81% CPU max MHz: 4672.0698 CPU min MHz: 2200.0000 BogoMIPS: 7585.71
docker-compose.yaml:
services: llama: image: ghcr.io/ggml-org/llama.cpp:server-vulkan-b9544 # image: ghcr.io/anbeeld/beellama.cpp:server-vulkan-v0.3.1 ports: - "8080:8080" volumes: - huggingface_cache:/root/.cache - ./templates:/templates - ./models.ini:/config/models.ini:ro - ./models:/models devices: - /dev/kfd - /dev/dri command: - --models-preset - /config/models.ini - --models-max - "1" environment: LLAMA_ARG_HOST: "0.0.0.0" ulimits: nofile: soft: 65536 hard: 65536 nproc: soft: 65536 hard: 65536 sysctls: - net.ipv4.tcp_keepalive_time=600 - net.ipv4.tcp_keepalive_intvl=30 - net.core.somaxconn=8192
models.ini:
[*] flash-attn = on ctx-size = 131072 [mellum2-12b-thinking] alias = mellum2, mellum hf-repo = JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q8_0:Q8_0 temp = 0.6 top-p = 0.95 top-k = 20 no-mmproj = true cache-type-k = bf16 cache-type-v = bf16 n-gpu-layers = 99 no-cache-prompt = true cache-ram = 0 [qwen3.5-9b] alias = qwen35-9b, qwopus hf-repo = unsloth/Qwen3.5-9B-MTP-GGUF:UD-Q6_K_XL temp = 1.0 top-p = 0.95 top-k = 20 min-p = 0.00 repeat-penalty = 1.0 presence-penalty = 1.5 chat-template-file = /templates/qwen.jinja chat-template-kwargs = {"preserve_thinking":true} no-mmproj = true n-gpu-layers = 99 no-cache-prompt = true cache-ram = 0 cache-type-k = bf16 cache-type-v = bf16
submitted by
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.