r/LocalLLaMA · · 3 min read

Jetbrains Mellum 2: a really good and performant model

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Jetbrains Mellum 2: a really good and performant model

Oh Hey Folks,

I took the Mellum 2 model for a spin, so I wanted to share my impressions here.

Disclaimer: the tests presented here are not cientific nor have those nice names like perplexity,etc. These tests are somewhat more akin to what Im working in a daily basis or how useful a model is helping me on a given task. Just saying.

First of all, being a 12b moe model with 2.5b params activated is somewhat uncommon but look at the speed:

Model JetBrains/Mellum2-12B-A2.5B-Thinking
Prompt eval 492.7 t/s
Generation 111.2 t/s
ms / token 9.0 ms
Context 131 072 tokens
KV cache bf16
Backend llama.cpp Vulkan b9544
GPU AMD Radeon RX 7900 XT 20 GB

An even at ~130k context it never dropped bellow 100t/s.

Tool calls by session:

Tools call made by Mellum 2 Model

Like I said, I used some tasks to do the test, so here more information about it:

  1. tool_test: this one is simple in theory, but gemma4 -12b and gpt-oss-20b that are bigger models fails at least in the write/part V. The prompt is here: https://gist.github.com/gcavalcante8808/e5b4173dab2d66fd8c9c18d2e04d4742
  2. test_report: this one scores the model on those tasks that are part of tool_test, so this one has somewhat tricky stuff like checking the prometheus metrics, reconstruct the TransactionLog, etc. The prompt is here: https://gist.github.com/gcavalcante8808/969c071b872d8677211f836febcbfdcf
  3. Sometimes I also need to call the session-debugger to pinpoint where the model had some difficulties, this one is not so simple for a model of this weight on my opinion: https://gist.github.com/gcavalcante8808/7be2c5e9220fd6ecb7106100b8a4cb93

For a quick comparison, the legendary qwen3.5-9b which also oneshots the same tasks, gets roughly 30t/s token generation in the same hardware!

TLDR: Jetbrains rocked! I'm really impressed!

Setup

I have an AMD XT7900 (20GB Card) and 128GB of DD4 RAM and I tested using vulkan.

PS: I tried to test with ROCM, but my gpu was having hard locks, so I postponed rocm tests.

lscpu:

❯ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 3900X 12-Core Processor CPU family: 23 Model: 113 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU(s) scaling MHz: 81% CPU max MHz: 4672.0698 CPU min MHz: 2200.0000 BogoMIPS: 7585.71 

docker-compose.yaml:

services: llama: image: ghcr.io/ggml-org/llama.cpp:server-vulkan-b9544 # image: ghcr.io/anbeeld/beellama.cpp:server-vulkan-v0.3.1 ports: - "8080:8080" volumes: - huggingface_cache:/root/.cache - ./templates:/templates - ./models.ini:/config/models.ini:ro - ./models:/models devices: - /dev/kfd - /dev/dri command: - --models-preset - /config/models.ini - --models-max - "1" environment: LLAMA_ARG_HOST: "0.0.0.0" ulimits: nofile: soft: 65536 hard: 65536 nproc: soft: 65536 hard: 65536 sysctls: - net.ipv4.tcp_keepalive_time=600 - net.ipv4.tcp_keepalive_intvl=30 - net.core.somaxconn=8192 

models.ini:

[*] flash-attn = on ctx-size = 131072 [mellum2-12b-thinking] alias = mellum2, mellum hf-repo = JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q8_0:Q8_0 temp = 0.6 top-p = 0.95 top-k = 20 no-mmproj = true cache-type-k = bf16 cache-type-v = bf16 n-gpu-layers = 99 no-cache-prompt = true cache-ram = 0 [qwen3.5-9b] alias = qwen35-9b, qwopus hf-repo = unsloth/Qwen3.5-9B-MTP-GGUF:UD-Q6_K_XL temp = 1.0 top-p = 0.95 top-k = 20 min-p = 0.00 repeat-penalty = 1.0 presence-penalty = 1.5 chat-template-file = /templates/qwen.jinja chat-template-kwargs = {"preserve_thinking":true} no-mmproj = true n-gpu-layers = 99 no-cache-prompt = true cache-ram = 0 cache-type-k = bf16 cache-type-v = bf16 
submitted by /u/gcavalcante8808
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA