r/LocalLLaMA · May 25, 2026 · 5 min read

Update on 12x32gb sxm v100 cluster / local AI for legal drafting

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Update on 12x32gb sxm v100 cluster / local AI for legal drafting

Update from the lawyer with the V100 server. A few of you asked what I actually ended up running once the dust settled, so here it is. Still just a lawyer, still driving the whole thing through Claude Code, still not fully sure what I'm doing — but it works now, which is more than I could say last time.

First, the hardware caught up to the plan. The last two V100s are in, so the "final form" I promised is real: twelve V100-SXM2 32GB on the Threadripper Pro. It's Board A on GPUs {4,5,8,9}, Board B on {6,7,10,11}, an NVLink pair on {0,1}, and a mixed pair on {2,3} where one card is a 16GB. Split a model across two different NVLink boards and throughput falls off a cliff (the cross-board hop is PCIe/NUMA, not NVLink), so I keep every model inside one board. Learned that one the expensive way.

And yeah, I caved and built the second box. EPYC 7302P, 512gb RAM, 4x RTX 3090 + 2x V100-PCIe. The mid-life crisis remains on schedule.

The bigger change: I gave up on vLLM for the local models. Not because vLLM is bad — because the models I actually want are MoE GGUFs, and vLLM on Volta is a dead end for those (FP8/AWQ/Marlin all want SM75+, the GPTQ kernels are broken on 7.0). I moved the whole thing to llama.cpp (mainline — a recent build finally fixed a Gemma chat-parser bug that had been mangling my long prompts).

Here's the part that's the opposite of what my first post implied: on V100, dense models are a trap. Only MoE clears a usable speed. Rough decode numbers — Q8 GGUF, Q4 KV cache, flash-attn on, one 4-card board, on real drafting prompts (several thousand tokens of context, not a 5-token "hello"):

Model	Type	tok/s (decode)
Gemma-4-26B-A4B	MoE	~113
Qwen3.6-35B-A3B	MoE	~82
Qwen3.5-122B-A10B	MoE	~50
any dense 27-32B	dense	~20-28 (under my 40 floor, not worth it)
dense ~128B	dense	~9 (forget it)

So a 122B/10B-active reasoning model runs at ~50 tok/s on four V100s — faster than the dense 32B managed on vLLM in my first post — and it holds that at long context (I've pushed Gemma past 25k tokens without it falling apart, where the dense models choked). That reframed everything: I stopped chasing big dense weights and built the system around MoE.

What's actually running (the stack you asked for):
It isn't one model answering chat — it's an orchestrator that routes a legal task across several local models, each pinned to its own board so they don't fight over GPUs. When it runs the heaviest job (a full affidavit or motion, intake-to-document), it lights up 16 GPUs across both boxes:

- Workhorse drafting — Qwen3.6-35B-A3B on Board A {4,5,8,9}
- Heavy reasoning + high-stakes drafting — Qwen3.5-122B-A10B on Board B {6,7,10,11}
- A small "does this even have grounds" gate model on the {0,1} pair
- An adversarial reviewer whose entire job is to attack my own draft, on the {2,3} pair
- Gemma-4-26B for financial/extraction + a small Qwen as the router, on the 3090s on the second box via Ollama

It's a sequential pipeline so they don't all hammer at once, but all 16 stay resident. Lighter work uses far less — combining and Bates-stamping exhibits is pure CPU (PyMuPDF + Tesseract, no GPU at all); a plain summary mostly just hits Gemma and the router.

The honest part, since this sub kept me honest last time:
- The local models hallucinate citations and dates. Confidently. I had to build a verifier that checks every cite, date, and Bates number in a draft against the actual source material and blocks anything it can't ground, on top of the adversarial reviewer. Local drafting is bimodal — sometimes it correctly refuses to invent, sometimes it fabricates a whole dated chronology and swears in the same breath that it invented nothing. It does not touch a final document without that gate and without me.
- The dumbest bug I found: my own pipeline was ~79% poisoned. The thing that builds the evidence bundle was scooping up its OWN prior outputs as if they were client evidence, so the models were "grounding" on slop they'd written earlier — at one point it cited an RTX 3060 as a Bates number, which, fair. Fixed the builder to stop eating its own tail and scrubbed it out. If you run any RAG/agent pipeline, go look at what's literally in your context window — mine was a hall of mirrors and I had no idea.
- I also made it refuse to quietly fall back to a cloud model when I tell it to run local-only. If it can't do a step locally it says so, by name, instead of phoning Anthropic behind my back.

Still want the exact thing I wanted in the first post — a model that writes like me and handles the boring form-filling and pattern stuff. I'm closer: the system now captures my edits as correction data, which is the start of a real fine-tune set. Haven't pulled the QLoRA trigger yet. So the same questions stand, and I'd genuinely take advice:
- For QLoRA on this hardware (V100, no bf16, no FA2): do you reach for a 35B-A3B MoE base, or am I smarter to fine-tune a dense ~14B I can actually train and keep the MoE for the heavy serving?
- Anyone serving MoE on Volta found anything faster than llama.cpp — ik_llama, something else? And is there a better long-context KV story than Q4?
- Am I an idiot keeping 122B-A10B around at 50 tok/s when I could just run the 35B for everything?

Tell me what I'm doing wrong.

submitted by /u/TumbleweedNew6515
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA