r/LocalLLaMA · May 14, 2026 · 1 min read

MLX 16/8/4/2-bit quants of nvidia/llama-embed-nemotron-8b

#gpu

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Like Read original ↗

I converted nvidia/llama-embed-nemotron-8b to MLX fp16, 8-bit, 4-bit, and 2-bit (for my OCD) and put it on HuggingFace:

ncorder/llama-embed-nemotron-8b-mlx-fp16

ncorder/llama-embed-nemotron-8b-mlx-8bit

ncorder/llama-embed-nemotron-8b-mlx-4bit

ncorder/llama-embed-nemotron-8b-mlx-2bit

I was running this model using GGUFs + llama-server for local semantic search over an Obsidian vault and some other projects. It worked fine but I got tired of managing a whole HTTP server just for embeddings and also wanted Apple Silicon optimizations. The MLX version loads in-process via mlx-embeddings, no server.

from mlx_embeddings import load_model, encode model, tokenizer = load_model("ncorder/llama-embed-nemotron-8b-mlx-4bit") embeddings = encode(model, tokenizer, ["your text here"])

Enjoy!

submitted by /u/kexxty
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA