MLX 16/8/4/2-bit quants of nvidia/llama-embed-nemotron-8b
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I converted nvidia/llama-embed-nemotron-8b to MLX fp16, 8-bit, 4-bit, and 2-bit (for my OCD) and put it on HuggingFace:
ncorder/llama-embed-nemotron-8b-mlx-fp16
ncorder/llama-embed-nemotron-8b-mlx-8bit
ncorder/llama-embed-nemotron-8b-mlx-4bit
ncorder/llama-embed-nemotron-8b-mlx-2bit
I was running this model using GGUFs + llama-server for local semantic search over an Obsidian vault and some other projects. It worked fine but I got tired of managing a whole HTTP server just for embeddings and also wanted Apple Silicon optimizations. The MLX version loads in-process via mlx-embeddings, no server.
from mlx_embeddings import load_model, encode model, tokenizer = load_model("ncorder/llama-embed-nemotron-8b-mlx-4bit") embeddings = encode(model, tokenizer, ["your text here"]) Enjoy!
[link] [comments]
More from r/LocalLLaMA
-
I tracked EU GPU prices across 15 stores for 50+ days - RTX 5090 is the only card not dropping in price
May 14
-
The RTX 5000 PRO (48GB) arrived and it is better than I expected.
May 14
-
When is Andrej Karpathy going to look at a chicken nugget and tweet that it helped him solve AGI, which in turn inspires 6 random devs to create GitHub projects giving us actual AGI?
May 14
-
VS Code's new "Agents window" lets you use local AI models. Still requires an Internet connection and a Github Copilot plan (because we can't have nice things)
May 14
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.