r/LocalLLaMA · · 1 min read

ztok — a fast multithreaded tokenizer in Zig that loads tiktoken / HF / SentencePiece and is 2–5× faster

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I built ztok, a tokenizer library focused on being fast and format-agnostic for local pipelines.

- Loads what you already have — .tiktoken, HF tokenizer.json, SentencePiece .model, TokenMonster, Mistral Tekken. Auto-detected.

- Bit-identical to tiktoken / HF / SentencePiece on the equivalence gate, so it's a drop-in.

- Faster on the same vocab + same bytes (cl100k vs tiktoken, EPYC 24c/48t): ~2× single-thread, 3.8–5.5× batched (~291–425 MB/s vs ~78). Also faster than HF tokenizers andSentencePiece on their own vocabs.

- 8 language bindings over one C ABI — Python, Node, Ruby, Go, Rust, .NET, Java, Swift.

- Built for the boring-but-useful jobs: RAG chunking with token-cap windows + byte-accurate offsets, and dataset tokenization straight to .bin/.npy for training.

Zig 0.16, AGPL-3.0, ~1100 tests. Feedback welcome, especially on vocab formats I'm missing.

https://github.com/sirus20x6/ztok

submitted by /u/FaustAg
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA