r/LocalLLaMA · · 7 min read

A barebones CPU-only inference engine for Qwen 3, written from scratch in pure C

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

TL;DR: The (very messy) code and writeups can be found at https://github.com/jakint0sh/qwen3-engine

Read the README for instructions on how to get started.

And for those who just want a bulleted list: - Inference engine for Qwen 3 sizes 4B and below - Written from scratch in pure C - No dependencies except libc, libm, and cJSON (and OpenMP if compiled with parallelization) - Loads directly from HF safetensors, does 4-bit affine quant on the fly - Does KV caching - Built-in chat interface - Very slow, but the code is readable and tractable, and would be good to learn from

And now for the blab-fest...

So, as the title would suggest, I wrote my own LLM inference engine, specifically targeting the smaller Qwen 3 models, from scratch in pure C. Now, you may very well ask why anyone would do such a thing. It was partly a learning experience for me, since I didn't know how LLMs worked and I wanted to learn, and partly it was that I was challenged to write my own inference engine, and I decided I wasn't going to take the easy way out and glue python libraries together. I'm a decent C programmer, and figured that C would be a good choice to attack the problem with since you need speed in inference anyway.

So, I ended up spending about a week and a half in a loop of eat, read, write code, sleep, repeat, and in that time, I went from knowing nothing about how transformer models work to having implemented all of inference in my own code from scratch. It was quite the experience.

I relied heavily on ChatGPT to explain all of the core LLM concepts to me (tokenization, the transformer math, KV caching, quantization, etc) as I had no machine learning, numerics, or HPC background. I had a math background, so the linear algebra and general math concepts weren't an issue for me. But I definitely would have run into a number of issues surrounding quantization, softmax, and similar had I not had the robot overlords helping me.

I made a number of choices while writing the code. Firstly, I heavily prioritized representational correctness and clarity over performance. I was moving FAST, and learning and implementing such a massive amount of machinery that I knew I would just get mired in implementation details and bugs if I didn't put guardrails in place to save me from some of C's sharp edges early on. I took the easy way out and put asserts everywhere, and oddly enough, I don't remember a single time where one of those asserts tripped, but I felt better knowing that if I ever did anything idiotic I'd get the runtime complaining about it.

Unfortunately, because I deprioritzed performance and didn't have a good sense of what is fast or slow on modern computers (my experience was mostly in the realm of vintage computing and assembly programming, where lookup tables are ALWAYS faster than computing values inline and there is no such thing as cache), I made some design decisions that were pretty awful for compiler optimization and cache locality. In the end, even when compiled with OpenMP parallelization, my engine is awfully slow, only being able to spit out 1 token per second on my laptop (an i5-1240P with 16 threads, roughly performance-comparable to an Apple M1 on CPU compute).

Secondly, as much as possible, I wanted to maintain as much authorship of the implementation as I could. This meant no external code or libraries (as much as was reasonable) and also, no LLM-written code. ChatGPT did give me some implementation ideas that weren't strictly related to the math or LLM structure, but I personally wrote all of the code, and the majority of the implementation ideas and concepts are my own. It's not like I invented inference or any of the math therein, but I can confidently say that I did it all myself. I did use the C standard library and math functions, and I also used cJSON because I didn't want to write a JSON parser just to load configuration and deal with the safetensors file format. I could have, but I figured that it would be a huge time sink and a big potential source of bugs.

Thirdly, related to the above, I didn't want to have any external dependencies as much as possible. That meant no python script with a ton of runtime libs required to process the model weights into something that the C engine could ingest. This is the approach that was taken by the inference engine at https://github.com/adriancable/qwen3.c and you have to convert the weights into a special binary format for the inference engine. But apparently I perfer pain and suffering, so I decided I would ingest the weights as they were distributed (which, in the case for Qwen3-4B, was BF16 safetensors), and quantize them on the fly while loading. This means that you don't have to do anything fancy to get going with the engine. Just download the weights, compile it, and go. So that's nice at least.

Fourthly, related to the above points, I wanted the code to be readable and tractable. The qwen3.c implementation is fast, but it's dense in most of the important areas (the pointer math in the parallelized for loop for the MHA is... not easy to understand), and is an absolute bear to try to brain out unless you're very fluent in C, and very domain-knowledgable on inference as well. That's fine for a compact, performant runtime, and in fact you kind of have to do it that way if you actually want good performance because you have to write code in ways that the compiler can optimize, but it makes for a poor educational example. I wanted my engine to be easier to go through and understand, and be something that someone could actually learn from.

Also I wanted to be able to understand my own code as I was writing it, because again, I was moving really fast through a lot of stuff, and I didn't want to get dragged down in implementation difficulties. I would have been shooting myself in the foot big time had I not made the code easy for very-sleep-deprived future-me to understand later.

And fifthly, I wanted my engine to be reasonably comparable to modern implementations in terms of architecture, so I needed to implement quantization and KV caching at the very least.

There's a lot more I could say, but that's the gist of it. It runs Qwen 3 4B in a reasonable memory footprint by doing a simple affine 4-bit quant of most of the weights, and it has a little terminal-based chat interface built in, and it's usable as-is.

The code is messy, and there's a lot of unfinished stuff, and a couple of bugs too. I wanted to clean those up before sharing my code... but now it's been 2 months short of a year since I really touched it, and evidently, I'm never going to get around to it. So, it's messy, but I'm sharing it anyway. If you feel so inclined to clean up some of the mess and rough edges, pull requests are welcome.

I also wrote some technical writeup documents about this project, and those are in the repository as well, in the "writeups" dir. They're mostly just historical artifacts at this point, but I think they're good to include nonetheless. Maybe you'll get a laugh out of reading them. I was also considering writing a document that could be read alongside the code, and explain the whole implementation bottom to top, and I can absolutely put that together if any of you would be interested in it.

Comments and feedback are welcome, and if you have any questions at all about the code, I'll do my best to answer them promptly! I hope those of you who're just trying to get a foot in the door in learning about how this stuff works under the hood can learn from it.

Edit: formatting goof

submitted by /u/jakint0sh
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA