r/LocalLLaMA · · 2 min read

260K-param LLM running on an emulated 90s CPU inside an 18-year-old RTOS

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

260K-param LLM running on an emulated 90s CPU inside an 18-year-old RTOS

I know this sub loves absurd LLM projects, so sharing my contribution while we wait for the new Qwen 3.7 models to drop!

I successfully got a tiny LLM running inside an RTOS, running inside a custom-built JavaScript emulator for the Freescale ColdFire MCF5307, which is a derivative of the legendary Motorola 68K that powered the original Mac and Sega Genesis.

The RTOS was written back in 2008 with three classmates for our embedded systems university course. It was lost to time, with the hardware and original ROM long gone. A few months ago, I decided to use Claude and Qwen to revive it, writing the CPU emulator from scratch and reverse-engineering the ROM from kernel calls. Once the original 2008 binary was booting, I wanted to go full inception and try running an LLM on the emulated stack.

As the starting point, I took Karpathy's llama2.c with the stories260K model trained on TinyStories. It's about half a megabyte of weights, which is tight but fits in the 16MB of emulated memory after shrinking the kernel stack to free up room. The ColdFire has no FPU, so every float calculation requires libgcc's software emulation, meaning a forward pass would need millions of emulated instructions per token which is a non-starter.

To get around this, I quantized the model to INT8 with a per-row scale factor, turning the critical matmuls into pure integer math and thus dropping the inner loop to a handful of instructions. For floats outside of matmul, I went old school and used Carmack's fast inverse square root (from Quake) and a whole bunch of lookup tables for RoPE to avoid trig calculations. The only thing that stayed as emulated floating point is softmax/RMSnorm, but those get hit infrequently enough that it's still relatively fast.

The whole model outputs at a blistering 2-4 seconds per token, generating mostly coherent (and sometimes weird) TinyStories-style English!

You can try it directly in your browser, just type %a to run the model. For the curious, I have a longer write-up on my whole RTOS archeology project here.

Obviously, this is not useful for anything practical, but it's neat to see LLMs running on potato-level stacks. My next step is putting the whole stack on an FPGA that re-implements the original hardware, which should bring it up to actually usable speeds.

submitted by /u/MironV
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA