MagicQuant (v2.0) - Hybrid Mixed GGUF Models + Unsloth Dynamic Learned Quant Configurations + Benchmark table with collapsed winners and more
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I spent the past 5+ months building a pipeline that creates hybrid GGUF quant mixes. I also built it to learn from Unsloth (or other) models by utilizing their quant to tensor assignment. And some architectures like Qwen3.6 27B have super weird patterns that can get genuinely lower KLD while dropping the model size meaningfully. Totally depends on the architecture though! This has been incredibly fun for me to build. I call my project, "MagicQuant". And I'd love to show you what it is currently producing alongside the published repo's to showcase.
And the hybrid aspect is super fun and mostly what I'll talk about. But the final results table doesn't just include hybrids, it includes Unsloth, llama.cpp, or anything else it learns from, but it only shows the survivors of the pipelines gauntlet.
MagicQuant has dominance, premium, nonlinear sub space winners, and collapse logic that instead of a quant dump repo that says "I don't know if IQ4_XS or Q4_K_S is better than the other even though they're the same size. Nor do I know if this model is allergic to IQ4_NL, but good luck!" MagicQuant aims to actually test what's the best bang for your buck based on the VRAM you have.
Some models are very predictable, boring, and don't really have crazy improvements to be made, but maybe some nice optional sub zones, great collapse spaces, etc. Some models are weird, have quirks, and the system recognizes this and optimizes the living hell out of it.
MagicQuant aims to solve a few key issues I personally have with the space:
1.) Everyone posts Q8/Q6/Q5 and so on. But there's no benchmarks. Was there a dramatic dip in KLD going from one quant to another? If so, why are you showing me an obviously bad trade?
2.) What if I need to be in the Q4 size range, but am willing to sacrifice a few more bytes for nonlinearly better KLD win? AKA, find good nonlinear KLD trade points that deserve to exist as an option.
3.) When downloading a model, I want to know only what quants matter. Not every quant currently available. Which is worth it? Which did better on this model? Hint hint, model architectures are weird, some like certain quants, some like weird quants but only in certain bit ranges where noise gets beneficial, some are allergic, some favor weird ones massively. Some LOVE MXFP4, though most hate it lol.
4.) Detect anomalies, hunt them down, validate their existence, and abuse the learned pattern. This is rare, but when it exists, USE IT! Qwen3.6 27B falls under this category of weirdness that can be abused.
This post is long. Here's the 'what to skip':
- Example Section - showcases the actual results. Should read to understand.
- Please Understand - I understand it's weird to have a section on the topic of, "This is more grounded than you think, please understand X". But without this section you may misjudge or misunderstand MagicQuant. You can skip it if you want, but I think it's important.
- Cloning Section - Optional read, but like, it's cool.
- Nonlinear Wins Section - If you don't care how winners are picked or what this means, skip.
- What Is MagicQuant Section - Probably should read, but like if you just want to see the results, click download, and play, skip this too.
Basically the Examples and final sections are really all that's necessary. The rest is just the sauce for those who want to understand, have questions, etc. Again, I apologize for the length, but it was a ton of work, a lot of fun and a lot of after work hours effort hammering away at this.
MagicQuant Repo Examples
Lets start with my favorite and best results thus far because it's the most fun. Most models are way more tame, but Qwen3.6 27B had a lot of room to flex what MagicQuant found.
Qwen3.6 27B
There were many more models that were eliminated from the running and uploaded. But I showcased just a couple that're crossed out to help give reference points. Utilized for learning config patterns was Unsloth Dynamic XL models (they only had the XL models) and llama.cpp default configurations.
| Name | KLD | Size (GB) |
|---|---|---|
| LM-Q8_0 | 0.003768 | 28.60 |
| MQ-Q6_K_1 | 0.002845 | 27.25 |
| MQ-Q6_K_2 | 0.003884 | 25.23 |
| MQ-Q6_K_3 | 0.004914 | 23.66 |
| LM-Q6_K | 0.007249 | 22.08 |
| MQ-Q5_K_S_1 | 0.006477 | 21.90 |
| MQ-Q5_K_S_2 | 0.007617 | 20.86 |
| LM-Q5_K_S | 0.010790 | 18.68 |
| UD-Q4_K_XL | 0.023521 | 17.61 |
| MQ-IQ4_NL_1 | 0.019687 | 17.59 |
| LM-IQ4_NL | 0.025714 | 15.80 |
| LM-IQ4_XS | 0.027015 | 15.08 |
| MQ-IQ3_M_1 | 0.043802 | 14.49 |
| LM-IQ3_S | 0.064393 | 12.42 |
| LM-IQ3_XXS | 0.093578 | 11.19 |
| LM-IQ2_M | 0.163117 | 10.00 |
| LM-IQ2_S | 0.210251 | 9.36 |
| LM-IQ2_XXS | 0.302597 | 8.43 |
Smaller than Q8 but lower KLD?
So, let me point out the elephant in the room. How in the world did MagicQuant build a model that was 1.35 GB smaller than Q8_0 but drop the KLD damage by nearly 25%?
Well, because Q6_K in ffn_down resulted in a KLD that was lower than if it had Q8_0! This was not a detectable pattern in the isolated environment, but it was an emergent behavior when quantization occurred with much less BF16 around the board.
MagicQuant has ways to try to smoke test anomalies, in this scenario it found multiple, but I'm pointing out one. And MQ-Q6_K_1 simply rebuilt the same model with Q8_0 on every group except for ffn_down.
What's going on?
Well, MagicQuant builds winners that're nonlinearly better for the KLD to size trade, or just what I call free lunch (aka smaller or same size and better).
When it comes to dominance winners or nonlinear winners, that's just great winning models found. Things like "premium" winners or "collapse winners" is more spicy logic, not always necessarily "purely better" but it's also still very sound logic in general.
But the MQ-Q6_K_1 was a simple showcase of a hybrid pattern, but once you start hitting lower bit, you start seeing super fun patterns like the following.
This was MQ-Q5_K_S_1:
{ "embeddings": "IQ4_NL", "lm_head": "Q6_K", "attn_q": "IQ4_XS", "attn_kv": "Q8_0", "attn_output": "Q8_0", "ffn_up_gate": "UD-Q6_K_XL", "ffn_down": "Q5_K" } Or sometimes it's less crazy like that and you get wins like the UD-Q4_K_XL that was eliminated by MQ-IQ4_NL_1 with this pattern:
{ "embeddings": "IQ4_NL", "lm_head": "UD-Q4_K_XL", "attn_q": "IQ4_XS", "attn_kv": "Q5_K_S", "attn_output": "UD-Q4_K_XL", "ffn_up_gate": "UD-Q4_K_XL", "ffn_down": "UD-Q4_K_XL" } I love that it literally just used Unsloths Q4_K_XL and said, "oh if I just change these 2 groups, it's free lunch." This is actually how UD-Q3_K_XL got eliminated too, though it was eliminated by "premium" logic not due to it being purely "better".
"premium" winners means it's maximum 1% bigger than the baseline we're comparing too, and the KLD is nonlinearly better than going to the next bit anchor point. So it's a more bias spicy winner in my pipeline but it's also a very high bar imo.
But MagicQuant on this model was able to hit really hard as you can see. Anomaly detection is rare, but when it occurs you see madness like this. And it's how 7 hybrids were decisively chosen as the end final survivors.
But normally from what I've observed, the 27B model was extra spicy, but more tame and normal results looks like Qwen3 4B.
Qwen3 4B 2507 Instruct
Now the following Qwen3-4B-Instruct-2507 is more what I'd call, "normal" for MagicQuant. No anomalies, no craziness, just what I consider straight value.
| Name | Quant Family | KLD | Size (GiB) |
|---|---|---|---|
| LM-Q8_0 | Q8_0 | 0.001339 | 3.99 |
| MQ-Q6_K_1 | Q6_K | 0.001817 | 3.58 |
| UD-Q6_K_XL | UD-Q6_K_XL | 0.002111 | 3.41 |
| LM-Q6_K | Q6_K | 0.004640 | 3.08 |
| MQ-Q5_K_1 | Q5_K | 0.006632 | 2.88 |
| UD-Q5_K_XL | UD-Q5_K_XL | 0.009839 | 2.73 |
| MQ-Q4_K_M_1 | Q4_K_M | 0.020346 | 2.44 |
| LM-Q4_K_S | Q4_K_S | 0.029803 | 2.22 |
| LM-IQ4_XS | IQ4_XS | 0.031300 | 2.11 |
| UD-Q3_K_XL | UD-Q3_K_XL | 0.072278 | 1.98 |
A cool win for a hybrid GGUF was the MQ-Q4_K_M_1. It was what MagicQuant calls a "nonlinear" winner and it ended up collapsing and removed UD-Q4_K_XL, LM-Q4_K_M.
here's a side by side:
| Model | KLD | PPL Δ | Size (GiB) |
|---|---|---|---|
| MQ-Q4_K_M_1 | 0.020346 | 0.8312% | 2.439 |
| UD-Q4_K_XL | 0.022351 | 1.2805% | 2.413 |
| LM-Q4_K_M | 0.025432 | 1.6528% | 2.326 |
This does NOT mean it was the same size or smaller than those it collapsed. Sometimes it does, sometimes it doesn't. But the system values nonlinear winners. Basically the difference in size was considered too small to keep all 3. Especially when there were additional smaller quants under LM-Q4_K_M. The system has lots of smart configurable logic that said in this scenario, "Do we really need 3 separate models within 113 MB size range of each other?" And there was a nonlinear winning hybrid, unsloth model, and llama.cpp model within a collapsible range.
But this is just is a showcase of one of many decisions MagicQuant will make to create a clean quant table of what was decided to be clear winners worth their salt and that actually pay rent.
Mind you, hybrids, llama.cpp, or unsloth models are treated identically. Each can win collapsed spaces, UD-Q5_K_XL and LM-Q4_K_S both collapsed other models and became the winner for example.
Also shown in the table, 2 of the hybrids, MQ-Q6_K_1 and MQ-5_K_1 both were discovered as nonlinear trade wins between their quant family bit space. Meaning it's not just a Q6.5 or Q5.5, but genuinely good trades for KLD for the increase in size. Thus the system decided they were worthy of existing.
The 3 shown MagicQuant Hybrids actually utilized the following configurations:
| Name | embeddings | attn_q | attn_kv | attn_output | ffn_up_gate | ffn_down |
|---|---|---|---|---|---|---|
| MQ-Q6_K_1 | Q8_0 | Q8_0 | Q8_0 | Q8_0 | Q6_K | Q8_0 |
| MQ-Q5_K_1 | Q8_0 | Q5_K | Q8_0 | Q6_K | UD-Q5_K_XL | Q5_K_S |
| MQ-Q4_K_M_1 | Q8_0 | Q5_K | Q8_0 | Q6_K | IQ4_XS | IQ4_XS |
The goal is not to light up the map with hybrid models only. It's to find what's the best KLD to file size trades you can make.
Qwen3.6 35B A3B - MOE Example
So, how does this system handle MOE? Well, as well as whatever Quant it learns from. Here's a more fun and recent example from the new Qwen3.6 series. In which this series has more Unsloth Dynamic to showcase too. Now you'll notice a lot of MagicQuant hybrids and less options as well.
Reason is because tons of stuff was dominated and collapsed. This was actually less because of gnarly hybrids. Actually the funkiest one was this:
{ "embeddings": "UD-IQ3_S", "lm_head": "Q8_0", "attn_q": "Q6_K", "attn_kv": "Q8_0", "attn_output": "Q8_0", "ffn_up_gate": "UD-IQ4_NL", "ffn_down": "UD-Q3_K_XL", "moe_router": "Q8_0" } But in reality, most were like this:
{ "embeddings": "UD-IQ3_S", "lm_head": "UD-Q6_K", "attn_q": "UD-Q6_K", "attn_kv": "UD-Q6_K", "attn_output": "UD-Q6_K", "ffn_up_gate": "UD-Q6_K", "ffn_down": "UD-Q6_K", "moe_router": "UD-Q6_K" } This MOE model mostly comes down to the experts, and Unsloth dominated freaking EVERYWHERE. I mean of course they did! But UD-IQ3_S was basically a free lunch cheat code. Why a Q3 you may ask? Well, remember Unloth Dynamic feels out tensor sensitivity and at the UD-IQ3_S and a variety of others that matched (this is just what my system latched onto even though others tied it). Unsloth made the embeddings really really strong on UD-IQ3_S because their system obviously found out it was sensitive and worth protecting. The size of that tensor group is actually larger than Q5 mind you but it demolished Q6 and Q8 in that category because it was both smaller and lower KLD!
Which is how the following table was born:
| Name | KLD | Size (GB) |
|---|---|---|
| LM-Q8_0 | 0.004654 | 36.90 |
| MQ-Q6_K_1 | 0.005149 | 31.59 |
| MQ-Q5_K_1 | 0.005523 | 29.19 |
| MQ-Q5_K_S_1 | 0.006730 | 26.33 |
| MQ-Q4_K_M_1 | 0.007799 | 24.82 |
| MQ-Q4_K_M_2 | 0.011007 | 22.32 |
| MQ-IQ4_NL_1 | 0.013277 | 20.89 |
| MQ-IQ3_M_1 | 0.026330 | 17.60 |
| UD-IQ3_S | 0.068376 | 13.68 |
| MQ-IQ2_XXS_1 | 0.275130 | 9.59 |
This has been a pretty clear pattern I've noticed mind you. When a model has more Unsloth Dynamic models to work with, the better it can do. Which again... That makes tons of sense. But this is how MagicQuant works. Sometimes wins are really weird combinations, sometimes it's anomalies, sometimes it's cool sub zones, and sometimes it's just honestly noticing a few tweaks could be make here or there to effectively get a bit of a boost.
Please Understand
I want to stress that MagicQuant can't "guarantee" anything. I can't say, "give me an optimized Q4". It instead checks the search space and tries to find IF any spaces exist at all. It may or may not exist. That's the point. Some MagicQuant tables will light up MagicQuant hybrids on the map like a Christmas tree. Some MagicQuant goes, "Unsloth killed it, go use them. Here's maybe 2 sub zones for nonlinear wins if you're in this VRAM size."
Additionally, the utilization of KLD is the primary metric, though there's other PPL metrics behind the scenes and showcased on manifest files on the repo. I use PPL as a secondary smoke signal. But I'm also sampling hundreds of isolated probes, so physics is an issue. I'd love to add more benchmarks, but KLD is very effective at testing tensor configurations and thus a very good, cheap benchmark that's heavily utilized throughout the process. Plus it lets me finish the pipeline before my great great great grand babies are born. But I'm always open to ideas, improvements, etc! But the goal isn't to produce a model that claims it's universally better in every single situation. It's to test and find the best tensor configurations!
If you see metrics showcasing MagicQuant beating an Unsloth model. Please keep in mind I never once benchmark the original Unsloth Dynamic artifact. I grab an Unsloth model, just like I would for a llama.cpp or batwoski GGUF model. I then strip the model away from all its special sauce, then I normalize it with my derivative model, my imatrix, etc. Finally I begin grouping tensors, isolating them, probing, and begin building hybrid models.
So when you see something beating an Unsloth model for example. I am NOT saying this version beats Unsloths original artifact. I never ran that benchmark. They use their own imatrix, their own stuff, the only thing I benchmarked was their tensor config in a fair and isolated environment. But it is saying that under my isolated environment, the tensor configuration pattern on X did beat Y.
I hope that makes sense 😄
That's also why when an Unsloth Dynamic model wins, I literally link to their repo instead of re-hosting their quants. Plus, whenever an Unsloth model is beat, it's usually literally a MagicQuant model using Unsloths very tensor configurations but with a more optimized group pattern. So I didn't quantize jack diddly! You say don't quantize 1 dimensional tensors? I say, "That's not my responsibility to care about. Unsloth already protected that, thus so did MagicQuant."
Quantization is very hard. I leave that to the smart people working on that frontier. Think of MagicQuant like a meta level above quantization.
But it's also why when in the past for MagicQuant v1.0 when I was asked, "Does this beat Unsloth Dynamic." I didn't realize the misunderstanding because MagicQuant isn't a quantizer that makes tensor by tensor decisions like Unlsoth Dynamic. I literally use their Unsloth Dynamic configurations. To me, asking if I beat Unsloth is kind of like asking me after I overclocked a CPU, "Did you beat Intel/Ryzen?"
As where it's more like, "I mean I got some good silicon and was able to overclock it to X. But it's still the same CPU."
Cloning
Another cool feature of MagicQuant is cloning. MagicQuant repositories are automatically generated. One of the generated files is called, "magicquant.clone-configs.json". In which the system doesn't necessarily need this file to clone, but it makes it incredibly easier and faster versus downloading every model and learning the config again when it was already done once before anyways.
This lets me upload a repository, then look at the uncensored model of Qwen3.6 35B A3B, for example the model by llmfan46/Qwen3.6-35B-A3B-uncensored-heretic which utilized Heretic.
I can target that unensored repository and the Qwen3.6 35B MagicQuant repository and the system will bake a clone of the finalists, including the Unsloth Dynamic models too since Unsloth doesn't host the uncensored models.
MagicQuant will actually rebuild the finalists without requiring the entire process to run again from scratch. It'll link the repo in the readme to the original MagicQuant and properly re-run benchmarks as well. It checks tensor patterns too. That way there's no accidental clones of things that don't match.
I have a cloned repo of the Qwen3.6 35B A3B for an uncensored version. Though at least as of right now when I posted, it's a cloned version of the old Qwen3.6 35B results I got, not the newest and more refined results. It's still baking the clone and should hopefully be done in the next 24 hours of me posting this reddit post with the newest MagicQuant hybrids for the uncensored model.
Importance Of Nonlinear wins
MagicQuant does not look for simple "winners" in sub space between baselines. Instead it only allows nonlinear trade wins. TLDR:
Imagine a graph like this:
Size → | | Q6 | / | / | Q5 | / |Q4 +---------------- A nonlinear win looks like:
Q6 / / ← MQ-Q5_K_1 (above the line) Q5 / Q4 That hybrid sits above the straight line between Q4 and Q5.
Meaning:
It’s a more efficient trade than the normal step-up
This is what MagicQuant calls a "nonlinear trade/win" when such wordage is used.
Because anyone could just bump up a tensor or 2, see the KLD drop slightly, say, "look it's better" and then light up the repo table with all MagicQuant models. That's not the point of MagicQuant. And nonlinear winners is an important distinction to understand "why" a winner deserves to exist or is picked.
For a hybrid to be presented between a bit space, it must be genuinely better than just going to the next quant bit up.
What Is MagicQuant?
From previous posts, or those who followed MagicQuant v1.0, a common misconception about MagicQuant and that it is a quantization algorithm. It does not make tensor by tensor decisions like Unsloth Dynamic or llama.cpp.
Here's a very simple explanation.
1.) The pipeline quantizes a model using llama.cpp or downloads the unsloth model.
2.) Each tensor is read and categorized into upwards of ~10 dynamically activated group tensor categories. This is simple regex level finding to match them into their slots.
3.) Store what quant was assigned to each tensor within a database and to their assigned tensor group.
Unless you want more details, you can skip this part. This gets a bit more quant heavy with the discussion and really is me just dumping knowledge sauce for those who want to heavily understand what's going on:
A bit more explanation for those unfamiliar. When you want to quantize a model for example to Q3, and lets say you have hypothetically 400 tensors in your model. Llama.cpp nor Unsloth just tells every single 400 tensors to be some Q3 bit quant. That'd destroy the AI's brains.
Instead, real quantization algorithms like Unsloth Dynamic 2.0 feels out tensors, which are sensitive, which are not, and they do lots of fancy things. Using loose language here to explain mind you.
What MagicQuant then does is look at Unsloths model and for example would see something like 100 tensors in ffn_up_gate group with 10 of them as F32, 30 as Q6_K, 20 as Q4_K, 40 as IQ3_XXS. And this knowledge would be recorded for re-use by MagicQuants pipeline when recreating the baseline, building hybrids, isolating samples, and more.
The Qwen3 4B Instruct 2507 model shown earlier, here's the actual range of quantizations used within each tensor group that I actually recorded when reviewing Unsloths UD-Q3_K_XL GGUF model.
| Tensor Group | Unique Final Quant Types |
|---|---|
embeddings | Q6_K |
attn_q | IQ3_XXS, IQ4_XS, Q3_K, F32 |
attn_kv | IQ3_XXS, IQ4_XS, Q3_K, Q4_K, Q5_K, Q6_K, F32 |
attn_output | Q4_K, F32 |
ffn_up_gate | IQ3_S, IQ4_XS, Q3_K, F32 |
ffn_down | Q4_K, Q5_K, Q6_K, F32 |
Cool right? This is understood by real ML researchers obviously, but I consider myself a mere mortal and this was just cool for me to fully realize.
Now MagicQuant remembers this kind of information. It's not trying to be architecture aware necessarily or do some fancy thing. It simply remembers each tensor assignment and their assigned group (eg. ffn_down, attn_q, etc). Then if I want to use UD-Q3_K_XL on lets say attn_output on a future hybrid, I can just re-apply what I learned.
This prevents MagicQuant from having to figure out what exact tensors to touch, not touch, which are sensitive, etc. I just stand on the shoulders of giants. I leave that hard part to the smart people pushing that frontier.
This is how hybrids are born mind you. I simply digest these mappings and then build isolated samples of every tensor group to quantization configuration.
Then not only do I have the ability to re-apply the learned config, but I have a prediction engine that very practically (it's not omniscient) uses the probed isolated sample knowledge to predict, then build, then validate, and find potential hybrids utilizing the mixed tensor to group knowledge that was extracted and isolated in samples.
Final Example. Sorry to beat this in, but it was a big misunderstanding with v1.0 but think of MagicQuant like a wine critic and tester. I didn't make the wine, but I've tested enough to tell you which pair with what meals and when.
Finally
I spent the last 5+ months working on MagicQuant v2.0 and it was a lot of work. I had to learn a lot. I had a lot of failures. I had to go back to the drawing board multiple times. I swear I would have physically chucked the code out the window at some points if it were possible.
But, thank you so much to those who helped me along the way. I've wanted something like this for myself because I feel like I can finally just look at a repo, know what trades I'm getting at what sizes, and not guess if IQ4_NL or IQ4_XS is going to be barely any different or find out an architecture is allergic to one and not the other (because yeah that happens).
It was also a ton of fun building the hybrid aspect too. Sometimes there's hybrid winners, sometimes there's not. Totally depends. That's the point. If the space exists for nonlinear good trades, that's great. If not, a repo still is posted with just the baselines. Heck if only unsloth wins, then it's just a ton of links to Unsloth.
If you used v1.0 (MXFP4 era), check the docs for why it was deprecated and what changed:
https://github.com/magiccodingman/MagicQuant-Wiki/blob/main/wiki/archival/version_1/README.md
That v1.0 doc will read more like a postmortem to be honest. I felt it was important though to document why it failed, why it was wrong, what I learned, even when the results looked deceptively successful.
If anyone notices flaws in the methodology, has disagreements, or anything else, I'm more than open to such a discussion. I'm not really trying to prove one thing or another. I'm just trying to build a pipeline that produces results I myself can trust so I finally know, what in the world is worth it.
If you test the models, I always love feedback. Did a MagicQuant do compression on part of the muscle that's causing you issues you'd not have with a non hybrid? Or is the Hybrid doing pretty well for you? Do you see flaws in how I'm operating or ways that it could be improved?
I've literally dumped all my logs in a magicquant-manifest folder on every repo so you can fully reproduce and trace everything that is occurring. And the wiki documents every detail to showcase how I build isolated samples, try to make fair comparisons, and more. I'm not really wanting to prove anything, I just want to trust my own system. Feedback helps me with that. And hopefully this interests someone enough to give it a test and validate or poke holes.
I've spent way too much time on this project.. Like, I literally had to make an entire benchmarking and quantization queue system to speed up results massively. Right now MagicQuant actually has a system that leases out NVME's as scratch disks because disk IO/latency becomes a bottleneck. I regret both everything and nothing. Thank you!
GitHub:
GitHub Wiki - Where you can make requests, provide more feedback, etc
Just a note on the wiki. I did have AI help me write it. I'm going to be rewriting a ton of the wiki to be less AI manifesto talk. I apologize for that. I have it right now that it's helping me document things as it changes. Because it's a lot. And it's very helpful, but I have been reviewing what it's writing. Just haven't gone back to actually refine and humanize it yet.
Huggingface Collection:
Huggingface Collection For Current MagicQuant Repo's
Funnily MagicQuant makes me look at quants now like I look at my quails. I see one even slightly causing a ruckus. That's bad for the flock. I guess my dog Orzo is about to get some more quail jerky.
Quick FAQ:
Q: Will the code/pipeline be released?
A: Yes, I'm going to finish refining it first before posting it on my GitHub where the current wiki will also become the source code location + be renamed to not use the "-wiki" in the name. But I'd spend more time bug fixing if I released it right now. It's mostly usable in my IDE in debug mode. And the code is a mess at the moment as it has evolved so many times. But I do plan to release the code. Especially because I don't have the hardware to run a lot of the bigger models either! If others found this project interesting and helped post MagicQuants, that'd be amazing! Also I'm going through 1 more refactor where I'm highly debating making it a small web app running locally instead of a CLI. Honestly it'd make my life easier by doing so. I'm quite tempted.
Q: Is there an Imatrix?
A: Yes, I use my own imatrix. It's ~1.5M tokens dispersed over multiple domains. If you're interested, I did document what I'm currently using on the wiki on the imatrix-dataset page. But if you have suggestions to improve it, please lay it on me!
Q: Am I going to add more benchmarks?
A: Very unlikely. I'm not trying to make a benchmark suite at every angle. It's more to answer the question of what's best within real practical use. This isn't to say proper benchmarks with harnesses and so on aren't amazing. But when sampling hundreds of models, physics becomes the biggest slow down. And within all practical reality, KLD is great for this imo, and with PPL as just a secondary smoke alarm. But again, if you have ideas, I'm not just willing to listen, but its been the advice, support, and idea building with others that'd helped me get this far in the first place. But I do have a line in the sand drawn that the tests can't increase the time to build to heat death time increase.
[link] [comments]
More from r/LocalLLaMA
-
24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)
May 13
-
Web-Search is coming to a screeching performance halt as Google shuts down their free search index, and traffic defenders like Cloudflare challenge AI at every gateway. What are our options?
May 13
-
Side Projects.
May 13
-
MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant)
May 13
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.