r/LocalLLaMA · · 1 min read

Human Evaluation of GLM-5.2

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Human Evaluation of GLM-5.2

I've seen plenty of benchmarks that put GLM-5.2 below many of the closed source alternatives but at their heels. I thought to myself, next version GLM will totally be where the best frontiers are at now.

The last few days I've been testing it on a real world project, and it's basically Goated in my view. I wish I can run it locally but I've seen some madlads with the hardware that could around here.

Today I ran into Design Arena's leaderboard for the first time, this is what OpenRouter bases its benchmarks numbers on.. and it's human voting based! You can plug in that Doner kebab test there and vote on the most delicious looking 🍢

Game Dev, GLM-5.2 one step below Fable 5

And almost in every category, GLM-5.2 is kicking tokens and taking names. In some of the tests, it's right below Fable which for all intents and purposes is MIA.

Therefore, GLM-5.2, the MIT open-weights model.. is in my view, equivalent to the best models Claude has today 😳👏 I think we just won.

So I guess most standardized benchmarks really don't reflect real-world performance anymore, either because they're based on old assumptions/expectations or simply because they're being blatantly gamed.

submitted by /u/Alternative-Cat-1347
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA