r/LocalLLaMA · · 3 min read

I had 55 LLMs blind-grade each other (22k judgments, all open). Every model family with enough data is biased toward its own siblings. Qwen judges favor Qwen by ~0.9 points. Mistral penalizes its own by ~1.0.

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I have been running an open evaluation setup where N models answer the same prompt, then blind-grade each other in an N x N matrix with self-judgments excluded. No single privileged judge. So far: 286 evaluations, 198 hand-written questions, 22,254 valid judgments across 55 models from 11 developer families. Code, dataset, and all prompts are MIT licensed.

The finding I did not expect: same-family rating bias is statistically significant in all 8 families with enough data (p < 0.05, 7 of 8 survive Bonferroni). On a 0-10 scale:

  • Qwen judges rate other Qwen models +0.91
  • xAI +0.75, Anthropic +0.62, MiniMax +0.31, OpenAI +0.23
  • Google -0.59, Meta -0.68, Mistral -1.02

The positive in-group bias is the expected story. The negative ones are the interesting part. Mistral judges systematically rate other Mistral models a full point lower, the largest absolute bias in the set. I have not seen that reported before and I do not have a clean explanation. Could be training data, RLHF preference data, or stylistic self-penalty.

Two other things fell out of it. Aggregate leaderboards hide a lot: six different models hold the top spot across nine category pools, so "best model" is the wrong question. And code is where judges disagree most, nearly double the disagreement of meta-alignment, which makes single-judge code eval especially shaky.

Repo and data: github.com/themultivac/multivac-evaluation
Paper: themultivac.com/papers/blind-peer-matrix.pdf

Where I think this needs to go next, and where I would welcome pushback:

  • Anchor to ground truth where it exists. The fair criticism of any peer setup is that it is LLMs judging LLMs. For code and math that is fixable: grade with a test suite or a verifier and use the judges only where execution cannot decide. In a recent code run the judges actually contradicted execution on a concurrency test, preferring an answer the tests failed, so this is not hypothetical.
  • Control the bias number for response quality. Right now the same-family bias is a raw score gap, which conflates real bias with the possibility that some families just produce better answers. The cleaner version holds the response fixed and compares same-family judges against other-family judges on the exact same output, via a within-response mixed-effects model. That isolates the judge effect from the answer's quality. This is the result I most want to harden.
  • Better aggregation than averaging. Means treat a lenient judge and a strict judge as equal. A Bradley-Terry or item-response model that estimates judge leniency and item difficulty jointly would give more honest rankings, and I would run it alongside the current numbers to see how much moves.
  • Test the mechanism behind same-family bias. If it is stylistic self-recognition, then paraphrasing a response to strip surface style should shrink the bias. That is a clean counterfactual and I have not seen it run.
  • Validate against humans, and fix the question monoculture. A human correlation study on a subset is the obvious gold-standard check, and I wrote all 198 questions myself, so multi-author or held-out real-world prompts would remove my fingerprints from the question design.

The honest weak spots are that it is still LLMs judging LLMs, and I wrote every question. I would rather hear the methodology critique now than after I submit it. What would you want to see before trusting these numbers?

submitted by /u/Silver_Raspberry_811
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA