Eugene Yan · · 2 min read

Weights & Biases LLM-Evaluator Hackathon - Hackathon Judge

Mirrored from Eugene Yan for archival readability. Support the source by reading on the original site.

[ llm eval ] · 2 min read

This weekend, I had the opportunity to judge the Weights & Biases LLM-Judge Hackathon. Over two days, more than 100 people took part with 15 teams demoing their work on day two. The teams built creative and practical projects such as constructing and validating knowledge graphs from documents, evaluating LLMs on MBTI traits and creativity, optimizing evaluation prompts, evaluating multi-turn conversations, and more.

I was invited to kick off the hackathon with a short talk, and took the chance to discuss:

  • Things to consider when using LLMs-evaluators: What is our baseline? How will LLM-evaluators score responses? What metrics to evaluate LLM-evaluators on?
  • A decision tree to decide on scoring methods, metrics, and evaluator vs. guardrail
  • Open questions on LLM-evaluator performance, alignment, and integration


I was impressed by the level of effort and care that went into the demos, with some teams hacking all the way till 10pm on Saturday night (and had to get kicked out of the building). From the demos, the teams accomplished A LOT in the span of one and a half days. The top team won Meta Ray-Bans for each member of the team.

congrats to the winners! pic.twitter.com/BYfT9prkDK

— eugene (@eugeneyalt) September 23, 2024


Overall, everyone had a great time hacking and giving demos. I also hacked on something of my own and hope to share it soon. Yes, it’s also LLM-evaluator related, focused on the UX/UI with the goal of making labeling and evaluation more effective and fun.


If you found this useful, please cite this write-up as:

Yan, Ziyou. (Sep 2024). Weights & Biases LLM-Evaluator Hackathon - Hackathon Judge. eugeneyan.com. https://eugeneyan.com/speaking/hackathon-judge/.

or

@article{yan2024judge,
  title   = {Weights & Biases LLM-Evaluator Hackathon - Hackathon Judge},
  author  = {Yan, Ziyou},
  journal = {eugeneyan.com},
  year    = {2024},
  month   = {Sep},
  url     = {https://eugeneyan.com/speaking/hackathon-judge/}
}

Share on:


Browse related tags: [ llm eval ] or Search
« Building the Same App Using Various Web Frameworks AlignEval: Building an App to Make Evals Easy, Fun, and Automated »

Join 11,800+ readers getting updates on machine learning, RecSys, LLMs, and engineering.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Eugene Yan