In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences 🩸
It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.
This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.
Contamination free code evaluations with LiveCodeBench! 🖥️
LiveCodeBench is a new leaderboard, which contains: - complete code evaluations (on code generation, self repair, code execution, tests) - my favorite feature: problem selection by publication date 📅
This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! 🚀
The new RL leaderboard evaluates agents in 87 possible environments (from Atari 🎮 to motion control simulations🚶and more)!
When you submit your model, it's run and evaluated in real time - and the leaderboard displays small videos of the best model's run, which is super fun to watch! ✨
Kudos to @qgallouedec for creating and maintaining the leaderboard! Let's find out which agent is the best at games! 🚀