🏆 Leaderboards & Arenas

zh-ai-community 's Collections

2025 February

2025 January Papers 🧐

2025 January

🖼️ 2025 MLLMs

🧠 Reasoning Models

🖼️ MLLMs

✨ MoE models

🎬 Video models

🔊 Audio Models

📌 LLMs < 35B Chat

🔢 Math models

🏆 Leaderboards & Arenas

🚀 Trending Demo

💻 Code Models

🎨 Image models

📊 Dataset

updated Jan 23

Upvote

Running

14

14

Open Agent Leaderboard

🥇

Open Agent Leaderboard

Note By Om AI Lab
Running

4

4

CompassJudger Subjective Evaluation Learderboard

🌎

CompassJudger Subjective Evaluation Learderboard

Note By Shanghai AI Lab
Running on CPU Upgrade

633

633

Open VLM Leaderboard

🌎

VLMEvalKit Evaluation Results Collection

Note By OpenMMLab The OpenVLM Leaderboard evaluates and ranks 62 Vision-Language Models (VLMs) across 23 multi-modal benchmarks using the VLMEvalKit, featuring only open-source or publicly available API models.
Running on CPU Upgrade

109

109

Open Chinese LLM Leaderboard

🏆

Display and filter LLM benchmark results

Note By BAAI. The Open Chinese LLM Leaderboard aims to track, rank, and evaluate open Chinese large language models (LLMs). This leaderboard is powered by the FlagEval platform, providing corresponding computational resources and runtime environment. The evaluation dataset consists entirely of Chinese data to assess Chinese language proficiency.
Running

4

4

FlagEval-Arena

🐢

Arena

Note By BAAI Featuring 50 popular closed-source models from China and beyond!
Running

91

91

OpenCompass LLM Leaderboard

🚀

Display a web page

Note By Shanghai AI Lab An LLM leaderboard for Chinese models on many metric axes - super complete
Running

32

32

EvalCrafter

⚡

Display leaderboard data for video generation models

Note By Tencent AI Text to video generation leaderboard
Running on Zero

268

268

GenAI Arena

📈

Realtime Image/Video Gen AI Arena

Note By Tiger Lab An arena for image generation!
Running

18

18

LLM Leaderboard for SEA

🥇

Browse leaderboard of language models

Note By Alibaba - DAMO Southeast Asian (SEA) languages leaderboard
Running on CPU Upgrade

66

66

AIR-Bench Leaderboard

🥇

Explore benchmark results for QA and long doc models

Note By Jina AI and BAAI A new benchmark focuses on fair out-of-domain evaluation for RAG & NeuralIR
Running

9

9

Science Leaderboard

👁

Leaderboard for LLM for Science Reasoning

Note By Tiger Lab Leaderboard for Science reasoning.
Running

183

183

VBench Leaderboard

📊

Upload and evaluate video models

Note By Shanghai AI Lab Leaderboard for Video Generative Models.
Running

20

20

CompassArena

🏢

Explore and interact with AI assistant capabilities
Running

4

4

JudgerBench Leaderboard

🌎

JudgerBench Leaderboard
Running

22

22

ChronoMagic Bench

🥇

A Benchmark for Metamorphic Evaluation of T2V Generation

Note By PKU-Yuan group ChronoMagic-Bench represents the inaugural benchmark dedicated to assessing T2V models' capabilities in generating time-lapse videos that demonstrate significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text control.
Running

11

11

TempCompass

🥇

Submit and view model evaluation data

Note TempCompass is a benchmark to evaluate the temporal perception ability of Video LLMs.
Running on Zero

45

45

K-Sort Arena

📈

Efficient Image/Video K-Sort Arena

Note K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
Running

8

8

Salad Bench Leaderboard

🏢

Display model leaderboard from Excel data

Note Leaderboard for LLM Safety.

Upvote

🏆 Leaderboards & Arenas

Open Agent Leaderboard

CompassJudger Subjective Evaluation Learderboard

Open VLM Leaderboard

Open Chinese LLM Leaderboard

FlagEval-Arena

OpenCompass LLM Leaderboard

EvalCrafter

GenAI Arena

LLM Leaderboard for SEA

AIR-Bench Leaderboard

Science Leaderboard

VBench Leaderboard

CompassArena

JudgerBench Leaderboard

ChronoMagic Bench

TempCompass

K-Sort Arena

Salad Bench Leaderboard