M-RewardBench: Evaluating Reward Models in Multilingual Settings Paper • 2410.15522 • Published 20 days ago • 10
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content Paper • 2410.10783 • Published 26 days ago • 25
Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation Paper • 2407.13696 • Published Jul 18 • 5
Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP Paper • 2407.00402 • Published Jun 29 • 22
view article Article BM25 for Python: Achieving high performance while simplifying dependencies with *BM25S*⚡ By xhluca • Jul 9 • 36
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild Paper • 2406.04770 • Published Jun 7 • 26
Mixture-of-Agents Enhances Large Language Model Capabilities Paper • 2406.04692 • Published Jun 7 • 55
Large Language Model Confidence Estimation via Black-Box Access Paper • 2406.04370 • Published Jun 1 • 19
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 65
Granite Code Models: A Family of Open Foundation Models for Code Intelligence Paper • 2405.04324 • Published May 7 • 22
view article Article Releasing Common Corpus: the largest public domain dataset for training LLMs By Pclanglais • Mar 20 • 17
Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation Paper • 2402.10210 • Published Feb 15 • 29
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning Paper • 2402.06619 • Published Feb 9 • 54
Genie: Achieving Human Parity in Content-Grounded Datasets Generation Paper • 2401.14367 • Published Jan 25 • 6