OpenCoder Collection OpenCoder is an open and reproducible code LLM family which matches the performance of top-tier code LLMs. • 8 items • Updated Nov 23 • 78
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages Paper • 2410.23825 • Published Oct 31 • 3
LLM Reasoning Papers Collection Papers to improve reasoning capabilities of LLMs • 17 items • Updated 2 days ago • 89
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment Paper • 2410.05873 • Published Oct 8 • 3
MaskLID: Code-Switching Language Identification through Iterative Masking Paper • 2406.06263 • Published Jun 10 • 5
view article Article DuckDB: run SQL queries on 50,000+ datasets on the Hugging Face Hub Jun 7, 2023 • 4
CommonCatalog Collection Common Catalog, a dataset with Creative Commons licensed images and machine-generated caption pairs • 8 items • Updated May 16 • 14
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only Paper • 2306.01116 • Published Jun 1, 2023 • 32
LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons Paper • 2402.14086 • Published Feb 21 • 9
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model Paper • 2402.07827 • Published Feb 12 • 45