view post Post 1211 Wow, impressive 340B model by nvidia with a nice permissive license! 🚀 The technical report is full of insights and seems to use a different learning rate schedule than cosine, probably a variant of WSD. Hope to get more info on that! 👀 nvidia/nemotron-4-340b-666b7ebaf1b3867caf2f1911
Cool papers Efficient Streaming Language Models with Attention Sinks Paper • 2309.17453 • Published Sep 29, 2023 • 13 Simple and Controllable Music Generation Paper • 2306.05284 • Published Jun 8, 2023 • 147 FinGPT: Large Generative Models for a Small Language Paper • 2311.05640 • Published Nov 3, 2023 • 28 MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers Paper • 2305.07185 • Published May 12, 2023 • 9
Efficient Streaming Language Models with Attention Sinks Paper • 2309.17453 • Published Sep 29, 2023 • 13
FinGPT: Large Generative Models for a Small Language Paper • 2311.05640 • Published Nov 3, 2023 • 28
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers Paper • 2305.07185 • Published May 12, 2023 • 9
LLM.C Fineweb vs Edu-Fineweb eliebak/wsd_124M_150B_edu Text Generation • Updated Jun 11, 2024 • 19 eliebak/wsd_124M_150B_fw Text Generation • Updated Jun 11, 2024 • 19 eliebak/wsd_124M_300B_edu Text Generation • Updated Jun 11, 2024 • 18 eliebak/wsd_124M_300B_fw Text Generation • Updated Jun 11, 2024 • 19