OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework
Paper
•
2404.14619
•
Published
•
124
Note Attached Appendix A screenshot showing Pre-training and instruction Fine-tuning hyper-parameters with comparison between Pytorch FSDP (Pre-training) vs DeepSpeed Zero3 (Fine-tuning). Infra cost summary: Pre-training GPU time: 13 days - 3B model Pytorch FSDP on 128 H100 with 80GB VRAM Fine-tuning GPU time: 14.2 hours - 3B model with DeepSpeed Zero3 on 8 A100 and 80GB VRAM