Ermakov Petr

ermakovpetr

AI & ML interests

LLM, Search, Diffusion

Recent Activity

liked a model about 15 hours ago

yandex/YandexGPT-5-Lite-8B-pretrain

reacted to artnitolog's post with 🤝 9 days ago

Recently, we open-sourced YaFSDP, Yandex’s tool for efficient distributed training of LLMs. Here are some of the key ideas used in YaFSDP to provide speedup and memory savings over FSDP: • Allocate and utilize just two buffers throughout the transformer for all collected weights to circumvent the torch memory allocator; • Gather small normalization layers at the beginning of the iteration and average the gradients only at the end; • Move gradient division to the very end of the backward pass. To learn more about how YaFSDP works, check out our latest blog post: https://medium.com/yandex/yafsdp-a-tool-for-faster-llm-training-and-optimized-gpu-utilization-is-no-632b7539f5b3

reacted to artnitolog's post with 🤗 9 days ago

View all activity

Organizations

ermakovpetr's activity

liked a model about 15 hours ago

yandex/YandexGPT-5-Lite-8B-pretrain

Updated about 8 hours ago • 22 • 90

reacted to artnitolog's post with 🤝🤗🚀🔥❤️👍 9 days ago

Post

2558

Recently, we open-sourced YaFSDP, Yandex’s tool for efficient distributed training of LLMs.

Here are some of the key ideas used in YaFSDP to provide speedup and memory savings over FSDP:
• Allocate and utilize just two buffers throughout the transformer for all collected weights to circumvent the torch memory allocator;
• Gather small normalization layers at the beginning of the iteration and average the gradients only at the end;
• Move gradient division to the very end of the backward pass.

To learn more about how YaFSDP works, check out our latest blog post: https://medium.com/yandex/yafsdp-a-tool-for-faster-llm-training-and-optimized-gpu-utilization-is-no-632b7539f5b3

upvoted a paper 3 months ago

Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

Paper • 2412.01819 • Published Dec 2, 2024 • 35

liked a model 5 months ago

ISTA-DASLab/Meta-Llama-3.1-70B-Instruct-AQLM-PV-2Bit-1x16

Text Generation • Updated Sep 17, 2024 • 120 • 46

upvoted a paper 7 months ago

Does Diffusion Beat GAN in Image Super Resolution?

Paper • 2405.17261 • Published May 27, 2024 • 20

upvoted a paper 9 months ago

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

Paper • 2406.02532 • Published Jun 4, 2024 • 13