nanotron/ultrascale-playbook

Awesome work! And thanks for linking my Flash Attention blog post :)

Caught few errors while reading (WIP - will add more as I go through the whole thing):

Typos:

Cheatsheet glossary: ep -> "expert parallelism degree" not "context parallelism degree"
"PROFILING THE MEMORY USAGE" -> "througho ut training" -> "throughout training"
"extremely usefull" -> "extremely useful"
"attention module will requires" -> "require"
"the memory savings in activations when using TP with SP helps us fit far bigger batches than TP alone" mentioned twice (in succession) in the summarization section of the TP/SP chapter, i.e. bullet points 2 & 3 are the same
"As you can see, ZeRO-3 and PP sove" -> "solve"
"need to be balanced in Pipaline Parallelism," -> "Pipeline"
"that are actually used to distribute and training larger" -> "train larger"
"Efficiently accessing data from global memory can improve a lot the performance." -> "can improve performance by a lot"
"Let's briefly mentionned" -> "Let's briefly go through"
"For float16 it is ..." -> there is a weird tilda (~) over 10^-3 here
"and when you should should be ready to follow the blog post easily." -> "and you should now be ready to follow the blog post easily."
(note: maybe just pass it once through grammarly free :) you can just ctrl+f the strings on the left side to find matches for the errors i found)

Logic:

Througput Scaling with TP/SP (3B Model) -> for TP=32 you get 41.4% whereas for TP=16 you get 43.4% (so it gets better :) despite the chart & logic showing the opposite)
in general i'm a bit suspicious of the TP vs TP/SP throughput scaling / maximum batch size plots, it seems like for TP=32 you can have 5x the batch size just due to SP?
"Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers" <- pipeline parallelism, this doesn't make sense? Activations for only a subset of layers now need to be kept on the GPU. Or if assuming act checkpointing it's the same conclusion, assuming we keep 4 layers per GPU now you need 4 @ X memory (assuming simplistically that you store activations at the beginning of each transformer layer) vs 4 @ X @ PP where PP is the number of stages in pipeline parallelism (note: using @ bc of rendering issues with asterisk).
The final table in "5D parallelism in a nutshell" section has errors when it comes to "Disadvantage" and "Parallel/sharding dimension" columns for ZeRO-1, ZeRO-2, and ZeRO-3.
(A2: typical scales in LLM training section): " So total optimizer states will be around (6 x h^2) per weight matrix -> this should be 12 x h^2 given that we need fp32, right?
In the A3 section might be worth mentioning (since the book is meant even for those who lack background) that attn FLOPs are dropped as they're (usually, assuming shorter context length) negligible. E.g. you set the FLOPs for a single transformer layer to 32 x seq x mbs x h^2 -> so per token you have 32 x h^2 and that is 4 x (2 x h^2) [due to 4 matrices Q,K,V,O and each op in matmul taking 2 flops] and then we have 3 x (4x2xh^2) [assuming we have gated unit otherwise would be 2, and assuming the intermediate dim is 4x, that's a lot of assumptions for someone new heh]

Spaces:

nanotron
/

ultrascale-playbook

Running

Few Errors