File size: 12,197 Bytes
8ad35a7 71e77cb 8ad35a7 71e77cb 8ad35a7 264a231 8ad35a7 264a231 8ad35a7 71e77cb 8ad35a7 71e77cb 8ad35a7 71e77cb 8ad35a7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 |
---
title: "Scaling Model Training with More Compute, How Do They Do It?"
format:
revealjs:
theme: moon
fig-format: png
---
## Who am I?
- Zachary Mueller
- Technical Lead for the π€ Accelerate project
- API design geek
## Understanding GPU Usage
- We can somewhat estimate the memory usage in vanilla full-fine-tuning of models
- Requires certain assumptions (that I'll be covering):
- Adam optimizer
- Batch size of 1
## Understanding GPU Usage
General estimate (`bert-base-cased`, 108M params):
- Each parameter is 4 bytes
- Backward ~= 2x the model size
- The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):
::: {style="font-size: 50%;"}
| dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
|---------|:-----|:------:|:------:|:------:|:------:|
| float32 | 413.18 MB | 413.18 MB | 826.36 MB | 1.61 GB | 1.61 GB |
| float16 | 413.18 MB* | 619.77 MB | 826.36 MB | 826.36 MB | 826.36 MB |
*All estimations were based off the [Model Estimator Tool](https://huggingface.co/spaces/hf-accelerate/model-memory-usage)
:::
## Understanding GPU Usage
This works fine for small models, we have cards with anywhere from 12-24GB of GPU memory (on the GPU-poor side).
But what happens as we scale?
Here's `llama-3-8B` (8.03B parameters)
::: {style="font-size: 50%;"}
| dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
|---------|:-----|:------:|:------:|:------:|:------:|
| float32 | 28.21 GB | 28.21 GB | 56.43 GB | 112.84 GB | 112.84 GB |
| float16 | 28.21 GB* | 42.32 GB | 56.43 GB | 56.43 GB | 56.43 GB |
:::
Well, *I* don't have 56GB of GPU memory in a single card, let alone 112GB.
What can we do?
# Distributed Training
## Kinds of Training
* Single GPU:
* No distributed techniques at play
* Distributed Data Parallelism (DDP):
* A full copy of the model exists on each device, but data is chunked between each GPU
* Fully Sharded Data Parallelism (FSDP) & DeepSpeed (DS):
* Split chunks of the model and optimizer states across GPUs, allowing for training bigger models on smaller (multiple) GPUs
# Fully Sharded Data Parallelism
## Fully Sharded Data Parallelism
![](fsdp.png)
:::{.notes}
* Take the model and split it across `n` GPUs
* Each GPU computes the shard's gradients
* At the end, all gradients are synchronized and the final full model gradient is calculated
* The backward pass can then be performed
:::
## FSDP: Getting parameter specific
* Different parameters can dicatate how much memory is needed for total GPU training across multiple GPUs
* These include how model weights are sharded, gradients, and more.
* I'll cover some important ones I needed when doing a Full-Fine-Tune of Llama-3-8B *without PEFT* on 2x4090's
## `sharding_strategy`
* Dictates the level of divving resources to perform
* `FULL_SHARD`: Includes optimizer states, gradients, and parameters
* `SHARD_GRAD_OP`: Includes optimizer states and gradients
* `NO_SHARD`: Normal DDP
* `HYBRID_SHARD`: Includes optimizer states, gradients, and parameters but each node has the full model
:::{.notes}
FULL_SHARD:
Parameters, Gradients, Optimizer States: All are sharded.
Parameters Handling: Unshard before forward pass, reshard after forward pass, unshard before backward pass, reshard after backward pass.
Gradients Handling: Synchronize and shard after backward pass.
Optimizer States: Updated locally per rank.
SHARD_GRAD_OP:
Gradients and Optimizer States: Sharded during computation.
Parameters: Unshard before forward pass, remain unsharded during forward pass, reshard after backward pass.
Inside no_sync(): Parameters are not resharded after backward computation.
Optimizer States: Updated locally per rank.
NO_SHARD:
Parameters, Gradients, Optimizer States: Not sharded, replicated across ranks.
Gradients Handling: Synchronized via all-reduce after backward pass.
Optimizer States: Updated locally per rank.
HYBRID_SHARD:
Parameters, Gradients, Optimizer States: Combines FULL_SHARD within a node and replicates parameters across nodes.
Communication: Expensive operations like all-gathers and reduce-scatters are limited to within a node, enhancing performance for medium-sized models.
:::
## `auto_wrap_policy`:
* How the model should be split
* Can be either `TRANSFORMER_BASED_WRAP` or `SIZE_BASED_WRAP`
* `TRANSFORMER`/`fsdp_transformers_layer_cls_to_wrap`:
* Need to declare the layer
* Generally `transformers` has good defaults
* `SIZE`/`fsdp_min_num_param`:
* Number of total parameters in a shard
## `offload_params`:
* Offloads the parameters and gradients to the CPU if they can't fit into memory
* Allows you to train much larger models locally, but will be much slower
> Case: FFT of Llama-3-8B with `fsdp_offload_params` on 2x4090 GPUs was 72hrs, vs ~an hour or two when using 1xH100
## `cpu_ram_efficient_loading` and `sync_module_states`
* Uses the idea behind big model inference/the `meta` device to load in the model to the GPU in a low-ram scenario
* Rather than needing `model_size` * `n_gpus` RAM, we can load the model on a single node and then send the weights directly to each shard when the time is right via `sync_module_states`
# Tying this to π€ Accelerate
## Tying this to π€ Accelerate
* So far we've covered the theory, but how do we put it into practice
* By using a library that's at the heart of the entire open-source ecosystem
::: {style="font-size: 60%;padding-left:10%;padding-top:0%;"}
* Nearly all of π€
* `axolotl`
* `fastai`
* `FastChat`
* `lucidrains`
* `kornia`
:::
Are you using it and you don't even know?
## What is π€ Accelerate?
```{mermaid}
%%| fig-height: 6
graph LR
A(("π€ Accelerate#32;"))
A --> B["CLI Interface#32;"]
A --> C["Training Library#32;"]
A --> D["Big Model<br>Inference#32;"]
```
## A CLI Interface
* `accelerate config`
* Configure the environment
* `accelerate estimate-memory`
* How to guess vRAM requirements
* `accelerate launch`
* How to run your script
## Launching distributed training is hard
- ```bash
python script.py
```
- ```bash
torchrun --nnodes=1 --nproc_per_node=2 script.py
```
- ```bash
deepspeed --num_gpus=2 script.py
```
How can we make this better?
## `accelerate launch`
```bash
accelerate launch script.py
```
## `accelerate config`
* Rely on `config.yaml` files
* Choose to either running `accelerate config` or write your own:
:::: {.columns style="font-size: 50%;padding-left:10%;"}
::: {.column width="40%"}
```{.yaml filename=ddp_config.yaml}
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
```
:::
::: {.column width="40%"}
```{.yaml filename=fsdp_config.yaml}
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: false
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
```
:::
::::
# A Training Library
## A Training Library: The Code
:::: {.columns style="font-size: 50%;"}
::: {.column}
<br><br><br>
```{.python code-line-numbers="5-6,9"}
# For alignment purposes
for batch in dataloader:
optimizer.zero_grad()
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
loss.backward()
optimizer.step()
scheduler.step()
```
:::
::: {.column}
```{.python code-line-numbers="1-7,12-13,16"}
from accelerate import Accelerator
accelerator = Accelerator()
dataloader, model, optimizer scheduler = (
accelerator.prepare(
dataloader, model, optimizer, scheduler
)
)
for batch in dataloader:
optimizer.zero_grad()
inputs, targets = batch
# inputs = inputs.to(device)
# targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.backward(loss) # loss.backward()
optimizer.step()
scheduler.step()
```
:::
::::
## A Training Library: How Scaling Works
* Accelerate's DataLoaders and schedulers work off of a sharding mindset
* Rather than repeating the same data across `n` nodes, we instead split it
* Speeds up training linearly
* Given a batch size of 16 on a single GPU, to recreate this across 8 GPUs you would use a batch size of 2
* This also means the scheduler will be stepped `n` GPUs at a time per "global step"
## A Training Library: Mixed Precision
* This may be a bit different than your "normal" idea of mixed precision.
* We do **not** convert the model weights to BF16/FP16
* Instead we **wrap the forward pass** with `autocast` to convert the gradients automatically
* This preserves the original precision of the weights, which leads to stable training and better fine-tuning later on.
* **If you use `.bf16()` weights, you are STUCK in bf16 perminantly**
## A Training Library: Mixed Precision
* Let's tie that back up to the model estimator with neat tools like NVIDIA's TransformerEngine
::: {style="font-size: 60%;"}
| Optimization Level | Computation (GEMM) | Comm | Weight | Master Weight | Weight Gradient | Optimizer States |
| -- | -- | -- | -- | -- | -- | -- |
| FP16 AMP | FP16 | FP32 | FP32 | N/A | FP32 | FP32+FP32 |
| Nvidia TE | FP8 | FP32 | FP32 | N/A | FP32 | FP32+FP32 |
| MS-AMP O1 | FP8 | FP8 | FP16 | N/A | FP8 | FP32+FP32 |
| MS-AMP O2 | FP8 | FP8 | FP16 | N/A | FP8 | FP8+FP16 |
| MS-AMP O3 | FP8 | FP8 | FP8 | FP16 | FP8 | FP8+FP16 |
:::
:::{.notes}
What is actually happening:
* Linear Layers and other certain compatible layers are wrapped in a special version that allows for FP8 computation
* The general forward pass is wrapped around BF16
* This means that the most memory saved is done during the gradients of the model, *not* the model itself.
* With tools like `MS-AMP` we can convert more chunks into lower precision, but again like before stable training occurs when the models weights are in full precision and the backprop happens in full precision too.
:::
## DeepSpeed vs Fully Sharded Data Parallelism
* Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation
::: {style="font-size: 50%;"}
Framework | Model Loading (`torch_dtype`) | Mixed Precision | Preparation (Local) | Training | Optimizer (Local)
--|--|--|--|--|--
FSDP | bf16 | default (none) | bf16 | bf16 | bf16
FSDP | bf16 | bf16 | fp32 | bf16 | fp32
DeepSpeed | bf16 | bf16 | fp32 | bf16 | fp32
:::
To learn more, check out the [documentation](https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed) or join my office hours
## Key Takeaways:
* You can scale out training with `accelerate`, FSDP, and DeepSpeed across multiple GPUs to train bigger models
* Techniques like `FP8` can help speed up training some and reduce computational overhead
* Comes at a cost of end-precision and locking model weights for futher fine-tunes if not careful
## Some Handy Resources
- [π€ Accelerate documentation](https://hf.co/docs/accelerate)
- [Launching distributed code](https://huggingface.co/docs/accelerate/basic_tutorials/launch)
- [Distributed code and Jupyter Notebooks](https://huggingface.co/docs/accelerate/basic_tutorials/notebook)
- [Migrating to π€ Accelerate easily](https://huggingface.co/docs/accelerate/basic_tutorials/migration)
- [Big Model Inference tutorial](https://huggingface.co/docs/accelerate/usage_guides/big_modeling)
- [DeepSpeed and π€ Accelerate](https://huggingface.co/docs/accelerate/usage_guides/deepspeed)
- [Fully Sharded Data Parallelism and π€ Accelerate](https://huggingface.co/docs/accelerate/usage_guides/fsdp)
- [FSDP vs DeepSpeed In-Depth](https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed) |