graph LR A(("🤗 Accelerate#32;")) A --> B["CLI Interface#32;"] A --> C["Training Library#32;"] A --> D["Big Model<br>Inference#32;"]
General estimate (bert-base-cased
, 108M params):
dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
---|---|---|---|---|---|
float32 | 413.18 MB | 413.18 MB | 826.36 MB | 1.61 GB | 1.61 GB |
float16 | 413.18 MB* | 619.77 MB | 826.36 MB | 826.36 MB | 826.36 MB |
*All estimations were based off the Model Estimator Tool
This works fine for small models, we have cards with anywhere from 12-24GB of GPU memory (on the GPU-poor side).
But what happens as we scale?
Here’s llama-3-8B
(8.03B parameters)
dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
---|---|---|---|---|---|
float32 | 28.21 GB | 28.21 GB | 56.43 GB | 112.84 GB | 112.84 GB |
float16 | 28.21 GB* | 42.32 GB | 56.43 GB | 56.43 GB | 56.43 GB |
Well, I don’t have 56GB of GPU memory in a single card, let alone 112GB.
What can we do?
sharding_strategy
FULL_SHARD
: Includes optimizer states, gradients, and parametersSHARD_GRAD_OP
: Includes optimizer states and gradientsNO_SHARD
: Normal DDPHYBRID_SHARD
: Includes optimizer states, gradients, and parameters but each node has the full modelauto_wrap_policy
:TRANSFORMER_BASED_WRAP
or SIZE_BASED_WRAP
TRANSFORMER
/fsdp_transformers_layer_cls_to_wrap
:
transformers
has good defaultsSIZE
/fsdp_min_num_param
:
offload_params
:Case: FFT of Llama-3-8B with
fsdp_offload_params
on 2x4090 GPUs was 72hrs, vs ~an hour or two when using 1xH100
cpu_ram_efficient_loading
and sync_module_states
meta
device to load in the model to the GPU in a low-ram scenariomodel_size
* n_gpus
RAM, we can load the model on a single node and then send the weights directly to each shard when the time is right via sync_module_states
axolotl
fastai
FastChat
lucidrains
kornia
Are you using it and you don’t even know?
graph LR A(("🤗 Accelerate#32;")) A --> B["CLI Interface#32;"] A --> C["Training Library#32;"] A --> D["Big Model<br>Inference#32;"]
accelerate config
accelerate estimate-memory
accelerate launch
How can we make this better?
accelerate launch
accelerate config
config.yaml
filesaccelerate config
or write your own:fsdp_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: false
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
from accelerate import Accelerator
accelerator = Accelerator()
dataloader, model, optimizer scheduler = (
accelerator.prepare(
dataloader, model, optimizer, scheduler
)
)
for batch in dataloader:
optimizer.zero_grad()
inputs, targets = batch
# inputs = inputs.to(device)
# targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.backward(loss) # loss.backward()
optimizer.step()
scheduler.step()
n
nodes, we instead split itn
GPUs at a time per “global step”autocast
to convert the gradients automatically.bf16()
weights, you are STUCK in bf16 perminantlyOptimization Level | Computation (GEMM) | Comm | Weight | Master Weight | Weight Gradient | Optimizer States |
---|---|---|---|---|---|---|
FP16 AMP | FP16 | FP32 | FP32 | N/A | FP32 | FP32+FP32 |
Nvidia TE | FP8 | FP32 | FP32 | N/A | FP32 | FP32+FP32 |
MS-AMP O1 | FP8 | FP8 | FP16 | N/A | FP8 | FP32+FP32 |
MS-AMP O2 | FP8 | FP8 | FP16 | N/A | FP8 | FP8+FP16 |
MS-AMP O3 | FP8 | FP8 | FP8 | FP16 | FP8 | FP8+FP16 |
Framework | Model Loading (torch_dtype ) |
Mixed Precision | Preparation (Local) | Training | Optimizer (Local) |
---|---|---|---|---|---|
FSDP | bf16 | default (none) | bf16 | bf16 | bf16 |
FSDP | bf16 | bf16 | fp32 | bf16 | fp32 |
DeepSpeed | bf16 | bf16 | fp32 | bf16 | fp32 |
To learn more, check out the documentation or join my office hours
accelerate
, FSDP, and DeepSpeed across multiple GPUs to train bigger modelsFP8
can help speed up training some and reduce computational overhead