Successful inference using a 24GB GPU
#2
by
mike-ravkine
- opened
This model is a little bit too large for inference to work on A10G, for anyone in a similar situation try this:
max_memory = {0:"18GiB", "cpu":"99GiB"}
device_map = infer_auto_device_map(model,
no_split_module_classes=["DecoderLayer"],
max_memory=max_memory)
if device_map['lm_head'] == 'cpu': device_map['lm_head'] = 0
model = load_checkpoint_and_dispatch(model, load_quant, device_map=device_map)
This configuration leaves the last 12/60 layers on the CPU to bring down GPU memory usage and works around an issue where the model crashes if lm_head is not on the GPU.