abhinavkulkarni/tiiuae-falcon-40b-instruct-w4-g128-awq · Successful inference using a 24GB GPU

This model is a little bit too large for inference to work on A10G, for anyone in a similar situation try this:

        max_memory = {0:"18GiB", "cpu":"99GiB"}
        device_map = infer_auto_device_map(model,
                                           no_split_module_classes=["DecoderLayer"],
                                           max_memory=max_memory)

        if device_map['lm_head'] == 'cpu': device_map['lm_head'] = 0

        model = load_checkpoint_and_dispatch(model, load_quant, device_map=device_map)

This configuration leaves the last 12/60 layers on the CPU to bring down GPU memory usage and works around an issue where the model crashes if lm_head is not on the GPU.