a100-80g memory but still call error
File "/root/.cache/huggingface/modules/transformers_modules/falcon-40b/modelling_RW.py", line 93, in forward
return (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 79.35 GiB total capacity; 77.18 GiB already allocated; 3.19 MiB free; 78.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
here is my code:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model = "falcon-40b"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
sequences = pipeline(
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
max_length=200,
do_sample=True,
top_k=5,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
A 40-B parameter model will not fit on and A100-80GB if it is in bf16 or fp16. In 16-bit precision the amount of VRAM needed to run a given model is at least 2GB per 1B parameters, and some models are closer to 3GB per 1B parameters. This does not include the amount of memory needed to actually run any type of inferencing. Two easy options: 1) run it on a node with multiple A100 80GB GPUs. 2) load the model in 8bit precision. This requires the package "bitsandbytes". This reduces the necessary VRAM to about 45GB. I have successfully loaded and performed inference with the falcon-40b-instruct model on a system with 4 A4500's (each GPU has 20GB VRAM) using this method.
I successfully using single 48GB VRAM A6000. As masonbraysx said, you need the bitsandbytes library. I prefer to install that package with github repo for the latest dev version+transformer+accelerate.
Activating bitsandbytes config which enabling load_in_4bit, nf4 quant type (QLora version), and bfloat16.
For the base model in bfloat16
, we recommend 85-100GB of memory.
There has been some efforts, such as FalconTune, to have the model in 4 bits (~20-30GB only).
Can anyone share the inference speeds on each setup? Knowing the speed is as important as being able to load it..
Can anyone share the inference speeds on each setup? Knowing the speed is as important as being able to load it..
Mine in cloud environment with single RTX A6000 48 GB VRAM got 1-2 tokens/second. Pretty slow but okay its running.
Can anyone share the inference speeds on each setup? Knowing the speed is as important as being able to load it..
Running it on 4 A100-80GB and it takes between 23 and 24 ms per token (using this https://github.com/huggingface/text-generation-inference to serve it).