CUDA out of memory on RTX A5000 inference.

#57
by RoberyanL - opened

I am running the model on RTXA5000 with 24g memory, which should satisfy the need, yet when I run the code, it still output CUDA issue. How should I fix this?

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, pipeline
import torch

# Function to clean CUDA memory
def clean_cuda_memory():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()

# Clean CUDA memory before starting
clean_cuda_memory()

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the model configuration and modify the rope_scaling parameter
model_config = AutoConfig.from_pretrained(model_id)
model_config.rope_scaling = {"type": "linear", "factor": 8.0}  # Adjust to the required format

# Load the model with the modified configuration
model = AutoModelForCausalLM.from_pretrained(model_id, config=model_config, torch_dtype=torch.float32)

# Initialize the text generation pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=1
)

system_role = '''
MY DESIGN FOR SYSTEM ROLE
'''

user_input = '''
MY INPUT
'''

# Define the messages
messages = [
    {"role": "system", "content": system_role},
    {"role": "user", "content": user_input},
]

# Concatenate messages into a single prompt
prompt = ""
for message in messages:
    if message["role"] == "system":
        prompt += f"System: {message['content']}\n"
    elif message["role"] == "user":
        prompt += f"User: {message['content']}\n"

# Generate text based on the prompt
output = pipe(
    prompt,
    max_new_tokens=512,
    eos_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

# Extract the generated text
generated_text = output[0]["generated_text"]

# Print the response
print(generated_text)

The error message:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB (GPU 1; 23.69 GiB total capacity; 22.86 GiB already allocated; 128.06 MiB free; 22.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Maybe try to run the model in float16 or bfloat16, if not already.

Thanks bro, this helps. But I see that 8B model can not generate well giving long task instructions, does anyone have better practice suggestions. And I see that 70B model needs VRAM 70GB to run with FP8, or 35GB to run with INT4. Can I use 4 24G GPUs for the FP8?

@RoberyanL no, joining 4 GPUs doesnt work anymore like it used to. the memory is not shared amongst them. if you can handle the slow inference speeds, pickup 96gb ram and run it from CPU.

You can also consider using the Inference API to call the model without having to download it.

from huggingface_hub import InferenceClient

client = InferenceClient(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    token="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
)

for message in client.chat_completion(
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_tokens=500,
    stream=True,
):
    print(message.choices[0].delta.content, end="")

Or if you are running it locally, use TGI

@RoberyanL no, joining 4 GPUs doesnt work anymore like it used to. the memory is not shared amongst them. if you can handle the slow inference speeds, pickup 96gb ram and run it from from the CPU.

@info-int do you have any links where I could read about this? I am also having the same problem @RoberyanL has had.

This link: https://huggingface.co/docs/accelerate/en/usage_guides/big_modeling suggests you can run a model that doesn't fit completely in CUDA memory
And this link: https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference seems to suggest that with pipeline parallelism you can split a model across multiple GPUs.

You can use TGI which will shard the model across multiple devices and will be far faster than anything you do with the transformers library

Sign up or log in to comment