Cuda OOM by simply doing a forward pass on an A6000 (48GB VRAM)
Hi,
I seem to be getting OOM by merely trying to do a forward pass with 4k text sequence length, and no images as input.
Following is an example script for simply doing a forward pass with a 4-bit quantized version of the model:
DEVICE = "cuda:0"
USE_LORA = False
USE_QLORA = True
processor = AutoProcessor.from_pretrained(
"HuggingFaceM4/idefics2-8b",
do_image_splitting=False
)
if USE_QLORA or USE_LORA:
lora_config = LoraConfig(
r=32,
lora_alpha=16,
lora_dropout=0.1,
target_modules='.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$',
use_dora=False if USE_QLORA else True,
init_lora_weights="gaussian"
)
if USE_QLORA:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = Idefics2ForConditionalGeneration.from_pretrained(
"HuggingFaceM4/idefics2-8b",
torch_dtype=torch.bfloat16,
quantization_config=bnb_config if USE_QLORA else None,
)
model.add_adapter(lora_config)
model.enable_adapters()
else:
model = Idefics2ForConditionalGeneration.from_pretrained(
"HuggingFaceM4/idefics2-8b",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2",
).to(DEVICE)
train_dataset = load_from_disk("./custom_datasets/idefics/train.hf")
test_dataset = load_from_disk("./custom_datasets/idefics/test.hf")
class MyDataCollator:
def __init__(self, processor):
self.processor = processor
self.image_token_id = processor.tokenizer.additional_special_tokens_ids[
processor.tokenizer.additional_special_tokens.index("<image>")
]
def __call__(self, examples):
texts = []
for example in examples:
messages = example["messages"]
text = processor.apply_chat_template(messages, add_generation_prompt=False)
texts.append(text.strip())
batch = processor(text=texts, return_tensors="pt", padding=True)
labels = batch["input_ids"].clone()
labels[labels == processor.tokenizer.pad_token_id] = self.image_token_id
batch["labels"] = labels
return batch
data_collator = MyDataCollator(processor)
data_loader = DataLoader(train_dataset, batch_size=1, collate_fn=data_collator)
for batch in data_loader:
for k,v in batch.items():
print(k, "->", v.shape)
out = model(**batch)
break
Running it produces the following error:
output = lora_B(lora_A(dropout(x))) * scaling
~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 234.00 MiB. GPU 0 has a total capacity of 47.30 GiB of which 11.44 MiB is free. Including non-PyTorch memory, this process has 47.27 GiB memory in use. Of the allocated memory 46.30 GiB is allocated by PyTorch, and 484.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Following are the dims of the input sequences being printed:
input_ids -> torch.Size([1, 4273])
attention_mask -> torch.Size([1, 4273])
labels -> torch.Size([1, 4273])
Why do I get OOM on sequence length of 4k, without even using images as input?
I have fine-tuned Mistral-7b with QLoRA before, and it was fine.
In this case, how come the text backbone is not capable of even doing a forward pass at half the context length I used to fine-tune before?
Weirdly, if I try to run it without LORA/QLORA (both flags set to False
), I don't get OOM. Here is the code I am running:
DEVICE = "cuda:0"
USE_LORA = False
USE_QLORA = False
processor = AutoProcessor.from_pretrained(
"HuggingFaceM4/idefics2-8b",
do_image_splitting=False
)
if USE_QLORA or USE_LORA:
lora_config = LoraConfig(
r=32,
lora_alpha=16,
lora_dropout=0.1,
target_modules='.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$',
use_dora=False if USE_QLORA else True,
init_lora_weights="gaussian"
)
if USE_QLORA:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b",
torch_dtype=torch.bfloat16,
quantization_config=bnb_config if USE_QLORA else None,
)
model.add_adapter(lora_config)
model.enable_adapters()
else:
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2",
).to(DEVICE)
train_dataset = load_from_disk("./custom_datasets/idefics/train.hf")
test_dataset = load_from_disk("./custom_datasets/idefics/test.hf")
class MyDataCollator:
def __init__(self, processor):
self.processor = processor
self.image_token_id = processor.tokenizer.additional_special_tokens_ids[
processor.tokenizer.additional_special_tokens.index("<image>")
]
def __call__(self, examples):
texts = []
for example in examples:
messages = example["messages"]
text = processor.apply_chat_template(messages, add_generation_prompt=False)
texts.append(text.strip())
batch = processor(text=texts, return_tensors="pt", padding=True)
labels = batch["input_ids"].clone()
labels[labels == processor.tokenizer.pad_token_id] = self.image_token_id
batch["labels"] = labels
return batch
data_collator = MyDataCollator(processor)
data_loader = DataLoader(train_dataset, batch_size=1, collate_fn=data_collator)
for batch in data_loader:
batch = {k: v.to(DEVICE) for k, v in batch.items()}
for k, v in batch.items():
print(k, '->', v.shape)
out = model(**batch)
break
Note: the output shapes are the same:
input_ids -> torch.Size([1, 4273])
attention_mask -> torch.Size([1, 4273])
labels -> torch.Size([1, 4273])
While we are at it, I should note that I cannot even load the model only with LoRA adapters. It gets stuck at model.add_adapter(lora_config)
for several minutes (and counting...). Here is the code for repro:
DEVICE = "cuda:0"
USE_LORA = True
USE_QLORA = False
processor = AutoProcessor.from_pretrained(
"HuggingFaceM4/idefics2-8b",
do_image_splitting=False
)
if USE_QLORA or USE_LORA:
lora_config = LoraConfig(
r=32,
lora_alpha=16,
lora_dropout=0.1,
target_modules='.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$',
use_dora=False if USE_QLORA else True,
init_lora_weights="gaussian"
)
if USE_QLORA:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b",
torch_dtype=torch.bfloat16,
quantization_config=bnb_config if USE_QLORA else None,
)
print('---> loaded model')
model.add_adapter(lora_config)
print('---> added adapter')
model.enable_adapters()
print('---> enabled adapters')
else:
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2",
).to(DEVICE)
train_dataset = load_from_disk("./custom_datasets/idefics/train.hf")
test_dataset = load_from_disk("./custom_datasets/idefics/test.hf")
class MyDataCollator:
def __init__(self, processor):
self.processor = processor
self.image_token_id = processor.tokenizer.additional_special_tokens_ids[
processor.tokenizer.additional_special_tokens.index("<image>")
]
def __call__(self, examples):
texts = []
for example in examples:
messages = example["messages"]
text = processor.apply_chat_template(messages, add_generation_prompt=False)
texts.append(text.strip())
batch = processor(text=texts, return_tensors="pt", padding=True)
labels = batch["input_ids"].clone()
labels[labels == processor.tokenizer.pad_token_id] = self.image_token_id
batch["labels"] = labels
return batch
data_collator = MyDataCollator(processor)
data_loader = DataLoader(train_dataset, batch_size=1, collate_fn=data_collator)
for batch in data_loader:
batch = {k: v.to(DEVICE) for k, v in batch.items()}
for k, v in batch.items():
print(k, '->', v.shape)
out = model(**batch)
break
The console only prints:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 7/7 [00:02<00:00, 2.93it/s]
---> loaded model
Hi
@starzmustdie
can you say more about your setup?
i just did an inference on a 4k text sequence (no image) with 4bit quantization and torch.float16 on a 16GB V100
OS:
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy
Installed packages:
Package Version
------------------------ -----------
accelerate 0.29.2
aiohttp 3.9.4
aiosignal 1.3.1
annotated-types 0.6.0
anyio 4.3.0
appdirs 1.4.4
asttokens 2.4.1
attrs 23.2.0
beautifulsoup4 4.12.2
bitsandbytes 0.43.1
certifi 2024.2.2
cfgv 3.4.0
charset-normalizer 3.3.2
click 8.1.7
comm 0.2.2
contourpy 1.2.1
cycler 0.12.1
datasets 2.18.0
debugpy 1.8.1
decorator 5.1.1
dill 0.3.8
distlib 0.3.8
distro 1.9.0
docker-pycreds 0.4.0
einops 0.7.0
executing 2.0.1
filelock 3.13.4
flash-attn 2.5.7
fonttools 4.51.0
frozenlist 1.4.1
fsspec 2024.2.0
gitdb 4.0.11
GitPython 3.1.43
h11 0.14.0
httpcore 1.0.5
httpx 0.27.0
huggingface-hub 0.22.2
identify 2.5.35
idna 3.7
iniconfig 2.0.0
ipykernel 6.29.4
ipython 8.23.0
jedi 0.19.1
Jinja2 3.1.3
joblib 1.4.0
jupyter_client 8.6.1
jupyter_core 5.7.2
kiwisolver 1.4.5
lxml 5.1.0
MarkupSafe 2.1.5
matplotlib 3.8.4
matplotlib-inline 0.1.7
more-itertools 10.2.0
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
nest-asyncio 1.6.0
networkx 3.3
ninja 1.11.1.1
nltk 3.8.1
nodeenv 1.8.0
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.19.3
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.1.105
openai 1.7.2
packaging 24.0
pandas 2.2.2
parso 0.8.4
peft 0.10.0
pexpect 4.9.0
pillow 10.3.0
pip 24.0
platformdirs 4.2.0
pluggy 1.4.0
pre-commit 3.4.0
prompt-toolkit 3.0.43
protobuf 4.25.3
psutil 5.9.8
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 15.0.2
pyarrow-hotfix 0.6
pydantic 2.7.0
pydantic_core 2.18.1
Pygments 2.17.2
pyparsing 3.1.2
pytest 7.4.4
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
pyzmq 26.0.0
regex 2023.12.25
requests 2.31.0
safetensors 0.4.3
sentry-sdk 1.45.0
setproctitle 1.3.3
setuptools 65.5.0
six 1.16.0
smmap 5.0.1
sniffio 1.3.1
soupsieve 2.5
stack-data 0.6.3
sympy 1.12
tiktoken 0.5.2
tokenizers 0.15.2
torch 2.2.2
tornado 6.4
tqdm 4.66.2
traitlets 5.14.2
transformers 4.40.0.dev0
triton 2.2.0
typing_extensions 4.11.0
tzdata 2024.1
urllib3 2.2.1
virtualenv 20.25.1
wandb 0.16.6
wcwidth 0.2.13
wheel 0.43.0
xxhash 3.4.1
yarl 1.9.4
zss 1.1.4
Nvidia-smi:
Wed Apr 17 18:54:00 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX 6000 Ada Gene... On | 00000000:01:00.0 Off | Off |
| 30% 33C P8 29W / 300W | 5MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX 6000 Ada Gene... On | 00000000:2D:00.0 Off | Off |
| 30% 49C P8 34W / 300W | 5MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX 6000 Ada Gene... On | 00000000:41:00.0 Off | Off |
| 30% 41C P8 26W / 300W | 5MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX 6000 Ada Gene... On | 00000000:61:00.0 Off | Off |
| 30% 45C P8 25W / 300W | 5MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Also, here is a flame graph of running the forward pass
(i can upload the pickles to the drive if it is any use to you)
Also, if I can provide any more details, please let me know.
Thank you
thanks for the details @starzmustdie . i will circle back if I need more info, it's on my todo for today to dig in.
in the meantime, perhaps that collab where I did an inference on less than 10GB of GPU mem will be useful https://colab.research.google.com/drive/1P8goWEyrScceBEMp4dD3eh2aCLMud_Su?usp=sharing
Regarding the issue of extremely slow time of instantiating a LoRA version of the model (but not a QLoRA), I was doing some benchmarking on this piece of the code peft library. Here is what I found:
- Loading Idefics2 with QLoRa is almost instantaneous. Here is the output of measuring the time:
(truncated output...)
Time for injecting adapter to model.text_model.layers.28.mlp.up_proj: 0.004160881042480469
Time for injecting adapter to model.text_model.layers.28.mlp.down_proj: 0.005509138107299805
Time for injecting adapter to model.text_model.layers.29.self_attn.q_proj: 0.002628326416015625
Time for injecting adapter to model.text_model.layers.29.self_attn.k_proj: 0.0020885467529296875
Time for injecting adapter to model.text_model.layers.29.self_attn.v_proj: 0.0025033950805664062
Time for injecting adapter to model.text_model.layers.29.self_attn.o_proj: 0.0025649070739746094
Time for injecting adapter to model.text_model.layers.29.mlp.gate_proj: 0.003892183303833008
Time for injecting adapter to model.text_model.layers.29.mlp.up_proj: 0.0038487911224365234
Time for injecting adapter to model.text_model.layers.29.mlp.down_proj: 0.0054149627685546875
Time for injecting adapter to model.text_model.layers.30.self_attn.q_proj: 0.0026171207427978516
Time for injecting adapter to model.text_model.layers.30.self_attn.k_proj: 0.002309083938598633
Time for injecting adapter to model.text_model.layers.30.self_attn.v_proj: 0.002630949020385742
Time for injecting adapter to model.text_model.layers.30.self_attn.o_proj: 0.002721071243286133
Time for injecting adapter to model.text_model.layers.30.mlp.gate_proj: 0.0038025379180908203
Time for injecting adapter to model.text_model.layers.30.mlp.up_proj: 0.0038831233978271484
Time for injecting adapter to model.text_model.layers.30.mlp.down_proj: 0.005049228668212891
Time for injecting adapter to model.text_model.layers.31.self_attn.q_proj: 0.002608776092529297
Time for injecting adapter to model.text_model.layers.31.self_attn.k_proj: 0.002328157424926758
Time for injecting adapter to model.text_model.layers.31.self_attn.v_proj: 0.0026826858520507812
Time for injecting adapter to model.text_model.layers.31.self_attn.o_proj: 0.0026001930236816406
Time for injecting adapter to model.text_model.layers.31.mlp.gate_proj: 0.003985166549682617
Time for injecting adapter to model.text_model.layers.31.mlp.up_proj: 0.0039288997650146484
Time for injecting adapter to model.text_model.layers.31.mlp.down_proj: 0.0049343109130859375
Average time for injecting adapter: 0.0027877162142497737
- Loading Idefics2 with LoRa is extremely slow. More precisely, it takes ~ 5 seconds to inject an adapter for some of the layers. This means that the model can be loading many many minutes:
Time for injecting adapter to model.vision_model.encoder.layers.23.self_attn.k_proj: 0.025675296783447266
Time for injecting adapter to model.vision_model.encoder.layers.23.self_attn.v_proj: 0.0254061222076416
Time for injecting adapter to model.vision_model.encoder.layers.23.self_attn.q_proj: 0.024999380111694336
Time for injecting adapter to model.vision_model.encoder.layers.23.self_attn.out_proj: 0.025428056716918945
Time for injecting adapter to model.vision_model.encoder.layers.23.mlp.fc1: 0.08384990692138672
Time for injecting adapter to model.vision_model.encoder.layers.23.mlp.fc2: 0.0839688777923584
Time for injecting adapter to model.vision_model.encoder.layers.24.self_attn.k_proj: 0.025103330612182617
Time for injecting adapter to model.vision_model.encoder.layers.24.self_attn.v_proj: 0.025132179260253906
Time for injecting adapter to model.vision_model.encoder.layers.24.self_attn.q_proj: 0.025215625762939453
Time for injecting adapter to model.vision_model.encoder.layers.24.self_attn.out_proj: 0.02485203742980957
Time for injecting adapter to model.vision_model.encoder.layers.24.mlp.fc1: 0.08424067497253418
Time for injecting adapter to model.vision_model.encoder.layers.24.mlp.fc2: 0.08370113372802734
Time for injecting adapter to model.vision_model.encoder.layers.25.self_attn.k_proj: 0.02559804916381836
Time for injecting adapter to model.vision_model.encoder.layers.25.self_attn.v_proj: 0.025536537170410156
Time for injecting adapter to model.vision_model.encoder.layers.25.self_attn.q_proj: 0.02513265609741211
Time for injecting adapter to model.vision_model.encoder.layers.25.self_attn.out_proj: 0.024949312210083008
Time for injecting adapter to model.vision_model.encoder.layers.25.mlp.fc1: 0.08578777313232422
Time for injecting adapter to model.vision_model.encoder.layers.25.mlp.fc2: 0.08488917350769043
Time for injecting adapter to model.vision_model.encoder.layers.26.self_attn.k_proj: 0.025366783142089844
Time for injecting adapter to model.vision_model.encoder.layers.26.self_attn.v_proj: 0.025299787521362305
Time for injecting adapter to model.vision_model.encoder.layers.26.self_attn.q_proj: 0.025112152099609375
Time for injecting adapter to model.vision_model.encoder.layers.26.self_attn.out_proj: 0.024931669235229492
Time for injecting adapter to model.vision_model.encoder.layers.26.mlp.fc1: 0.08368563652038574
Time for injecting adapter to model.vision_model.encoder.layers.26.mlp.fc2: 0.08527112007141113
Time for injecting adapter to model.connector.modality_projection.gate_proj: 0.2910897731781006
Time for injecting adapter to model.connector.modality_projection.up_proj: 0.28179073333740234
Time for injecting adapter to model.connector.modality_projection.down_proj: 4.413707256317139
Time for injecting adapter to model.connector.perceiver_resampler.layers.0.self_attn.q_proj: 0.4621694087982178
Time for injecting adapter to model.connector.perceiver_resampler.layers.0.self_attn.k_proj: 0.11468839645385742
Time for injecting adapter to model.connector.perceiver_resampler.layers.0.self_attn.v_proj: 0.11388301849365234
Time for injecting adapter to model.connector.perceiver_resampler.layers.0.self_attn.o_proj: 0.1406879425048828
Time for injecting adapter to model.connector.perceiver_resampler.layers.0.mlp.gate_proj: 4.9388508796691895 <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.0.mlp.up_proj: 4.939562559127808 <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.0.mlp.down_proj: 4.912921190261841 <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.1.self_attn.q_proj: 0.4630849361419678
Time for injecting adapter to model.connector.perceiver_resampler.layers.1.self_attn.k_proj: 0.11741971969604492
Time for injecting adapter to model.connector.perceiver_resampler.layers.1.self_attn.v_proj: 0.11777377128601074
Time for injecting adapter to model.connector.perceiver_resampler.layers.1.self_attn.o_proj: 0.14248085021972656
Time for injecting adapter to model.connector.perceiver_resampler.layers.1.mlp.gate_proj: 4.919850587844849 <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.1.mlp.up_proj: 4.9399824142456055 <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.1.mlp.down_proj: 4.886034965515137 <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.2.self_attn.q_proj: 0.45731663703918457
Time for injecting adapter to model.connector.perceiver_resampler.layers.2.self_attn.k_proj: 0.11633014678955078
Time for injecting adapter to model.connector.perceiver_resampler.layers.2.self_attn.v_proj: 0.11535906791687012
Time for injecting adapter to model.connector.perceiver_resampler.layers.2.self_attn.o_proj: 0.13594913482666016
Time for injecting adapter to model.connector.perceiver_resampler.layers.2.mlp.gate_proj: 4.919170618057251 <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.2.mlp.up_proj: 4.907344579696655 <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.2.mlp.down_proj: 5.401084661483765 <---------
(continues, but truncated)
I have tried doing the same benchmarking on adding LoRA/QLoRa adapters to the base mistralai/Mistral-7B-v0.1
model, and the results are consistent with the above.
Why is this the case? When I was performing LoRa finetuning of Mistral in axolotl it definitely didn't take this long. Is there some new bug introduced?
with respect to the last comment on lora loading, do you have an idea of what could be happening @smangrul ? 🙏
@VictorSanh any luck reproducing the OOM issue? :)
@VictorSanh
@smangrul
Update: The reason why the loading of the LoRA model took so long compared to QLoRa is because of the flag use_dora=True
in LoraConfig
. 🤦♂️
@VictorSanh any luck reproducing the OOM issue? :)
no luck so far. i have memory usages that are significantly lower than what you are observing (these numbers are computed with the default example in the model card for idefics2-8b
)