databricks/dolly-v2-12b · Error: CUDA error: CUBLAS_STATUS_NOT

Apr 13, 2023

Hello,

We are running the code:

import torch
from transformers import pipeline, AutoModelForCausalLM
print('got here')
generate_text = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
print('got here2')
generate_text("Who is Nic Chaillan?")
print('got here3')

On an Azure NV48s v3 (24 GPU vcpus, 224 GiB memory)

We get the error:

got here
got here2
Traceback (most recent call last):
File "/datadrive/dolly-v2-12b/test.py", line 8, in
generate_text("Who is Nic Chaillan?")
File "/usr/local/lib/python3.9/dist-packages/transformers/pipelines/base.py", line 1074, in call
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File "/usr/local/lib/python3.9/dist-packages/transformers/pipelines/base.py", line 1081, in run_single
model_outputs = self.forward(model_inputs, **forward_params)
File "/usr/local/lib/python3.9/dist-packages/transformers/pipelines/base.py", line 990, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "/home/nicos/.cache/huggingface/modules/transformers_modules/databricks/dolly-v2-12b/f8adc425f3ce69a26d57c89c1b69429a74e2ec0e/instruct_pipeline.py", line 103, in _forward
generated_sequence = self.model.generate(
File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py", line 1571, in generate
return self.sample(
File "/usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py", line 2534, in sample
outputs = self(
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 654, in forward
outputs = self.gpt_neox(
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 546, in forward
outputs = layer(
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 319, in forward
attention_layer_outputs = self.attention(
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 153, in forward
attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
File "/usr/local/lib/python3.9/dist-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 233, in _attn
attn_output = torch.matmul(attn_weights, value)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmStridedBatchedExFix(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

Any clue what to do to fix this?

srowen

Databricks org Apr 13, 2023

This means you don't have all the NVIDIA libraries installed. Here it's complaining about CUBLAS. You can see what you have to add to a standard runtime in Databricks for example, here: https://github.com/databrickslabs/dolly/blob/master/train_dolly.py#L27 That might be a clue.

HamzaFarhan

Apr 14, 2023

I have the same error. Any luck on solving this?

srowen

Databricks org Apr 14, 2023

I think this can also arise as an "out of memory" error. Please, it's more helpful if people say how they are running this, and whether you've ruled out what is in previous comments!

HamzaFarhan

Apr 14, 2023

My Code:

from transformers import pipeline
generate_text = pipeline(model="databricks/dolly-v2-3b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map='auto')
edu_prompt = "Extract the universities from the following text: My name is Hamza and I have a bachelor's degree from the university of toronto and a master's degree from the university of waterloo."
edu = generate_text(edu_prompt)

12 GB GPU
torch 1.13.1 with cuda 11.7

I don't think a 6GB model should give me an "out of memory" error.

srowen

Databricks org Apr 14, 2023

Yeah that's not it, but do you have cublas installed? See above

Tool10

Apr 19, 2023

Hi.
I have the same problem on an Ubuntu 20.04 server with plenty of memory. Have you had any success fixing this error?
/Tomas

srowen

Databricks org Apr 19, 2023

Do you have the right cublas installed? What lib version vs what CUDA?

Tool10

Apr 19, 2023

Do you have the right cublas installed? What lib version vs what CUDA?

Which version should I have? I have cuda 11.7.

srowen

Databricks org Apr 19, 2023

This is all covered in the provided training scripts.
https://github.com/databrickslabs/dolly/blob/master/train_dolly.py#L53

srowen changed discussion status to closed Apr 21, 2023

Tool10

Apr 24, 2023

Sorry, since I am new user I could not reply anymore last week. This problem is not solved. I have created a Dockerfile with the correct cublas version, but it does not work as follows (it ends with the same error):
------ Dockerfile ------
FROM pytorch/pytorch:1.11.0-cuda11.3-cudnn8-devel
WORKDIR /app/dolly

RUN apt-get upgrade
ADD https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcusparse-dev-11-3_11.5.0.58-1_amd64.deb /tmp
RUN dpkg -i /tmp/libcusparse-dev-11-3_11.5.0.58-1_amd64.deb
ADD https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcublas-11-3_11.5.1.109-1_amd64.deb /tmp
ADD https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcublas-dev-11-3_11.5.1.109-1_amd64.deb /tmp
RUN dpkg -i /tmp/libcublas-11-3_11.5.1.109-1_amd64.deb
RUN dpkg -i /tmp/libcublas-dev-11-3_11.5.1.109-1_amd64.deb
ADD https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcusolver-dev-11-3_11.1.2.109-1_amd64.deb /tmp
RUN dpkg -i /tmp/libcusolver-dev-11-3_11.1.2.109-1_amd64.deb
ADD https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcurand-dev-11-3_10.2.4.109-1_amd64.deb /tmp
RUN dpkg -i /tmp/libcurand-dev-11-3_10.2.4.109-1_amd64.deb
RUN pip install accelerate>=0.12.0 transformers[torch]==4.25.1
RUN pip install ipython
ADD https://huggingface.co/databricks/dolly-v2-3b/raw/main/instruct_pipeline.py .
COPY ./init_dolly.py .

CMD DISABLE_ADDMM_CUDA_LT=1 ipython -i init_dolly.py

------ init_dolly.py ------
import torch
from instruct_pipeline import InstructionTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", device_map="auto", torch_dtype=torch.bfloat16)

generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)

srowen

Databricks org Apr 24, 2023

What hardware? this would only run on an A100 as you've written it.

Tool10

Apr 24, 2023

What hardware? this would only run on an A100 as you've written it.

OK, then that is why it doesn't work. How do I change the used hardware?

srowen

Databricks org Apr 24, 2023

I suspect OOM or something, but what error are you getting? maybe this should be a separate thread with more info.
You control the hardware by, well, choosing where you run it?