Zero GPU does not support 4-bit quantization with bitsandbytes?

#71
by tanyuzhou - opened

Hi there, just tried to deploy a llm model 0-roleplay with a new Zero GPU space.

I successfully built the space, but encounter following errors when trying to run it.

UPDATE: Found a similar issue here, but it seems the provided soulution does not work for me.

UPDATE AGAIN: Successfully run the space by calling the AutoModelForCausalLM.from_pretrained() inside the method with @spaces.GPU. But still wondering why could this happen?

===== Application Startup at 2024-06-10 17:21:51 =====

The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
โ†‘ Those bitsandbytes warnings are expected on ZeroGPU โ†‘
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
/usr/local/lib/python3.10/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.
  warnings.warn(warning_msg)
`low_cpu_mem_usage` was None, now set to True since model is quantized.

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading shards:  50%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ     | 1/2 [00:04<00:04,  4.08s/it]
Downloading shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2/2 [00:10<00:00,  5.48s/it]
Downloading shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2/2 [00:10<00:00,  5.27s/it]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2/2 [00:00<00:00,  2.96it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 116, in worker_init
    torch.move(nvidia_uuid)
  File "/usr/local/lib/python3.10/site-packages/spaces/zero/torch.py", line 254, in _move
    bitsandbytes.move()
  File "/usr/local/lib/python3.10/site-packages/spaces/zero/bitsandbytes.py", line 120, in _move
    tensor.data = _param_to_4bit(tensor,
  File "/usr/local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 324, in to
    return self._quantize(device)
  File "/usr/local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 289, in _quantize
    w_4bit, quant_state = bnb.functional.quantize_4bit(
  File "/usr/local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1234, in quantize_4bit
    raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/gradio/queueing.py", line 532, in process_events
    response = await route_utils.call_process_api(
  File "/usr/local/lib/python3.10/site-packages/gradio/route_utils.py", line 276, in call_process_api
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 1928, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.10/site-packages/gradio/blocks.py", line 1512, in call_function
    prediction = await fn(*processed_input)
  File "/usr/local/lib/python3.10/site-packages/gradio/utils.py", line 799, in async_wrapper
    response = await f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/gradio/chat_interface.py", line 546, in _submit_fn
    response = await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 859, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 177, in gradio_handler
    raise res.value
ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
ZeroGPU Explorers org

@tanyuzhou interesting. I think what it's trying to do is to somehow quantize an already int8 weight? Can you use quanto instead? AFAIK it's more up-to-date by means of maintenance/transformers compatibility https://huggingface.co/docs/transformers/main/en/quantization/quanto

@tanyuzhou interesting. I think what it's trying to do is to somehow quantize an already int8 weight? Can you use quanto instead? AFAIK it's more up-to-date by means of maintenance/transformers compatibility https://huggingface.co/docs/transformers/main/en/quantization/quanto

Hi @merve thanks for paying attention to this issue! I don't know a lot of quantization, is it possible to use quanto on a model that already been quantized to 4-bit?

BTW, I tried to work on this issue in past a few hours, and I think the issue might related to the way how Hunggingface wrapper the gpu inference method with @spaces.GPU.

First, I tried to switch my space to use A10G small spec. And it turns out everything worked fine. So I think the model is quantized properly.

Then, I move the AutoModelForCausalLM.from_pretrained("Rorical/0-roleplay", return_dict=True, trust_remote_code=True) inside the response() method, which is decorated by @spaces.GPU in order to load the pretrained model with a GPU environment, which also worked too. (here is the code)

But I am still considering my original code should work as there is another space with a 4-bit quantized model call AutoModelForCausalLM.from_pretrained() before __main__ and work fine.

tanyuzhou changed discussion title from Zero GPU does not support 8-bit quantization with bitsandbytes? to Zero GPU does not support 4-bit quantization with bitsandbytes?

Sign up or log in to comment