A little leg-up for getting it running please?
Can you please give me a little leg up here? The PR for Chameleon is now merged in Transformers and I got the merged version and compiled and it all seems to work without errors. The only problem is that it doesn't give me the result. Instead of an answer it's just giving me back the question I asked. Have you got any idea please? Yes I am relatively new to all of this and I'm using code that I found in Transformers examples. I'm sure I'm just being stupid somewhere, but I'd appreciate some help please. The code I use is this:
[CODE]
from transformers import ChameleonProcessor, ChameleonForCausalLM
import torch
from PIL import Image
import requests
processor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")
model = ChameleonForCausalLM.from_pretrained("facebook/chameleon-7b")
prompt = "What color is the belt in this image?"
image_url = "scene_pics/jiu_jitsu_belt_white_1.jpg"
image = Image.open(image_url)
inputs = processor(prompt, image, return_tensors="pt").to(model.device)
autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=50)
print("aout : " + str(len(output)))
print(processor.decode(output[0], skip_special_tokens=True))
[/CODE]
And what I get back is this:
[CODE]
/mount/ae_cvm/transformers/src/transformers/utils/hub.py:127: FutureWarning: Using TRANSFORMERS_CACHE
is deprecated and will be removed in v5 of Transformers. Use HF_HOME
instead.
warnings.warn(
Some kwargs in processor config are unused and will not have any effect: image_seq_length, image_token.
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:20<00:00, 6.98s/it]
Setting pad_token_id
to eos_token_id
:2 for open-end generation.
aout : 1
What color is the belt in this image?
[/CODE]
Pinging @RaushanTurganbay here
@archie1983 Can you verify that you're running exact same code as here?
interesting that you didn't get error when generating with that prompt, because you have to add a special "" token to the prompr in the place where image should be. Like "What color is the belt in this image?".
If you already did that, can you try loading with fp16, by adding "torch_dtype=torch.float16" to the model loading code? My guess is that full precision doesn't work good when weights are in bf16. Note, that loading in bf16 right now seems to be failing, that's why I recommended fp16
Thanks for this, @RaushanTurganbay . Yes, I have the special image token in there, it's just that this discussion forum doesn't display code very well. The reason why I tried to run in fp16 was because when I run it in bf16 like you showed, then I get this problem:
[CODE]
/mount/ae_cvm/transformers/src/transformers/utils/hub.py:127: FutureWarning: Using TRANSFORMERS_CACHE
is deprecated and will be removed in v5 of Transformers. Use HF_HOME
instead.
warnings.warn(
Some kwargs in processor config are unused and will not have any effect: image_seq_length, image_token.
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:11<00:00, 3.85s/it]
Setting pad_token_id
to eos_token_id
:2 for open-end generation.
Traceback (most recent call last):
File "test2.py", line 36, in
output = model.generate(**inputs, max_new_tokens=50)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mount/ae_cvm/transformers/src/transformers/generation/utils.py", line 1969, in generate
result = self._sample(
File "/mount/ae_cvm/transformers/src/transformers/generation/utils.py", line 2912, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/ae_cvm/transformers/src/transformers/models/chameleon/modeling_chameleon.py", line 1531, in forward
outputs = self.model(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/ae_cvm/transformers/src/transformers/models/chameleon/modeling_chameleon.py", line 1280, in forward
image_tokens = self.get_image_tokens(pixel_values)
File "/mount/ae_cvm/transformers/src/transformers/models/chameleon/modeling_chameleon.py", line 1230, in get_image_tokens
_, _, image_toks = self.vqmodel.encode(pixel_values)
File "/mount/ae_cvm/transformers/src/transformers/models/chameleon/modeling_chameleon.py", line 1014, in encode
hidden_states = self.encoder(pixel_values)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/ae_cvm/transformers/src/transformers/models/chameleon/modeling_chameleon.py", line 939, in forward
hidden_states = [self.conv_in(pixel_values)]
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (float) and bias type (c10::Half) should be the same
[/CODE]
Any ideas on how I can get the bias type to match? Thanks.
@archie1983
the issue with dtype was mixed in main branch, you can pull the changes by !pip install --upgrade git+https://github.com/huggingface/transformers.git
I cannot say for sure why the model is not replying, I made same question with a generic image from the web and got a reply. See code below.
I can recommend to verify that:
- the model is stopping generation because it has generated an 'eos' token
- experiment with other prompts and check how often you see this pattern. there was an issue when model explicitly refused to reply because of safety alignment
- use these params in generation do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.2`, the model yet doesn't have it. I made a PR to this repo and waiting for merge
- share the image you used as HF dataset/web link so I can also try to see, if nothing works and it fails to generate way too often
from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
import torch
import requests
from PIL import Image
model = ChameleonForConditionalGeneration.from_pretrained("facebook/chameleon-7b", torch_dtype=torch.bfloat16, device_map="cuda:0")
processor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")
prompt = "What color is the belt in this image?<image>"
image = Image.open(requests.get("https://fujisports.com/cdn/shop/products/FUJIADULTBJJBELTS_0003s_0000_fuji__bjjadultbelt__beltbjjwhite__white__1_1_1_3_1600x1600.jpg?v=1644942404", stream=True).raw)
inputs = processor(prompt, images=image, return_tensors="pt").to(model.device, torch.bfloat16)
generated_ids = model.generate(**inputs, max_new_tokens=100)
out = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(out)
Awesome, @RaushanTurganbay , I pulled the latest commits for Transformers and reinstalled and then ran your code. That worked. this is the result I got:
mount/ae_cvm/transformers/src/transformers/utils/hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:09<00:00, 3.15s/it]
Some kwargs in processor config are unused and will not have any effect: image_seq_length, image_token.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
What color is the belt in this image?The belt in this image is white.
It wasn't quick though. The whole inference took like 4 hours, but it did get there in the end. I am using Nvidia Jetson Xavier AGX for this, so my PyTorch is a special one- built by Nvidia and available through their docker container: l4t-text-generation. That's probably also the reason why I had to slightly change your code and take away the device_map parameter where you're loading the model. With that parameter I was getting this:
Traceback (most recent call last):
File "test3.py", line 18, in <module>
generated_ids = model.generate(**inputs, max_new_tokens=100)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mount/ae_cvm/transformers/src/transformers/generation/utils.py", line 1980, in generate
result = self._sample(
File "/mount/ae_cvm/transformers/src/transformers/generation/utils.py", line 2923, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/ae_cvm/transformers/src/transformers/models/chameleon/modeling_chameleon.py", line 1533, in forward
outputs = self.model(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/ae_cvm/transformers/src/transformers/models/chameleon/modeling_chameleon.py", line 1298, in forward
causal_mask = self._update_causal_mask(
File "/mount/ae_cvm/transformers/src/transformers/models/chameleon/modeling_chameleon.py", line 1421, in _update_causal_mask
causal_mask = torch.triu(causal_mask, diagonal=1)
RuntimeError: "triu_tril_cuda_template" not implemented for 'BFloat16'
Anyway, I'm glad that it worked. Now it would be great to speed it up. Any chance for a quantized version?
Thanks
@archie1983 glad that it worked. It is super slow because you're running on CPU by removing device-map. I guess the error you saw is related to the torch version installed, can you try updating it?
@RaushanTurganbay Torch in my case comes from Nvidia- they build and maintain what works on Jetson. The latest torch they had available for my hardware (Jetson AGX Xavier) and my Jetpack (5.1.2) was torch 2.1.0. This was still giving the same problem. But I did find a solution. Posting it here for posterity really so that if anybody has this problem in future, then there's this solution:
https://github.com/meta-llama/llama3/issues/110#issuecomment-2178078985
After using their method #1 my inference now runs in mere seconds.
Thanks
can I run this on my CPU device?
@Pollecode yes, you can run on CPU by simply removing "device_map" when loading the model. But the generation will be very slow, 4h from the above comments
In case you cant fit the model in fp16 in your GPU, try out quantization (https://huggingface.co/docs/peft/en/developer_guides/quantization#quantize-a-model)
Thanks