What it is made for?
#1
by
Onix22
- opened
As I seem it pretty much lost vision function and just relies on prompts and hallucinates .
Sorry, but the vision functionality is working fine for me?
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"nbeerbower/Dumpling-Qwen2.5-VL-7B", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(
"nbeerbower/Dumpling-Qwen2.5-VL-7B",
)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://raw.githubusercontent.com/vikhyat/moondream/main/assets/demo-1.jpg",
},
{"type": "text", "text": "What is the girl doing?"},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
The girl appears to be holding a large sandwich or burger in front of her face, possibly pretending to eat it or playfully posing with it.
Well yes it still does have some of it but very degraded performance.
ask it to do some longer description or analysis of the image.
when you train visual model you need to train it with images
Ok, thanks for the feedback