What it is made for?

#1
by Onix22 - opened

As I seem it pretty much lost vision function and just relies on prompts and hallucinates .

Sorry, but the vision functionality is working fine for me?

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "nbeerbower/Dumpling-Qwen2.5-VL-7B", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained(
    "nbeerbower/Dumpling-Qwen2.5-VL-7B", 
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://raw.githubusercontent.com/vikhyat/moondream/main/assets/demo-1.jpg",
            },
            {"type": "text", "text": "What is the girl doing?"},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
The girl appears to be holding a large sandwich or burger in front of her face, possibly pretending to eat it or playfully posing with it.

Well yes it still does have some of it but very degraded performance.
ask it to do some longer description or analysis of the image.
when you train visual model you need to train it with images

Ok, thanks for the feedback

Sign up or log in to comment