nbeerbower/Dumpling-Qwen2.5-VL-7B · What it is made for?

10 days ago

As I seem it pretty much lost vision function and just relies on prompts and hallucinates .

Owner 10 days ago

Sorry, but the vision functionality is working fine for me?

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "nbeerbower/Dumpling-Qwen2.5-VL-7B", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained(
    "nbeerbower/Dumpling-Qwen2.5-VL-7B", 
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://raw.githubusercontent.com/vikhyat/moondream/main/assets/demo-1.jpg",
            },
            {"type": "text", "text": "What is the girl doing?"},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

The girl appears to be holding a large sandwich or burger in front of her face, possibly pretending to eat it or playfully posing with it.

Onix22

9 days ago

•

edited 9 days ago

Well yes it still does have some of it but very degraded performance.
ask it to do some longer description or analysis of the image.
when you train visual model you need to train it with images

nbeerbower

Owner 9 days ago

Ok, thanks for the feedback