Several questions on the same video

by delphijb - opened 3 days ago

3 days ago

Hello,
How can I ask several questions on the same video?
I tried this:

try:
messages = [
{
"role": "user",
"content": [
{"type": "video", "path": video_path},
{"type": "text", "text": "Describe this video"}
]
},
{
"role": "user",
"content": [
{"type": "video", "path": video_path},
{"type": "text", "text": "how many people on the video ?"}
]
},
{
"role": "user",
"content": [
{"type": "video", "path": video_path},
{"type": "text", "text": "Is there any violence on the video ? (answer "yes" or "no"."}
]
}
]

But, i have a bug:

(smolvlm2-env) F:\CODE\smolvlm2>python smol4.py
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00, 1.54s/it]
----- Début de l'itération 1 -----
Message: {'role': 'user', 'content': [{'type': 'video', 'path': 'P:\1.mp4'}, {'type': 'text', 'text': 'Describe this video'}]}
Erreur: string indices must be integers

(it works with a single question)

Thx in advance

orrzohar

Hugging Face TB Research org about 4 hours ago

•

edited about 4 hours ago

Hi @delphijb ,

I have been trying to see how this can be supported. For batch inference, right now you need to do this: (UPDATE: this is already supported, just need to add padding=True)

video_path= "example.mp4"
conversation1 = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": video_path},
            {"type": "text", "text": "Describe this video"}
        ]
    }
]

conversation2 = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": video_path},
            {"type": "text", "text": "Summarize this video"}
        ]
    }
]

conversations = [conversation1, conversation2]

batched_inputs = processor.apply_chat_template(
    conversations,
    add_generation_prompt=True,
    tokenize=True,
    padding=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)
batch_inputs.append(inputs)


torch.cuda.reset_peak_memory_stats()
generated_ids = model.generate(**batched_inputs, do_sample=True, max_new_tokens=1024)
peak_mem = torch.cuda.max_memory_allocated()

generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0], '\n\n', generated_texts[1])

print(f"Peak GPU memory: {peak_mem / 1024**3:.2f} GB")

The main issue with asking multiple questions about one video is that the "video" is part of the user prompt. Therefore, this is not possible to support, as you basically want to "swap out" part of the user prompt and keep the other parts the same. As such, I think batch inference is basically what you would get anyway, as you would need to duplicate the image/video anyway.

You could also use the multi-turn conversation capability of the model, but this is sequential, not parallel.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment