mistral-community/pixtral-12b · pixtral-12b video support

I have a doubt regarding the mistral-community/pixtral-12b model. Does it support taking videos (sequence of frames) + text as input and generating a text response, similar to the video-llava-7b-hf model?

Pixtral-12b works well with images, but when I send 8 frames from a 40-second video to the model, I get a CUDA OOM (Out of Memory) error. Even when I try sending 5 frames, the model hits the token limit. However, when I send just 1 frame for the 40-second video, it gives a response, but the quality isn’t as good.

Can anyone clarify if this behavior is expected with Pixtral-12b, and how can I improve it?