Does MiniCPM support multi-image input?
I want to process 4-6 images each time, what is the best practice?
Here is a common practice, input messages like this, 'system prompt, image 1... image n, question'.
But, any kind of sequence can be input to the model in the following way, and you can try to find the best way to do so.
msgs = []
system_prompt = 'Answer in detail.'
prompt = 'Caption this two images'
tgt_path = ['path/to/image1', 'path/to/image/2']
if system_prompt:
msgs.append(dict(type='text', value=system_prompt))
if isinstance(tgt_path, list):
msgs.extend([dict(type='image', value=p) for p in tgt_path])
else:
msgs = [dict(type='image', value=tgt_path)]
msgs.append(dict(type='text', value=prompt))
content = []
for x in msgs:
if x['type'] == 'text':
content.append(x['value'])
elif x['type'] == 'image':
image = Image.open(x['value']).convert('RGB')
content.append(image)
msgs = [{'role': 'user', 'content': content}]
res = model.chat(
msgs=msgs,
context=None,
image=None,
tokenizer=self.tokenizer,
**default_kwargs
)
If you have more questions, feel free to continue the discussion.
@Cuiunbo
My environment:python==3.8,sentencepiece==0.1.99, torch==2.2.0, Pillow==10.1.0,torchvision==0.16.2,transformers==4.40.2, CUDA Version: 12.2
Solved, I added it in the code torch.backends.cudnn.enabled = False
nice! Glad you made this work, and feel free to ask if you have more questions!
We'll get back to you as soon as we can.
@Cuiunbo
It looks like the model works only with 2 images maximum. I've tried it with 2 images and it worked perfectly fine, but with any number of images more than 2 it just gives you a
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 21 but got size 20 for tensor number 1 in the list.
@ma-korotkov
Thanks for providing the implementation, in case you don't modify the model file, the max context length for llama3 is 2048, but i remember llama3 supports 4096, you can try to modify it.
Also, the resolution of the image affects the number of images you can input into the model, you can also try resizing it before that.
@Cuiunbo Thanks a lot! You are right, I'm using quite a big images, so resizing them have helped to fit into context length
@Cuiunbo Thanks a lot! You are right, I'm using quite a big images, so resizing them have helped to fit into context length
@ma-korotkov Nice! I hope to hear your feedback on our video capabilities, Since we didn't train on multiimages data, it's amazing if we can do some simple video tasks now!
May I ask if you have added the training data of interleaving pictures and text? I found that it did not learn well when using multiple pictures and texts (pairs) as the context.
Hi
What is the prefered way to continue the "chat" regarding a previously loaded image without reloading?
Is it possible to use the model just as a language model, ie. without any image?
Here is a common practice, input messages like this, 'system prompt, image 1... image n, question'.
But, any kind of sequence can be input to the model in the following way, and you can try to find the best way to do so.
msgs = [] system_prompt = 'Answer in detail.' prompt = 'Caption this two images' tgt_path = ['path/to/image1', 'path/to/image/2'] if system_prompt: msgs.append(dict(type='text', value=system_prompt)) if isinstance(tgt_path, list): msgs.extend([dict(type='image', value=p) for p in tgt_path]) else: msgs = [dict(type='image', value=tgt_path)] msgs.append(dict(type='text', value=prompt)) content = [] for x in msgs: if x['type'] == 'text': content.append(x['value']) elif x['type'] == 'image': image = Image.open(x['value']).convert('RGB') content.append(image) msgs = [{'role': 'user', 'content': content}] res = model.chat( msgs=msgs, context=None, image=None, tokenizer=self.tokenizer, **default_kwargs )
If you have more questions, feel free to continue the discussion.
image=None? 图片为None?
Here is a common practice, input messages like this, 'system prompt, image 1... image n, question'.
But, any kind of sequence can be input to the model in the following way, and you can try to find the best way to do so.
msgs = [] system_prompt = 'Answer in detail.' prompt = 'Caption this two images' tgt_path = ['path/to/image1', 'path/to/image/2'] if system_prompt: msgs.append(dict(type='text', value=system_prompt)) if isinstance(tgt_path, list): msgs.extend([dict(type='image', value=p) for p in tgt_path]) else: msgs = [dict(type='image', value=tgt_path)] msgs.append(dict(type='text', value=prompt)) content = [] for x in msgs: if x['type'] == 'text': content.append(x['value']) elif x['type'] == 'image': image = Image.open(x['value']).convert('RGB') content.append(image) msgs = [{'role': 'user', 'content': content}] res = model.chat( msgs=msgs, context=None, image=None, tokenizer=self.tokenizer, **default_kwargs )
If you have more questions, feel free to continue the discussion.
image=None? 图片为None?
Images are contained in the msg.