Trying to reproduce VQA-v2 results in the Table 2 here: https://arxiv.org/pdf/2306.16527.pdf
Hi, I'm trying to reproduce the 0-shot results in Table 2 in the IDEFICS paper for the VQA-v2, VizWiz, and TextVQA datasets. I have been starting with VQA-v2, and It seems from the paper that idefics-9b-instruct is getting 50.9% accuracy on VQA-v2, but I am only able to get an accuracy of around 30%. I suspect there is a prompting problem. I am following the evaluation protocol here exactly to steer IDEFICS towards VQA: https://huggingface.co/docs/transformers/main/en/tasks/idefics#visual-question-answering
This is an example output I am getting as a result:
Prompt:
['Instruction: Provide an answer to the question. Use the image to answer.\n',
<PIL.Image.Image image mode=RGB size=224x224 at 0x7FC0940A0430>,
'Question: what is the boy listening to? Answer:']
Model output:
'Instruction: Provide an answer to the question. Use the image to answer.\n Question: what is the boy listening to? Answer: music\n\nQuestion: what is the boy listening to? Answer: music\n\nQuestion: what is the boy listening to? Answer: music\n\nQuestion: what is the boy listening to? Answer: music\n\nQuestion: what is the boy listening to? Answer: music\n\nQuestion: what is the'
For the VQA-v2, VizWiz, and TextVQA evals, would you be able to provide the exact prompt/evaluation code used for IDEFICS/point me to any necessary additional pre-processing beyond the example in the link above that might be needed to reproduce the results in the paper? Thanks a bunch for the help!
@HugoLaurencon could you answer that question?
Thanks @VictorSanh , @HugoLaurencon any thoughts you have on this would be super appreciated :) Thanks!
Hi
@abalakrishnaTRI
, are you using the instruct model or the base model?
The number 50.9% you are trying to reproduce is for the base model, for the instruct one it's 65.8%.
The prompts we used for the base model are in the attached image (from the appendix of https://arxiv.org/pdf/2306.16527.pdf).
For WizViz, note the prompts changes from VQAv2 or TextVQA.
It is slightly different from what you are using, you should an "Image:" before the image.
We also used stop words: whenever a stop word is generated, we cut the generation at this place (without including the stop word), strip the example (to remove extra white space) and use this as the answer.
I believe the 0-shot, especially for VQA tasks, is not a good way to evaluate the models because it depends too much on how you are answering. Saying "On the left" when the correct answer is "Left" will be scored 0. Using the stop words can help.
We don't provide access to the code base at the moment.
Can you try with these changes to see if you can improve your performance? If not, can you try switching to the base model to see if you can reproduce the result? It will help identify the problem.
Following your suggestion in the previous comment, I am trying to reproduce the VizWiz Zero Shot results on the validation set. But my zero-shot outputs does not seem to be quite right. Quantitative numbers are not looking good either.
Colab - https://colab.research.google.com/drive/1KNg3q1YFk5aLux4eii17COnUbsuYEFJ0?usp=sharing
Can you review once and suggest any changes to the prompt ?