Error with finetuning

@liuhaotian Thanks for this great model!
I am trying to finetune the HF version of LLava instead of the other version provided in haotian-liu/LLaVA.

The script roughly looks like

from transformers import pipeline
from PIL import Image    
import requests
from datasets import load_dataset
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
from transformers import TrainingArguments, Trainer


model_id = "llava-hf/llava-1.5-7b-hf"
model = LlavaForConditionalGeneration.from_pretrained(

processor = AutoProcessor.from_pretrained(model_id)

def preprocess_data(examples):
    images = examples['image']
    texts = ['USER: <image>\n'+x+'\nASSISTANT:' for x in examples['text']]

    outputs = [x for x in examples['answer']]
    encoding = processor(texts,images, padding=True, truncation=True, return_tensors="pt")

    for k, v in encoding.items():
          encoding[k] = v.squeeze()

    targets = [processor.tokenizer.encode(x, add_special_tokens=False)+[processor.tokenizer.eos_token_id] for x in outputs]

    encoding["labels"] = targets
    return encoding

dataset = load_dataset('.....', split='train')
processed_dataset =, batched=True, remove_columns=['image','text','answer'])

training_args = TrainingArguments(

trainer = Trainer(


I keep running into this error

ValueError: The input provided to the model are wrong. The number of image tokens is 1 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.

Anyone having similar issues? 

This might be a Mac Silicon specific issue since I am getting the same error with the generation script when I move the model and tensors .to('mps').
Running on a linux machine now gives me
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the call method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.

This was fixed in For now, you will need to install Transformers from source to use it: pip install [email protected]:huggingface/transformers.git.

Thanks! Is there an example of how to process that dataset? I still run into shape mismatch errors for the labels.
probably due to this method.

def preprocess_data(examples):
     images = examples['image']
     texts = ['USER: <image>\n'+x+'\nASSISTANT:' for x in examples['text']]
     outputs = [x for x in examples['answer']]
     encoding = processor(texts,images, padding=True, truncation=True, return_tensors="pt")
     for k, v in encoding.items():
           encoding[k] = v.squeeze()
     targets = [torch.tensor(processor.tokenizer.encode(x, add_special_tokens=False)+[processor.tokenizer.eos_token_id]) for x in outputs]
     targets = pad_sequence(targets, batch_first=True, padding_value=model.config.ignore_index)
     encoding["labels"] = targets
     return encoding

I am able to get an output when I use

processed_dataset =, batched=True, remove_columns=['image','text','answer'])
examples = processed_dataset[:2]
model.generate(**inputs, max_new_tokens=200, do_sample=False)

But still fails when I use

training_args = TrainingArguments(

trainer = Trainer(


I am launching the script with accelerate

@nielsr Even with the latest code from git I run into an issue.
I have more information on this issue .
It looks like there is a difference in how the same input is used during the generate call vs the forward call.
model.generate(**inputs) works without any issue but
model.forward(**inputs) throws this error

Hi Kshetrajna,
Were you able to solve the problem and run the forward without issues? I have the same problem...

