Edit model card

Description

UForm-Gen2-dpo is a small generative vision-language model alined for Image Captioning and Visual Question Answering on preference datasets VLFeedback and LLaVA-Human-Preference-10K using Direct Preference Optimization (DPO).

The model consists of two parts:

  1. CLIP-like ViT-H/14
  2. Qwen1.5-0.5B-Chat

The model took less than one day to train on a DGX-H100 with 8x H100 GPUs. Thanks to Nebius.ai for providing the compute 🤗

Usage

The generative model can be used to caption images, answer questions about them. Also it is suitable for a multimodal chat.

from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained("unum-cloud/uform-gen2-dpo", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("unum-cloud/uform-gen2-dpo", trust_remote_code=True)
prompt = "Question or Instruction"
image = Image.open("image.jpg")
inputs = processor(text=[prompt], images=[image], return_tensors="pt")
with torch.inference_mode():
     output = model.generate(
        **inputs,
        do_sample=False,
        use_cache=True,
        max_new_tokens=256,
        eos_token_id=151645,
        pad_token_id=processor.tokenizer.pad_token_id
    )
prompt_len = inputs["input_ids"].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]

You can check examples of different prompts in our demo space.

Evaluation

perception reasoning OCR artwork celebrity code_reasoning color commonsense_reasoning count existence landmark numerical_calculation position posters scene text_translation

MME Benchmark

Model perception reasoning OCR artwork celebrity code_reasoning color commonsense_reasoning count existence landmark numerical_calculation position posters scene text_translation
uform-gen2-dpo 1,048.75 224.64 72.50 97.25 62.65 67.50 123.33 57.14 136.67 195.00 104.00 50.00 51.67 59.18 146.50 50.00
uform-gen2-qwen-500m 863.40 236.43 57.50 93.00 67.06 57.50 78.33 81.43 53.33 150.00 98.00 50.00 50.00 62.93 153.25 47.50
Downloads last month
2,985
Safetensors
Model size
1.27B params
Tensor type
F32
·
Inference Examples
Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train unum-cloud/uform-gen2-dpo

Spaces using unum-cloud/uform-gen2-dpo 62

Collection including unum-cloud/uform-gen2-dpo