Lavender
Collection
Lavender: Diffusion Instruction Tuning
Boosting SoTA vision-language model with Stable Diffusion
https://astrazeneca.github.io/vlm/
β’
3 items
β’
Updated
π LoRA fine-tuned model based on meta-llama/Llama-3.2-11B-Vision-Instruct
This is a Lavender fine-tuned version of Llama-3.2-11B-Vision-Instruct. Lavender is a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. This model retains the core capabilities of Llama-3.2 while incorporating Stable Diffusion's visual expertise.
meta-llama/Llama-3.2-11B-Vision-Instruct
lxasqjc/lavender-llama-3.2-11b-lora
LICENSE.txt
)This model contains only LoRA weights. To use it, you must first load the base model and then apply the LoRA adapter.
pip install torch transformers accelerate peft
from transformers import MllamaForConditionalGeneration, MllamaProcessor
from peft import PeftModel
import requests
import torch
from PIL import Image
# Define model paths
base_model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"
lora_model_name = "lxasqjc/lavender-llama-3.2-11b-lora"
# Load base model
base_model = MllamaForConditionalGeneration.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = MllamaProcessor.from_pretrained(base_model_name)
# Load LoRA adapter
lora_model = PeftModel.from_pretrained(base_model, lora_model_name)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
image,
input_text,
add_special_tokens=False,
return_tensors="pt"
).to(lora_model.device)
output = lora_model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))
meta-llama/Llama-3.2-11B-Vision-Instruct
bfloat16
If you use this model or work in your research, please cite:
@misc{jin2025diffusioninstructiontuning,
title={Diffusion Instruction Tuning},
author={Chen Jin and Ryutaro Tanno and Amrutha Saseendran and Tom Diethe and Philip Teare},
year={2025},
eprint={2502.06814},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.06814},
}
Base model
meta-llama/Llama-3.2-11B-Vision-Instruct