lavender-llama-3.2-11b-lora

🚀 LoRA fine-tuned model based on meta-llama/Llama-3.2-11B-Vision-Instruct

Model Overview

This is a Lavender fine-tuned version of Llama-3.2-11B-Vision-Instruct. Lavender is a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. This model retains the core capabilities of Llama-3.2 while incorporating Stable Diffusion's visual expertise.

Base Model: meta-llama/Llama-3.2-11B-Vision-Instruct
Fine-Tuned Model: lxasqjc/lavender-llama-3.2-11b-lora
Lavender Paper: Diffusion Instruction Tuning (arXiv)
Lavender Project Space: Diffusion Instruction Tuning
Github repository: Diffusion Instruction Tuning
Parameter Efficient Fine-Tuning (PEFT): Uses LoRA (Low-Rank Adaptation) to optimize model efficiency.
License: Llama 3.2 Community License (See LICENSE.txt)

🛠️ How to Use This Model

This model contains only LoRA weights. To use it, you must first load the base model and then apply the LoRA adapter.

Install Required Packages

pip install torch transformers accelerate peft

Load and Use the LoRA Model

from transformers import MllamaForConditionalGeneration, MllamaProcessor
from peft import PeftModel
import requests
import torch
from PIL import Image

# Define model paths
base_model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"
lora_model_name = "lxasqjc/lavender-llama-3.2-11b-lora"

# Load base model
base_model = MllamaForConditionalGeneration.from_pretrained(
    base_model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = MllamaProcessor.from_pretrained(base_model_name)
# Load LoRA adapter
lora_model = PeftModel.from_pretrained(base_model, lora_model_name)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt"
).to(lora_model.device)

output = lora_model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))

💡 Applications

Vision-Language Tasks: Describing images, answering visual questions.
Instruction-Following: Improved instruction compliance compared to base Llama-3.2.
Multimodal Reasoning: Can process and reason about images with text.

📊 Training Details

Fine-tuning Method: LoRA (Low-Rank Adaptation)
Base Model: meta-llama/Llama-3.2-11B-Vision-Instruct
PEFT Framework: Hugging Face PEFT
Precision: bfloat16
Hyperparameters: Coming soon

📏 Limitations & Considerations

This model is not a standalone model. It requires the base Llama-3.2-11B-Vision-Instruct model.
Biases & Ethical Use: Like all large models, it may exhibit biases present in the pretraining data.
Hardware Requirements: Minimum 24GB VRAM (A100 40GB recommended) for inference.

🔗 References

Base Model: Meta Llama-3.2-11B-Vision-Instruct
LoRA Paper: Low-Rank Adaptation of Large Language Models
PEFT Documentation: Hugging Face PEFT
Project Space: Diffusion Instruction Tuning
Paper: Diffusion Instruction Tuning (arXiv)

Citation

If you use this model or work in your research, please cite:

@misc{jin2025diffusioninstructiontuning,
          title={Diffusion Instruction Tuning}, 
          author={Chen Jin and Ryutaro Tanno and Amrutha Saseendran and Tom Diethe and Philip Teare},
          year={2025},
          eprint={2502.06814},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2502.06814}, 
    }

lxasqjc
/

lavender-llama-3.2-11b-lora