lavender-llama-3.2-11b-lora

πŸš€ LoRA fine-tuned model based on meta-llama/Llama-3.2-11B-Vision-Instruct

Model Overview

This is a Lavender fine-tuned version of Llama-3.2-11B-Vision-Instruct. Lavender is a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. This model retains the core capabilities of Llama-3.2 while incorporating Stable Diffusion's visual expertise.


πŸ› οΈ How to Use This Model

This model contains only LoRA weights. To use it, you must first load the base model and then apply the LoRA adapter.

Install Required Packages

pip install torch transformers accelerate peft

Load and Use the LoRA Model

from transformers import MllamaForConditionalGeneration, MllamaProcessor
from peft import PeftModel
import requests
import torch
from PIL import Image

# Define model paths
base_model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"
lora_model_name = "lxasqjc/lavender-llama-3.2-11b-lora"

# Load base model
base_model = MllamaForConditionalGeneration.from_pretrained(
    base_model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = MllamaProcessor.from_pretrained(base_model_name)
# Load LoRA adapter
lora_model = PeftModel.from_pretrained(base_model, lora_model_name)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt"
).to(lora_model.device)

output = lora_model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))

πŸ’‘ Applications

  • Vision-Language Tasks: Describing images, answering visual questions.
  • Instruction-Following: Improved instruction compliance compared to base Llama-3.2.
  • Multimodal Reasoning: Can process and reason about images with text.

πŸ“Š Training Details

  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • Base Model: meta-llama/Llama-3.2-11B-Vision-Instruct
  • PEFT Framework: Hugging Face PEFT
  • Precision: bfloat16
  • Hyperparameters: Coming soon

πŸ“ Limitations & Considerations

  • This model is not a standalone model. It requires the base Llama-3.2-11B-Vision-Instruct model.
  • Biases & Ethical Use: Like all large models, it may exhibit biases present in the pretraining data.
  • Hardware Requirements: Minimum 24GB VRAM (A100 40GB recommended) for inference.

πŸ”— References

Citation

If you use this model or work in your research, please cite:

@misc{jin2025diffusioninstructiontuning,
          title={Diffusion Instruction Tuning}, 
          author={Chen Jin and Ryutaro Tanno and Amrutha Saseendran and Tom Diethe and Philip Teare},
          year={2025},
          eprint={2502.06814},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2502.06814}, 
    }
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Model tree for lxasqjc/lavender-llama-3.2-11b-lora

Adapter
(120)
this model

Collection including lxasqjc/lavender-llama-3.2-11b-lora