--- language: - en tags: - vision-language - phi - llava - clip - qlora - multimodal license: mit datasets: - laion/instructional-image-caption-data base_model: microsoft/phi-1_5 library_name: transformers pipeline_tag: image-to-text --- # LLaVA-Phi Model This is a vision-language model based on Microsoft's Phi-1.5 architecture with CLIP for image processing capabilities. ## Model Description - **Base Model**: Microsoft Phi-1.5 - **Vision Encoder**: CLIP ViT-B/32 - **Training**: QLoRA fine-tuning - **Dataset**: Instruct 150K ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor import torch from PIL import Image # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained("sagar007/Lava_phi") tokenizer = AutoTokenizer.from_pretrained("sagar007/Lava_phi") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32") # For text def generate_text(prompt): inputs = tokenizer(f"human: {prompt}\ngpt:", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=128) return tokenizer.decode(outputs[0], skip_special_tokens=True) # For images def process_image_and_prompt(image_path, prompt): image = Image.open(image_path) image_tensor = processor(images=image, return_tensors="pt").pixel_values inputs = tokenizer(f"human: \n{prompt}\ngpt:", return_tensors="pt") outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], images=image_tensor, max_new_tokens=128 ) return tokenizer.decode(outputs[0], skip_special_tokens=True) ``` ## Training Details - Trained using QLoRA (Quantized Low-Rank Adaptation) - 4-bit quantization for efficiency - Gradient checkpointing enabled - Mixed precision training (bfloat16) ## License MIT License ## Citation ```bibtex @software{llava_phi_2024, author = {sagar007}, title = {LLaVA-Phi: Vision-Language Model}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/sagar007/Lava_phi} } ```