metadata
license: cc-by-4.0
datasets:
- FreedomIntelligence/ALLaVA-4V
pipeline_tag: image-text-to-text
library_name: prismcaptioner
PrismCaptioner Model Card
Model details
PrismCaptioners are open-source captioners with LLaVA architecture finetuned on GPT4V-assisted dataset ALLaVA. We have released PrismCaptioner-7B and PrismCaptioner-2B.
PrismCaptioner-2B details
- Vision Backbone: google/siglip-so400m-patch14-384
- Language Backbone: internlm/internlm2-1_8b
- Dataset: 1x ALLaVA-Caption-[LAION/VFLAN], 2x Evol-Instruct-GPT4-Turbo-143K
Paper and codebase for more information: [Paper] [Code]
Intended uses
- Perception Module: The model can be integrated into Prism as a perception module to solve vision-language task by utilizing an external LLM.
- Effective Captioner: The model can produce high-quality captions for given images.
Model usage
Clone the Prism repo and complete the preparation. You can use PrismCaptioners following usage or demo below.
# In the Prism repo folder
from decouple import supported_VLM
model = supported_VLM['prismcaptioner-2b']()
res = model.generate(['assets/case1.png', 'Given the image below, please provide a detailed description of what you see.'])