|
--- |
|
license: cc-by-4.0 |
|
datasets: |
|
- FreedomIntelligence/ALLaVA-4V |
|
pipeline_tag: image-text-to-text |
|
library_name: prismcaptioner |
|
--- |
|
<br> |
|
|
|
# PrismCaptioner Model Card |
|
|
|
**Model details** |
|
|
|
PrismCaptioners are open-source captioners with LLaVA architecture finetuned on GPT4V-assisted dataset [ALLaVA](https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V). We have released [PrismCaptioner-7B](https://huggingface.co/Yuxuan-Qiao/PrismCaptioner-7B) and [PrismCaptioner-2B](https://huggingface.co/Yuxuan-Qiao/PrismCaptioner-7B). |
|
|
|
PrismCaptioner-7B details |
|
- **Vision Backbone:** google/siglip-so400m-patch14-384 |
|
- **Language Backbone:** internlm/internlm2-7b |
|
- **Dataset:** 1x ALLaVA-Caption-[LAION/VFLAN] |
|
|
|
**Paper and codebase for more information:** |
|
[[Paper](https://arxiv.org/abs/2406.14544)] [[Code](https://github.com/SparksJoe/Prism)] |
|
|
|
**Intended uses** |
|
- **Perception Module:** The model can be integrated into [Prism](https://github.com/SparksJoe/Prism) as a perception module to solve vision-language task by utilizing an external LLM. |
|
- **Effective Captioner:** The model can produce high-quality captions for given images. |
|
|
|
**Model usage** |
|
|
|
Clone the [Prism](https://github.com/SparksJoe/Prism) repo and complete the [preparation](https://github.com/SparksJoe/Prism/tree/main?tab=readme-ov-file#preparation). You can use PrismCaptioners following [usage](https://github.com/SparksJoe/Prism/blob/main/README.md#usage) or demo below. |
|
|
|
```python |
|
# In the Prism repo folder |
|
from decouple import supported_VLM |
|
|
|
model = supported_VLM['prismcaptioner-7b']() |
|
res = model.generate(['assets/case1.png', 'Given the image below, please provide a detailed description of what you see.']) |
|
``` |