|
--- |
|
license: mit |
|
language: |
|
- en |
|
library_name: transformers |
|
--- |
|
|
|
# OpenFlamingo-9B-HF |
|
|
|
This is a Huggingface version of OpenFlamingo. In order to make the model support Huggingface's accelerator feature, we wrapped the original OpenFlamingo into a Huggingface model. |
|
|
|
You can find detailed descriptions in the [luodian/otter](https://github.com/Luodian/otter). |
|
|
|
Current OF-9B-HF model supports training with fully-sharded mechanism and loading to consumer GPUs (e.g. 3090-24G). |
|
|
|
The following is original OpenFlamingo's model description. |
|
|
|
[Blog post](https://laion.ai/blog/open-flamingo/) | [Code](https://github.com/mlfoundations/open_flamingo) | [Demo](https://7164d2142d11.ngrok.app) |
|
|
|
OpenFlamingo is an open source implementation of DeepMind's [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) models. |
|
OpenFlamingo-9B is built off of [CLIP ViT-L/14](https://huggingface.co/openai/clip-vit-large-patch14) and [LLaMA-7B](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/). Before using this model please familiarize yourself with our [terms and conditions](https://github.com/mlfoundations/open_flamingo/blob/main/TERMS_AND_CONDITIONS.md). |
|
|
|
## Model Details |
|
We freeze the pretrained vision encoder and language model, and then we train connecting Perceiver modules and cross-attention layers, following the original Flamingo paper. |
|
|
|
Our training data is a mixture of [LAION 2B](https://huggingface.co/datasets/laion/laion2B-en) and a large interleaved image-text dataset called Multimodal C4, which will be released soon. |
|
|
|
The current model is an early checkpoint of an ongoing effort. This checkpoint has seen 5 million interleaved image-text examples from Multimodal C4. |
|
|
|
## Uses |
|
OpenFlamingo-9B is intended to be used **for academic research purposes only.** Commercial use is prohibited, in line with LLaMA's non-commercial license. |
|
|
|
### Bias, Risks, and Limitations |
|
This model may generate inaccurate or offensive outputs, reflecting biases in its training data and pretrained priors. |
|
|
|
In an effort to mitigate current potential biases and harms, we have deployed a content filter on model outputs in the OpenFlamingo demo. We continue to red-team the model to understand and improve its safety. |
|
|
|
## Evaluation |
|
We've evaluated this checkpoint and report validation performance for two vision-language tasks: COCO captioning and VQAv2. Results are displayed below. |
|
|
|
**COCO (CIDEr)** |
|
|
|
|0-shot|4-shot|8-shot|16-shot|32-shot| |
|
|--|--|--|--|--| |
|
|65.52|74.28|79.26|81.84|84.52| |
|
|
|
|
|
**VQAv2 (VQA accuracy)** |
|
|
|
|0-shot|4-shot|8-shot|16-shot|32-shot| |
|
|---|---|---|---|---| |
|
|43.55|44.05|47.5|48.87|50.34| |