metadata

license: gemma
language:
  - en
pipeline_tag: image-text-to-text

Cerule - A Tiny Mighty Vision Model

Based on Google's - Gemma-2b + SigLIP

 ██████╗███████╗██████╗ ██╗   ██╗██╗     ███████╗
██╔════╝██╔════╝██╔══██╗██║   ██║██║     ██╔════╝
██║     █████╗  ██████╔╝██║   ██║██║     █████╗
██║     ██╔══╝  ██╔══██╗██║   ██║██║     ██╔══╝
╚██████╗███████╗██║  ██║╚██████╔╝███████╗███████╗
 ╚═════╝╚══════╝╚═╝  ╚═╝ ╚═════╝ ╚══════╝╚══════╝

We train and release "Cerule", a tiny yet powerful Vision Lanuage Model based on the newly released Google's Gemma-2b and Google's SigLIP.

- Pretraining stage : 650K images (A LAION Subset)
- Finetuning stage : 695K images (SVIT-mix-665K - Bunny mix modified by BAAI)

The training setup was 4xA100's 80GB and took ~6 hours to pretrain and ~13 hours to finetune. We modify and adapt the training code from Bunny.

Image	Example
	Describe the image The image is a playful and surreal depiction of a man in a space suit, sitting on a chair and holding a green beer bottle. The man is wearing a white space suit, complete with a helmet and gloves. His feet are clad in black and white shoes, and he is placed on a sandy surface. The background features a large, blue planet, with a moon and a star visible in the sky.
	Who are the characters in the image? The image features three characters, two of them are Mario and Luigi, and the third one is Yoshi. Describe the actions of the characters The Mario and Luigi characters are holding their arms out, as if they are waving. Yoshi is standing on its own, with its arms folded.
	What's funny about this image? The image is quite humorous as it depicts a man ironing clothes on the back of a yellow taxi cab. This is not a typical sight you'd expect to see in everyday life.

Training:

Training code Released !!! https://github.com/tensoic/Cerule

Inference:

Clone the following repo and following instructions for a CLI based inference. https://github.com/tensoic/Cerule

License

Model subject to Gemma(base model license) terms of use along with the underlying datasets(LAOIN and SVIT) subject to their respective licenses. All codes are Apache 2.0

Acknowledgments

We sincerely thank the Amazing teams at Google, LLaVA, and BAAI without which this project would not have been possible!