|
--- |
|
language: |
|
- en |
|
tags: |
|
- deepspeed |
|
- visualchat |
|
- multi-image |
|
- causal |
|
- chat |
|
license: apache-2.0 |
|
datasets: |
|
- openai/clip-vit-large-patch14 |
|
--- |
|
--- |
|
|
|
# Llama-2-13b-deepspeed-visualchat |
|
|
|
> **ATTENTION**: this encoder needs QwenCLIP model |
|
|
|
DeepSpeed-VisualChat is a scalable, efficient, and user-friendly multi-modal training pipeline that leverages a novel multi-modal causal attention mechanism for better alignment of visual and text features. It uses data blending techniques to address the scarcity of interleaved text-and-image inputs in datasets. |
|
|
|
|
|
The framework trains using a 2B visual encoder from QWen-VL and a 13B-70B language decoder from LLaMA-2, showcasing its extraordinary scalability. DeepSpeed-VisualChat is now open-sourced and encourages community contributions and collaborations. Visit the GitHub page to get started. |
|
|