Spaces:
Running
Running
File size: 8,043 Bytes
536f48d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
import streamlit as st
from streamlit_extras.switch_page_button import switch_page
translations = {
'en': {'title': 'LLaVA-NeXT',
'original_tweet':
"""
[Original tweet](https://twitter.com/mervenoyann/status/1770832875551682563) (March 21, 2024)
""",
'tweet_1':
"""
LLaVA-NeXT is recently merged to 🤗 Transformers and it outperforms many of the proprietary models like Gemini on various benchmarks!🤩
For those who don't know LLaVA, it's a language model that can take image 💬
Let's take a look, demo and more in this.
""",
'tweet_2':
"""
LLaVA is essentially a vision-language model that consists of ViT-based CLIP encoder, a MLP projection and Vicuna as decoder ✨
LLaVA 1.5 was released with Vicuna, but LLaVA NeXT (1.6) is released with four different LLMs:
- Nous-Hermes-Yi-34B
- Mistral-7B
- Vicuna 7B & 13B
""",
'tweet_3':
"""
Thanks to 🤗 Transformers integration, it is very easy to use LLaVA NeXT, not only standalone but also with 4-bit loading and Flash Attention 2 💜
See below on standalone usage 👇
""",
'tweet_4':
"""
To fit large models and make it even faster and memory efficient, you can enable Flash Attention 2 and load model into 4-bit using bitsandbytes ⚡️ transformers makes it very easy to do this! See below 👇
""",
'tweet_5':
"""
If you want to try the code right away, here's the [notebook](https://t.co/NvoxvY9z1u).
Lastly, you can directly play with the LLaVA-NeXT based on Mistral-7B through the demo [here](https://t.co/JTDlqMUwEh) 🤗
""",
'ressources':
"""
Ressources:
[LLaVA-NeXT: Improved reasoning, OCR, and world knowledge](https://llava-vl.github.io/blog/2024-01-30-llava-next/)
by Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee (2024)
[GitHub](https://github.com/haotian-liu/LLaVA/tree/main)
[Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/llava_next)
"""
},
'fr': {
'title': 'LLaVA-NeXT',
'original_tweet':
"""
[Tweet de base](https://twitter.com/mervenoyann/status/1770832875551682563) (en anglais) (21 mars 2024)
""",
'tweet_1':
"""
LLaVA-NeXT a récemment été intégré à 🤗 Transformers et surpasse de nombreux modèles propriétaires comme Gemini sur différents benchmarks !🤩
Pour ceux qui ne connaissent pas LLaVA, il s'agit d'un modèle de langage qui peut prendre des images 💬.
""",
'tweet_2':
"""
LLaVA est essentiellement un modèle langage/vision qui se compose d'un encodeur CLIP basé sur ViT, d'une projection MLP et de Vicuna en tant que décodeur ✨.
LLaVA 1.5 a été publié avec Vicuna, mais LLaVA NeXT (1.6) est publié avec quatre LLM différents :
- Nous-Hermes-Yi-34B
- Mistral-7B
- Vicuna 7B & 13B
""",
'tweet_3':
"""
Grâce à l'intégration dans 🤗 Transformers, il est très facile d'utiliser LLaVA NeXT, non seulement en mode autonome mais aussi avec un chargement 4 bits et Flash Attention 2 💜.
Voir ci-dessous pour l'utilisation autonome 👇
""",
'tweet_4':
"""
Pour entraîner des grands modèles et les rendre encore plus rapides et efficaces en termes de mémoire, vous pouvez activer Flash Attention 2 et charger le modèle en 4 bits à l'aide de bitsandbytes ⚡️ ! Voir ci-dessous 👇 """,
'tweet_5':
"""
Si vous voulez essayer le code tout de suite, voici le [notebook](https://t.co/NvoxvY9z1u).
Enfin, vous pouvez directement jouer avec le LLaVA-NeXT reposant sur Mistral-7B grâce à cette [démo](https://t.co/JTDlqMUwEh) 🤗
""",
'ressources':
"""
Ressources :
[LLaVA-NeXT: Improved reasoning, OCR, and world knowledge](https://llava-vl.github.io/blog/2024-01-30-llava-next/)
de Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee (2024)
[GitHub](https://github.com/haotian-liu/LLaVA/tree/main)
[Documentation d'Hugging Face](https://huggingface.co/docs/transformers/model_doc/llava_next)
"""
}
}
def language_selector():
languages = {'EN': '🇬🇧', 'FR': '🇫🇷'}
selected_lang = st.selectbox('', options=list(languages.keys()), format_func=lambda x: languages[x], key='lang_selector')
return 'en' if selected_lang == 'EN' else 'fr'
left_column, right_column = st.columns([5, 1])
# Add a selector to the right column
with right_column:
lang = language_selector()
# Add a title to the left column
with left_column:
st.title(translations[lang]["title"])
st.success(translations[lang]["original_tweet"], icon="ℹ️")
st.markdown(""" """)
st.markdown(translations[lang]["tweet_1"], unsafe_allow_html=True)
st.markdown(""" """)
st.image("pages/LLaVA-NeXT/image_1.jpeg", use_column_width=True)
st.markdown(""" """)
st.markdown(translations[lang]["tweet_2"], unsafe_allow_html=True)
st.markdown(""" """)
st.image("pages/LLaVA-NeXT/image_2.jpeg", use_column_width=True)
st.markdown(""" """)
st.markdown(translations[lang]["tweet_3"], unsafe_allow_html=True)
st.markdown(""" """)
st.image("pages/LLaVA-NeXT/image_3.jpeg", use_column_width=True)
st.markdown(""" """)
with st.expander ("Code"):
st.code("""
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True)
model.to("cuda:0")
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))
""")
st.markdown(""" """)
st.markdown(translations[lang]["tweet_4"], unsafe_allow_html=True)
st.markdown(""" """)
st.image("pages/LLaVA-NeXT/image_4.jpeg", use_column_width=True)
st.markdown(""" """)
with st.expander ("Code"):
st.code("""
from transformers import LlavaNextForConditionalGeneration, BitsandBytesconfig
# 4bit
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtpe="torch.float16")
model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", quantization_config=quantization_config, device_map="auto")
# Flash Attention 2
model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True, use_flash_attention_2=True).to(0)
""")
st.markdown(""" """)
st.markdown(translations[lang]["tweet_5"], unsafe_allow_html=True)
st.markdown(""" """)
st.video("pages/LLaVA-NeXT//video_1.mp4", format="video/mp4")
st.markdown(""" """)
st.info(translations[lang]["ressources"], icon="📚")
st.markdown(""" """)
st.markdown(""" """)
st.markdown(""" """)
col1, col2, col3= st.columns(3)
with col1:
if lang == "en":
if st.button('Previous paper', use_container_width=True):
switch_page("UDOP")
else:
if st.button('Papier précédent', use_container_width=True):
switch_page("UDOP")
with col2:
if lang == "en":
if st.button("Home", use_container_width=True):
switch_page("Home")
else:
if st.button("Accueil", use_container_width=True):
switch_page("Home")
with col3:
if lang == "en":
if st.button("Next paper", use_container_width=True):
switch_page("Painter")
else:
if st.button("Papier suivant", use_container_width=True):
switch_page("Painter") |