File size: 8,043 Bytes
536f48d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
import streamlit as st
from streamlit_extras.switch_page_button import switch_page


translations = {
'en': {'title': 'LLaVA-NeXT',
    'original_tweet': 
       """
       [Original tweet](https://twitter.com/mervenoyann/status/1770832875551682563) (March 21, 2024)
       """,
    'tweet_1':
        """
        LLaVA-NeXT is recently merged to 🤗 Transformers and it outperforms many of the proprietary models like Gemini on various benchmarks!🤩   
        For those who don't know LLaVA, it's a language model that can take image 💬  
        Let's take a look, demo and more in this.
        """,
    'tweet_2':
        """
        LLaVA is essentially a vision-language model that consists of ViT-based CLIP encoder, a MLP projection and Vicuna as decoder ✨  
        LLaVA 1.5 was released with Vicuna, but LLaVA NeXT (1.6) is released with four different LLMs:  
        - Nous-Hermes-Yi-34B  
        - Mistral-7B  
        - Vicuna 7B & 13B 
        """,
    'tweet_3':
        """
        Thanks to 🤗 Transformers integration, it is very easy to use LLaVA NeXT, not only standalone but also with 4-bit loading and Flash Attention 2 💜  
        See below on standalone usage 👇 
        """,
    'tweet_4':
        """
        To fit large models and make it even faster and memory efficient, you can enable Flash Attention 2 and load model into 4-bit using bitsandbytes ⚡️ transformers makes it very easy to do this! See below 👇 
        """,
    'tweet_5':
        """
        If you want to try the code right away, here's the [notebook](https://t.co/NvoxvY9z1u).  
        Lastly, you can directly play with the LLaVA-NeXT based on Mistral-7B through the demo [here](https://t.co/JTDlqMUwEh) 🤗 
        """,
    'ressources':
        """
        Ressources:  
        [LLaVA-NeXT: Improved reasoning, OCR, and world knowledge](https://llava-vl.github.io/blog/2024-01-30-llava-next/) 
        by Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee (2024)   
        [GitHub](https://github.com/haotian-liu/LLaVA/tree/main)   
        [Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/llava_next)
        """
      },
'fr': {
    'title': 'LLaVA-NeXT',
    'original_tweet': 
       """
       [Tweet de base](https://twitter.com/mervenoyann/status/1770832875551682563) (en anglais) (21 mars 2024)
       """,
    'tweet_1':
        """
        LLaVA-NeXT a récemment été intégré à 🤗 Transformers et surpasse de nombreux modèles propriétaires comme Gemini sur différents benchmarks !🤩   
        Pour ceux qui ne connaissent pas LLaVA, il s'agit d'un modèle de langage qui peut prendre des images 💬. 
        """,
    'tweet_2':
        """
        LLaVA est essentiellement un modèle langage/vision qui se compose d'un encodeur CLIP basé sur ViT, d'une projection MLP et de Vicuna en tant que décodeur ✨.  
        LLaVA 1.5 a été publié avec Vicuna, mais LLaVA NeXT (1.6) est publié avec quatre LLM différents :  
        - Nous-Hermes-Yi-34B  
        - Mistral-7B  
        - Vicuna 7B & 13B 
        """,
    'tweet_3':
        """
        Grâce à l'intégration dans 🤗 Transformers, il est très facile d'utiliser LLaVA NeXT, non seulement en mode autonome mais aussi avec un chargement 4 bits et Flash Attention 2 💜.  
        Voir ci-dessous pour l'utilisation autonome 👇 
        """,
    'tweet_4':
        """
        Pour entraîner des grands modèles et les rendre encore plus rapides et efficaces en termes de mémoire, vous pouvez activer Flash Attention 2 et charger le modèle en 4 bits à l'aide de bitsandbytes ⚡️ ! Voir ci-dessous 👇         """,
    'tweet_5':
        """
        Si vous voulez essayer le code tout de suite, voici le [notebook](https://t.co/NvoxvY9z1u).  
        Enfin, vous pouvez directement jouer avec le LLaVA-NeXT reposant sur Mistral-7B grâce à  cette [démo](https://t.co/JTDlqMUwEh) 🤗 
        """,
    'ressources':
        """
        Ressources :  
        [LLaVA-NeXT: Improved reasoning, OCR, and world knowledge](https://llava-vl.github.io/blog/2024-01-30-llava-next/) 
        de Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee (2024)   
        [GitHub](https://github.com/haotian-liu/LLaVA/tree/main)   
        [Documentation d'Hugging Face](https://huggingface.co/docs/transformers/model_doc/llava_next)
        """
    }
}    


def language_selector():
    languages = {'EN': '🇬🇧', 'FR': '🇫🇷'}
    selected_lang = st.selectbox('', options=list(languages.keys()), format_func=lambda x: languages[x], key='lang_selector')
    return 'en' if selected_lang == 'EN' else 'fr'

left_column, right_column = st.columns([5, 1])

# Add a selector to the right column
with right_column:
    lang = language_selector()

# Add a title to the left column
with left_column:
    st.title(translations[lang]["title"])
    
st.success(translations[lang]["original_tweet"], icon="ℹ️")
st.markdown(""" """)

st.markdown(translations[lang]["tweet_1"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/LLaVA-NeXT/image_1.jpeg", use_column_width=True)
st.markdown(""" """)

st.markdown(translations[lang]["tweet_2"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/LLaVA-NeXT/image_2.jpeg", use_column_width=True)
st.markdown(""" """)

st.markdown(translations[lang]["tweet_3"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/LLaVA-NeXT/image_3.jpeg", use_column_width=True)
st.markdown(""" """)

with st.expander ("Code"):
    st.code("""
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration 
import torch 

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")

model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True) 
model.to("cuda:0") 

inputs = processor(prompt, image, return_tensors="pt").to("cuda:0") 

output = model.generate(**inputs, max_new_tokens=100) 
print(processor.decode(output[0], skip_special_tokens=True))
    """)
st.markdown(""" """)

st.markdown(translations[lang]["tweet_4"], unsafe_allow_html=True)
st.markdown(""" """)

st.image("pages/LLaVA-NeXT/image_4.jpeg", use_column_width=True)
st.markdown(""" """)

with st.expander ("Code"):
    st.code("""
from transformers import LlavaNextForConditionalGeneration, BitsandBytesconfig 

# 4bit 
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtpe="torch.float16")
model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", quantization_config=quantization_config, device_map="auto") 
 
# Flash Attention 2
model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True, use_flash_attention_2=True).to(0) 
    """)
st.markdown(""" """)

st.markdown(translations[lang]["tweet_5"], unsafe_allow_html=True)
st.markdown(""" """)

st.video("pages/LLaVA-NeXT//video_1.mp4", format="video/mp4")
st.markdown(""" """)

st.info(translations[lang]["ressources"], icon="📚")  

st.markdown(""" """)
st.markdown(""" """)
st.markdown(""" """)
col1, col2, col3= st.columns(3)
with col1:
    if lang == "en":
        if st.button('Previous paper', use_container_width=True):
            switch_page("UDOP")
    else:
        if st.button('Papier précédent', use_container_width=True):
            switch_page("UDOP")
with col2:
    if lang == "en":
        if st.button("Home", use_container_width=True):
            switch_page("Home")
    else:
        if st.button("Accueil", use_container_width=True):
            switch_page("Home")
with col3:
    if lang == "en":
        if st.button("Next paper", use_container_width=True):
            switch_page("Painter")
    else:
        if st.button("Papier suivant", use_container_width=True):
            switch_page("Painter")