IDEA-CCNL
/

Ziya-BLIP2-14B-Visual-v1

@@ -45,19 +45,19 @@ pip install torch==1.12.1 tokenizers==0.13.3 git+https://github.com/huggingface/
 This example demonstrates the model's ability to read pictures, its knowledge and its ability to compose. Firstly in the first problem, the model identifies the picture as a scene from the movie Titanic and gives information about the movie director, release date and award achievements; in the second problem, the model creates a modern love poem based on the user's needs.
-<img src="https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1/blob/main/assets/taitanic.png">
 这个例子展示了Ziya-Visual传统中国文化的识别和理解能力，模型识别出了中国画中的信息，在得到提示《清明上河图》之后给出了也给出画家张择端和北宋的历史背景。
 This example demonstrates Ziya-Visual's ability to recognise and understand traditional Chinese culture. The model identifies information in Chinese paintings, and after getting the hint 'Qingming Shanghe Tu' gives also gives the historical context of the painter Zhang Zeduan and the Northern Song Dynasty.
-<img src="https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1/blob/main/assets/song_dynasty.png">
 如果输入多张图片进行问答呢？Ziya-Visual也是胜任的，在这个例子中，Ziya-Visual展现了强大的多图和多轮交互能力，根据用户给的三张图片，叙述了一个女士在城市夜景中邂逅一对母子猫咪，并与之交谈、分别的小故事。
 What if multiple images are entered for a quiz? Ziya-Visual is also up to the task. In this example, Ziya-Visual demonstrates the power of multiple images and multiple rounds of interaction, narrating a short story of a lady who encounters a mother and son cat in a city night scene, talks to them and separates them, based on three images given by the user.
-<img src="https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1/blob/main/assets/long_story.png">
 ### 训练 Training
@@ -73,7 +73,7 @@ In the training of Chinese visual quiz model, the biggest problem is the small a
 In order to better combine the capabilities of the vision pre-training model and the LLM, as in the Mini-GPT4 and LLaVA work, the training of Ziya-Visual-v1 followed the classical network structure and the two-stage training paradigm proposed by BLIP2. Moreover, we found during our experiments that whether or not the parameters of the Vision Encoder are trained has very little impact on the final generation results. Therefore, for the overall model, we inherited the ViT + QFormer parameters from BLIP2 for the vision processing part and the Ziya-v1 weights for the LLM part, both of which were frozen from training. Our main training component is the visual mapping layer (Projection Layer). In the first stage, we use the image Caption data to train the mapping layer so that the image features extracted by Vision Encder can be aligned with the text feature space in LLM; in the second stage, we use the image Q & A dataset to further fine-tune the visual-verbal capabilities of Ziya-Visual.
-<img src="https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1/blob/main/assets/Ziya-Visual.drawio.svg">
 ### 效果评估 Performance
@@ -82,15 +82,15 @@ In order to better combine the capabilities of the vision pre-training model and
 Firstly, the evaluation on the VQA effectiveness shows that the Ziya-Visual model outperforms VisualGLM on most of the metrics on both the Chinese and English test sets of GQA, while scoring lower on BLUE-4, indicating that Ziya-Visual generates more generalized and accurate answers on most open domain multimodal questions and answers, but generates some discrete questions on answers have autonomy. For the mPLUG-Owl model, the mPLUG-Owl 7B Instruction tuning (LoRA) version was used for English and the multilingual mPLUG-Owl 7B (Multilingual) Instruction tuning (LoRA) version was used for Chinese. On the other hand, Ziya-Visual's LLaMA has better multilingual comprehension and generation capabilities, and the multilingual multimodal training corpus was introduced in the second phase of Ziya-Visual training through a translation tool, so it has an advantage in the Chinese data.
-<img src="https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1/blob/main/assets/gqa.png">
 其次我们使用LLaVA[2]的做法利用GPT-4打分评价，该方法利用coco数据集中的caption和物体检测框信息输入给GPT-4；然后将Ziya-Visual和VisualGLM的图像问答的回答再输入到GPT-4，要求GPT-4从回答的有用性、相关性、准确性、细节程度进行评分（1-10分）；LLaVA中将对话任务划分为conv（简单对话），detail（细节对话）和complex（复杂推理），all是三种对话任务的综合平均分。最终评价结果如下，可以看到在简单对话和细节对话中，Ziya-Viusual优于VisualGLM，在复杂推理中略输于VisualGLM，最终��体平均结果优于VisualGLM。在对比mPLUG-Owl中我们得到的结论是类似的，Ziya-Visual总体平均结果优于mPLUG-Owl。
 Secondly, we used the LLaVA approach to score the evaluation using the GPT-4, which uses the caption and object detection box information from the coco dataset to input to the GPT-4; the responses to the image quiz from Ziya-Visual and VisualGLM are then input to the GPT-4, which is asked to score the responses in terms of usefulness, relevance, accuracy, and The responses were then fed back into GPT-4, which was asked to rate the responses in terms of usefulness, relevance, accuracy, and level of detail (on a scale of 1-10); LLaVA divided the dialogue tasks into conv (simple dialogue), detail (detailed dialogue) and complex (complex reasoning), and all was the combined average score of the three dialogue tasks. The final evaluation results are as follows, and it can be seen that Ziya-Viusual outperforms VisualGLM in simple and detail dialogues, slightly loses out to VisualGLM in complex reasoning, and finally outperforms VisualGLM in overall average results.
 In comparing mPLUG-Owl we reach a similar conclusion, with Ziya-Viusual outperforming mPLUG-Owl on average overall.
-<img src="https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1/blob/main/assets/visualglm.png">
-<img src="https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1/blob/main/assets/mplug.png">
 ## 使用 Usage

 This example demonstrates the model's ability to read pictures, its knowledge and its ability to compose. Firstly in the first problem, the model identifies the picture as a scene from the movie Titanic and gives information about the movie director, release date and award achievements; in the second problem, the model creates a modern love poem based on the user's needs.
+![](assets/taitanic.png)
 这个例子展示了Ziya-Visual传统中国文化的识别和理解能力，模型识别出了中国画中的信息，在得到提示《清明上河图》之后给出了也给出画家张择端和北宋的历史背景。
 This example demonstrates Ziya-Visual's ability to recognise and understand traditional Chinese culture. The model identifies information in Chinese paintings, and after getting the hint 'Qingming Shanghe Tu' gives also gives the historical context of the painter Zhang Zeduan and the Northern Song Dynasty.
+![](assets/song_dynasty.png)
 如果输入多张图片进行问答呢？Ziya-Visual也是胜任的，在这个例子中，Ziya-Visual展现了强大的多图和多轮交互能力，根据用户给的三张图片，叙述了一个女士在城市夜景中邂逅一对母子猫咪，并与之交谈、分别的小故事。
 What if multiple images are entered for a quiz? Ziya-Visual is also up to the task. In this example, Ziya-Visual demonstrates the power of multiple images and multiple rounds of interaction, narrating a short story of a lady who encounters a mother and son cat in a city night scene, talks to them and separates them, based on three images given by the user.
+![](assets/long_story.png)
 ### 训练 Training
 In order to better combine the capabilities of the vision pre-training model and the LLM, as in the Mini-GPT4 and LLaVA work, the training of Ziya-Visual-v1 followed the classical network structure and the two-stage training paradigm proposed by BLIP2. Moreover, we found during our experiments that whether or not the parameters of the Vision Encoder are trained has very little impact on the final generation results. Therefore, for the overall model, we inherited the ViT + QFormer parameters from BLIP2 for the vision processing part and the Ziya-v1 weights for the LLM part, both of which were frozen from training. Our main training component is the visual mapping layer (Projection Layer). In the first stage, we use the image Caption data to train the mapping layer so that the image features extracted by Vision Encder can be aligned with the text feature space in LLM; in the second stage, we use the image Q & A dataset to further fine-tune the visual-verbal capabilities of Ziya-Visual.
+![](assets/Ziya-Visual.drawio.svg)
 ### 效果评估 Performance
 Firstly, the evaluation on the VQA effectiveness shows that the Ziya-Visual model outperforms VisualGLM on most of the metrics on both the Chinese and English test sets of GQA, while scoring lower on BLUE-4, indicating that Ziya-Visual generates more generalized and accurate answers on most open domain multimodal questions and answers, but generates some discrete questions on answers have autonomy. For the mPLUG-Owl model, the mPLUG-Owl 7B Instruction tuning (LoRA) version was used for English and the multilingual mPLUG-Owl 7B (Multilingual) Instruction tuning (LoRA) version was used for Chinese. On the other hand, Ziya-Visual's LLaMA has better multilingual comprehension and generation capabilities, and the multilingual multimodal training corpus was introduced in the second phase of Ziya-Visual training through a translation tool, so it has an advantage in the Chinese data.
+![](assets/gqa.png)
 其次我们使用LLaVA[2]的做法利用GPT-4打分评价，该方法利用coco数据集中的caption和物体检测框信息输入给GPT-4；然后将Ziya-Visual和VisualGLM的图像问答的回答再输入到GPT-4，要求GPT-4从回答的有用性、相关性、准确性、细节程度进行评分（1-10分）；LLaVA中将对话任务划分为conv（简单对话），detail（细节对话）和complex（复杂推理），all是三种对话任务的综合平均分。最终评价结果如下，可以看到在简单对话和细节对话中，Ziya-Viusual优于VisualGLM，在复杂推理中略输于VisualGLM，最终��体平均结果优于VisualGLM。在对比mPLUG-Owl中我们得到的结论是类似的，Ziya-Visual总体平均结果优于mPLUG-Owl。
 Secondly, we used the LLaVA approach to score the evaluation using the GPT-4, which uses the caption and object detection box information from the coco dataset to input to the GPT-4; the responses to the image quiz from Ziya-Visual and VisualGLM are then input to the GPT-4, which is asked to score the responses in terms of usefulness, relevance, accuracy, and The responses were then fed back into GPT-4, which was asked to rate the responses in terms of usefulness, relevance, accuracy, and level of detail (on a scale of 1-10); LLaVA divided the dialogue tasks into conv (simple dialogue), detail (detailed dialogue) and complex (complex reasoning), and all was the combined average score of the three dialogue tasks. The final evaluation results are as follows, and it can be seen that Ziya-Viusual outperforms VisualGLM in simple and detail dialogues, slightly loses out to VisualGLM in complex reasoning, and finally outperforms VisualGLM in overall average results.
 In comparing mPLUG-Owl we reach a similar conclusion, with Ziya-Viusual outperforming mPLUG-Owl on average overall.
+![](assets/visualglm.png)
+![](assets/mplug.png)
 ## 使用 Usage