zhanjun commited on
Commit
33f6918
·
verified ·
1 Parent(s): b08c886

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -0
README.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Official Repository for paper "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
2
+ <a href='https://junzhan2000.github.io/AnyGPT.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/pdf/2402.12226.pdf'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> [![](https://img.shields.io/badge/Datasets-AnyInstruct-yellow)](https://huggingface.co/datasets/fnlp/AnyInstruct)
3
+
4
+ <p align="center">
5
+ <img src="static/images/logo.png" width="16%"> <br>
6
+ </p>
7
+
8
+ ## Introduction
9
+ We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. The [base model](https://huggingface.co/fnlp/AnyGPT-base) aligns the four modalities, allowing for intermodal conversions between different modalities and text. Furthermore, we constructed the [AnyInstruct](https://huggingface.co/datasets/fnlp/AnyInstruct) dataset based on various generative models, which contains instructions for arbitrary modal interconversion. Trained on this dataset, our [chat model](https://huggingface.co/fnlp/AnyGPT-chat) can engage in free multimodal conversations, where multimodal data can be inserted at will.
10
+
11
+ AnyGPT proposes a generative training scheme that converts all modal data into a unified discrete representation, using the Next Token Prediction task for unified training on a Large Language Model (LLM). From the perspective of 'compression is intelligence': when the quality of the Tokenizer is high enough, and the perplexity (PPL) of the LLM is low enough, it is possible to compress the vast amount of multimodal data on the internet into the same model, thereby emerging capabilities not present in a pure text-based LLM.
12
+ Demos are shown in [project page](https://junzhan2000.github.io/AnyGPT.github.io).
13
+
14
+ ## Example Demonstrations
15
+ [![视频标题](http://img.youtube.com/vi/oW3E3pIsaRg/0.jpg)](https://www.youtube.com/watch?v=oW3E3pIsaRg)
16
+
17
+
18
+ ## Open-Source Checklist
19
+ - [x] Base Model
20
+ - [ ] Chat Model
21
+ - [x] Inference Code
22
+ - [x] Instruction Dataset
23
+
24
+ ## Inference
25
+
26
+ ### Installation
27
+
28
+ ```bash
29
+ git clone https://github.com/OpenMOSS/AnyGPT.git
30
+ cd AnyGPT
31
+ conda create --name AnyGPT python=3.9
32
+ conda activate AnyGPT
33
+ pip install -r requirements.txt
34
+ ```
35
+
36
+ ### Model Weights
37
+ * Check the AnyGPT-base weights in [fnlp/AnyGPT-base](https://huggingface.co/fnlp/AnyGPT-base)
38
+ * Check the AnyGPT-chat weights in [fnlp/AnyGPT-chat](https://huggingface.co/fnlp/AnyGPT-chat)
39
+ * Check the SpeechTokenizer and Soundstorm weights in [fnlp/AnyGPT-speech-modules](https://huggingface.co/fnlp/AnyGPT-speech-modules)
40
+ * Check the SEED tokenizer weights in [AILab-CVC/seed-tokenizer-2](https://huggingface.co/AILab-CVC/seed-tokenizer-2)
41
+
42
+
43
+ The SpeechTokenizer is used for tokenizing and reconstructing speech, Soundstorm is responsible for completing paralinguistic information, and SEED-tokenizer is used for tokenizing images.
44
+
45
+ The model weights of unCLIP SD-UNet which are used to reconstruct the image, and Encodec-32k which are used to tokenize and reconstruct music will be downloaded automatically.
46
+
47
+ ### Base model CLI Inference
48
+ ```bash
49
+ python anygpt/src/infer/cli_infer_base_model.py \
50
+ --model-name-or-path "path/to/AnyGPT-7B-base" \
51
+ --image-tokenizer-path models/seed-tokenizer-2/seed_quantizer.pt \
52
+ --speech-tokenizer-path "path/to/model" \
53
+ --speech-tokenizer-config "path/to/config" \
54
+ --soundstorm-path "path/to/model" \
55
+ --output-dir "infer_output/base"
56
+ ```
57
+
58
+ for example
59
+ ```bash
60
+ python anygpt/src/infer/cli_infer_base_model.py \
61
+ --model-name-or-path models/anygpt/base \
62
+ --image-tokenizer-path models/seed-tokenizer-2/seed_quantizer.pt \
63
+ --speech-tokenizer-path models/speechtokenizer/ckpt.dev \
64
+ --speech-tokenizer-config models/speechtokenizer/config.json \
65
+ --soundstorm-path models/soundstorm/speechtokenizer_soundstorm_mls.pt \
66
+ --output-dir "infer_output/base"
67
+ ```
68
+
69
+ #### Interaction
70
+ The Base Model can perform various tasks, including text-to-image, image caption, Automatic Speech Recognition (ASR), Zero-shot Text-to-Speech (TTS), Text-to-Music, and Music Captioning.
71
+
72
+ We can perform inference following a specific instruction format.
73
+
74
+ * Text-to-Image
75
+ * ```text|image|{caption}```
76
+ * example:
77
+ ```text|image|A bustling medieval market scene with vendors selling exotic goods under colorful tents```
78
+ * Image Caption
79
+ * ```image|text|{caption}```
80
+ * example:
81
+ ```image|text|static/infer/image/cat.jpg```
82
+ * TTS(random voice)
83
+ * ```text|speech|{speech content}```
84
+ * example:
85
+ ```text|speech|I could be bounded in a nutshell and count myself a king of infinite space.```
86
+ * Zero-shot TTS
87
+ * ```text|speech|{speech content}|{voice prompt}```
88
+ * example:
89
+ ```text|speech|I could be bounded in a nutshell and count myself a king of infinite space.|static/infer/speech/voice_prompt1.wav/voice_prompt3.wav```
90
+ * ASR
91
+ * ```speech|text|{speech file path}```
92
+ * example: ```speech|text|AnyGPT/static/infer/speech/voice_prompt2.wav```
93
+ * Text-to-Music
94
+ * ```text|music|{caption}```
95
+ * example:
96
+ ```text|music|features an indie rock sound with distinct elements that evoke a dreamy, soothing atmosphere```
97
+ * Music Caption
98
+ * ```music|text|{music file path}```
99
+ * example: ```music|text|static/infer/music/features an indie rock sound with distinct element.wav```
100
+
101
+ **Notes**
102
+
103
+ For different tasks, we used different language model decoding strategies. The decoding configuration files for image, speech, and music generation are located in ```config/image_generate_config.json```, ```config/speech_generate_config.json```, and ```config/music_generate_config.json```, respectively. The decoding configuration files for other modalities to text are in ```config/text_generate_config.json```. You can directly modify or add parameters to change the decoding strategy.
104
+
105
+ Due to limitations in data and training resources, the model's generation may still be unstable. You can generate multiple times or try different decoding strategies.
106
+
107
+ The speech and music response will be saved to ```.wav``` files, and the image response will be saved to a ```jpg```. The filename will be a concatenation of the prompt and the time. The paths to these files will be indicated in the response.
108
+
109
+ ### Training
110
+ #### Pretraining
111
+
112
+ * Install dependency
113
+ ``` bash
114
+ cd FastChat
115
+ pip3 install -e ".[train]"
116
+ ```
117
+ * run
118
+ ```
119
+ srun --partition=llm_h --job-name=pretrain --gres=gpu:8 --quotatype=spot --ntasks=1 --ntasks-per-node=1 --cpus-per-task 100 --kill-on-bad-exit=1 bash scripts/stage1_pretrain.sh
120
+ ```
121
+
122
+ We have provided some sample data in the "data" folder. To download the complete dataset, please refer to the following:
123
+
124
+ * Image data: https://huggingface.co/datasets/zhanjun/AnyGPT-data-image
125
+ * The two datasets in the t2i folder are high-quality image datasets, used for fine-tuning text-to-image generation.
126
+ * Speech data: https://huggingface.co/datasets/zhanjun/AnyGPT-data-speech
127
+ * Music data: None
128
+ * Insruction data: https://huggingface.co/datasets/zhanjun/Anygpt_data_instruction
129
+
130
+ These data are preprocessed by multimodal tokeniziers.
131
+
132
+
133
+ ## Acknowledgements
134
+ - [SpeechGPT](https://github.com/0nutation/SpeechGPT/tree/main/speechgpt), [Vicuna](https://github.com/lm-sys/FastChat): The codebase we built upon.
135
+ - We thank the great work from [SpeechTokenizer](https://github.com/ZhangXInFD/SpeechTokenizer),[soundstorm-speechtokenizer](https://github.com/ZhangXInFD/soundstorm-speechtokenizer), [SEED-tokenizer](https://github.com/AILab-CVC/SEED),
136
+
137
+ ## Lincese
138
+ `AnyGPT` is released under the original [License](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) of [LLaMA2](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf).
139
+
140
+ ## Citation
141
+ If you find AnyGPT and AnyInstruct useful in your research or applications, please kindly cite:
142
+ ```
143
+ @article{zhan2024anygpt,
144
+ title={AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling},
145
+ author={Zhan, Jun and Dai, Junqi and Ye, Jiasheng and Zhou, Yunhua and Zhang, Dong and Liu, Zhigeng and Zhang, Xin and Yuan, Ruibin and Zhang, Ge and Li, Linyang and others},
146
+ journal={arXiv preprint arXiv:2402.12226},
147
+ year={2024}
148
+ }
149
+ ```