Text Generation
Transformers
Chinese
llama
weiren119 commited on
Commit
4be4ab7
1 Parent(s): fc5b9c2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +206 -1
README.md CHANGED
@@ -1,5 +1,210 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
  ## Intro
5
 
 
1
  ---
2
+ datasets:
3
+ - yentinglin/zh_TW_c4
4
+ - yentinglin/traditional_chinese_instructions
5
+ inference: false
6
+ license: llama2
7
+ language:
8
+ - zh
9
+ model_creator: Yen-Ting Lin
10
+ model_link: https://huggingface.co/yentinglin/Taiwan-LLaMa-v1.0
11
+ model_name: Language Models for Taiwanese Culture 1.0
12
+ model_type: llama
13
+ quantized_by: weiren119
14
+ ---
15
+
16
+ <!-- header start -->
17
+ <!-- header end -->
18
+
19
+ # Taiwan-LLaMa-v1.0 - GGML
20
+ - Model creator: [Yen-Ting Lin](https://huggingface.co/yentinglin)
21
+ - Original model: [Language Models for Taiwanese Culture v1.0](https://huggingface.co/yentinglin/Taiwan-LLaMa-v1.0)
22
+
23
+ ## Description
24
+
25
+ This repo contains GGML format model files for [Yen-Ting Lin's Language Models for Taiwanese Culture v1.0](https://huggingface.co/yentinglin/Taiwan-LLaMa-v1.0).
26
+
27
+ They are known to work with:
28
+ * [llama.cpp](https://github.com/ggerganov/llama.cpp), commit `e76d630` and later.
29
+
30
+ ...and probably work with these too, but I have not tested perssonally:
31
+ * [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
32
+ * [KoboldCpp](https://github.com/LostRuins/koboldcpp), version 1.37 and later.
33
+ * [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), version 0.1.77 and later.
34
+
35
+ ## Repositories available
36
+
37
+ * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/audreyt/Taiwan-LLaMa-v1.0-GGML)
38
+ * [Yen-Ting Lin's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/yentinglin/Taiwan-LLaMa-v1.0)
39
+
40
+
41
+ <!-- footer start -->
42
+ <!-- footer end -->
43
+
44
+ # Original model card: Yen-Ting Lin's Language Models for Taiwanese Culture v1.0
45
+ # Language Models for Taiwanese Culture
46
+
47
+
48
+ <p align="center">
49
+ ✍️ <a href="https://huggingface.co/spaces/yentinglin/Taiwan-LLaMa2" target="_blank">Online Demo</a>
50
+
51
+ 🤗 <a href="https://huggingface.co/yentinglin" target="_blank">HF Repo</a> • 🐦 <a href="https://twitter.com/yentinglin56" target="_blank">Twitter</a> • 📃 <a href="https://arxiv.org/pdf/2305.13711.pdf" target="_blank">[Paper Coming Soon]</a>
52
+ • 👨️ <a href="https://yentingl.com/" target="_blank">Yen-Ting Lin</a>
53
+ <br/><br/>
54
+ <img src="https://www.csie.ntu.edu.tw/~miulab/taiwan-llama/logo-v2.png" width="100"> <br/>
55
+ <a href="https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE">
56
+ <img src="https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg"></a>
57
+ <a href="https://github.com/tatsu-lab/stanford_alpaca/blob/main/DATA_LICENSE">
58
+ <img src="https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg"></a>
59
+ <br/>
60
+
61
+ </p>
62
+
63
+
64
+
65
+
66
+ ## Overview
67
+ Taiwan-LLaMa is a full parameter fine-tuned model based on LLaMa 2 for Traditional Chinese applications.
68
+
69
+ **Taiwan-LLaMa v1.0** pretrained on over 5 billion tokens and instruction-tuned on over 490k conversations both in traditional chinese.
70
+
71
+ ## Demo
72
+ A live demonstration of the model can be accessed at [Hugging Face Spaces](https://huggingface.co/spaces/yentinglin/Taiwan-LLaMa2).
73
+
74
+ ## Key Features
75
+
76
+ 1. **Traditional Chinese Support**: The model is fine-tuned to understand and generate text in Traditional Chinese, making it suitable for Taiwanese culture and related applications.
77
+
78
+ 2. **Instruction-Tuned**: Further fine-tuned on conversational data to offer context-aware and instruction-following responses.
79
+
80
+ 3. **Performance on Vicuna Benchmark**: Taiwan-LLaMa's relative performance on Vicuna Benchmark is measured against models like GPT-4 and ChatGPT. It's particularly optimized for Taiwanese culture.
81
+
82
+ 4. **Flexible Customization**: Advanced options for controlling the model's behavior like system prompt, temperature, top-p, and top-k are available in the demo.
83
+
84
+
85
+ ## Work in progress
86
+ - [ ] **Improved pretraining**: A refined pretraining process (e.g. more data from Taiwan, training strategies) is under development, aiming to enhance model performance for better Taiwanese culture.
87
+ - [ ] **Extend max length**: Utilizing the Rope mechanism as described in [the paper](https://arxiv.org/abs/2104.09864), the model's length will be extended from 4k to 8k.
88
+
89
+
90
+ ## Taiwanese Culture Examples
91
+
92
+ | Question | ChatGPT | **Taiwan LLaMa v1.0** | **Taiwan LLaMa v0.0 (w/o pretraining)** | Llama2_Traditional_Chinese_13b_Chat |
93
+ |:-|:-|:-|:-|:-|
94
+ | 以惡地景觀著名的是高雄哪一個景點? | 以惡地地景著名的是高雄市的旗津。 | 高雄以擁有惡地景觀著名的景點是田寮月世界。✅ | 以惡地景觀著名的是高雄市的旗津島。旗津島是一個小島,... | 高雄市內有多個以惡地景觀著名的景點,其中以下面幾個最為著名: 1. 高雄捷運紅線沿途景點:高雄市政府為了推廣高雄 |
95
+ | 唐鳳是誰? | 唐鳳(Audrey Tang)是台灣的政治家、社會活動家和軟體工程師。她於2016年擔任台灣行政院數位政務委員,是台灣政府歷史上首位沒有政黨背景的政務官。唐鳳在軟體工程領域有豐富的經驗。 ✅ | 唐鳳是政治評論家、資訊社會學研究者,也是台灣第一位擔任政府數位政委的人。 ✅ | 唐鳳(Tang Feng)是一位著名的中國演員和歌手。他出生於1974年12月17日,出生地點在中國北京。唐鳳在1995年開始他的演藝生涯,並在同年發行了他的首張專輯《我不孤單》。此後,他發行了許多受歡迎的歌曲,並在多部電視劇和電影中擔任主角。唐鳳因其精湛的演技和迷人的歌聲而被譽為“亞洲魅力男王”。他的音樂獲得了多個獎項,包括“華表最佳新男演唱獎”、“華表最佳歌曲獎”和“華表最佳舞蹈獎”等。 | 唐鳳是唐朝時期的一位著名文學家,他的代表作品包括《陰騭》、《狂凡子》等。他的文學風格具有強烈的個性和創意,被譽為「唐代文學之父」。 |
96
+ | NTU 在哪? | NTU位於新加坡,具體地址是50 Nanyang Avenue, Singapore 639798。 | NTU 是國立臺灣大學的縮寫。它是臺灣的一所頂尖研究型大學,位於臺北市。 ✅| NTU 是台灣的一所大學,位於台北市。✅ | NTU 的總部位於台灣新北市板橋區,佔地面積約為1,000公頃。|
97
+ | 甚麼是22K? | "22K" 可能指的是金的純度,尤其是在珠寶和金飾方面。 | "22K"是指台灣的大學畢業生的起薪,大約是新台幣22,000元。這個詞被用來形容大學畢業生的起薪偏低,並且在就業市場上對於大學畢業生的需求不足。✅| 22K是指溫度尺度上的22,000K(千度)。1K代表一度,即溫度差距,是攝氏溫標(攝氏度)的一百萬分之一。因此,22K等於-253.15°C(攝氏溫標)或-423.67°F(華氏溫標)。 | 22K是一種簡寫,即22,000。它是一個數字,表示22,000個單位。|
98
+ | 肉圓要不要加香菜? | 這是個個人口味的問題,有些人喜歡在肉圓中加入香菜,因為香菜的清香可以增添風味;而有些人則不喜歡香菜的味道,覺得它會蓋過其他食材的味道。所以要不要加香菜,可以根據個人喜好來決定。 | 這是一個個人喜好的問題,有些人喜歡在肉圓裡加香菜,有些人不喜歡。**在台灣**,肉圓加香菜是一種常見的做法,這是一種在肉圓上加香菜的做法。然而,在其他國這是一個文化差異的問題。😊 | 肉圓加香菜可以增添一些芳香和美味的口感,如果你喜歡香菜的味道,可以試試加入肉圓中。香菜通常在製作過程中炒香,可以增添一些芳香和美味的味道。如果你想要加入香菜,可以在製作過程中炒香一些香菜,然後將其混入肉圓中。肉圓加香菜可以增添一些芳香和美味的口感,如果你喜歡香菜的味道,可以試試加入肉圓中。| 加香菜的肉圓味道更好,但是不加香菜的肉圓也很好吃。如果你喜歡香菜的味道,那麼加上香菜就好了。如果你不喜歡香菜,那麼不加也不會影響肉 |
99
+
100
+
101
+ ## Model
102
+
103
+ We provide a number of model checkpoints that we trained. Please find them on Hugging Face [here](https://huggingface.co/models?search=taiwan-llama). Here are some quick links to the checkpoints that are finetuned from LLaMa 2:
104
+
105
+ | **Model** | **13B** |
106
+ |--------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
107
+ | **Taiwan-LLaMa v1.0** (_better for Taiwanese Culture_) | 🤗 <a href="https://huggingface.co/yentinglin/Taiwan-LLaMa-v1.0" target="_blank">yentinglin/Taiwan-LLaMa-v1.0</a> |
108
+ | Taiwan-LLaMa v0.9 (partial instruction set) | 🤗 <a href="https://huggingface.co/yentinglin/Taiwan-LLaMa-v0.9" target="_blank">yentinglin/Taiwan-LLaMa-v0.9</a> |
109
+ | Taiwan-LLaMa v0.0 (no Traditional Chinese pretraining) | 🤗 <a href="https://huggingface.co/yentinglin/Taiwan-LLaMa-v0.0" target="_blank">yentinglin/Taiwan-LLaMa-v0.0</a> |
110
+
111
+ ## Data
112
+
113
+ Here are some quick links to the datasets that we used to train the models:
114
+
115
+ | **Dataset** | **Link** |
116
+ |---------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
117
+ | **Instruction-tuning** | 🤗 <a href="https://huggingface.co/datasets/yentinglin/traditional_chinese_instructions" target="_blank">yentinglin/traditional_chinese_instructions</a> |
118
+ | Traditional Chinese Pretraining | 🤗 <a href="https://huggingface.co/datasets/yentinglin/zh_TW_c4" target="_blank">yentinglin/zh_TW_c4</a> |
119
+
120
+
121
+ ## Architecture
122
+ Taiwan-LLaMa is based on LLaMa 2, leveraging transformer architecture, <a href="https://github.com/Dao-AILab/flash-attention" target="_blank">flash attention 2</a>, and bfloat16.
123
+
124
+ It includes:
125
+
126
+ * Pretraining Phase: Pretrained on a vast corpus of over 5 billion tokens, extracted from common crawl in Traditional Chinese.
127
+ * Fine-tuning Phase: Further instruction-tuned on over 490k multi-turn conversational data to enable more instruction-following and context-aware responses.
128
+
129
+ ## Generic Capabilities on Vicuna Benchmark
130
+
131
+ The data is translated into traditional Chinese for evaluating the general capability.
132
+
133
+
134
+ <img src="./images/zhtw_vicuna_bench_chatgptbaseline.png" width="700">
135
+
136
+ The scores are calculated with ChatGPT as the baseline, represented as 100%. The other values show the relative performance of different models compared to ChatGPT.
137
+
138
+ | Language Model | Relative Score (%) |
139
+ |-------------------------------------|--------------------|
140
+ | GPT-4 | 102.59% |
141
+ | ChatGPT | 100.00% |
142
+ | **Taiwan-LLaMa v1.0** | 76.76% |
143
+ | Claude-Instant-1.2 | 74.04% |
144
+ | Llama2_Traditional_Chinese_13b_Chat | 56.21% |
145
+
146
+
147
+
148
+
149
+ ## How to deploy the model on my own machine?
150
+ We recommend hosting models with [🤗 Text Generation Inference](https://github.com/huggingface/text-generation-inference). Please see their [license](https://github.com/huggingface/text-generation-inference/blob/main/LICENSE) for details on usage and limitations.
151
+ ```bash
152
+ bash run_text_generation_inference.sh "yentinglin/Taiwan-LLaMa" NUM_GPUS DIR_TO_SAVE_MODEL PORT MAX_INPUT_LEN MODEL_MAX_LEN
153
+ ```
154
+
155
+ Prompt format follows vicuna-v1.1 template:
156
+ ```
157
+ A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {user} ASSISTANT:
158
+ ```
159
+
160
+ ## Setup development environment
161
+ ```bash
162
+ conda create -n taiwan-llama python=3.10 -y
163
+ conda activate taiwan-llama
164
+ pip install -r requirements.txt
165
+ ```
166
+
167
+
168
+ ## Citations
169
+ If you use our code, data, or models in your research, please cite this repository. You can use the following BibTeX entry:
170
+
171
+ ```bibtex
172
+ @inproceedings{lin-chen-2023-llm,
173
+ title = "{LLM}-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models",
174
+ author = "Lin, Yen-Ting and Chen, Yun-Nung",
175
+ booktitle = "Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023)",
176
+ month = jul,
177
+ year = "2023",
178
+ address = "Toronto, Canada",
179
+ publisher = "Association for Computational Linguistics",
180
+ url = "https://aclanthology.org/2023.nlp4convai-1.5",
181
+ pages = "47--58"
182
+ }
183
+
184
+ @misc{taiwanllama,
185
+ author={Lin, Yen-Ting and Chen, Yun-Nung},
186
+ title={Taiwanese-Aligned Language Models based on Meta-Llama2},
187
+ year={2023},
188
+ url={https://github.com/adamlin120/Taiwan-LLaMa},
189
+ note={Code and models available at https://github.com/adamlin120/Taiwan-LLaMa},
190
+ }
191
+ ```
192
+
193
+ ## Collaborate With Us
194
+ If you are interested in contributing to the development of Traditional Chinese language models, exploring new applications, or leveraging Taiwan-LLaMa for your specific needs, please don't hesitate to contact us. We welcome collaborations from academia, industry, and individual contributors.
195
+
196
+ ## License
197
+ The code in this project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.
198
+
199
+ The models included in this project are licensed under the LLAMA 2 Community License. See the [LLAMA2 License](https://github.com/facebookresearch/llama/blob/main/LICENSE) for full details.
200
+
201
+ ## OpenAI Data Acknowledgment
202
+ The data included in this project were generated using OpenAI's models and are subject to OpenAI's Terms of Use. Please review [OpenAI's Terms of Use](https://openai.com/policies/terms-of-use) for details on usage and limitations.
203
+
204
+
205
+ ## Acknowledgements
206
+
207
+ We thank [Meta LLaMA team](https://github.com/facebookresearch/llama) and [Vicuna team](https://github.com/lm-sys/FastChat) for their open-source efforts in democratizing large language models.
208
  ---
209
  ## Intro
210