Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- HuggingFaceFW/fineweb
|
4 |
+
- erhwenkuo/c4-chinese-zhtw
|
5 |
+
- erhwenkuo/wikipedia-zhtw
|
6 |
+
- p208p2002/wudao
|
7 |
+
- p208p2002/NDLTD-T10-90-111
|
8 |
+
- codeparrot/github-code-clean
|
9 |
+
language:
|
10 |
+
- en
|
11 |
+
- zh
|
12 |
+
---
|
13 |
+
# Llama 3 zhtw
|
14 |
+
|
15 |
+
在 Llama 3 上試驗中文 Continue Pretraining (CP),共計訓練 800M tokens。
|
16 |
+
|
17 |
+
由於中文預訓練語料品質還有改進空間,CP 後表現未能超越原版 Llama 3,我們比較幾個開源社群訓練的中文 Llama 3 也有類似狀況。
|
18 |
+
|
19 |
+
在英文方面 LLaMA 3 zhtw 使用 FineWeb,使得 MMLU 表現高於其他中文CP模型,能力與原版 LLaMA 3 持平。
|
20 |
+
|
21 |
+
## Benchmarks
|
22 |
+
| Models | | ↑ TMMLU+ (ACC) | CMMLU (ACC) | MMLU (ACC) |
|
23 |
+
| ---------------------------- | --- | -------------- | ------------- | ------------- |
|
24 |
+
| | | TC, Knowledge | CN, Knowledge | EN, Knowledge |
|
25 |
+
| | | 5 shot | 5 shot | 5 shot |
|
26 |
+
| Yi-6B | 6B | 49.63 | 75.53 | 65.35 |
|
27 |
+
| Qwen-7B | 7B | 42.84 | 73.1 | 61.00 |
|
28 |
+
| Meta-Llama-3-8B | 8B | 41.97 | 50.8 | 65.17 |
|
29 |
+
| **p208p2002/llama3-zhtw-8B** | 8B | 41.84 | 50.6 | 65.31 |
|
30 |
+
| Breeze-7B-Base-v0_1 | 7B | 40.35 | 44.05 | 61.63 |
|
31 |
+
| hfl/llama-3-chinese-8b | 8B | 39.64 | 50.9 | 61.1 |
|
32 |
+
|
33 |
+
## Recipe
|
34 |
+
|
35 |
+
### Datasets
|
36 |
+
| Dataset | Lang | Weight |
|
37 |
+
|----------------|-------------|--------|
|
38 |
+
| FineWeb | en | 0.35 |
|
39 |
+
| Wudao | zh-cn | 0.1 |
|
40 |
+
| C4Tw | zh-tw | 0.1 |
|
41 |
+
| WikiZhTw | zh-tw | 0.15 |
|
42 |
+
| NdltdT10 | zh-tw | 0.1 |
|
43 |
+
| GitHubMarkDown | code | 0.1 |
|
44 |
+
| GitHubPython | code | 0.1 |
|
45 |
+
|
46 |
+
### Hyper Parameters
|
47 |
+
|
48 |
+
- Learning Rate: 1e-7
|
49 |
+
- Global Batch Size: 45
|
50 |
+
- Sequence Length: 8192
|
51 |
+
|