Update README.md
Browse files
README.md
CHANGED
@@ -2,36 +2,67 @@
|
|
2 |
language:
|
3 |
- en
|
4 |
- zh
|
|
|
|
|
|
|
|
|
5 |
---
|
6 |
|
7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
|
9 |
<p align="center">
|
10 |
-
|
|
|
|
|
|
|
|
|
11 |
</p>
|
12 |
|
13 |
<p align="center">
|
14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
</p>
|
16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
## News
|
18 |
-
|
19 |
-
-
|
|
|
|
|
20 |
|
21 |
## Model Introduction
|
22 |
|
23 |
**Llama-3-SynE** (<ins>Syn</ins>thetic data <ins>E</ins>nhanced Llama-3) is a significantly enhanced version of [Llama-3 (8B)](https://github.com/meta-llama/llama3), achieved through continual pre-training (CPT) to improve its **Chinese language ability and scientific reasoning capability**. By employing a meticulously designed data mixture and curriculum strategy, Llama-3-SynE successfully enhances new abilities while maintaining the original modelโs performance. This enhancement process involves utilizing existing datasets and synthesizing high-quality datasets specifically designed for targeted tasks.
|
24 |
|
25 |
Key features of Llama-3-SynE include:
|
|
|
26 |
- **Enhanced Chinese Language Capabilities**: Achieved through topic-based data mixture and perplexity-based data curriculum.
|
27 |
- **Improved Scientific Reasoning**: Utilizing synthetic datasets to enhance multi-disciplinary scientific knowledge.
|
28 |
- **Efficient CPT**: Only consuming around 100 billion tokens, making it a cost-effective solution.
|
29 |
|
30 |
## Model List
|
31 |
|
32 |
-
| Model
|
33 |
-
|
34 |
-
| Llama-3-SynE
|
35 |
|
36 |
## BenchMark
|
37 |
|
@@ -44,15 +75,15 @@ For HumanEval and ARC, we report the zero-shot evaluation performance. The best
|
|
44 |
|
45 |
### Major Benchmarks
|
46 |
|
47 |
-
|
|
48 |
-
|
49 |
-
| Llama-3-8B
|
50 |
-
| DCLM-7B
|
51 |
-
| Mistral-7B-v0.3
|
52 |
-
| Llama-3-Chinese-8B
|
53 |
-
| MAmmoTH2-8B
|
54 |
-
| Galactica-6.7B
|
55 |
-
| **Llama-3-SynE (ours)**
|
56 |
|
57 |
> On **Chinese evaluation benchmarks** (such as C-Eval and CMMLU), Llama-3-SynE significantly outperforms the base model Llama-3 (8B), indicating that our method is very effective in improving Chinese language capabilities.
|
58 |
|
@@ -62,15 +93,15 @@ For HumanEval and ARC, we report the zero-shot evaluation performance. The best
|
|
62 |
|
63 |
"PHY", "CHE", and "BIO" denote the physics, chemistry, and biology sub-tasks of the corresponding benchmarks.
|
64 |
|
65 |
-
| **Models**
|
66 |
-
|
67 |
-
| Llama-3-8B
|
68 |
-
| DCLM-7B
|
69 |
-
| Mistral-7B-v0.3
|
70 |
-
| Llama-3-Chinese-8B
|
71 |
-
| MAmmoTH2-8B
|
72 |
-
| Galactica-6.7B
|
73 |
-
| **Llama-3-SynE (ours)** | <ins>53.66</ins>
|
74 |
|
75 |
> On **scientific evaluation benchmarks** (such as SciEval, GaoKao, and ARC), Llama-3-SynE significantly outperforms the base model, particularly showing remarkable improvement in Chinese scientific benchmarks (for example, a 25.71% improvement in the GaoKao biology subtest).
|
76 |
|
@@ -134,7 +165,7 @@ print(output)
|
|
134 |
|
135 |
## License
|
136 |
|
137 |
-
This project is built upon Meta's Llama-3 model. The use of Llama-3-SynE model weights must follow the Llama-3 [license agreement](https://github.com/meta-llama/llama3/blob/main/LICENSE).
|
138 |
|
139 |
## Citation
|
140 |
|
|
|
2 |
language:
|
3 |
- en
|
4 |
- zh
|
5 |
+
datasets:
|
6 |
+
- survivi/Llama-3-SynE-Dataset
|
7 |
+
library_name: transformers
|
8 |
+
pipeline_tag: text-generation
|
9 |
---
|
10 |
|
11 |
+
<p align="center">
|
12 |
+
<img src="https://github.com/RUC-GSAI/Llama-3-SynE/blob/main/assets/llama-3-syne-logo.png" width="400"/>
|
13 |
+
</p>
|
14 |
+
|
15 |
+
<!-- <p align="center">
|
16 |
+
๐ <a href="https://arxiv.org/abs/2407.18743"> Report </a>  |   ๐ค <a href="https://huggingface.co/survivi/Llama-3-SynE">Model on Hugging Face</a>  |   ๐ <a href="https://huggingface.co/datasets/survivi/Llama-3-SynE-Dataset">CPT Dataset</a>
|
17 |
+
</p>
|
18 |
|
19 |
<p align="center">
|
20 |
+
๐ <a href="https://github.com/RUC-GSAI/Llama-3-SynE/blob/main/README.md">English</a>  |  <a href="https://github.com/RUC-GSAI/Llama-3-SynE/blob/main/README_zh.md">็ฎไฝไธญๆ</a>
|
21 |
+
</p> -->
|
22 |
+
|
23 |
+
<p align="center">
|
24 |
+
๐ <a href="https://arxiv.org/abs/2407.18743"> Report </a>  |   ๐ป <a href="https://github.com/RUC-GSAI/Llama-3-SynE">GitHub Repo</a>
|
25 |
</p>
|
26 |
|
27 |
<p align="center">
|
28 |
+
๐ <a href="https://huggingface.co/survivi/Llama-3-SynE/blob/main/README.md">English</a>  |  <a href="https://huggingface.co/survivi/Llama-3-SynE/blob/main/README_zh.md">็ฎไฝไธญๆ</a>
|
29 |
+
</p>
|
30 |
+
|
31 |
+
> Here is the Llama-3-SynE model. The continual pre-training dataset is also available [here](https://huggingface.co/datasets/survivi/Llama-3-SynE-Dataset).
|
32 |
+
|
33 |
+
<!-- <p align="center">
|
34 |
+
๐ <a href="https://arxiv.org/abs/2407.18743"> Report </a>  |   ๐ป <a href="https://github.com/RUC-GSAI/Llama-3-SynE">GitHub Repo</a>
|
35 |
</p>
|
36 |
|
37 |
+
<p align="center">
|
38 |
+
๐ <a href="https://huggingface.co/datasets/survivi/Llama-3-SynE-Dataset/blob/main/README.md">English</a>  |  <a href="https://huggingface.co/datasets/survivi/Llama-3-SynE-Dataset/blob/main/README_zh.md">็ฎไฝไธญๆ</a>
|
39 |
+
</p>
|
40 |
+
|
41 |
+
> Here is the continual pre-training dataset. The Llama-3-SynE model is available [here](https://huggingface.co/survivi/Llama-3-SynE). -->
|
42 |
+
|
43 |
+
---
|
44 |
+
|
45 |
## News
|
46 |
+
|
47 |
+
- โจโจ `2024/08/12`: We released the [continual pre-training dataset](https://huggingface.co/datasets/survivi/Llama-3-SynE-Dataset).
|
48 |
+
- โจโจ `2024/08/10`: We released the [Llama-3-SynE model](https://huggingface.co/survivi/Llama-3-SynE).
|
49 |
+
- โจ `2024/07/26`: We released the [technical report](https://arxiv.org/abs/2407.18743), welcome to check it out!
|
50 |
|
51 |
## Model Introduction
|
52 |
|
53 |
**Llama-3-SynE** (<ins>Syn</ins>thetic data <ins>E</ins>nhanced Llama-3) is a significantly enhanced version of [Llama-3 (8B)](https://github.com/meta-llama/llama3), achieved through continual pre-training (CPT) to improve its **Chinese language ability and scientific reasoning capability**. By employing a meticulously designed data mixture and curriculum strategy, Llama-3-SynE successfully enhances new abilities while maintaining the original modelโs performance. This enhancement process involves utilizing existing datasets and synthesizing high-quality datasets specifically designed for targeted tasks.
|
54 |
|
55 |
Key features of Llama-3-SynE include:
|
56 |
+
|
57 |
- **Enhanced Chinese Language Capabilities**: Achieved through topic-based data mixture and perplexity-based data curriculum.
|
58 |
- **Improved Scientific Reasoning**: Utilizing synthetic datasets to enhance multi-disciplinary scientific knowledge.
|
59 |
- **Efficient CPT**: Only consuming around 100 billion tokens, making it a cost-effective solution.
|
60 |
|
61 |
## Model List
|
62 |
|
63 |
+
| Model | Type | Seq Length | Download |
|
64 |
+
| :----------- | :--- | :--------- | :------------------------------------------------------------ |
|
65 |
+
| Llama-3-SynE | Base | 8K | [๐ค Huggingface](https://huggingface.co/survivi/Llama-3-SynE) |
|
66 |
|
67 |
## BenchMark
|
68 |
|
|
|
75 |
|
76 |
### Major Benchmarks
|
77 |
|
78 |
+
| **Models** | **MMLU** | **C-Eval** | **CMMLU** | **MATH** | **GSM8K** | **ASDiv** | **MAWPS** | **SAT-Math** | **HumanEval** | **MBPP** |
|
79 |
+
| :---------------------- | :--------------- | :--------------- | :--------------- | :--------------- | :--------------- | :--------------- | :--------------- | :--------------- | :--------------- | :--------------- |
|
80 |
+
| Llama-3-8B | **66.60** | 49.43 | 51.03 | 16.20 | 54.40 | 72.10 | 89.30 | 38.64 | <ins>36.59</ins> | **47.00** |
|
81 |
+
| DCLM-7B | 64.01 | 41.24 | 40.89 | 14.10 | 39.20 | 67.10 | 83.40 | <ins>41.36</ins> | 21.95 | 32.60 |
|
82 |
+
| Mistral-7B-v0.3 | 63.54 | 42.74 | 43.72 | 12.30 | 40.50 | 67.50 | 87.50 | 40.45 | 25.61 | 36.00 |
|
83 |
+
| Llama-3-Chinese-8B | 64.10 | <ins>50.14</ins> | <ins>51.20</ins> | 3.60 | 0.80 | 1.90 | 0.60 | 36.82 | 9.76 | 14.80 |
|
84 |
+
| MAmmoTH2-8B | 64.89 | 46.56 | 45.90 | **34.10** | **61.70** | **82.80** | <ins>91.50</ins> | <ins>41.36</ins> | 17.68 | 38.80 |
|
85 |
+
| Galactica-6.7B | 37.13 | 26.72 | 25.53 | 5.30 | 9.60 | 40.90 | 51.70 | 23.18 | 7.31 | 2.00 |
|
86 |
+
| **Llama-3-SynE (ours)** | <ins>65.19</ins> | **58.24** | **57.34** | <ins>28.20</ins> | <ins>60.80</ins> | <ins>81.00</ins> | **94.10** | **43.64** | **42.07** | <ins>45.60</ins> |
|
87 |
|
88 |
> On **Chinese evaluation benchmarks** (such as C-Eval and CMMLU), Llama-3-SynE significantly outperforms the base model Llama-3 (8B), indicating that our method is very effective in improving Chinese language capabilities.
|
89 |
|
|
|
93 |
|
94 |
"PHY", "CHE", and "BIO" denote the physics, chemistry, and biology sub-tasks of the corresponding benchmarks.
|
95 |
|
96 |
+
| **Models** | **SciEval PHY** | **SciEval CHE** | **SciEval BIO** | **SciEval Avg.** | **SciQ** | **GaoKao MathQA** | **GaoKao CHE** | **GaoKao BIO** | **ARC Easy** | **ARC Challenge** | **ARC Avg.** | **AQUA-RAT** |
|
97 |
+
| :---------------------- | :--------------- | :--------------- | :--------------- | :--------------- | :--------------- | :---------------- | :--------------- | :--------------- | :--------------- | :---------------- | :--------------- | :--------------- |
|
98 |
+
| Llama-3-8B | 46.95 | 63.45 | 74.53 | 65.47 | 90.90 | 27.92 | 32.85 | 43.81 | 91.37 | 77.73 | 84.51 | <ins>27.95</ins> |
|
99 |
+
| DCLM-7B | **56.71** | 64.39 | 72.03 | 66.25 | **92.50** | 29.06 | 31.40 | 37.14 | 89.52 | 76.37 | 82.94 | 20.08 |
|
100 |
+
| Mistral-7B-v0.3 | 48.17 | 59.41 | 68.89 | 61.51 | 89.40 | 30.48 | 30.92 | 41.43 | 87.33 | 74.74 | 81.04 | 23.23 |
|
101 |
+
| Llama-3-Chinese-8B | 48.17 | 67.34 | 73.90 | <ins>67.34</ins> | 89.20 | 27.64 | 30.43 | 38.57 | 88.22 | 70.48 | 79.35 | 27.56 |
|
102 |
+
| MAmmoTH2-8B | 49.39 | **69.36** | <ins>76.83</ins> | **69.60** | 90.20 | **32.19** | <ins>36.23</ins> | <ins>49.05</ins> | **92.85** | **84.30** | **88.57** | 27.17 |
|
103 |
+
| Galactica-6.7B | 34.76 | 43.39 | 54.07 | 46.27 | 71.50 | 23.65 | 27.05 | 24.76 | 65.91 | 46.76 | 56.33 | 20.87 |
|
104 |
+
| **Llama-3-SynE (ours)** | <ins>53.66</ins> | <ins>67.81</ins> | **77.45** | **69.60** | <ins>91.20</ins> | <ins>31.05</ins> | **51.21** | **69.52** | <ins>91.58</ins> | <ins>80.97</ins> | <ins>86.28</ins> | **28.74** |
|
105 |
|
106 |
> On **scientific evaluation benchmarks** (such as SciEval, GaoKao, and ARC), Llama-3-SynE significantly outperforms the base model, particularly showing remarkable improvement in Chinese scientific benchmarks (for example, a 25.71% improvement in the GaoKao biology subtest).
|
107 |
|
|
|
165 |
|
166 |
## License
|
167 |
|
168 |
+
This project is built upon Meta's Llama-3 model. The use of Llama-3-SynE model weights must follow the Llama-3 [license agreement](https://github.com/meta-llama/llama3/blob/main/LICENSE). The code in this open-source repository follows the [Apache 2.0](LICENSE) license.
|
169 |
|
170 |
## Citation
|
171 |
|