survivi commited on
Commit
eba1e27
1 Parent(s): 2269080

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +134 -30
README.md CHANGED
@@ -1,46 +1,150 @@
1
  ---
2
- tags:
3
- - generated_from_trainer
4
- model-index:
5
- - name: Llama-3-SynE
6
- results: []
7
  ---
8
 
9
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
10
- should probably proofread and complete it, then remove this comment. -->
11
-
12
  # Llama-3-SynE
13
 
14
- This model was trained from scratch on an unknown dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- ## Model description
17
 
18
- More information needed
 
 
19
 
20
- ## Intended uses & limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
- More information needed
23
 
24
- ## Training and evaluation data
 
 
25
 
26
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- ## Training procedure
29
 
30
- ### Training hyperparameters
31
 
32
- The following hyperparameters were used during training:
33
- - learning_rate: 5e-05
34
- - train_batch_size: 64
35
- - eval_batch_size: 64
36
- - seed: 42
37
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
38
- - lr_scheduler_type: linear
39
- - num_epochs: 3.0
40
 
41
- ### Framework versions
42
 
43
- - Transformers 4.41.1
44
- - Pytorch 2.1.2+cu121
45
- - Datasets 2.20.0
46
- - Tokenizers 0.19.1
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ - zh
 
 
5
  ---
6
 
 
 
 
7
  # Llama-3-SynE
8
 
9
+ <p align="center">
10
+ 📄<a href="https://arxiv.org/abs/2407.18743" target="_blank"> Report </a> • 💻 <a href="https://github.com/RUC-GSAI/Llama-3-SynE" target="_blank">GitHub Repo</a>
11
+ </p>
12
+
13
+ <p align="center">
14
+ 🔍<a href="README_zh.md" target="_blank">中文</a>
15
+ </p>
16
+
17
+ ## News
18
+ - ✨✨ ``2024/08/10``: We released the [Llama-3-SynE model](https://huggingface.co/survivi/Llama-3-SynE).
19
+ - ✨ ``2024/07/26``: We released the [technical report](https://arxiv.org/abs/2407.18743), welcome to check it out!
20
+
21
+ ## Model Introduction
22
+
23
+ **Llama-3-SynE** (<ins>Syn</ins>thetic data <ins>E</ins>nhanced Llama-3) is a significantly enhanced version of [Llama-3 (8B)](https://github.com/meta-llama/llama3), achieved through continual pre-training (CPT) to improve its **Chinese language ability and scientific reasoning capability**. By employing a meticulously designed data mixture and curriculum strategy, Llama-3-SynE successfully enhances new abilities while maintaining the original model’s performance. This enhancement process involves utilizing existing datasets and synthesizing high-quality datasets specifically designed for targeted tasks.
24
+
25
+ Key features of Llama-3-SynE include:
26
+ - **Enhanced Chinese Language Capabilities**: Achieved through topic-based data mixture and perplexity-based data curriculum.
27
+ - **Improved Scientific Reasoning**: Utilizing synthetic datasets to enhance multi-disciplinary scientific knowledge.
28
+ - **Efficient CPT**: Only consuming around 100 billion tokens, making it a cost-effective solution.
29
+
30
+ ## Model List
31
+
32
+ | Model | Type | Seq Length | Download |
33
+ |-----------------|-------|------------|----------------------------------------------------------------|
34
+ | Llama-3-SynE | Base | 8K | [🤗 Huggingface](https://huggingface.co/survivi/Llama-3-SynE) |
35
+
36
+ ## BenchMark
37
+
38
+ We divide all evaluation benchmarks into two groups. The first group is _major benchmarks_, which aim to evaluate the comprehensive capacities of LLMs. Note that we include commonly used math and code benchmarks in this group because it is standard practice to use these benchmarks for evaluating various general-purpose LLMs.
39
+
40
+ The second group is _scientific benchmarks_, which have a broader coverage of multidisciplinary scientific knowledge.
41
+
42
+ We report the eight-shot performance on GSM8K, ASDiv, and MAWPS, five-shot for C-Eval, CMMLU, MMLU, MATH, GaoKao, SciQ, SciEval, SAT-Math, and AQUA-RAT, three-shot for MBPP.
43
+ For HumanEval and ARC, we report the zero-shot evaluation performance. The best and second best are in **bold** and <ins>underlined</ins>, respectively.
44
+
45
+ ### Major Benchmarks
46
+
47
+ | **Models** | **MMLU** | **C-Eval** | **CMMLU** | **MATH** | **GSM8K** | **ASDiv** | **MAWPS** | **SAT-Math** | **HumanEval** | **MBPP** |
48
+ |---------------------------|---------------|----------|---------|---------------|---------|---------|---------|-----------|----------------|--------|
49
+ | Llama-3-8B | **66.60** | 49.43 | 51.03 | 16.20 | 54.40 | 72.10 | 89.30 | 38.64 | <ins>36.59</ins> | **47.00** |
50
+ | DCLM-7B | 64.01 | 41.24 | 40.89 | 14.10 | 39.20 | 67.10 | 83.40 | <ins>41.36</ins> | 21.95 | 32.60 |
51
+ | Mistral-7B-v0.3 | 63.54 | 42.74 | 43.72 | 12.30 | 40.50 | 67.50 | 87.50 | 40.45 | 25.61 | 36.00 |
52
+ | Llama-3-Chinese-8B | 64.10 | <ins>50.14</ins> | <ins>51.20</ins> | 3.60 | 0.80 | 1.90 | 0.60 | 36.82 | 9.76 | 14.80 |
53
+ | MAmmoTH2-8B | 64.89 | 46.56 | 45.90 | **34.10** | **61.70**| **82.80**| <ins>91.50</ins> | <ins>41.36</ins> | 17.68 | 38.80 |
54
+ | Galactica-6.7B | 37.13 | 26.72 | 25.53 | 5.30 | 9.60 | 40.90 | 51.70 | 23.18 | 7.31 | 2.00 |
55
+ | **Llama-3-SynE (ours)** | <ins>65.19</ins> | **58.24**| **57.34**| <ins>28.20</ins> | <ins>60.80</ins> | <ins>81.00</ins> | **94.10**| **43.64**| **42.07** | <ins>45.60</ins>|
56
+
57
+ > On **Chinese evaluation benchmarks** (such as C-Eval and CMMLU), Llama-3-SynE significantly outperforms the base model Llama-3 (8B), indicating that our method is very effective in improving Chinese language capabilities.
58
+
59
+ > On **English evaluation benchmarks** (such as MMLU, MATH, and code evaluation benchmarks), Llama-3-SynE demonstrates comparable or better performance than the base model, indicating that our method effectively addresses the issue of catastrophic forgetting during the CPT process.
60
+
61
+ ### Scientific Benchmarks
62
+
63
+ "PHY", "CHE", and "BIO" denote the physics, chemistry, and biology sub-tasks of the corresponding benchmarks.
64
+
65
+ | **Models** | **SciEval PHY** | **SciEval CHE** | **SciEval BIO** | **SciEval Avg.** | **SciQ** | **GaoKao MathQA** | **GaoKao CHE** | **GaoKao BIO** | **ARC Easy** | **ARC Challenge** | **ARC Avg.** | **AQUA-RAT** |
66
+ |--------------------|-----------------|-----------------|-----------------|------------------|---------------|-------------------|----------------|----------------|---------------|-------------------|--------------|-------------------|
67
+ | Llama-3-8B | 46.95 | 63.45 | 74.53 | 65.47 | 90.90 | 27.92 | 32.85 | 43.81 | 91.37 | 77.73 | 84.51 | <ins>27.95</ins> |
68
+ | DCLM-7B | **56.71** | 64.39 | 72.03 | 66.25 | **92.50** | 29.06 | 31.40 | 37.14 | 89.52 | 76.37 | 82.94 | 20.08 |
69
+ | Mistral-7B-v0.3 | 48.17 | 59.41 | 68.89 | 61.51 | 89.40 | 30.48 | 30.92 | 41.43 | 87.33 | 74.74 | 81.04 | 23.23 |
70
+ | Llama-3-Chinese-8B | 48.17 | 67.34 | 73.90 | <ins>67.34</ins> | 89.20 | 27.64 | 30.43 | 38.57 | 88.22 | 70.48 | 79.35 | 27.56 |
71
+ | MAmmoTH2-8B | 49.39 | **69.36** | <ins>76.83</ins> | **69.60** | 90.20 | **32.19** | <ins>36.23</ins> | <ins>49.05</ins> | **92.85** | **84.30** | **88.57** | 27.17 |
72
+ | Galactica-6.7B | 34.76 | 43.39 | 54.07 | 46.27 | 71.50 | 23.65 | 27.05 | 24.76 | 65.91 | 46.76 | 56.33 | 20.87 |
73
+ | **Llama-3-SynE (ours)** | <ins>53.66</ins> | <ins>67.81</ins> | **77.45** | **69.60** | <ins>91.20</ins> | <ins>31.05</ins> | **51.21** | **69.52** | <ins>91.58</ins> | <ins>80.97</ins> | <ins>86.28</ins> | **28.74** |
74
+
75
+ > On **scientific evaluation benchmarks** (such as SciEval, GaoKao, and ARC), Llama-3-SynE significantly outperforms the base model, particularly showing remarkable improvement in Chinese scientific benchmarks (for example, a 25.71% improvement in the GaoKao biology subtest).
76
+
77
+ ## Quick Start
78
 
79
+ Use the transformers backend for inference:
80
 
81
+ ```python
82
+ from transformers import AutoTokenizer, AutoModelForCausalLM
83
+ import torch
84
 
85
+ model_path = "survivi/Llama-3-SynE"
86
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
87
+ model = AutoModelForCausalLM.from_pretrained(
88
+ model_path, torch_dtype=torch.bfloat16, trust_remote_code=True
89
+ )
90
+ model.to("cuda:0")
91
+ model.eval()
92
+ prompt = "Hello world!"
93
+ inputs = tokenizer(prompt, return_tensors="pt")
94
+ inputs = inputs.to("cuda")
95
+ pred = model.generate(
96
+ **inputs,
97
+ max_new_tokens=2048,
98
+ repetition_penalty=1.05,
99
+ temperature=0.5,
100
+ top_k=5,
101
+ top_p=0.85,
102
+ do_sample=True
103
+ )
104
+ pred = pred[0][len(inputs.input_ids[0]) :]
105
+ output = tokenizer.decode(pred, skip_special_tokens=True)
106
+ print(output)
107
+ ```
108
 
109
+ Use the vLLM backend for inference:
110
 
111
+ ```python
112
+ from transformers import AutoTokenizer
113
+ from vllm import LLM, SamplingParams
114
 
115
+ model_path = "survivi/Llama-3-SynE"
116
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
117
+ sampling_params = SamplingParams(
118
+ max_tokens=2048,
119
+ repetition_penalty=1.05,
120
+ temperature=0.5,
121
+ top_k=5,
122
+ top_p=0.85,
123
+ )
124
+ llm = LLM(
125
+ model=model_path,
126
+ tensor_parallel_size=1,
127
+ trust_remote_code=True,
128
+ )
129
+ prompt = "Hello world!"
130
+ output = llm.generate(prompt, sampling_params)
131
+ output = output[0].outputs[0].text
132
+ print(output)
133
+ ```
134
 
135
+ ## License
136
 
137
+ This project is built upon Meta's Llama-3 model. The use of Llama-3-SynE model weights must follow the Llama-3 [license agreement](https://github.com/meta-llama/llama3/blob/main/LICENSE).
138
 
139
+ ## Citation
 
 
 
 
 
 
 
140
 
141
+ If you find our work helpful, please consider citing the following paper:
142
 
143
+ ```
144
+ @article{jie2024llama3syne,
145
+ title={Towards Effective and Efficient Continual Pre-training of Large Language Models},
146
+ author={Chen, Jie and Chen, Zhipeng and Wang, Jiapeng and Zhou, Kun and Zhu, Yutao and Jiang, Jinhao and Min, Yingqian and Zhao, Wayne Xin and Dou, Zhicheng and Mao, Jiaxin and others},
147
+ journal={arXiv preprint arXiv:2407.18743},
148
+ year={2024}
149
+ }
150
+ ```