zhs12 commited on
Commit
73d743f
·
verified ·
1 Parent(s): f013916

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -8,7 +8,7 @@ language:
8
  ---
9
  # Model Card for llama3-8B-360Zhinao-360k-Instruct
10
 
11
- llama3-8B-360Zhinao-360k-Instruct is 360Zhinao's extension of llama3-8B-Instruct to a 360k context window.
12
 
13
  Within the 360k-token length,
14
  llama3-8B-360Zhinao-360k-Instruct achieves:
@@ -78,8 +78,6 @@ python -m vllm.entrypoints.openai.api_server \
78
  > log8.server 2>&1
79
  ```
80
 
81
- <!-- NIAH scripts -->
82
-
83
 
84
  ## Methods
85
 
@@ -87,7 +85,11 @@ llama3-8B-360Zhinao-360k-Instruct is trained from [llama3-8B-Instruct](https://h
87
  Its original context-length is 8k with RoPE base 500,000.
88
 
89
  We directly extended its context length to 360k. We changed RoPE base to 500,000,000 and trained on a combined SFT dataset of [LWM's open-sourced data](https://huggingface.co/LargeWorldModel) and internal long-context data in Chinese and English.
90
- We implemented SFT on top of [EasyContext](https://github.com/jzhang38/EasyContext/) but later found that turning on pretraining loss produced much better results.
 
 
 
 
91
 
92
  ## Contact & License
93
  Email: [email protected]
 
8
  ---
9
  # Model Card for llama3-8B-360Zhinao-360k-Instruct
10
 
11
+ llama3-8B-360Zhinao-360k-Instruct is 360Zhinao's extension of llama3-8B-Instruct to a 360k context window [[GitHub]](https://github.com/Qihoo360/360zhinao/tree/main/360k).
12
 
13
  Within the 360k-token length,
14
  llama3-8B-360Zhinao-360k-Instruct achieves:
 
78
  > log8.server 2>&1
79
  ```
80
 
 
 
81
 
82
  ## Methods
83
 
 
85
  Its original context-length is 8k with RoPE base 500,000.
86
 
87
  We directly extended its context length to 360k. We changed RoPE base to 500,000,000 and trained on a combined SFT dataset of [LWM's open-sourced data](https://huggingface.co/LargeWorldModel) and internal long-context data in Chinese and English.
88
+ We implemented SFT on top of [EasyContext](https://github.com/jzhang38/EasyContext/) ([code](https://github.com/Qihoo360/360zhinao/blob/main/360k/train.sft.EasyContext.py) with simple derivation on loss reduction), but later found that turning on pretraining loss produced much better results.
89
+ SFT is likely suitable for further finetuning within the already extended context window.
90
+
91
+ We have been using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) for several months with tailored optimization on GPU memory. Its context parallelism wasn’t quite ready back then and we have now switched to ring attention implementations such as EasyContext.
92
+
93
 
94
  ## Contact & License
95
  Email: [email protected]