Canwen Xu commited on
Commit
ce1c7c6
1 Parent(s): 05d7855

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -41,7 +41,7 @@ We collect different kinds of texts in our pre-training, including encyclopedia,
41
 
42
  ## Training procedure
43
 
44
- Based on the hyper-parameter searching on the learning rate and batch size, we set the learning rate as $1.5\times10^{-4}$ and the batch size as $3,072$, which makes the model training more stable. In the first version, we still adopt the dense attention and the max sequence length is $1,024$. We will implement sparse attention in the future. We pre-train our model for $20,000$ steps, and the first $5,000$ steps are for warm-up. The optimizer is Adam. It takes two weeks to train our largest model using $64$ NVIDIA V100.
45
 
46
  ## Eval results
47
 
 
41
 
42
  ## Training procedure
43
 
44
+ Based on the hyper-parameter searching on the learning rate and batch size, we set the learning rate as \\(1.5\times10^{-4}\\) and the batch size as \\(3,072\\), which makes the model training more stable. In the first version, we still adopt the dense attention and the max sequence length is \\(1,024\\). We will implement sparse attention in the future. We pre-train our model for \\(20,000\\) steps, and the first \\(5,000\\) steps are for warm-up. The optimizer is Adam. It takes two weeks to train our largest model using \\(64\\) NVIDIA V100.
45
 
46
  ## Eval results
47