Text Generation
Transformers
PyTorch
English
gpt2
feature-extraction
causal-lm
text-generation-inference
rskuzma commited on
Commit
d777cc2
1 Parent(s): a3d228a

clarification on deduplicated data in Pile

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -100,7 +100,7 @@ Cerebras-GPT is trained using [the Pile](https://pile.eleuther.ai) dataset from
100
 
101
  We tokenized the data using byte-pair encoding using the GPT-2 vocabulary. Our tokenized version of the Pile has 371B tokens. We include more details about the training dataset preprocessing in Appendix B.1 of our paper.
102
 
103
- Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the total token count by 33%. Our models are trained on the Pile **without deduplication**, which presents an opportunity for further improvement with the deduplicated data set.
104
 
105
  <br><br>
106
 
 
100
 
101
  We tokenized the data using byte-pair encoding using the GPT-2 vocabulary. Our tokenized version of the Pile has 371B tokens. We include more details about the training dataset preprocessing in Appendix B.1 of our paper.
102
 
103
+ Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the Pile dataset size. Pythia was trained on both the standard dataset and deduplicated dataset to characterize the impact. Our models are trained on the standard Pile without deduplication, which may present an opportunity for further improvement with the deduplicated data set.
104
 
105
  <br><br>
106