Removing the word "exclusively" since it implies the model is ONLY trained on the data described in the section

#7
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -40,7 +40,7 @@ It is primarily intended for research purposes.*
40
  _____
41
  ## Pretraining corpus
42
 
43
- The model is pretrained exclusively on publicly available data. We combine the resources from [the public part of the NCC corpus](https://huggingface.co/datasets/NbAiLab/NCC), from [the cleaned HPLT corpus](https://hplt-project.org/datasets/v1.2), and from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX).
44
  This resulted in over 34B subword tokens of Norwegian (Bokmål or Nynorsk) in total, which amounts to about 26.7B whitespace-separated tokens.
45
  We also augment the corpus with [Starcoder](https://huggingface.co/datasets/vikp/starcoder_filtered); 20% of the 260B tokens are sampled from this code corpus.
46
  The natural language data is repeated six times to get the pretraining budget of 260B tokens, in accordance with findings from [Muennighoff et al. (2023)](https://neurips.cc/virtual/2023/poster/70706).
 
40
  _____
41
  ## Pretraining corpus
42
 
43
+ The model was initialized from [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) which was based on an unknown corpus, then pretrained on publicly available data. We combine the resources from [the public part of the NCC corpus](https://huggingface.co/datasets/NbAiLab/NCC), from [the cleaned HPLT corpus](https://hplt-project.org/datasets/v1.2), and from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX).
44
  This resulted in over 34B subword tokens of Norwegian (Bokmål or Nynorsk) in total, which amounts to about 26.7B whitespace-separated tokens.
45
  We also augment the corpus with [Starcoder](https://huggingface.co/datasets/vikp/starcoder_filtered); 20% of the 260B tokens are sampled from this code corpus.
46
  The natural language data is repeated six times to get the pretraining budget of 260B tokens, in accordance with findings from [Muennighoff et al. (2023)](https://neurips.cc/virtual/2023/poster/70706).