Removing the word "exclusively" since it implies the model is ONLY trained on the data described in the section
Browse filesThe team who trained NORA likely doesn't know what MistralAi used to train Mistral-7b-v0.1, (since MistralAI haven't revealed it to avoid legal action) and as such cannot make claims about whether mistral was trained exclusively on public web data.
Clarified this in the section so that somebody who is just skimming the text (or an LLM is trying to summarize the page) doesn't miss the fact that this model is Mistral-Unknown + Nora Corpus when looking for details around the section "pretraining corpus"
It is LIKELY that Mistral trained on public data like Common Crawl, but it is impossible to claim it was done exclusively: It might very well have been trained on books or other corpora that are not on the open internet, private datasets, etc.
README.md
CHANGED
@@ -40,7 +40,7 @@ It is primarily intended for research purposes.*
|
|
40 |
_____
|
41 |
## Pretraining corpus
|
42 |
|
43 |
-
The model
|
44 |
This resulted in over 34B subword tokens of Norwegian (Bokmål or Nynorsk) in total, which amounts to about 26.7B whitespace-separated tokens.
|
45 |
We also augment the corpus with [Starcoder](https://huggingface.co/datasets/vikp/starcoder_filtered); 20% of the 260B tokens are sampled from this code corpus.
|
46 |
The natural language data is repeated six times to get the pretraining budget of 260B tokens, in accordance with findings from [Muennighoff et al. (2023)](https://neurips.cc/virtual/2023/poster/70706).
|
|
|
40 |
_____
|
41 |
## Pretraining corpus
|
42 |
|
43 |
+
The model was initialized from [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) which was based on an unknown corpus, then pretrained on publicly available data. We combine the resources from [the public part of the NCC corpus](https://huggingface.co/datasets/NbAiLab/NCC), from [the cleaned HPLT corpus](https://hplt-project.org/datasets/v1.2), and from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX).
|
44 |
This resulted in over 34B subword tokens of Norwegian (Bokmål or Nynorsk) in total, which amounts to about 26.7B whitespace-separated tokens.
|
45 |
We also augment the corpus with [Starcoder](https://huggingface.co/datasets/vikp/starcoder_filtered); 20% of the 260B tokens are sampled from this code corpus.
|
46 |
The natural language data is repeated six times to get the pretraining budget of 260B tokens, in accordance with findings from [Muennighoff et al. (2023)](https://neurips.cc/virtual/2023/poster/70706).
|