Commit
·
8576cf5
1
Parent(s):
9718c77
Update README.md
Browse files
README.md
CHANGED
@@ -43,12 +43,39 @@ The baseline is the [Multilingual BERT](https://github.com/google-research/bert/
|
|
43 |
The model is trained on the following corpora (stats in the table below are after cleaning):
|
44 |
|
45 |
| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
|
46 |
-
|
47 |
| OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
|
48 |
| OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
|
49 |
| Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
|
50 |
| **Total** | **90.15** | **2421.33** | **15.867** | **15.2** |
|
51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
52 |
#### Acknowledgements
|
53 |
|
54 |
- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!
|
|
|
43 |
The model is trained on the following corpora (stats in the table below are after cleaning):
|
44 |
|
45 |
| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
|
46 |
+
|-----------|:--------:|:--------:|:--------:|:--------:|
|
47 |
| OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
|
48 |
| OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
|
49 |
| Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
|
50 |
| **Total** | **90.15** | **2421.33** | **15.867** | **15.2** |
|
51 |
|
52 |
+
### Citation
|
53 |
+
|
54 |
+
If you use this model in a research paper, I'd kindly ask you to cite the following paper:
|
55 |
+
|
56 |
+
```
|
57 |
+
Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.
|
58 |
+
```
|
59 |
+
|
60 |
+
or, in bibtex:
|
61 |
+
|
62 |
+
```
|
63 |
+
@inproceedings{dumitrescu-etal-2020-birth,
|
64 |
+
title = "The birth of {R}omanian {BERT}",
|
65 |
+
author = "Dumitrescu, Stefan and
|
66 |
+
Avram, Andrei-Marius and
|
67 |
+
Pyysalo, Sampo",
|
68 |
+
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
|
69 |
+
month = nov,
|
70 |
+
year = "2020",
|
71 |
+
address = "Online",
|
72 |
+
publisher = "Association for Computational Linguistics",
|
73 |
+
url = "https://aclanthology.org/2020.findings-emnlp.387",
|
74 |
+
doi = "10.18653/v1/2020.findings-emnlp.387",
|
75 |
+
pages = "4324--4328",
|
76 |
+
}
|
77 |
+
```
|
78 |
+
|
79 |
#### Acknowledgements
|
80 |
|
81 |
- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!
|