Update README.md
Browse files
README.md
CHANGED
@@ -66,13 +66,34 @@ The following table presents the F1 scores:
|
|
66 |
## Publication
|
67 |
|
68 |
```bibtex
|
69 |
-
@
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
76 |
}
|
77 |
```
|
78 |
## Contact
|
|
|
66 |
## Publication
|
67 |
|
68 |
```bibtex
|
69 |
+
@inproceedings{dada-etal-2023-impact,
|
70 |
+
title = "On the Impact of Cross-Domain Data on {G}erman Language Models",
|
71 |
+
author = "Dada, Amin and
|
72 |
+
Chen, Aokun and
|
73 |
+
Peng, Cheng and
|
74 |
+
Smith, Kaleb and
|
75 |
+
Idrissi-Yaghir, Ahmad and
|
76 |
+
Seibold, Constantin and
|
77 |
+
Li, Jianning and
|
78 |
+
Heiliger, Lars and
|
79 |
+
Friedrich, Christoph and
|
80 |
+
Truhn, Daniel and
|
81 |
+
Egger, Jan and
|
82 |
+
Bian, Jiang and
|
83 |
+
Kleesiek, Jens and
|
84 |
+
Wu, Yonghui",
|
85 |
+
editor = "Bouamor, Houda and
|
86 |
+
Pino, Juan and
|
87 |
+
Bali, Kalika",
|
88 |
+
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
|
89 |
+
month = dec,
|
90 |
+
year = "2023",
|
91 |
+
address = "Singapore",
|
92 |
+
publisher = "Association for Computational Linguistics",
|
93 |
+
url = "https://aclanthology.org/2023.findings-emnlp.922",
|
94 |
+
doi = "10.18653/v1/2023.findings-emnlp.922",
|
95 |
+
pages = "13801--13813",
|
96 |
+
abstract = "Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45{\%} over the previous state-of-the-art.",
|
97 |
}
|
98 |
```
|
99 |
## Contact
|