feat: add readme.md
Browse files
README.md
CHANGED
@@ -1,3 +1,89 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: cc-by-sa-4.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- cs
|
4 |
+
- en
|
5 |
+
- de
|
6 |
+
- fr
|
7 |
+
- tu
|
8 |
+
- zh
|
9 |
+
- es
|
10 |
+
- ru
|
11 |
+
tags:
|
12 |
+
- Summarization
|
13 |
+
- abstractive summarization
|
14 |
+
- mbart-large-cc25
|
15 |
+
- Czech
|
16 |
+
- text2text generation
|
17 |
+
- text generation
|
18 |
license: cc-by-sa-4.0
|
19 |
+
datasets:
|
20 |
+
- Multilingual_large_dataset_(multilarge)
|
21 |
+
- cnc/dm
|
22 |
+
- xsum
|
23 |
+
- mlsum
|
24 |
+
- cnewsum
|
25 |
+
- cnc
|
26 |
+
- sumeczech
|
27 |
+
metrics:
|
28 |
+
- rouge
|
29 |
+
- rougeraw
|
30 |
+
- MemesCS
|
31 |
---
|
32 |
+
|
33 |
+
# mbart25-multilingual-summarization-multilarge-cs
|
34 |
+
This model is a fine-tuned checkpoint of [facebook/mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.
|
35 |
+
|
36 |
+
## Task
|
37 |
+
The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: 'en_XX' : 'en', 'de_DE': 'de', 'es_XX': 'es', 'fr_XX':'fr', 'ru_RU':'ru', 'tr_TR':'tr'.
|
38 |
+
|
39 |
+
## Dataset
|
40 |
+
Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
|
41 |
+
```
|
42 |
+
Train set: 3 464 563 docs
|
43 |
+
Validation set: 121 260 docs
|
44 |
+
```
|
45 |
+
| Stats | fragment | | | avg document length | | avg summary length | | Documents |
|
46 |
+
|-------------|----------|---------------------|--------------------|--------|---------|--------|--------|--------|
|
47 |
+
| __dataset__ |__compression__ | __density__ | __coverage__ | __nsent__ | __nwords__ | __nsent__ | __nwords__ | __count__ |
|
48 |
+
| cnc | 7.388 | 0.303 | 0.088 | 16.121 | 316.912 | 3.272 | 46.805 | 750K |
|
49 |
+
| sumeczech | 11.769 | 0.471 | 0.115 | 27.857 | 415.711 | 2.765 | 38.644 | 1M |
|
50 |
+
| cnndm | 13.688 | 2.983 | 0.538 | 32.783 | 676.026 | 4.134 | 54.036 | 300K |
|
51 |
+
| xsum | 18.378 | 0.479 | 0.194 | 18.607 | 369.134 | 1.000 | 21.127 | 225K|
|
52 |
+
| mlsum/tu | 8.666 | 5.418 | 0.461 | 14.271 | 214.496 | 1.793 | 25.675 | 274K |
|
53 |
+
| mlsum/de | 24.741 | 8.235 | 0.469 | 32.544 | 539.653 | 1.951 | 23.077 | 243K|
|
54 |
+
| mlsum/fr | 24.388 | 2.688 | 0.424 | 24.533 | 612.080 | 1.320 | 26.93 | 425K |
|
55 |
+
| mlsum/es | 36.185 | 3.705 | 0.510 | 31.914 | 746.927 | 1.142 | 21.671 | 291K |
|
56 |
+
| mlsum/ru | 78.909 | 1.194 | 0.246 | 62.141 | 948.079 | 1.012 | 11.976 | 27K|
|
57 |
+
| cnewsum | 20.183 | 0.000 | 0.000 | 16.834 | 438.271 | 1.109 | 21.926 | 304K |
|
58 |
+
#### Tokenization
|
59 |
+
Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary).
|
60 |
+
|
61 |
+
## Training
|
62 |
+
Trained based on cross-entropy loss.
|
63 |
+
```
|
64 |
+
Time: 3 days 8 hours
|
65 |
+
Epochs: 860K steps cca 8 (from 10)
|
66 |
+
GPUs: 4x NVIDIA A100-SXM4-40GB
|
67 |
+
eloss: 2.214 - 1.762
|
68 |
+
tloss: 3.365 - 1.445
|
69 |
+
```
|
70 |
+
|
71 |
+
### ROUGE results per individual dataset test set:
|
72 |
+
| ROUGE | ROUGE-1 | | | ROUGE-2 | | | ROUGE-L | | |
|
73 |
+
|-----------|---------|---------|-----------|--------|--------|-----------|--------|--------|---------|
|
74 |
+
| dataset |Precision | Recall | Fscore | Precision | Recall | Fscore | Precision | Recall | Fscore |
|
75 |
+
| cnc | 27.45 | 24.8 | 25.24 | 9.35 | 8.54 | 8.67 | 20.14 | 18.19 | 18.54 |
|
76 |
+
| sumeczech | 25.38 | 21.61 | 22.66 | 7.71 | 6.67 | 6.96 | 18.76 | 16.02 | 16.78 |
|
77 |
+
| cnndm | 41.97 | 42.61 | 41.05 | 19.64 | 19.88 | 19.16 | 29.38 | 29.85 | 28.73 |
|
78 |
+
| xsum | 39.18 | 39.8 | 38.83 | 16.59 | 16.98 | 16.5 | 31.25 | 31.74 | 30.96 |
|
79 |
+
| mlsum-tu | 51.02 | 47.95 | 47.72 | 36.15 | 34.07 | 33.9 | 44.59 | 41.9 | 41.74 |
|
80 |
+
| mlsum-de | 46.96 | 46.16 | 46.02 | 35.95 | 35.87 | 35.66 | 43.26 | 42.7 | 42.53 |
|
81 |
+
| mlsum-fr | 34.51 | 31.4 | 32.03 | 16.56 | 15.07 | 15.37 | 26.73 | 24.41 | 24.86 |
|
82 |
+
| mlsum-es | 32.62 | 29.66 | 30.21 | 13.3 | 12.2 | 12.39 | 26.24 | 24.02 | 24.4 |
|
83 |
+
| mlsum-ru | 1.25 | 1.54 | 1.31 | 0.46 | 0.46 | 0.44 | 1.25 | 1.54 | 1.31 |
|
84 |
+
| cnewsum | 26.43 | 29.44 | 26.38 | 7.38 | 8.52 | 7.46 | 25.99 | 28.94 | 25.92 |
|
85 |
+
|
86 |
+
# USAGE
|
87 |
+
```
|
88 |
+
soon
|
89 |
+
```
|