File size: 6,076 Bytes
55d1682 040a555 55d1682 040a555 55d1682 38206a7 040a555 dbcbd62 040a555 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
language:
- cs
- en
- de
- fr
- tu
- zh
- es
- ru
tags:
- Summarization
- abstractive summarization
- multilingual summarization
- m2m100_418M
- Czech
- text2text generation
- text generation
license: cc-by-sa-4.0
datasets:
- Multilingual_large_dataset_(multilarge)
- cnc/dm
- xsum
- mlsum
- cnewsum
- cnc
- sumeczech
metrics:
- rouge
- rougeraw
- MemesCS
---
# m2m100-418M-multilingual-summarization-multilarge-cs
This model is a fine-tuned checkpoint of [facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.
## Task
The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ''cs', 'en', 'de', 'es', 'fr', 'ru', 'tu', 'zh'
Assume that you are using the provided MultilingualSummarizer.ipynb file and included files from git repository.
```python
## Configuration of summarization pipeline
#
def summ_config():
cfg = OrderedDict([
## summarization model - checkpoint
# ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
# ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
# ctu-aic/mbart25-multilingual-summarization-multilarge-cs
("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
## language of summarization task
# language : string : cs, en, de, fr, es, tr, ru, zh
("language", "en"),
## generation method parameters in dictionary
#
("inference_cfg", OrderedDict([
("num_beams", 4),
("top_k", 40),
("top_p", 0.92),
("do_sample", True),
("temperature", 0.95),
("repetition_penalty", 1.23),
("no_repeat_ngram_size", None),
("early_stopping", True),
("max_length", 128),
("min_length", 10),
])),
#texts to summarize values = (list of strings, string, dataset)
("texts",
[
"english text1 to summarize",
"english text2 to summarize",
]
),
#OPTIONAL: Target summaries values = (list of strings, string, None)
('golds',
[
"target english text1",
"target english text2",
]),
#('golds', None),
])
return cfg
cfg = summ_config()
mSummarize = MultiSummarizer(**cfg)
ret = mSummarize(**cfg)
```
## Dataset
Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
```
Train set: 3 464 563 docs
Validation set: 121 260 docs
```
| Stats | fragment | | | avg document length | | avg summary length | | Documents |
|-------------|----------|---------------------|--------------------|--------|---------|--------|--------|--------|
| __dataset__ |__compression__ | __density__ | __coverage__ | __nsent__ | __nwords__ | __nsent__ | __nwords__ | __count__ |
| cnc | 7.388 | 0.303 | 0.088 | 16.121 | 316.912 | 3.272 | 46.805 | 750K |
| sumeczech | 11.769 | 0.471 | 0.115 | 27.857 | 415.711 | 2.765 | 38.644 | 1M |
| cnndm | 13.688 | 2.983 | 0.538 | 32.783 | 676.026 | 4.134 | 54.036 | 300K |
| xsum | 18.378 | 0.479 | 0.194 | 18.607 | 369.134 | 1.000 | 21.127 | 225K|
| mlsum/tu | 8.666 | 5.418 | 0.461 | 14.271 | 214.496 | 1.793 | 25.675 | 274K |
| mlsum/de | 24.741 | 8.235 | 0.469 | 32.544 | 539.653 | 1.951 | 23.077 | 243K|
| mlsum/fr | 24.388 | 2.688 | 0.424 | 24.533 | 612.080 | 1.320 | 26.93 | 425K |
| mlsum/es | 36.185 | 3.705 | 0.510 | 31.914 | 746.927 | 1.142 | 21.671 | 291K |
| mlsum/ru | 78.909 | 1.194 | 0.246 | 62.141 | 948.079 | 1.012 | 11.976 | 27K|
| cnewsum | 20.183 | 0.000 | 0.000 | 16.834 | 438.271 | 1.109 | 21.926 | 304K |
#### Tokenization
Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary).
## Training
Trained based on cross-entropy loss.
```
Time: 3 days 10 hours
Epochs: 1072K steps = 10 (from 10)
GPUs: 4x NVIDIA A100-SXM4-40GB
eloss: 2.824 - 1.745
tloss: 4.559 - 1.615
```
### ROUGE results per individual dataset test set:
| ROUGE | ROUGE-1 | | | ROUGE-2 | | | ROUGE-L | | |
|------------|---------|---------|-----------|--------|--------|-----------|--------|--------|---------|
| dataset | Precision | Recall | Fscore | Precision | Recall | Fscore | Precision | Recall | Fscore |
| cnc | 30.13 | 22.56 | 25.21 | 10.53 | 8.01 | 8.9 | 22.47 | 16.92 | 18.86 |
| sumeczech- | 26.6 | 19.66 | 22.01 | 8.17 | 6.12 | 6.82 | 19.93 | 14.81 | 16.54 |
| cnndm | 41.8 | 38.41 | 38.94 | 18.74 | 17.14 | 17.4 | 29.69 | 27.33 | 27.68 |
| xsum | 38.27 | 33.62 | 35.16 | 14.39 | 12.69 | 13.25 | 30.77 | 27.05 | 28.29 |
| mlsum-tu | 52.44 | 44.36 | 46.39 | 36.98 | 31.51 | 32.86 | 46.04 | 39.04 | 40.8 |
| mlsum-de | 42.19 | 40.5 | 40.7 | 28.8 | 28.51 | 28.37 | 38.95 | 37.7 | 37.79 |
| mlsum-fr | 34.57 | 27.74 | 29.95 | 16.27 | 13.04 | 14.08 | 27.18 | 21.89 | 23.6 |
| mlsum-es | 30.93 | 26.41 | 27.66 | 11.42 | 9.85 | 10.28 | 25.12 | 21.59 | 22.55 |
| mlsum-ru | 0.65 | 0.52 | 0.56 | 0.15 | 0.15 | 0.15 | 0.65 | 0.52 | 0.56 |
# USAGE
```
soon
``` |