|
--- |
|
language: |
|
- cs |
|
- en |
|
- de |
|
- fr |
|
- tu |
|
- zh |
|
- es |
|
- ru |
|
tags: |
|
- Summarization |
|
- abstractive summarization |
|
- multilingual summarization |
|
- m2m100_418M |
|
- Czech |
|
- text2text generation |
|
- text generation |
|
license: cc-by-sa-4.0 |
|
datasets: |
|
- Multilingual_large_dataset_(multilarge) |
|
- cnc/dm |
|
- xsum |
|
- mlsum |
|
- cnewsum |
|
- cnc |
|
- sumeczech |
|
metrics: |
|
- rouge |
|
- rougeraw |
|
- MemesCS |
|
--- |
|
# m2m100-418M-multilingual-summarization-multilarge-cs |
|
This model is a fine-tuned checkpoint of [facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries. |
|
## Task |
|
The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ''cs', 'en', 'de', 'es', 'fr', 'ru', 'tu', 'zh' |
|
|
|
#Usage |
|
Assume that you are using the provided MultilingualSummarizer.ipynb file and included files from git repository. |
|
|
|
```python |
|
## Configuration of summarization pipeline |
|
# |
|
def summ_config(): |
|
cfg = OrderedDict([ |
|
|
|
## summarization model - checkpoint |
|
# ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs |
|
# ctu-aic/mt5-base-multilingual-summarization-multilarge-cs |
|
# ctu-aic/mbart25-multilingual-summarization-multilarge-cs |
|
("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"), |
|
|
|
## language of summarization task |
|
# language : string : cs, en, de, fr, es, tr, ru, zh |
|
("language", "en"), |
|
|
|
## generation method parameters in dictionary |
|
# |
|
("inference_cfg", OrderedDict([ |
|
("num_beams", 4), |
|
("top_k", 40), |
|
("top_p", 0.92), |
|
("do_sample", True), |
|
("temperature", 0.95), |
|
("repetition_penalty", 1.23), |
|
("no_repeat_ngram_size", None), |
|
("early_stopping", True), |
|
("max_length", 128), |
|
("min_length", 10), |
|
])), |
|
#texts to summarize values = (list of strings, string, dataset) |
|
("texts", |
|
[ |
|
"english text1 to summarize", |
|
"english text2 to summarize", |
|
] |
|
), |
|
#OPTIONAL: Target summaries values = (list of strings, string, None) |
|
('golds', |
|
[ |
|
"target english text1", |
|
"target english text2", |
|
]), |
|
#('golds', None), |
|
]) |
|
return cfg |
|
|
|
cfg = summ_config() |
|
mSummarize = MultiSummarizer(**cfg) |
|
ret = mSummarize(**cfg) |
|
``` |
|
|
|
## Dataset |
|
Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set. |
|
``` |
|
Train set: 3 464 563 docs |
|
Validation set: 121 260 docs |
|
``` |
|
| Stats | fragment | | | avg document length | | avg summary length | | Documents | |
|
|-------------|----------|---------------------|--------------------|--------|---------|--------|--------|--------| |
|
| __dataset__ |__compression__ | __density__ | __coverage__ | __nsent__ | __nwords__ | __nsent__ | __nwords__ | __count__ | |
|
| cnc | 7.388 | 0.303 | 0.088 | 16.121 | 316.912 | 3.272 | 46.805 | 750K | |
|
| sumeczech | 11.769 | 0.471 | 0.115 | 27.857 | 415.711 | 2.765 | 38.644 | 1M | |
|
| cnndm | 13.688 | 2.983 | 0.538 | 32.783 | 676.026 | 4.134 | 54.036 | 300K | |
|
| xsum | 18.378 | 0.479 | 0.194 | 18.607 | 369.134 | 1.000 | 21.127 | 225K| |
|
| mlsum/tu | 8.666 | 5.418 | 0.461 | 14.271 | 214.496 | 1.793 | 25.675 | 274K | |
|
| mlsum/de | 24.741 | 8.235 | 0.469 | 32.544 | 539.653 | 1.951 | 23.077 | 243K| |
|
| mlsum/fr | 24.388 | 2.688 | 0.424 | 24.533 | 612.080 | 1.320 | 26.93 | 425K | |
|
| mlsum/es | 36.185 | 3.705 | 0.510 | 31.914 | 746.927 | 1.142 | 21.671 | 291K | |
|
| mlsum/ru | 78.909 | 1.194 | 0.246 | 62.141 | 948.079 | 1.012 | 11.976 | 27K| |
|
| cnewsum | 20.183 | 0.000 | 0.000 | 16.834 | 438.271 | 1.109 | 21.926 | 304K | |
|
#### Tokenization |
|
Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary). |
|
## Training |
|
Trained based on cross-entropy loss. |
|
``` |
|
Time: 3 days 10 hours |
|
Epochs: 1072K steps = 10 (from 10) |
|
GPUs: 4x NVIDIA A100-SXM4-40GB |
|
eloss: 2.824 - 1.745 |
|
tloss: 4.559 - 1.615 |
|
``` |
|
### ROUGE results per individual dataset test set: |
|
|
|
| ROUGE | ROUGE-1 | | | ROUGE-2 | | | ROUGE-L | | | |
|
|------------|---------|---------|-----------|--------|--------|-----------|--------|--------|---------| |
|
| dataset | Precision | Recall | Fscore | Precision | Recall | Fscore | Precision | Recall | Fscore | |
|
| cnc | 30.13 | 22.56 | 25.21 | 10.53 | 8.01 | 8.9 | 22.47 | 16.92 | 18.86 | |
|
| sumeczech- | 26.6 | 19.66 | 22.01 | 8.17 | 6.12 | 6.82 | 19.93 | 14.81 | 16.54 | |
|
| cnndm | 41.8 | 38.41 | 38.94 | 18.74 | 17.14 | 17.4 | 29.69 | 27.33 | 27.68 | |
|
| xsum | 38.27 | 33.62 | 35.16 | 14.39 | 12.69 | 13.25 | 30.77 | 27.05 | 28.29 | |
|
| mlsum-tu | 52.44 | 44.36 | 46.39 | 36.98 | 31.51 | 32.86 | 46.04 | 39.04 | 40.8 | |
|
| mlsum-de | 42.19 | 40.5 | 40.7 | 28.8 | 28.51 | 28.37 | 38.95 | 37.7 | 37.79 | |
|
| mlsum-fr | 34.57 | 27.74 | 29.95 | 16.27 | 13.04 | 14.08 | 27.18 | 21.89 | 23.6 | |
|
| mlsum-es | 30.93 | 26.41 | 27.66 | 11.42 | 9.85 | 10.28 | 25.12 | 21.59 | 22.55 | |
|
| mlsum-ru | 0.65 | 0.52 | 0.56 | 0.15 | 0.15 | 0.15 | 0.65 | 0.52 | 0.56 | |
|
|
|
|
|
|
|
# USAGE |
|
``` |
|
soon |
|
``` |