metadata
language:
- cs
- en
- de
- fr
- tu
- zh
- es
- ru
tags:
- Summarization
- abstractive summarization
- multilingual summarization
- m2m100_418M
- Czech
- text2text generation
- text generation
license: cc-by-sa-4.0
datasets:
- Multilingual_large_dataset_(multilarge)
- cnc/dm
- xsum
- mlsum
- cnewsum
- cnc
- sumeczech
metrics:
- rouge
- rougeraw
- MemesCS
m2m100-418M-multilingual-summarization-multilarge-cs
This model is a fine-tuned checkpoint of facebook/m2m100_418M on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.
Task
The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ''cs', 'en', 'de', 'es', 'fr', 'ru', 'tu', 'zh'
#Usage
Assume that you are using the provided MultilingualSummarizer.ipynb file and included files from git repository.
def summ_config():
cfg = OrderedDict([
("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
("language", "en"),
("inference_cfg", OrderedDict([
("num_beams", 4),
("top_k", 40),
("top_p", 0.92),
("do_sample", True),
("temperature", 0.95),
("repetition_penalty", 1.23),
("no_repeat_ngram_size", None),
("early_stopping", True),
("max_length", 128),
("min_length", 10),
])),
("texts",
[
"english text1 to summarize",
"english text2 to summarize",
]
),
('golds',
[
"target english text1",
"target english text2",
]),
])
return cfg
cfg = summ_config()
mSummarize = MultiSummarizer(**cfg)
summaries,scores = mSummarize(**cfg)
Dataset
Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
Train set: 3 464 563 docs
Validation set: 121 260 docs
Stats |
fragment |
|
|
avg document length |
|
avg summary length |
|
Documents |
dataset |
compression |
density |
coverage |
nsent |
nwords |
nsent |
nwords |
count |
cnc |
7.388 |
0.303 |
0.088 |
16.121 |
316.912 |
3.272 |
46.805 |
750K |
sumeczech |
11.769 |
0.471 |
0.115 |
27.857 |
415.711 |
2.765 |
38.644 |
1M |
cnndm |
13.688 |
2.983 |
0.538 |
32.783 |
676.026 |
4.134 |
54.036 |
300K |
xsum |
18.378 |
0.479 |
0.194 |
18.607 |
369.134 |
1.000 |
21.127 |
225K |
mlsum/tu |
8.666 |
5.418 |
0.461 |
14.271 |
214.496 |
1.793 |
25.675 |
274K |
mlsum/de |
24.741 |
8.235 |
0.469 |
32.544 |
539.653 |
1.951 |
23.077 |
243K |
mlsum/fr |
24.388 |
2.688 |
0.424 |
24.533 |
612.080 |
1.320 |
26.93 |
425K |
mlsum/es |
36.185 |
3.705 |
0.510 |
31.914 |
746.927 |
1.142 |
21.671 |
291K |
mlsum/ru |
78.909 |
1.194 |
0.246 |
62.141 |
948.079 |
1.012 |
11.976 |
27K |
cnewsum |
20.183 |
0.000 |
0.000 |
16.834 |
438.271 |
1.109 |
21.926 |
304K |
Tokenization
Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary).
Training
Trained based on cross-entropy loss.
Time: 3 days 10 hours
Epochs: 1072K steps = 10 (from 10)
GPUs: 4x NVIDIA A100-SXM4-40GB
eloss: 2.824 - 1.745
tloss: 4.559 - 1.615
ROUGE results per individual dataset test set:
ROUGE |
ROUGE-1 |
|
|
ROUGE-2 |
|
|
ROUGE-L |
|
|
dataset |
Precision |
Recall |
Fscore |
Precision |
Recall |
Fscore |
Precision |
Recall |
Fscore |
cnc |
30.13 |
22.56 |
25.21 |
10.53 |
8.01 |
8.9 |
22.47 |
16.92 |
18.86 |
sumeczech- |
26.6 |
19.66 |
22.01 |
8.17 |
6.12 |
6.82 |
19.93 |
14.81 |
16.54 |
cnndm |
41.8 |
38.41 |
38.94 |
18.74 |
17.14 |
17.4 |
29.69 |
27.33 |
27.68 |
xsum |
38.27 |
33.62 |
35.16 |
14.39 |
12.69 |
13.25 |
30.77 |
27.05 |
28.29 |
mlsum-tu |
52.44 |
44.36 |
46.39 |
36.98 |
31.51 |
32.86 |
46.04 |
39.04 |
40.8 |
mlsum-de |
42.19 |
40.5 |
40.7 |
28.8 |
28.51 |
28.37 |
38.95 |
37.7 |
37.79 |
mlsum-fr |
34.57 |
27.74 |
29.95 |
16.27 |
13.04 |
14.08 |
27.18 |
21.89 |
23.6 |
mlsum-es |
30.93 |
26.41 |
27.66 |
11.42 |
9.85 |
10.28 |
25.12 |
21.59 |
22.55 |
mlsum-ru |
0.65 |
0.52 |
0.56 |
0.15 |
0.15 |
0.15 |
0.65 |
0.52 |
0.56 |
cnewsum |
25.14 |
26.56 |
24.45 |
6.89 |
7.54 |
6.78 |
24.77 |
26.15 |
24.08 |
USAGE
soon