File size: 6,076 Bytes
55d1682
040a555
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55d1682
040a555
 
 
 
 
 
 
 
 
 
 
 
55d1682
38206a7
040a555
 
 
dbcbd62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
040a555
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
language:
- cs
- en
- de
- fr
- tu
- zh
- es
- ru
tags:
- Summarization
- abstractive summarization
- multilingual summarization
- m2m100_418M
- Czech
- text2text generation
- text generation
license: cc-by-sa-4.0
datasets:
- Multilingual_large_dataset_(multilarge)
- cnc/dm
- xsum
- mlsum
- cnewsum
- cnc
- sumeczech
metrics:
- rouge
- rougeraw
- MemesCS
---
# m2m100-418M-multilingual-summarization-multilarge-cs
This model is a fine-tuned checkpoint of [facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries. 
## Task
The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ''cs', 'en', 'de', 'es', 'fr', 'ru', 'tu', 'zh'

Assume that you are using the provided MultilingualSummarizer.ipynb file and included files from git repository.

```python
## Configuration of summarization pipeline
#
def summ_config():
    cfg = OrderedDict([
        
        ## summarization model - checkpoint
        #   ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
        #   ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
        #   ctu-aic/mbart25-multilingual-summarization-multilarge-cs
        ("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
        
        ## language of summarization task
        #   language : string : cs, en, de, fr, es, tr, ru, zh
        ("language", "en"), 
        
        ## generation method parameters in dictionary
        #
        ("inference_cfg", OrderedDict([
            ("num_beams", 4),
            ("top_k", 40),
            ("top_p", 0.92),
            ("do_sample", True),
            ("temperature", 0.95),
            ("repetition_penalty", 1.23),
            ("no_repeat_ngram_size", None),
            ("early_stopping", True),
            ("max_length", 128),
            ("min_length", 10),
        ])),
        #texts to summarize values = (list of strings, string, dataset)
        ("texts",
            [
               "english text1 to summarize",
               "english text2 to summarize",
            ]
        ),
        #OPTIONAL: Target summaries values = (list of strings, string, None)
        ('golds',
         [
               "target english text1",
               "target english text2",
         ]),
        #('golds', None),
    ])
    return cfg

cfg = summ_config()
mSummarize = MultiSummarizer(**cfg)
ret = mSummarize(**cfg)
```

## Dataset
Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
```
Train set:        3 464 563 docs
Validation set:     121 260 docs
```
| Stats       | fragment |  | | avg document length |   | avg summary length  |  | Documents |
|-------------|----------|---------------------|--------------------|--------|---------|--------|--------|--------|
|  __dataset__   |__compression__ | __density__  | __coverage__            | __nsent__              | __nwords__ | __nsent__   | __nwords__ | __count__ |
| cnc      | 7.388    | 0.303               | 0.088              | 16.121 | 316.912 | 3.272  | 46.805 | 750K |
| sumeczech   | 11.769   | 0.471               | 0.115              | 27.857 | 415.711 | 2.765  | 38.644 | 1M |
| cnndm       | 13.688   | 2.983               | 0.538              | 32.783 | 676.026 | 4.134  | 54.036 | 300K |
| xsum        | 18.378   | 0.479               | 0.194              | 18.607 | 369.134 | 1.000  | 21.127 | 225K|
| mlsum/tu    | 8.666    | 5.418               | 0.461              | 14.271 | 214.496 | 1.793  | 25.675 | 274K |
| mlsum/de    | 24.741   | 8.235               | 0.469              | 32.544 | 539.653 | 1.951  | 23.077 | 243K|
| mlsum/fr    | 24.388   | 2.688               | 0.424              | 24.533 | 612.080 | 1.320  | 26.93  | 425K |
| mlsum/es    | 36.185   | 3.705               | 0.510              | 31.914 | 746.927 | 1.142  | 21.671 | 291K |
| mlsum/ru    | 78.909   | 1.194               | 0.246              | 62.141 | 948.079 | 1.012  | 11.976 | 27K|
| cnewsum     | 20.183   | 0.000               | 0.000              | 16.834 | 438.271 | 1.109  | 21.926 | 304K |
#### Tokenization
Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary). 
## Training
Trained based on cross-entropy loss.
```
Time: 3 days 10 hours
Epochs: 1072K steps = 10 (from 10)
GPUs: 4x NVIDIA A100-SXM4-40GB
eloss: 2.824 - 1.745
tloss: 4.559 - 1.615
```
### ROUGE results per individual dataset test set:

| ROUGE      | ROUGE-1 |  |    | ROUGE-2 |  |     | ROUGE-L |  |  |
|------------|---------|---------|-----------|--------|--------|-----------|--------|--------|---------|
|   dataset  |  Precision  | Recall  | Fscore  | Precision | Recall | Fscore | Precision | Recall | Fscore |
| cnc    | 30.13   | 22.56   | 25.21     | 10.53  | 8.01   | 8.9       | 22.47  | 16.92  | 18.86   |
| sumeczech- | 26.6    | 19.66   | 22.01     | 8.17   | 6.12   | 6.82      | 19.93  | 14.81  | 16.54   |
| cnndm      | 41.8    | 38.41   | 38.94     | 18.74  | 17.14  | 17.4      | 29.69  | 27.33  | 27.68   |
| xsum       | 38.27   | 33.62   | 35.16     | 14.39  | 12.69  | 13.25     | 30.77  | 27.05  | 28.29   |
| mlsum-tu   | 52.44   | 44.36   | 46.39     | 36.98  | 31.51  | 32.86     | 46.04  | 39.04  | 40.8    |
| mlsum-de   | 42.19   | 40.5    | 40.7      | 28.8   | 28.51  | 28.37     | 38.95  | 37.7   | 37.79   |
| mlsum-fr   | 34.57   | 27.74   | 29.95     | 16.27  | 13.04  | 14.08     | 27.18  | 21.89  | 23.6    |
| mlsum-es   | 30.93   | 26.41   | 27.66     | 11.42  | 9.85   | 10.28     | 25.12  | 21.59  | 22.55   |
| mlsum-ru   | 0.65    | 0.52    | 0.56      | 0.15   | 0.15   | 0.15      | 0.65   | 0.52   | 0.56    |



# USAGE
```
soon
```