krotima1's picture
feat: add readme.md
96a5e8b
|
raw
history blame
4.47 kB
metadata
language:
  - cs
  - en
  - de
  - fr
  - tu
  - zh
  - es
  - ru
tags:
  - Summarization
  - abstractive summarization
  - mbart-large-cc25
  - Czech
  - text2text generation
  - text generation
license: cc-by-sa-4.0
datasets:
  - Multilingual_large_dataset_(multilarge)
  - cnc/dm
  - xsum
  - mlsum
  - cnewsum
  - cnc
  - sumeczech
metrics:
  - rouge
  - rougeraw
  - MemesCS

mbart25-multilingual-summarization-multilarge-cs

This model is a fine-tuned checkpoint of facebook/mbart-large-cc25 on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.

Task

The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: 'en_XX' : 'en', 'de_DE': 'de', 'es_XX': 'es', 'fr_XX':'fr', 'ru_RU':'ru', 'tr_TR':'tr'.

Dataset

Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.

Train set:        3 464 563 docs
Validation set:     121 260 docs
Stats fragment avg document length avg summary length Documents
dataset compression density coverage nsent nwords nsent nwords count
cnc 7.388 0.303 0.088 16.121 316.912 3.272 46.805 750K
sumeczech 11.769 0.471 0.115 27.857 415.711 2.765 38.644 1M
cnndm 13.688 2.983 0.538 32.783 676.026 4.134 54.036 300K
xsum 18.378 0.479 0.194 18.607 369.134 1.000 21.127 225K
mlsum/tu 8.666 5.418 0.461 14.271 214.496 1.793 25.675 274K
mlsum/de 24.741 8.235 0.469 32.544 539.653 1.951 23.077 243K
mlsum/fr 24.388 2.688 0.424 24.533 612.080 1.320 26.93 425K
mlsum/es 36.185 3.705 0.510 31.914 746.927 1.142 21.671 291K
mlsum/ru 78.909 1.194 0.246 62.141 948.079 1.012 11.976 27K
cnewsum 20.183 0.000 0.000 16.834 438.271 1.109 21.926 304K

Tokenization

Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary).

Training

Trained based on cross-entropy loss.

Time: 3 days 8 hours
Epochs: 860K steps cca 8 (from 10)
GPUs: 4x NVIDIA A100-SXM4-40GB
eloss: 2.214 - 1.762
tloss: 3.365 - 1.445

ROUGE results per individual dataset test set:

ROUGE ROUGE-1 ROUGE-2 ROUGE-L
dataset Precision Recall Fscore Precision Recall Fscore Precision Recall Fscore
cnc 27.45 24.8 25.24 9.35 8.54 8.67 20.14 18.19 18.54
sumeczech 25.38 21.61 22.66 7.71 6.67 6.96 18.76 16.02 16.78
cnndm 41.97 42.61 41.05 19.64 19.88 19.16 29.38 29.85 28.73
xsum 39.18 39.8 38.83 16.59 16.98 16.5 31.25 31.74 30.96
mlsum-tu 51.02 47.95 47.72 36.15 34.07 33.9 44.59 41.9 41.74
mlsum-de 46.96 46.16 46.02 35.95 35.87 35.66 43.26 42.7 42.53
mlsum-fr 34.51 31.4 32.03 16.56 15.07 15.37 26.73 24.41 24.86
mlsum-es 32.62 29.66 30.21 13.3 12.2 12.39 26.24 24.02 24.4
mlsum-ru 1.25 1.54 1.31 0.46 0.46 0.44 1.25 1.54 1.31
cnewsum 26.43 29.44 26.38 7.38 8.52 7.46 25.99 28.94 25.92

USAGE

soon