fix: readme code

e330a98 about 2 years ago

6.18 kB

	---
	language:
	- cs
	- en
	- de
	- fr
	- tu
	- zh
	- es
	- ru
	tags:
	- Summarization
	- abstractive summarization
	- mbart-large-cc25
	- Czech
	- text2text generation
	- text generation
	license: cc-by-sa-4.0
	datasets:
	- Multilingual_large_dataset_(multilarge)
	- cnc/dm
	- xsum
	- mlsum
	- cnewsum
	- cnc
	- sumeczech
	metrics:
	- rouge
	- rougeraw
	- MemesCS
	---

	# mbart25-multilingual-summarization-multilarge-cs
	This model is a fine-tuned checkpoint of [facebook/mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.

	## Task
	The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: 'en_XX' : 'en', 'de_DE': 'de', 'es_XX': 'es', 'fr_XX':'fr', 'ru_RU':'ru', 'tr_TR':'tr'.

	# USAGE
	Assume that you are using the provided MultilingualSummarizer.ipynb file and included files from git repository.

	```python
	## Configuration of summarization pipeline
	#
	def summ_config():
	cfg = OrderedDict([

	## summarization model - checkpoint
	# ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
	# ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
	# ctu-aic/mbart25-multilingual-summarization-multilarge-cs
	("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),

	## language of summarization task
	# language : string : cs, en, de, fr, es, tr, ru, zh
	("language", "en"),

	## generation method parameters in dictionary
	#
	("inference_cfg", OrderedDict([
	("num_beams", 4),
	("top_k", 40),
	("top_p", 0.92),
	("do_sample", True),
	("temperature", 0.95),
	("repetition_penalty", 1.23),
	("no_repeat_ngram_size", None),
	("early_stopping", True),
	("max_length", 128),
	("min_length", 10),
	])),
	#texts to summarize values = (list of strings, string, dataset)
	("texts",
	[
	"english text1 to summarize",
	"english text2 to summarize",
	]
	),
	#OPTIONAL: Target summaries values = (list of strings, string, None)
	('golds',
	[
	"target english text1",
	"target english text2",
	]),
	#('golds', None),
	])
	return cfg

	cfg = summ_config()
	msummarizer = MultiSummarizer(**cfg)
	ret = msummarizer(**cfg)
	```

	## Dataset
	Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
	```
	Train set: 3 464 563 docs
	Validation set: 121 260 docs
	```
	\| Stats \| fragment \| \| \| avg document length \| \| avg summary length \| \| Documents \|
	\|-------------\|----------\|---------------------\|--------------------\|--------\|---------\|--------\|--------\|--------\|
	\| __dataset__ \|__compression__ \| __density__ \| __coverage__ \| __nsent__ \| __nwords__ \| __nsent__ \| __nwords__ \| __count__ \|
	\| cnc \| 7.388 \| 0.303 \| 0.088 \| 16.121 \| 316.912 \| 3.272 \| 46.805 \| 750K \|
	\| sumeczech \| 11.769 \| 0.471 \| 0.115 \| 27.857 \| 415.711 \| 2.765 \| 38.644 \| 1M \|
	\| cnndm \| 13.688 \| 2.983 \| 0.538 \| 32.783 \| 676.026 \| 4.134 \| 54.036 \| 300K \|
	\| xsum \| 18.378 \| 0.479 \| 0.194 \| 18.607 \| 369.134 \| 1.000 \| 21.127 \| 225K\|
	\| mlsum/tu \| 8.666 \| 5.418 \| 0.461 \| 14.271 \| 214.496 \| 1.793 \| 25.675 \| 274K \|
	\| mlsum/de \| 24.741 \| 8.235 \| 0.469 \| 32.544 \| 539.653 \| 1.951 \| 23.077 \| 243K\|
	\| mlsum/fr \| 24.388 \| 2.688 \| 0.424 \| 24.533 \| 612.080 \| 1.320 \| 26.93 \| 425K \|
	\| mlsum/es \| 36.185 \| 3.705 \| 0.510 \| 31.914 \| 746.927 \| 1.142 \| 21.671 \| 291K \|
	\| mlsum/ru \| 78.909 \| 1.194 \| 0.246 \| 62.141 \| 948.079 \| 1.012 \| 11.976 \| 27K\|
	\| cnewsum \| 20.183 \| 0.000 \| 0.000 \| 16.834 \| 438.271 \| 1.109 \| 21.926 \| 304K \|
	#### Tokenization
	Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary).

	## Training
	Trained based on cross-entropy loss.
	```
	Time: 3 days 8 hours
	Epochs: 860K steps cca 8 (from 10)
	GPUs: 4x NVIDIA A100-SXM4-40GB
	eloss: 2.214 - 1.762
	tloss: 3.365 - 1.445
	```

	### ROUGE results per individual dataset test set:
	\| ROUGE \| ROUGE-1 \| \| \| ROUGE-2 \| \| \| ROUGE-L \| \| \|
	\|-----------\|---------\|---------\|-----------\|--------\|--------\|-----------\|--------\|--------\|---------\|
	\| dataset \|Precision \| Recall \| Fscore \| Precision \| Recall \| Fscore \| Precision \| Recall \| Fscore \|
	\| cnc \| 27.45 \| 24.8 \| 25.24 \| 9.35 \| 8.54 \| 8.67 \| 20.14 \| 18.19 \| 18.54 \|
	\| sumeczech \| 25.38 \| 21.61 \| 22.66 \| 7.71 \| 6.67 \| 6.96 \| 18.76 \| 16.02 \| 16.78 \|
	\| cnndm \| 41.97 \| 42.61 \| 41.05 \| 19.64 \| 19.88 \| 19.16 \| 29.38 \| 29.85 \| 28.73 \|
	\| xsum \| 39.18 \| 39.8 \| 38.83 \| 16.59 \| 16.98 \| 16.5 \| 31.25 \| 31.74 \| 30.96 \|
	\| mlsum-tu \| 51.02 \| 47.95 \| 47.72 \| 36.15 \| 34.07 \| 33.9 \| 44.59 \| 41.9 \| 41.74 \|
	\| mlsum-de \| 46.96 \| 46.16 \| 46.02 \| 35.95 \| 35.87 \| 35.66 \| 43.26 \| 42.7 \| 42.53 \|
	\| mlsum-fr \| 34.51 \| 31.4 \| 32.03 \| 16.56 \| 15.07 \| 15.37 \| 26.73 \| 24.41 \| 24.86 \|
	\| mlsum-es \| 32.62 \| 29.66 \| 30.21 \| 13.3 \| 12.2 \| 12.39 \| 26.24 \| 24.02 \| 24.4 \|
	\| mlsum-ru \| 1.25 \| 1.54 \| 1.31 \| 0.46 \| 0.46 \| 0.44 \| 1.25 \| 1.54 \| 1.31 \|
	\| cnewsum \| 26.43 \| 29.44 \| 26.38 \| 7.38 \| 8.52 \| 7.46 \| 25.99 \| 28.94 \| 25.92 \|