Update README.md

6091a5d about 2 years ago

6.08 kB

	---
	language:
	- cs
	- en
	- de
	- fr
	- tu
	- zh
	- es
	- ru
	tags:
	- Summarization
	- abstractive summarization
	- multilingual summarization
	- m2m100_418M
	- Czech
	- text2text generation
	- text generation
	license: cc-by-sa-4.0
	datasets:
	- Multilingual_large_dataset_(multilarge)
	- cnc/dm
	- xsum
	- mlsum
	- cnewsum
	- cnc
	- sumeczech
	metrics:
	- rouge
	- rougeraw
	- MemesCS
	---
	# m2m100-418M-multilingual-summarization-multilarge-cs
	This model is a fine-tuned checkpoint of [facebook/m2m100_418M](https://huggingface.co/facebook/m2m100_418M) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.
	## Task
	The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ''cs', 'en', 'de', 'es', 'fr', 'ru', 'tu', 'zh'

	#Usage
	Assume that you are using the provided MultilingualSummarizer.ipynb file and included files from git repository.

	```python
	## Configuration of summarization pipeline
	#
	def summ_config():
	cfg = OrderedDict([

	## summarization model - checkpoint
	# ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
	# ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
	# ctu-aic/mbart25-multilingual-summarization-multilarge-cs
	("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),

	## language of summarization task
	# language : string : cs, en, de, fr, es, tr, ru, zh
	("language", "en"),

	## generation method parameters in dictionary
	#
	("inference_cfg", OrderedDict([
	("num_beams", 4),
	("top_k", 40),
	("top_p", 0.92),
	("do_sample", True),
	("temperature", 0.95),
	("repetition_penalty", 1.23),
	("no_repeat_ngram_size", None),
	("early_stopping", True),
	("max_length", 128),
	("min_length", 10),
	])),
	#texts to summarize values = (list of strings, string, dataset)
	("texts",
	[
	"english text1 to summarize",
	"english text2 to summarize",
	]
	),
	#OPTIONAL: Target summaries values = (list of strings, string, None)
	('golds',
	[
	"target english text1",
	"target english text2",
	]),
	#('golds', None),
	])
	return cfg

	cfg = summ_config()
	mSummarize = MultiSummarizer(**cfg)
	ret = mSummarize(**cfg)
	```

	## Dataset
	Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
	```
	Train set: 3 464 563 docs
	Validation set: 121 260 docs
	```
	\| Stats \| fragment \| \| \| avg document length \| \| avg summary length \| \| Documents \|
	\|-------------\|----------\|---------------------\|--------------------\|--------\|---------\|--------\|--------\|--------\|
	\| __dataset__ \|__compression__ \| __density__ \| __coverage__ \| __nsent__ \| __nwords__ \| __nsent__ \| __nwords__ \| __count__ \|
	\| cnc \| 7.388 \| 0.303 \| 0.088 \| 16.121 \| 316.912 \| 3.272 \| 46.805 \| 750K \|
	\| sumeczech \| 11.769 \| 0.471 \| 0.115 \| 27.857 \| 415.711 \| 2.765 \| 38.644 \| 1M \|
	\| cnndm \| 13.688 \| 2.983 \| 0.538 \| 32.783 \| 676.026 \| 4.134 \| 54.036 \| 300K \|
	\| xsum \| 18.378 \| 0.479 \| 0.194 \| 18.607 \| 369.134 \| 1.000 \| 21.127 \| 225K\|
	\| mlsum/tu \| 8.666 \| 5.418 \| 0.461 \| 14.271 \| 214.496 \| 1.793 \| 25.675 \| 274K \|
	\| mlsum/de \| 24.741 \| 8.235 \| 0.469 \| 32.544 \| 539.653 \| 1.951 \| 23.077 \| 243K\|
	\| mlsum/fr \| 24.388 \| 2.688 \| 0.424 \| 24.533 \| 612.080 \| 1.320 \| 26.93 \| 425K \|
	\| mlsum/es \| 36.185 \| 3.705 \| 0.510 \| 31.914 \| 746.927 \| 1.142 \| 21.671 \| 291K \|
	\| mlsum/ru \| 78.909 \| 1.194 \| 0.246 \| 62.141 \| 948.079 \| 1.012 \| 11.976 \| 27K\|
	\| cnewsum \| 20.183 \| 0.000 \| 0.000 \| 16.834 \| 438.271 \| 1.109 \| 21.926 \| 304K \|
	#### Tokenization
	Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary).
	## Training
	Trained based on cross-entropy loss.
	```
	Time: 3 days 10 hours
	Epochs: 1072K steps = 10 (from 10)
	GPUs: 4x NVIDIA A100-SXM4-40GB
	eloss: 2.824 - 1.745
	tloss: 4.559 - 1.615
	```
	### ROUGE results per individual dataset test set:

	\| ROUGE \| ROUGE-1 \| \| \| ROUGE-2 \| \| \| ROUGE-L \| \| \|
	\|------------\|---------\|---------\|-----------\|--------\|--------\|-----------\|--------\|--------\|---------\|
	\| dataset \| Precision \| Recall \| Fscore \| Precision \| Recall \| Fscore \| Precision \| Recall \| Fscore \|
	\| cnc \| 30.13 \| 22.56 \| 25.21 \| 10.53 \| 8.01 \| 8.9 \| 22.47 \| 16.92 \| 18.86 \|
	\| sumeczech- \| 26.6 \| 19.66 \| 22.01 \| 8.17 \| 6.12 \| 6.82 \| 19.93 \| 14.81 \| 16.54 \|
	\| cnndm \| 41.8 \| 38.41 \| 38.94 \| 18.74 \| 17.14 \| 17.4 \| 29.69 \| 27.33 \| 27.68 \|
	\| xsum \| 38.27 \| 33.62 \| 35.16 \| 14.39 \| 12.69 \| 13.25 \| 30.77 \| 27.05 \| 28.29 \|
	\| mlsum-tu \| 52.44 \| 44.36 \| 46.39 \| 36.98 \| 31.51 \| 32.86 \| 46.04 \| 39.04 \| 40.8 \|
	\| mlsum-de \| 42.19 \| 40.5 \| 40.7 \| 28.8 \| 28.51 \| 28.37 \| 38.95 \| 37.7 \| 37.79 \|
	\| mlsum-fr \| 34.57 \| 27.74 \| 29.95 \| 16.27 \| 13.04 \| 14.08 \| 27.18 \| 21.89 \| 23.6 \|
	\| mlsum-es \| 30.93 \| 26.41 \| 27.66 \| 11.42 \| 9.85 \| 10.28 \| 25.12 \| 21.59 \| 22.55 \|
	\| mlsum-ru \| 0.65 \| 0.52 \| 0.56 \| 0.15 \| 0.15 \| 0.15 \| 0.65 \| 0.52 \| 0.56 \|



	# USAGE
	```
	soon
	```