Merge branch 'main' of https://huggingface.co/krotima1/mbart-ht2a-cs into main
Browse files
README.md
ADDED
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- cs
|
4 |
+
- cs
|
5 |
+
tags:
|
6 |
+
- abstractive summarization
|
7 |
+
- mbart-cc25
|
8 |
+
- Czech
|
9 |
+
license: apache-2.0
|
10 |
+
datasets:
|
11 |
+
- private Czech News Center dataset news-based
|
12 |
+
- SumeCzech dataset news-based
|
13 |
+
metrics:
|
14 |
+
- rouge
|
15 |
+
- rougeraw
|
16 |
+
---
|
17 |
+
|
18 |
+
# mBART fine-tuned model for Czech abstractive summarization (HT2A-CS)
|
19 |
+
This model is a fine-tuned checkpoint of [facebook/mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) on the Czech news dataset to produce Czech abstractive summaries.
|
20 |
+
## Task
|
21 |
+
The model deals with the task ``Headline + Text to Abstract`` (HT2A) which consists in generating a multi-sentence summary considered as an abstract from a Czech news text.
|
22 |
+
|
23 |
+
## Dataset
|
24 |
+
The model has been trained on a large Czech news dataset developed by a concatenation of two datasets, the private CNC dataset provided by Czech News Center and [SumeCzech](https://ufal.mff.cuni.cz/sumeczech) dataset. The dataset includes around 1.75M Czech news-based documents consisting of a Headline, Abstract, and Full-text sections. Truncation and padding were set to 512 tokens for the encoder and 128 for the decoder.
|
25 |
+
|
26 |
+
## Training
|
27 |
+
The model has been trained on 1x NVIDIA Tesla A100 40GB for 60 hours and 4x NVIDIA Tesla A100 40GB for 40 hours. During training, the model has seen 12896K documents corresponding to roughly 8.4 epochs.
|
28 |
+
|
29 |
+
# Use
|
30 |
+
Assuming that you are using the provided Summarizer.ipynb file.
|
31 |
+
```python
|
32 |
+
def summ_config():
|
33 |
+
cfg = OrderedDict([
|
34 |
+
# summarization model - checkpoint from website
|
35 |
+
("model_name", "krotima1/mbart-ht2a-cs"),
|
36 |
+
("inference_cfg", OrderedDict([
|
37 |
+
("num_beams", 4),
|
38 |
+
("top_k", 40),
|
39 |
+
("top_p", 0.92),
|
40 |
+
("do_sample", True),
|
41 |
+
("temperature", 0.89),
|
42 |
+
("repetition_penalty", 1.2),
|
43 |
+
("no_repeat_ngram_size", None),
|
44 |
+
("early_stopping", True),
|
45 |
+
("max_length", 128),
|
46 |
+
("min_length", 10),
|
47 |
+
])),
|
48 |
+
#texts to summarize
|
49 |
+
("text",
|
50 |
+
[
|
51 |
+
"Input your Czech text",
|
52 |
+
]
|
53 |
+
),
|
54 |
+
])
|
55 |
+
return cfg
|
56 |
+
cfg = summ_config()
|
57 |
+
#load model
|
58 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(cfg["model_name"])
|
59 |
+
tokenizer = AutoTokenizer.from_pretrained(cfg["model_name"])
|
60 |
+
# init summarizer
|
61 |
+
summarize = Summarizer(model, tokenizer, cfg["inference_cfg"])
|
62 |
+
summarize(cfg["text"])
|
63 |
+
```
|