metadata

language:
  - cs
  - cs
tags:
  - abstractive summarization
  - mbart-cc25
  - Czech
license: apache-2.0
datasets:
  - SumeCzech dataset news-based
metrics:
  - rouge
  - rougeraw

mBART fine-tuned model for Czech abstractive summarization (AT2H-S)

This model is a fine-tuned checkpoint of facebook/mbart-large-cc25 on the Czech news dataset to produce Czech abstractive summaries.

Task

The model deals with the task Abstract + Text to Headline (AT2H) which consists in generating a one- or two-sentence summary considered as a headline from a Czech news text.

Dataset

The model has been trained on the SumeCzech dataset. The dataset includes around 1M Czech news-based documents consisting of a Headline, Abstract, and Full-text sections. Truncation and padding were configured for 512 tokens for the encoder and 64 for the decoder.

Training

The model has been trained on 1x NVIDIA Tesla A100 40GB for 40 hours. During training, the model has seen 2576K documents corresponding to roughly 3 epochs.

Use

Assuming you are using the provided Summarizer.ipynb file.

def summ_config():
    cfg = OrderedDict([
        # summarization model - checkpoint from website
        ("model_name", "krotima1/mbart-at2h-s"),
        ("inference_cfg", OrderedDict([
            ("num_beams", 4),
            ("top_k", 40),
            ("top_p", 0.92),
            ("do_sample", True),
            ("temperature", 0.89),
            ("repetition_penalty", 1.2),
            ("no_repeat_ngram_size", None),
            ("early_stopping", True),
            ("max_length", 64),
            ("min_length", 10),
        ])),
        #texts to summarize
        ("text",
            [
                "Input your Czech text",
            ]
        ),
    ])
    return cfg
cfg = summ_config()
#load model    
model = AutoModelForSeq2SeqLM.from_pretrained(cfg["model_name"])
tokenizer = AutoTokenizer.from_pretrained(cfg["model_name"])
# init summarizer
summarize = Summarizer(model, tokenizer, cfg["inference_cfg"])
summarize(cfg["text"])