Papers
arxiv:2112.08804

CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs

Published on Dec 16, 2021
Authors:
,
,
,
,
,

Abstract

We present CrossSum, a large-scale cross-lingual abstractive summarization dataset comprising 1.7 million article-summary samples in 1500+ language pairs. We create CrossSum by aligning identical articles written in different languages via cross-lingual retrieval from a multilingual summarization dataset. We propose a multi-stage data sampling algorithm to effectively train a cross-lingual summarization model capable of summarizing an article in any target language. We also propose LaSE, a new metric for automatically evaluating model-generated summaries and showing a strong correlation with ROUGE. Performance on ROUGE and LaSE indicate that pretrained models fine-tuned on CrossSum consistently outperform baseline models, even when the source and target language pairs are linguistically distant. To the best of our knowledge, CrossSum is the largest cross-lingual summarization dataset and the first-ever that does not rely solely on English as the pivot language. We are releasing the dataset, alignment and training scripts, and the models to spur future research on cross-lingual abstractive summarization. The resources can be found at https://github.com/csebuetnlp/CrossSum.

Community

Sign up or log in to comment

Models citing this paper 7

Browse 7 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 3

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.