Dataset Card for Custom Text Dataset

Dataset Name

  • Custom Text Summarization Dataset (CNN/DailyMail Subset)

Overview

This dataset contains a subset of the CNN/DailyMail news dataset, which is used for training text summarization models. The dataset consists of articles paired with human-generated summaries. It is widely used in the development of natural language processing models for summarization tasks.

  • Number of examples: 287,113 (training set), 13,368 (validation set), 11,490 (test set)
  • Languages: English

Composition

  • Source: CNN and DailyMail news articles
  • Size: 1% subset of the full dataset
  • Text Fields: Each example consists of:
    • article: The news article text
    • highlights: The human-generated summary of the article

Collection Process

The dataset was collected by scraping news articles from CNN and DailyMail websites. The articles were paired with manually written summaries to form training examples. This dataset was originally prepared for the task of abstractive text summarization.

Preprocessing

  • Tokenization using a pretrained tokenizer (e.g., T5 tokenizer)
  • Maximum token length capped at 512 for both input and output sequences
  • Lowercasing of all texts to maintain consistency
  • Special tokens for start and end of sequences

How to Use

from datasets import load_dataset
dataset = load_dataset("cnn_dailymail", "3.0.0", split="train[:1%]")

Evaluation

  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
  • BLEU (Bilingual Evaluation Understudy)

Limitations

  • Data bias: The dataset is composed of news articles from only two major sources, CNN and DailyMail, which may introduce a specific writing style and focus into the summaries.
  • Domain-specific issues: The dataset is limited to news articles and may not generalize well to other domains such as scientific texts or casual conversations.

Ethical Considerations

  • Privacy: Since the dataset consists of publicly available news articles, privacy concerns are minimal. However, users should be cautious when generating summaries for sensitive or private information.
  • Bias: News articles from CNN and DailyMail may reflect specific political or cultural biases, which could influence the summaries generated by models trained on this dataset.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.