Sports Text Classifier

Overview

This Sports Text Classifier is a crucial component of the OnlySports Dataset creation pipeline. It's designed to accurately identify and extract sports-related documents from a large corpus of web content.

Model Architecture

  • Base model: Snowflake-arctic-embed-xs
  • Additional layer: Binary classification layer
  • Training: 10 epochs with a learning rate of 3e-4

Performance

The classifier achieves exceptional accuracy in distinguishing between sports and non-sports documents:

image/png

Training Data

The classifier was trained on a balanced dataset of sports and non-sports content:

  • 64k samples from seven prestigious sports websites
  • 36k non-sports text documents classified using GPT-3.5

Usage

This classifier is primarily used in the creation of the OnlySports Dataset, presented in this paper. It can be applied to filter large text corpora for sports-related content with high accuracy.

Integration

The classifier is integrated into a MapReduce architecture for efficient processing of large-scale datasets. It's used in conjunction with URL keyword filtering to create a comprehensive sports text dataset.

Related Projects

This classifier is part of the larger OnlySports collection, which includes:

For more information, check our paper.

Downloads last month
22
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Chrisneverdie/OnlySports_Classifier

Finetuned
(6)
this model

Dataset used to train Chrisneverdie/OnlySports_Classifier

Collection including Chrisneverdie/OnlySports_Classifier