jakelever
/

coronabert

Text Classification

Inference Endpoints

Model card Files Files and versions Community

coronabert / README.md

jakelever's picture

More metadata

1f0cd54 over 3 years ago

|

2.99 kB

	---
	language: en
	thumbnail: https://raw.githubusercontent.com/jakelever/corona-ml/master/logo-circle.png
	tags:
	- coronavirus
	- covid
	- bionlp
	datasets:
	- cord19
	- pubmed
	license: mit
	---
	# CoronaCentral BERT Model for Topic / Article Type Classification

	This is the topic / article type classification for the [CoronaCentral website](https://coronacentral.ai). This forms part of the pipeline for downloading and processing coronavirus literature described in the [corona-ml repo](https://github.com/jakelever/corona-ml) with available [step-by-step descriptions](https://github.com/jakelever/corona-ml/blob/master/stepByStep.md). The method is described in the [preprint](https://doi.org/10.1101/2020.12.21.423860) and detailed performance results can be found in the [machine learning details](https://github.com/jakelever/corona-ml/blob/master/machineLearningDetails.md) document.

	This is derived from the [microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) model and fine-tuned for the sequence classification task.

	## Usage

	Below are two Google Colab notebooks with example usage of this sequence classification model using HuggingFace transformers and KTrain.

	- [HuggingFace example on Google Colab](https://colab.research.google.com/drive/1cBNgKd4o6FNWwjKXXQQsC_SaX1kOXDa4?usp=sharing)
	- [KTrain example on Google Colab](https://colab.research.google.com/drive/1h7oJa2NDjnBEoox0D5vwXrxiCHj3B1kU?usp=sharing)

	## Training Data

	The model is trained on ~3200 manually-curated articles sampled at various stages during the coronavirus pandemic. The code for training is available in the [category\_prediction](https://github.com/jakelever/corona-ml/tree/master/category_prediction) directory of the main Github Repo. The data is available in the [annotated_documents.json.gz](https://github.com/jakelever/corona-ml/blob/master/category_prediction/annotated_documents.json.gz) file.

	## Inputs and Outputs

	The model takes in a tokenized title and abstract (combined into a single string and separated by a new line). The outputs are topics and article types, broadly called categories in the pipeline code. The types are listed below. Some others are managed by hand-coded rules described in the [step-by-step descriptions](https://github.com/jakelever/corona-ml/blob/master/stepByStep.md).

	### List of Article Types

	- Comment/Editorial
	- Meta-analysis
	- News
	- Review

	### List of Topics

	- Clinical Reports
	- Communication
	- Contact Tracing
	- Diagnostics
	- Drug Targets
	- Education
	- Effect on Medical Specialties
	- Forecasting & Modelling
	- Health Policy
	- Healthcare Workers
	- Imaging
	- Immunology
	- Inequality
	- Infection Reports
	- Long Haul
	- Medical Devices
	- Misinformation
	- Model Systems & Tools
	- Molecular Biology
	- Non-human
	- Non-medical
	- Pediatrics
	- Prevalence
	- Prevention
	- Psychology
	- Recommendations
	- Risk Factors
	- Surveillance
	- Therapeutics
	- Transmission
	- Vaccines