---
license: apache-2.0
base_model: bert-base-cased
tags:
- generated_from_trainer
- medical
model-index:
- name: bert-base-cased-biomedical-ner
  results: []
language:
- en
datasets:
- EMBO/SourceData
pipeline_tag: token-classification
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Model Card: bert-base-cased-biological-ner

## Model Details

- **Model Name**: bert-base-cased-biomedical-ner
- **Model Architecture**: BERT (Bidirectional Encoder Representations from Transformers)
- **Pre-trained Model**: [bert-base-cased](https://huggingface.co/bert-base-cased)
- **Fine-tuned on**: [SourceData Dataset](https://huggingface.co/datasets/EMBO/SourceData)

## Model Description

The `bert-base-cased-biomedical-ner` is a fine-tuned variant of the BERT (Bidirectional Encoder Representations from Transformers) model, designed specifically for the task of Named Entity Recognition (NER) in the biomedical domain. The model has been fine-tuned on the SourceData Dataset, which is a substantial and comprehensive biomedical corpus for machine learning and AI in the publishing context.

Named Entity Recognition is a crucial task in natural language processing, particularly in the biomedical field, where identifying and classifying entities like genes, proteins, diseases, and more is essential for various applications, including information retrieval, knowledge extraction, and data mining.

## Intended Use

The `bert-base-cased-biological-ner` model is intended for NER tasks within the biomedical domain. It can be used for a range of applications, including but not limited to:

- Identifying and extracting biomedical entities (e.g., genes, proteins, diseases) from unstructured text.
- Enhancing information retrieval systems for scientific literature.
- Supporting knowledge extraction and data mining from biomedical literature.
- Facilitating the creation of structured biomedical databases.

## Labels

| Label           | Description                                      |
|-----------------|--------------------------------------------------|
| SMALL_MOLECULE  | Small molecules                                  |
| GENEPROD        | Gene products (genes and proteins)               |
| SUBCELLULAR     | Subcellular components                           |
| CELL_LINE       | Cell lines                                       |
| CELL_TYPE       | Cell types                                       |
| TISSUE          | Tissues and organs                               |
| ORGANISM        | Species                                          |
| DISEASE         | Diseases                                         | 
| EXP_ASSAY       | Experimental assays                              |
*Source of label information: [EMBO/SourceData Dataset](https://huggingface.co/datasets/EMBO/SourceData)*

## Usage
```python
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
import pandas as pd

tokenizer = AutoTokenizer.from_pretrained("Kushtrim/bert-base-cased-biomedical-ner")
model = AutoModelForTokenClassification.from_pretrained("Kushtrim/bert-base-cased-biomedical-ner")

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy='first')

text = "Add your text here"

results = ner(text)

pd.DataFrame.from_records(results)
```


## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3

### Framework versions

- Transformers 4.35.0
- Pytorch 2.1.0+cu118
- Datasets 2.14.6
- Tokenizers 0.14.1