UMLS-KGI-BERT-multilingual

This is a trilingual (FR, EN, ES) BERT encoder trained jointly on the European Clinical Case corpus as well as the UMLS metathesaurus knowledge graph, as described in this paper. The training corpus consists of a custom combination of clinical documents from the E3C and text sequences derived from the metathesaurus (see our Github repo for more details).

Model Details

This model was trained using a multi-task approach combining Masked Language Modelling with knowledge-graph-based classification/fill-mask type objectives. The idea behind this framework was to try to improve the robustness of specialised biomedical BERT models by having them learn from structured data as well as natural language, while remaining in the cross-entropy-based learning paradigm.

Developed by: Aidan Mannion
Funded by : GENCI-IDRIS grant AD011013535R1
Model type: DistilBERT
Language(s) (NLP): French, English, Spanish

For further details on the model architecture, training objectives, hardware & software used, as well as the preliminary downstream evaluation experiments carried out, refer to the ArXiv paper.

UMLS-KGI Models

Model	Model Repo	Dataset Size	Base Architecture	Base Model	Total KGI training steps
UMLS-KGI-BERT-multilingual	url-multi	940MB	DistilBERT	n/a	163,904
UMLS-KGI-BERT-FR	url-fr	604MB	DistilBERT	n/a	126,720
UMLS-KGI-BERT-EN	url-en	174MB	DistilBERT	n/a	19,008
UMLS-KGI-BERT-ES	url-es	162MB	DistilBERT	n/a	18,176
DrBERT-UMLS-KGI	url-drbert	604MB	CamemBERT/RoBERTa	DrBERT-4GB	126,720
PubMedBERT-UMLS-KGI	url-pubmedbert	174MB	BERT	microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract	19,008
BioRoBERTa-ES-UMLS-KGI	url-bioroberta	162MB	RoBERTa	RoBERTa-base-biomedical-es	18,176

Direct/Downstream Use

This model is intended for use in experimental clinical/biomedical NLP work, either as a part of a larger system requiring text encoding or fine-tuned on a specific downstream task requiring clinical language modelling. It has not been sufficiently tested for accuracy, robustness and bias to be used in production settings.

Out-of-Scope Use

Experiments on general-domain data suggest that, given it's specialised training corpus, this model is not suitable for use on out-of-domain NLP tasks, and we recommend that it only be used for processing clinical text.

Training Data

Training Hyperparameters

sequence length: 256
learning rate 7.5e-5
linear learning rate schedule with 10,770 warmup steps
effective batch size 1500 (15 sequences per batch x 100 gradient accumulation steps)
MLM masking probability 0.15

Training regime: The model was trained with fp16 non-mixed precision, using the AdamW optimizer with default parameters.

Evaluation

Testing Data, Factors & Metrics

Testing Data

This model was evaluated on the following datasets:

ncbi_disease (en)
J4YL19/biored_tokenized (en)
tner/bionlp2004 (en)
bigbio/pharmaconer (es)
bigbio/meddocan (es)
CAS-POS (fr)
ESSAI-POS (fr)
CAS-SG (dataset not publicly available) (fr)
QUAERO-MEDLINE (fr)

Metrics

We provide the macro-averaged F1 scores here; given that all of the downstream token classification tasks in these experiments show significant class imbalance, the weighted-average scores tend to be uniformly higher than their macro-averaged counterparts. In the interest of more fairly representing the less prevalent classes and highlighting the difficulty of capturing the long-tailed nature of the distributions in these datasets, we stick to the macro average.

Results

[More Information Needed]

Citation [BibTeX]

@inproceedings{mannion-etal-2023-umls,
    title = "{UMLS}-{KGI}-{BERT}: Data-Centric Knowledge Integration in Transformers for Biomedical Entity Recognition",
    author = "Mannion, Aidan  and
      Schwab, Didier  and
      Goeuriot, Lorraine",
    booktitle = "Proceedings of the 5th Clinical Natural Language Processing Workshop",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.clinicalnlp-1.35",
    pages = "312--322",
    abstract = "Pre-trained transformer language models (LMs) have in recent years become the dominant paradigm in applied NLP. These models have achieved state-of-the-art performance on tasks such as information extraction, question answering, sentiment analysis, document classification and many others. In the biomedical domain, significant progress has been made in adapting this paradigm to NLP tasks that require the integration of domain-specific knowledge as well as statistical modelling of language. In particular, research in this area has focused on the question of how best to construct LMs that take into account not only the patterns of token distribution in medical text, but also the wealth of structured information contained in terminology resources such as the UMLS. This work contributes a data-centric paradigm for enriching the language representations of biomedical transformer-encoder LMs by extracting text sequences from the UMLS.This allows for graph-based learning objectives to be combined with masked-language pre-training. Preliminary results from experiments in the extension of pre-trained LMs as well as training from scratch show that this framework improves downstream performance on multiple biomedical and clinical Named Entity Recognition (NER) tasks. All pre-trained models, data processing pipelines and evaluation scripts will be made publicly available.",
}

@misc{mannion2023umlskgibert,
      title={UMLS-KGI-BERT: Data-Centric Knowledge Integration in Transformers for Biomedical Entity Recognition}, 
      author={Aidan Mannion and Thierry Chevalier and Didier Schwab and Lorraine Geouriot},
      year={2023},
      eprint={2307.11170},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}