🦠 BIOMEDtra 🏥

BIOMEDtra (small) is an Electra like model (discriminator in this case) trained on Spanish Biomedical Crawled Corpus.

As mentioned in the original paper: ELECTRA is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset.

For a detailed description and experimental results, please refer the paper ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators.

Training details

The model was trained using the Electra base code for 3 days on 1 GPU (Tesla V100 16GB).

Dataset details

The largest Spanish biomedical and heath corpus to date gathered from a massive Spanish health domain crawler over more than 3,000 URLs were downloaded and preprocessed. The collected data have been preprocessed to produce the CoWeSe (Corpus Web Salud Español) resource, a large-scale and high-quality corpus intended for biomedical and health NLP in Spanish.

Model details ⚙

Param # Value
Layers 12
Hidden 256
Params 14M

Evaluation metrics (for discriminator) 🧾

Metric # Score
Accuracy 0.9561
Precision 0.808
Recall 0.531
AUC 0.949

Benchmarks 🔨

WIP 🚧

How to use the discriminator in transformers

from transformers import ElectraForPreTraining, ElectraTokenizerFast
import torch

discriminator = ElectraForPreTraining.from_pretrained("mrm8488/biomedtra-small-es")
tokenizer = ElectraTokenizerFast.from_pretrained("mrm8488/biomedtra-small-es")

sentence = "Los españoles tienden a sufir déficit de vitamina c"
fake_sentence = "Los españoles tienden a déficit sufrir de vitamina c"

fake_tokens = tokenizer.tokenize(fake_sentence)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
discriminator_outputs = discriminator(fake_inputs)
predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)

[print("%7s" % token, end="") for token in fake_tokens]

[print("%7s" % prediction, end="") for prediction in predictions.tolist()]

Acknowledgments

TBA

Citation

If you want to cite this model you can use this:

@misc{mromero2022biomedtra,
  title={Spanish BioMedical Electra (small)},
  author={Romero, Manuel},
  publisher={Hugging Face},
  journal={Hugging Face Hub},
  howpublished={\url{https://huggingface.co/mrm8488/biomedtra-small-es},
  year={2022}
}

Created by Manuel Romero/@mrm8488

Made with in Spain

Downloads last month
69
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Collection including mrm8488/biomedtra-small-es