Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Lb_mBERT

Lb_mBERT is a BERT-like language model for the Luxembourgish language.

We used the weights of the multilingual BERT (mBERT) language model as a starting point and continued pre-training it on the MLM task using the same corpus that we used for our LuxemBERT model (https://huggingface.co/lothritz/LuxemBERT).

We achieved higher performances on some downstream tasks than the original LuxemBERT, and another Luxembourgish BERT model called DA BERT (https://huggingface.co/iolariu/DA_BERT).

If you would like to know more about our work, the pre-training corpus, or use our models or datasets, please check out/cite the following papers:

@inproceedings{lothritz-etal-2022-luxembert,
    title = "{L}uxem{BERT}: Simple and Practical Data Augmentation in Language Model Pre-Training for {L}uxembourgish",
    author = "Lothritz, Cedric  and
      Lebichot, Bertrand  and
      Allix, Kevin  and
      Veiber, Lisa  and
      Bissyande, Tegawende  and
      Klein, Jacques  and
      Boytsov, Andrey  and
      Lefebvre, Cl{\'e}ment  and
      Goujon, Anne",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.543",
    pages = "5080--5089",
    abstract = "Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.",
}
@inproceedings{lothritz2023comparing,
  title={Comparing Pre-Training Schemes for Luxembourgish BERT Models},
  author={Lothritz, Cedric and Ezzini, Saad and Purschke, Christoph and Bissyande, Tegawend{\'e} Fran{\c{c}}ois D Assise and Klein, Jacques and Olariu, Isabella and Boytsov, Andrey and Lefebvre, Clement and Goujon, Anne},
  booktitle={Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)},
  year={2023}
}
Downloads last month
3
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.