sahajBERT

Collaboratively pre-trained model on Bengali language using masked language modeling (MLM) and Sentence Order Prediction (SOP) objectives.

Model description

sahajBERT is a model composed of 1) a tokenizer specially designed for Bengali and 2) an ALBERT architecture collaboratively pre-trained on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.

Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering.

We trained our model on 2 of these downstream tasks: sequence classification and token classification

How to use

You can use this model directly with a pipeline for masked language modeling:


from transformers import AlbertForMaskedLM, FillMaskPipeline, PreTrainedTokenizerFast

# Initialize tokenizer

tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")

# Initialize model

model = AlbertForMaskedLM.from_pretrained("neuropark/sahajBERT")

# Initialize pipeline

pipeline = FillMaskPipeline(tokenizer=tokenizer, model=model)

raw_text = "ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো" # Change me

pipeline(raw_text)

Here is how to use this model to get the features of a given text in PyTorch:


from transformers import AlbertModel, PreTrainedTokenizerFast

# Initialize tokenizer

tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")

# Initialize model

model = AlbertModel.from_pretrained("neuropark/sahajBERT")

text = "ধন্যবাদ। আপনার সাথে কথা বলে ভালো লাগলো" # Change me

encoded_input = tokenizer(text, return_tensors='pt')

output = model(**encoded_input)

Limitations and bias

WIP

Training data

The tokenizer was trained on he Bengali part of OSCAR and the model on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.

Training procedure

This model was trained in a collaborative manner by volunteer participants.

Contributors leaderboard

Rank Username Total contributed runtime
1 khalidsaifullaah 11 days 21:02:08
2 ishanbagchi 9 days 20:37:00
3 tanmoyio 9 days 18:08:34
4 debajit 8 days 14:15:10
5 skylord 6 days 16:35:29
6 ibraheemmoosa 5 days 01:05:57
7 SaulLu 5 days 00:46:36
8 lhoestq 4 days 20:11:16
9 nilavya 4 days 08:51:51
10 Priyadarshan 4 days 02:28:55
11 anuragshas 3 days 05:00:55
12 sujitpal 2 days 20:52:33
13 manandey 2 days 16:17:13
14 albertvillanova 2 days 14:14:31
15 justheuristic 2 days 13:20:52
16 w0lfw1tz 2 days 07:22:48
17 smoker 2 days 02:52:03
18 Soumi 1 days 20:42:02
19 Anjali 1 days 16:28:00
20 OptimusPrime 1 days 09:16:57
21 theainerd 1 days 04:48:57
22 yhn112 0 days 20:57:02
23 kolk 0 days 17:57:37
24 arnab 0 days 17:54:12
25 imavijit 0 days 16:07:26
26 osanseviero 0 days 14:16:45
27 subhranilsarkar 0 days 13:04:46
28 sagnik1511 0 days 12:24:57
29 anindabitm 0 days 08:56:44
30 borzunov 0 days 04:07:35
31 thomwolf 0 days 03:53:15
32 priyadarshan 0 days 03:40:11
33 ali007 0 days 03:34:37
34 sbrandeis 0 days 03:18:16
35 Preetha 0 days 03:13:47
36 Mrinal 0 days 03:01:43
37 laxya007 0 days 02:18:34
38 lewtun 0 days 00:34:43
39 Rounak 0 days 00:26:10
40 kshmax 0 days 00:06:38

Hardware used

Eval results

We evaluate sahajBERT model quality and 2 other model benchmarks (XLM-R-large and IndicBert) by fine-tuning 3 times their pre-trained models on two downstream tasks in Bengali:

  • NER: a named entity recognition on Bengali split of WikiANN dataset

  • NCC: a multi-class classification task on news Soham News Category Classification dataset from IndicGLUE

Base pre-trained Model NER - F1 (mean ± std) NCC - Accuracy (mean ± std)
sahajBERT 95.45 ± 0.53 91.97 ± 0.47
XLM-R-large 96.48 ± 0.22 90.05 ± 0.38
IndicBert 92.52 ± 0.45 74.46 ± 1.91

BibTeX entry and citation info

Coming soon!

Downloads last month
88
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for neuropark/sahajBERT

Finetunes
2 models