README.md · eli4s/Bert-L12-h384-A6 at 96f3f13ea25e9954ab3f1498a31311289c219833

This model was pretrained on the bookcorpus dataset using knowledge distillation.

The particularity of this model is that even though it shares the same architecture as BERT, it has a hidden size of 384 (half the hidden size of BERT) and 6 attention heads (hence the same head size of BERT).

The knowledge distillation was performed using multiple loss functions.

The weights of the model were initialized from scratch.

PS : the tokenizer is the same as the one of the model bert-base-uncased.

** PS2 : I am currently fixing a bug on this model. Do not expect anything from this model until my next update. **

To load the model & tokenizer :

from transformers import AutoModelForMaskedLM, BertTokenizer

model_name = "eli4s/Bert-L12-h384-A6"
model = AutoModelForMaskedLM.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

To use it on a sentence :

import torch

sentence = "Let's have a [MASK]."

model.eval()
encoded_inputs = tokenizer([sentence], padding='longest')
input_ids = torch.tensor(encoded_inputs['input_ids'])
attention_mask = torch.tensor(encoded_inputs['attention_mask'])
output = model(input_ids, attention_mask=attention_mask)

mask_index = input_ids.tolist()[0].index(103)
masked_token = output['logits'][0][mask_index].argmax(axis=-1)
predicted_token = tokenizer.decode(masked_token)

print(predicted_token)

Or we can also predict the n most relevant predictions :

top_n = 5

vocab_size = model.config.vocab_size
logits = output['logits'][0][mask_index].tolist()
top_tokens = sorted(list(range(vocab_size)), key=lambda  i:logits[i], reverse=True)[:top_n]

tokenizer.decode(top_tokens)