dumitrescustefan/t5-v1_1-base-romanian

This is a pretrained-from-scratch T5v1.1 base model (247M parameters) on the t5x platform.

Training was performed on a clean 80GB Romanian text corpus for 4M steps with these scripts. The model was trained with an encoder sequence length of 512 and a decoder sequence length of 256.

!! IMPORTANT !! This model was pretrained on the span corruption MLM task, meaning this model is not usable in any downstream task without finetuning first!

How to load a t5x model

from transformers import T5Tokenizer, T5Model

tokenizer = T5Tokenizer.from_pretrained('dumitrescustefan/t5-v1_1-base-romanian')
model = T5Model.from_pretrained('dumitrescustefan/t5-v1_1-base-romanian')

input_ids = tokenizer("Acesta este un test", return_tensors="pt").input_ids  # Batch size 1
decoder_input_ids = tokenizer("Acesta este", return_tensors="pt").input_ids  # Batch size 1

# preprocess: Prepend decoder_input_ids with start token which is pad token for T5Model.
# This is not needed for torch's T5ForConditionalGeneration as it does this internally using labels arg.
decoder_input_ids = model._shift_right(decoder_input_ids)

# forward pass
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
last_hidden_states = outputs.last_hidden_state

print(last_hidden_states.shape)  # this will print [1, 3, 768]

Remember to always sanitize your text! Replace ş and ţ cedilla-letters to comma-letters with :

text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

because the model was not trained on cedilla ş and ţs. If you don't, you will have decreased performance due to <UNK>s and increased number of tokens per word.

Acknowledgements

We'd like to thank TPU Research Cloud for providing the TPUv4 cores we used to train these models!

Authors

Yours truly,

Stefan Dumitrescu, Mihai Ilie and Per Egil Kummervold