BERT Tokenizer for Hinglish
This repository contains a BERT tokenizer that has been trained on more than 200,000 Hinglish words. Hinglish is a hybrid language that combines Hindi and English, and is commonly used in informal communication in India.
The tokenizer is capable of accurately tokenizing Hinglish text, splitting it into individual tokens that can be used as input to a BERT model. Here is an example of how the tokenizer works:
# Load model directly
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("obaidtambo/hinglish_bert_tokenizer")
example = "aap se kuch keha tha kehte kehte reh gaye"
tokens = tokenizer.tokenize(example)
print(tokens)
Output:
['aap',
'se',
'kuch',
'keh',
'##a',
'tha',
'keh',
'##te',
'keh',
'##te',
'reh',
'gaye']
As you can see, the tokenizer is able to accurately split the Hinglish text into individual tokens, including subword tokens (indicated by the ##
prefix).
We hope that this tokenizer will be useful for researchers and practitioners working on natural language processing tasks involving Hinglish text. If you have any questions or feedback, please feel free to open an issue or submit a pull request. ๐