NLTK Tokenizer for Transformers π€
π Overview
The NLTK Tokenizer is a custom tokenizer class designed for use with the Hugging Face Transformers library. This tokenizer leverage the NlktTokenizer
class extends the PreTrainedTokenizer
from the Hugging Face's Transformers library to create a NLTK-based tokenizer. This approach combines the robust pre-training and easy integration features of the PreTrainedTokenizer
with the linguistic processing strengths of NLTK's word_tokenize
. The result is a tokenizer that is both powerful in handling diverse language patterns and compatible with advanced NLP modeling techniques.
π οΈ Installation
To use the NLTK Tokenizer, ensure you have both transformers
and nltk
libraries installed. You can install them using:
pip install transformers nltk
conda install -c huggingface transformers nltk
π΄ββοΈ Getting Started
Initializing the Tokenizer
- Clone this repo
- Go to the directory where you cloned this repo
- Initialize the NLTK Tokenizer with a vocabulary file. Note that your vocab file should list one token per lines:
from tokenization_nltk import NlktTokenizer
tokenizer = NlktTokenizer(vocab_file='path/to/your/vocabulary.txt') #vocab.txt
- Enjoy π€
π¬ Basic Usage Examples
Simple Tokenization:
text = "Hello Shirin, How are you?" tokens = tokenizer.tokenize(text) print("Tokens:", tokens) #ouput: Tokens: ['Hello', 'Shirin', ',', 'How', 'are', 'you', '?']
Including Special Tokens:
text = "<s>Hello, world!<end_of_text>" tokens = tokenizer.tokenize(text, add_special_tokens=True) print(tokens) #output: ['<s>', 'Hello', ',', 'world', '!', '<end_of_text>']
Token-ID Conversion:
tokens = ['the', 'weather', 'IS', 'Sunny', '!'] token_ids = [tokenizer.convert_tokens_to_ids(token.lower()) for token in tokens]#lower() because the vocab.txt is all in lower case for us print(token_ids) #output: [1997, 4634, 2004, 11560, 1000]
ID-Token Conversion:
ids = [1, 24707, 4634, 19238, 1000, 31000] tokens = [tokenizer.convert_ids_to_tokens(id) for id in ids] print(tokens) #output: ['[PAD]', 'cloudy', 'weather', 'sucks', '!', '[UNK]']
Tokenizing a Long Text:
long_text = "In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort." long_tokens = tokenizer.tokenize(long_text) print("Tokens:", long_tokens) #output: ['In', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit', '.', 'Not', 'a', 'nasty', ',', 'dirty', ',', 'wet', 'hole', ',', 'filled', 'with', 'the', 'ends', 'of', 'worms', 'and', 'an', 'oozy', 'smell', ',', 'nor', 'yet', 'a', 'dry', ',', 'bare', ',', 'sandy', 'hole', 'with', 'nothing', 'in', 'it', 'to', 'sit', 'down', 'on', 'or', 'to', 'eat', ':', 'it', 'was', 'a', 'hobbit-hole', ',', 'and', 'that', 'means', 'comfort', '.']
Tokenizing Sentences with Emojis:
text_with_emoji = "I love pizza π! Do you like it too?" tokens_with_emoji = tokenizer.tokenize(text_with_emoji) print("Tokens:", tokens_with_emoji) #output: ['I', 'love', 'pizza', 'π', '!', 'Do', 'you', 'like', 'it', 'too', '?']
Saving the Tokenizer:
Save the tokenizer's state, including its vocabulary:
tokenizer.save_vocabulary(save_directory='path/to/save')
π§ͺ Evaluation using Pytest
We have comprehensively tested our tokenizer by implementing various test cases using pytest, ensuring its robustness and functionality across different input scenarios. Make sure to try it yourself by:
pytest test_tokenization_nltk.py
β οΈ Limitations:
- Contextual understanding: Biggest concern with NLTK's tokenization is that it operates mainly at the word level! This means it does not capture nuanced tokenization decisions needed for some NLP tasks that require sub-word or character-level understanding!
- Language Complexity: NLTK might struggle with tokenizing languages with complex morphologies or those requiring specialized tokenization rules. For instance, it might struggle to handle languages that heavily rely on context, like some forms of Chinese or Japanese.
- Out-of-Vocabulary Words: If the tokenizer encounters words not present in its vocabulary (like 31000 id in the last example), it might use an [UNK] (unknown) token or handle them poorly, affecting downstream tasks' performance.
- Limited Preprocessing Performance: It does not support fully the emojis.
π€ Hub Integration
Make sure you have your vocabularary file (vocab.txt
) in the same directory where you have the project.
- Simple Tokenization:
import torch
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ShirinYamani/task", trust_remote_code = True)
text = "Example sentence for tokenization."
# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
Including Special Tokens:
text = "<s>Hello, world!<end_of_text>" tokens = tokenizer.tokenize(text, add_special_tokens=True) print(tokens) #output: ['<s>', 'Hello', ',', 'world', '!', '<end_of_text>']
Token-ID Conversion:
tokens = ['the', 'weather', 'IS', 'Sunny', '!'] token_ids = [tokenizer.convert_tokens_to_ids(token.lower()) for token in tokens]#lower() because the vocab.txt is all in lower case for us print(token_ids) #output: [1997, 4634, 2004, 11560, 1000]