PersianBPETokenizer Model Card

Model Details

Model Description

The PersianBPETokenizer is a custom tokenizer specifically designed for the Persian (Farsi) language. It leverages the Byte-Pair Encoding (BPE) algorithm to create a robust vocabulary that can effectively handle the unique characteristics of Persian text. This tokenizer is optimized for use with advanced language models like BERT and RoBERTa, making it a valuable tool for various Persian NLP tasks.

Model Type

  • Tokenization Algorithm: Byte-Pair Encoding (BPE)
  • Normalization: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)
  • Pre-tokenization: Whitespace
  • Post-processing: TemplateProcessing for special tokens

Model Version

  • Version: 1.0
  • Date: September 6, 2024

License

  • License: MIT

Developers

Citation

If you use this tokenizer in your research, please cite it as:

Mohammad Shojaei. (2024). PersianBPETokenizer [Software]. Available at https://huggingface.co/mshojaei77/PersianBPETokenizer.

Model Use

Intended Use

  • Primary Use: Tokenization of Persian text for NLP tasks such as text classification, named entity recognition, machine translation, and more.
  • Secondary Use: Integration with pre-trained language models like BERT and RoBERTa for fine-tuning on Persian datasets.

Out-of-Scope Use

  • Non-Persian Text: This tokenizer is not designed for languages other than Persian.
  • Non-NLP Tasks: It is not intended for use in non-NLP tasks such as image processing or audio analysis.

Data

Training Data

  • Dataset: mshojaei77/PersianTelegramChannels
  • Description: A rich collection of Persian text extracted from various Telegram channels. This dataset provides a diverse range of language patterns and vocabulary, making it suitable for training a general-purpose Persian tokenizer.
  • Size: 60,730 samples

Data Preprocessing

  • Normalization: Applied NFD Unicode normalization, removed accents, converted text to lowercase, stripped leading and trailing whitespace, and removed ZWNJ characters.
  • Pre-tokenization: Used whitespace pre-tokenization.

Performance

Evaluation Metrics

  • Tokenization Accuracy: The tokenizer has been tested on various Persian sentences and has shown high accuracy in tokenizing and encoding text.
  • Compatibility: Fully compatible with Hugging Face Transformers, ensuring seamless integration with advanced language models.

Known Limitations

  • Vocabulary Size: The current vocabulary size is based on the training data. For very specialized domains, additional fine-tuning or training on domain-specific data may be required.
  • Out-of-Vocabulary Words: Rare or domain-specific words may be tokenized as unknown tokens ([UNK]).

Training Procedure

Training Steps

  1. Environment Setup: Installed necessary libraries (datasets, tokenizers, transformers).
  2. Data Preparation: Loaded the mshojaei77/PersianTelegramChannels dataset and created a batch iterator for efficient training.
  3. Tokenizer Model: Initialized the tokenizer with a BPE model and applied normalization and pre-tokenization steps.
  4. Training: Trained the tokenizer on the Persian text corpus using the BPE algorithm.
  5. Post-processing: Set up post-processing to handle special tokens.
  6. Saving: Saved the tokenizer to disk for future use.
  7. Compatibility: Converted the tokenizer to a PreTrainedTokenizerFast object for compatibility with Hugging Face Transformers.

Hyperparameters

  • Special Tokens: [UNK], [CLS], [SEP], [PAD], [MASK]
  • Batch Size: 1000 samples per batch
  • Normalization Steps: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)

How to Use

Installation

To use the PersianBPETokenizer, first install the required libraries:

pip install -q --upgrade datasets tokenizers transformers

Loading the Tokenizer

You can load the tokenizer using the Hugging Face Transformers library:

from transformers import AutoTokenizer

persian_tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianBPETokenizer")

Tokenization Example

test_sentence = "سلام، چطور هستید؟ امیدوارم روز خوبی داشته باشید"
tokens = persian_tokenizer.tokenize(test_sentence)
print("Tokens:", tokens)
encoded = persian_tokenizer(test_sentence)
print("Input IDs:", encoded["input_ids"])
print("Decoded:", persian_tokenizer.decode(encoded["input_ids"]))

Acknowledgments

  • Dataset: mshojaei77/PersianTelegramChannels
  • Libraries: Hugging Face datasets, tokenizers, and transformers

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train mshojaei77/PersianBPETokenizer