--- library_name: transformers license: mit language: - ky --- # Model Card for Kyrgyz BERT Tokenizer This is a WordPiece-based BERT tokenizer trained specifically for the Kyrgyz language. It was developed to support Kyrgyz NLP applications, including text classification, translation, and morphological analysis. The tokenizer was trained on a large corpus from various Kyrgyz text sources. ## Model Details ### Model Description - **Developed by:** Metinov Adilet - **Funded by :** Self-funded(MetinLab) - **Shared by :** metinovadilet - **Model type:** WordPiece Tokenizer (BERT-style) - **Language(s) (NLP):** Kyrgyz (ky) - **License:** MIT - **Finetuned from model [optional]:** N/A (trained from scratch) ### Model Sources - **Repository:** metinovadilet/bert-kyrgyz-tokenizer - **Paper [optional]:** N/A - **Demo [optional]:** N/A ## Uses ### Direct Use This tokenizer can be used directly for NLP tasks such as: - Tokenizing Kyrgyz texts for training language models - Preparing data for Kyrgyz BERT training or fine-tuning - Kyrgyz text segmentation and wordpiece-based analysis ### Downstream Use [optional] - Can be used as the tokenizer for BERT-based models trained on Kyrgyz text - Supports various NLP applications like sentiment analysis, morphological modeling, and machine translation ### Out-of-Scope Use - This tokenizer is not optimized for multilingual text. It is designed for Kyrgyz-only corpora. - It may not work well for transliterated or mixed-script text (e.g., combining Latin and Cyrillic scripts). ## Bias, Risks, and Limitations - The tokenizer is limited by the training corpus, meaning rare words, dialectal forms, and domain-specific terms may not be well-represented. - As with most tokenizers, it may exhibit biases from the source text, particularly in areas of gender, ethnicity, or socio-political context. ### Recommendations Users should be aware of potential biases and evaluate performance for their specific application. If biases or inefficiencies are found, fine-tuning or training with a more diverse corpus is recommended. ## How to Get Started with the Model Use the code below to get started with the model. ``` from transformers import BertTokenizerFast tokenizer = BertTokenizerFast.from_pretrained("metinovadilet/bert-kyrgyz-tokenizer") text = "Бул кыргыз тилинде жазылган текст." tokens = tokenizer(text, return_offsets_mapping=True) print("Input Text:", text) print("Tokens:", tokenizer.convert_ids_to_tokens(tokens['input_ids'])) print("Token IDs:", tokens['input_ids']) print("Offsets:", tokens['offset_mapping']) ``` ## Training Details and Training Data Non disclosable ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** NVIDIA RTX 3090 - **Hours used:** ~1 hour - **Compute Region:** Central Asia - **Carbon Emitted:** ~0.1 kg CO2 ## Technical Specifications ### Model Architecture and Objective - Architecture: WordPiece-based BERT tokenizer - Objective: Efficient tokenization for Kyrgyz NLP applications ### Compute Infrastructure [More Information Needed] #### Hardware - GPU: NVIDIA RTX 3090 (24GB VRAM) - CPU: intel core i5-13400f #### Software - Python 3.10 - Transformers (Hugging Face) - Tokenizers (Hugging Face) ## Citation [optional] If you use this tokenizer, please cite: ``` @misc{bert-kyrgyz-tokenizer, author = {Metinov Adilet}, title = {BERT Kyrgyz Tokenizer}, year = {2025}, url = {https://huggingface.co/metinovadilet/bert-kyrgyz-tokenizer}, note = {Trained at MetinLab} } ``` ## Model Card Contact For questions or issues, reach out to MetinLab via: Email: metinovadilet@gmail.com ## This model was made in Collaboration with UlutsoftLLC