metinovadilet's picture
Update README.md
41f6fe7 verified
---
library_name: transformers
license: mit
language:
- ky
pipeline_tag: fill-mask
tags:
- kyrgyz
- low-resource-language
- bert
- nlp
- masked-language-modeling
---
# Model Card for Kyrgyz BERT Tokenizer
This is a WordPiece-based BERT tokenizer trained specifically for the Kyrgyz language. It was developed to support Kyrgyz NLP applications, including text classification, translation, and morphological analysis. The tokenizer was trained on a large corpus from various Kyrgyz text sources.
## Model Details
### Model Description
- **Developed by:** Metinov Adilet
- **Funded by :** Self-funded(MetinLab)
- **Shared by :** metinovadilet
- **Model type:** WordPiece Tokenizer (BERT-style)
- **Language(s) (NLP):** Kyrgyz (ky)
- **License:** MIT
- **Finetuned from model [optional]:** N/A (trained from scratch)
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** metinovadilet/bert-kyrgyz-tokenizer
- **Paper [optional]:** N/A
- **Demo [optional]:** N/A
## Uses
### Direct Use
This tokenizer can be used directly for NLP tasks such as:
- Tokenizing Kyrgyz texts for training language models
- Preparing data for Kyrgyz BERT training or fine-tuning
- Kyrgyz text segmentation and wordpiece-based analysis
### Downstream Use [optional]
- Can be used as the tokenizer for BERT-based models trained on Kyrgyz text
- Supports various NLP applications like sentiment analysis, morphological modeling, and machine translation
### Out-of-Scope Use
- This tokenizer is not optimized for multilingual text. It is designed for Kyrgyz-only corpora.
- It may not work well for transliterated or mixed-script text (e.g., combining Latin and Cyrillic scripts).
## Bias, Risks, and Limitations
- The tokenizer is limited by the training corpus, meaning rare words, dialectal forms, and domain-specific terms may not be well-represented.
- As with most tokenizers, it may exhibit biases from the source text, particularly in areas of gender, ethnicity, or socio-political context.
### Recommendations
Users should be aware of potential biases and evaluate performance for their specific application. If biases or inefficiencies are found, fine-tuning or training with a more diverse corpus is recommended.
## How to Get Started with the Model
Use the code below to get started with the model.
```
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("metinovadilet/bert-kyrgyz-tokenizer")
text = "Бул кыргыз тилинде жазылган текст."
tokens = tokenizer(text, return_offsets_mapping=True)
print("Input Text:", text)
print("Tokens:", tokenizer.convert_ids_to_tokens(tokens['input_ids']))
print("Token IDs:", tokens['input_ids'])
print("Offsets:", tokens['offset_mapping'])
```
## Training Details and Training Data
Non disclosable
## Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** NVIDIA RTX 3090
- **Hours used:** ~1 hour
- **Compute Region:** Central Asia
- **Carbon Emitted:** ~0.1 kg CO2
## Technical Specifications
### Model Architecture and Objective
- Architecture: WordPiece-based BERT tokenizer
- Objective: Efficient tokenization for Kyrgyz NLP applications
### Compute Infrastructure
[More Information Needed]
#### Hardware
- GPU: NVIDIA RTX 3090 (24GB VRAM)
- CPU: intel core i5-13400f
#### Software
- Python 3.10
- Transformers (Hugging Face)
- Tokenizers (Hugging Face)
## Citation [optional]
If you use this tokenizer, please cite:
```
@misc{bert-kyrgyz-tokenizer,
author = {Metinov Adilet},
title = {BERT Kyrgyz Tokenizer},
year = {2025},
url = {https://huggingface.co/metinovadilet/bert-kyrgyz-tokenizer},
note = {Trained at MetinLab}
}
```
## Model Card Contact
For questions or issues, reach out to MetinLab via:
Email: [email protected]
## This model was made in Collaboration with UlutsoftLLC