|
--- |
|
library_name: transformers |
|
license: mit |
|
language: |
|
- ky |
|
pipeline_tag: fill-mask |
|
tags: |
|
- kyrgyz |
|
- low-resource-language |
|
- bert |
|
- nlp |
|
- masked-language-modeling |
|
--- |
|
|
|
# Model Card for Kyrgyz BERT Tokenizer |
|
|
|
|
|
This is a WordPiece-based BERT tokenizer trained specifically for the Kyrgyz language. It was developed to support Kyrgyz NLP applications, including text classification, translation, and morphological analysis. The tokenizer was trained on a large corpus from various Kyrgyz text sources. |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Developed by:** Metinov Adilet |
|
- **Funded by :** Self-funded(MetinLab) |
|
- **Shared by :** metinovadilet |
|
- **Model type:** WordPiece Tokenizer (BERT-style) |
|
- **Language(s) (NLP):** Kyrgyz (ky) |
|
- **License:** MIT |
|
- **Finetuned from model [optional]:** N/A (trained from scratch) |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** metinovadilet/bert-kyrgyz-tokenizer |
|
- **Paper [optional]:** N/A |
|
- **Demo [optional]:** N/A |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
This tokenizer can be used directly for NLP tasks such as: |
|
|
|
- Tokenizing Kyrgyz texts for training language models |
|
|
|
- Preparing data for Kyrgyz BERT training or fine-tuning |
|
|
|
- Kyrgyz text segmentation and wordpiece-based analysis |
|
|
|
### Downstream Use [optional] |
|
|
|
- Can be used as the tokenizer for BERT-based models trained on Kyrgyz text |
|
|
|
- Supports various NLP applications like sentiment analysis, morphological modeling, and machine translation |
|
|
|
### Out-of-Scope Use |
|
|
|
- This tokenizer is not optimized for multilingual text. It is designed for Kyrgyz-only corpora. |
|
|
|
- It may not work well for transliterated or mixed-script text (e.g., combining Latin and Cyrillic scripts). |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
- The tokenizer is limited by the training corpus, meaning rare words, dialectal forms, and domain-specific terms may not be well-represented. |
|
|
|
- As with most tokenizers, it may exhibit biases from the source text, particularly in areas of gender, ethnicity, or socio-political context. |
|
|
|
### Recommendations |
|
|
|
Users should be aware of potential biases and evaluate performance for their specific application. If biases or inefficiencies are found, fine-tuning or training with a more diverse corpus is recommended. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
``` |
|
from transformers import BertTokenizerFast |
|
|
|
tokenizer = BertTokenizerFast.from_pretrained("metinovadilet/bert-kyrgyz-tokenizer") |
|
|
|
text = "Бул кыргыз тилинде жазылган текст." |
|
|
|
tokens = tokenizer(text, return_offsets_mapping=True) |
|
|
|
print("Input Text:", text) |
|
print("Tokens:", tokenizer.convert_ids_to_tokens(tokens['input_ids'])) |
|
print("Token IDs:", tokens['input_ids']) |
|
print("Offsets:", tokens['offset_mapping']) |
|
``` |
|
## Training Details and Training Data |
|
|
|
Non disclosable |
|
|
|
## Environmental Impact |
|
|
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
- **Hardware Type:** NVIDIA RTX 3090 |
|
- **Hours used:** ~1 hour |
|
- **Compute Region:** Central Asia |
|
- **Carbon Emitted:** ~0.1 kg CO2 |
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
- Architecture: WordPiece-based BERT tokenizer |
|
|
|
- Objective: Efficient tokenization for Kyrgyz NLP applications |
|
|
|
### Compute Infrastructure |
|
|
|
[More Information Needed] |
|
|
|
#### Hardware |
|
|
|
- GPU: NVIDIA RTX 3090 (24GB VRAM) |
|
- CPU: intel core i5-13400f |
|
|
|
#### Software |
|
|
|
- Python 3.10 |
|
|
|
- Transformers (Hugging Face) |
|
|
|
- Tokenizers (Hugging Face) |
|
|
|
## Citation [optional] |
|
|
|
If you use this tokenizer, please cite: |
|
``` |
|
@misc{bert-kyrgyz-tokenizer, |
|
author = {Metinov Adilet}, |
|
title = {BERT Kyrgyz Tokenizer}, |
|
year = {2025}, |
|
url = {https://huggingface.co/metinovadilet/bert-kyrgyz-tokenizer}, |
|
note = {Trained at MetinLab} |
|
} |
|
``` |
|
## Model Card Contact |
|
|
|
For questions or issues, reach out to MetinLab via: |
|
|
|
Email: [email protected] |
|
|
|
## This model was made in Collaboration with UlutsoftLLC |