Model Card for Kyrgyz BERT Tokenizer

This is a WordPiece-based BERT tokenizer trained specifically for the Kyrgyz language. It was developed to support Kyrgyz NLP applications, including text classification, translation, and morphological analysis. The tokenizer was trained on a large corpus from various Kyrgyz text sources.

Model Details

Model Description

  • Developed by: Metinov Adilet
  • Funded by : Self-funded(MetinLab)
  • Shared by : metinovadilet
  • Model type: WordPiece Tokenizer (BERT-style)
  • Language(s) (NLP): Kyrgyz (ky)
  • License: MIT
  • Finetuned from model [optional]: N/A (trained from scratch)

Model Sources

  • Repository: metinovadilet/bert-kyrgyz-tokenizer
  • Paper [optional]: N/A
  • Demo [optional]: N/A

Uses

Direct Use

This tokenizer can be used directly for NLP tasks such as:

  • Tokenizing Kyrgyz texts for training language models

  • Preparing data for Kyrgyz BERT training or fine-tuning

  • Kyrgyz text segmentation and wordpiece-based analysis

Downstream Use [optional]

  • Can be used as the tokenizer for BERT-based models trained on Kyrgyz text

  • Supports various NLP applications like sentiment analysis, morphological modeling, and machine translation

Out-of-Scope Use

  • This tokenizer is not optimized for multilingual text. It is designed for Kyrgyz-only corpora.

  • It may not work well for transliterated or mixed-script text (e.g., combining Latin and Cyrillic scripts).

Bias, Risks, and Limitations

  • The tokenizer is limited by the training corpus, meaning rare words, dialectal forms, and domain-specific terms may not be well-represented.

  • As with most tokenizers, it may exhibit biases from the source text, particularly in areas of gender, ethnicity, or socio-political context.

Recommendations

Users should be aware of potential biases and evaluate performance for their specific application. If biases or inefficiencies are found, fine-tuning or training with a more diverse corpus is recommended.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("metinovadilet/bert-kyrgyz-tokenizer")

text = "Бул кыргыз тилинде жазылган текст."

tokens = tokenizer(text, return_offsets_mapping=True)

print("Input Text:", text)
print("Tokens:", tokenizer.convert_ids_to_tokens(tokens['input_ids']))
print("Token IDs:", tokens['input_ids'])
print("Offsets:", tokens['offset_mapping'])

Training Details and Training Data

Non disclosable

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: NVIDIA RTX 3090
  • Hours used: ~1 hour
  • Compute Region: Central Asia
  • Carbon Emitted: ~0.1 kg CO2

Technical Specifications

Model Architecture and Objective

  • Architecture: WordPiece-based BERT tokenizer

  • Objective: Efficient tokenization for Kyrgyz NLP applications

Compute Infrastructure

[More Information Needed]

Hardware

  • GPU: NVIDIA RTX 3090 (24GB VRAM)
  • CPU: intel core i5-13400f

Software

  • Python 3.10

  • Transformers (Hugging Face)

  • Tokenizers (Hugging Face)

Citation [optional]

If you use this tokenizer, please cite:

@misc{bert-kyrgyz-tokenizer,
  author = {Metinov Adilet},
  title = {BERT Kyrgyz Tokenizer},
  year = {2025},
  url = {https://huggingface.co/metinovadilet/bert-kyrgyz-tokenizer},
  note = {Trained at MetinLab}
}

Model Card Contact

For questions or issues, reach out to MetinLab via:

Email: [email protected]

This model was made in Collaboration with UlutsoftLLC

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Collection including metinovadilet/bert-kyrgyz-tokenizer