metinovadilet's picture
Update README.md
5e43839 verified
metadata
library_name: transformers
license: mit
language:
  - ky

Model Card for Kyrgyz BERT Tokenizer

This is a WordPiece-based BERT tokenizer trained specifically for the Kyrgyz language. It was developed to support Kyrgyz NLP applications, including text classification, translation, and morphological analysis. The tokenizer was trained on a large corpus from various Kyrgyz text sources.

Model Details

Model Description

  • Developed by: Metinov Adilet
  • Funded by : Self-funded(MetinLab)
  • Shared by : metinovadilet
  • Model type: WordPiece Tokenizer (BERT-style)
  • Language(s) (NLP): Kyrgyz (ky)
  • License: MIT
  • Finetuned from model [optional]: N/A (trained from scratch)

Model Sources

  • Repository: metinovadilet/bert-kyrgyz-tokenizer
  • Paper [optional]: N/A
  • Demo [optional]: N/A

Uses

Direct Use

This tokenizer can be used directly for NLP tasks such as:

  • Tokenizing Kyrgyz texts for training language models

  • Preparing data for Kyrgyz BERT training or fine-tuning

  • Kyrgyz text segmentation and wordpiece-based analysis

Downstream Use [optional]

  • Can be used as the tokenizer for BERT-based models trained on Kyrgyz text

  • Supports various NLP applications like sentiment analysis, morphological modeling, and machine translation

Out-of-Scope Use

  • This tokenizer is not optimized for multilingual text. It is designed for Kyrgyz-only corpora.

  • It may not work well for transliterated or mixed-script text (e.g., combining Latin and Cyrillic scripts).

Bias, Risks, and Limitations

  • The tokenizer is limited by the training corpus, meaning rare words, dialectal forms, and domain-specific terms may not be well-represented.

  • As with most tokenizers, it may exhibit biases from the source text, particularly in areas of gender, ethnicity, or socio-political context.

Recommendations

Users should be aware of potential biases and evaluate performance for their specific application. If biases or inefficiencies are found, fine-tuning or training with a more diverse corpus is recommended.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained("metinovadilet/bert-kyrgyz-tokenizer")

text = "Бул кыргыз тилинде жазылган текст."

tokens = tokenizer(text, return_offsets_mapping=True)

print("Input Text:", text)
print("Tokens:", tokenizer.convert_ids_to_tokens(tokens['input_ids']))
print("Token IDs:", tokens['input_ids'])
print("Offsets:", tokens['offset_mapping'])

Training Details and Training Data

Non disclosable

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: NVIDIA RTX 3090
  • Hours used: ~1 hour
  • Compute Region: Central Asia
  • Carbon Emitted: ~0.1 kg CO2

Technical Specifications

Model Architecture and Objective

  • Architecture: WordPiece-based BERT tokenizer

  • Objective: Efficient tokenization for Kyrgyz NLP applications

Compute Infrastructure

[More Information Needed]

Hardware

  • GPU: NVIDIA RTX 3090 (24GB VRAM)
  • CPU: intel core i5-13400f

Software

  • Python 3.10

  • Transformers (Hugging Face)

  • Tokenizers (Hugging Face)

Citation [optional]

If you use this tokenizer, please cite:

@misc{bert-kyrgyz-tokenizer,
  author = {Metinov Adilet},
  title = {BERT Kyrgyz Tokenizer},
  year = {2025},
  url = {https://huggingface.co/metinovadilet/bert-kyrgyz-tokenizer},
  note = {Trained at MetinLab}
}

Model Card Contact

For questions or issues, reach out to MetinLab via:

Email: [email protected]

This model was made in Collaboration with UlutsoftLLC