Update README.md

41f6fe7 verified 4 minutes ago

4.05 kB

	---
	library_name: transformers
	license: mit
	language:
	- ky
	pipeline_tag: fill-mask
	tags:
	- kyrgyz
	- low-resource-language
	- bert
	- nlp
	- masked-language-modeling
	---

	# Model Card for Kyrgyz BERT Tokenizer


	This is a WordPiece-based BERT tokenizer trained specifically for the Kyrgyz language. It was developed to support Kyrgyz NLP applications, including text classification, translation, and morphological analysis. The tokenizer was trained on a large corpus from various Kyrgyz text sources.


	## Model Details

	### Model Description

	- Developed by: Metinov Adilet
	- Funded by : Self-funded(MetinLab)
	- Shared by : metinovadilet
	- Model type: WordPiece Tokenizer (BERT-style)
	- Language(s) (NLP): Kyrgyz (ky)
	- License: MIT
	- Finetuned from model [optional]: N/A (trained from scratch)

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: metinovadilet/bert-kyrgyz-tokenizer
	- Paper [optional]: N/A
	- Demo [optional]: N/A

	## Uses

	### Direct Use
	This tokenizer can be used directly for NLP tasks such as:

	- Tokenizing Kyrgyz texts for training language models

	- Preparing data for Kyrgyz BERT training or fine-tuning

	- Kyrgyz text segmentation and wordpiece-based analysis

	### Downstream Use [optional]

	- Can be used as the tokenizer for BERT-based models trained on Kyrgyz text

	- Supports various NLP applications like sentiment analysis, morphological modeling, and machine translation

	### Out-of-Scope Use

	- This tokenizer is not optimized for multilingual text. It is designed for Kyrgyz-only corpora.

	- It may not work well for transliterated or mixed-script text (e.g., combining Latin and Cyrillic scripts).

	## Bias, Risks, and Limitations

	- The tokenizer is limited by the training corpus, meaning rare words, dialectal forms, and domain-specific terms may not be well-represented.

	- As with most tokenizers, it may exhibit biases from the source text, particularly in areas of gender, ethnicity, or socio-political context.

	### Recommendations

	Users should be aware of potential biases and evaluate performance for their specific application. If biases or inefficiencies are found, fine-tuning or training with a more diverse corpus is recommended.

	## How to Get Started with the Model

	Use the code below to get started with the model.
	```
	from transformers import BertTokenizerFast

	tokenizer = BertTokenizerFast.from_pretrained("metinovadilet/bert-kyrgyz-tokenizer")

	text = "Бул кыргыз тилинде жазылган текст."

	tokens = tokenizer(text, return_offsets_mapping=True)

	print("Input Text:", text)
	print("Tokens:", tokenizer.convert_ids_to_tokens(tokens['input_ids']))
	print("Token IDs:", tokens['input_ids'])
	print("Offsets:", tokens['offset_mapping'])
	```
	## Training Details and Training Data

	Non disclosable

	## Environmental Impact


	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: NVIDIA RTX 3090
	- Hours used: ~1 hour
	- Compute Region: Central Asia
	- Carbon Emitted: ~0.1 kg CO2
	## Technical Specifications

	### Model Architecture and Objective

	- Architecture: WordPiece-based BERT tokenizer

	- Objective: Efficient tokenization for Kyrgyz NLP applications

	### Compute Infrastructure

	[More Information Needed]

	#### Hardware

	- GPU: NVIDIA RTX 3090 (24GB VRAM)
	- CPU: intel core i5-13400f

	#### Software

	- Python 3.10

	- Transformers (Hugging Face)

	- Tokenizers (Hugging Face)

	## Citation [optional]

	If you use this tokenizer, please cite:
	```
	@misc{bert-kyrgyz-tokenizer,
	author = {Metinov Adilet},
	title = {BERT Kyrgyz Tokenizer},
	year = {2025},
	url = {https://huggingface.co/metinovadilet/bert-kyrgyz-tokenizer},
	note = {Trained at MetinLab}
	}
	```
	## Model Card Contact

	For questions or issues, reach out to MetinLab via:

	Email: [email protected]

	## This model was made in Collaboration with UlutsoftLLC