yhavinga
/

dutch-llama-tokenizer

Model card Files Files and versions Community

dutch-llama-tokenizer / README.md

yhavinga's picture

Update README.md

d284ee0 about 1 year ago

|

history blame contribute delete

3.37 kB

	---
	license: apache-2.0
	---

	# Dutch-Llama Tokenizer

	## Overview
	The Dutch-Llama Tokenizer is a versatile tokenizer trained to handle a variety of languages and formats, including Dutch, English, Python code, Markdown, and general text. It's based on a dataset consisting of diverse sources, which ensures its capability to tokenize a wide range of text inputs effectively.

	## Dataset Composition
	The tokenizer was trained on a comprehensive dataset, including:
	- MC4 Dutch and English texts (195M)
	- English and Dutch Wikipedia (278M and 356M, respectively)
	- Dutch and English book datasets (211M and 355M, respectively)
	- Dutch news articles (256M)
	- CodeParrot GitHub Python code (158M)
	- CodeSearchNet Python code (126M)
	- Markdown files with math markup (5.8M)
	- Arxiv scientific papers (169M)

	## Tokenizer Settings
	The tokenizer was trained using the `spm_train` command with the following settings:
	- Model Type: Byte Pair Encoding (BPE)
	- Vocab Size: 32,000
	- Character Coverage: 100%
	- Support for splitting digits and whitespace-only pieces
	- Optimized for large corpus training
	- Byte Fallback and language acceptance for Dutch (nl) and English (en)
	- Special tokens and IDs for unknown, beginning of sentence, end of sentence, padding, and custom user-defined symbols

	## Installation
	To use the Dutch-Llama Tokenizer, ensure you have Python 3.10.12 or later installed. Then, install the Transformers library from Hugging Face:
	```shell
	pip install transformers
	```

	## Usage
	First, import the `AutoTokenizer` from the Transformers library and load the Dutch-Llama Tokenizer:
	```python
	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("yhavinga/dutch-llama-tokenizer")
	```
	To tokenize text, use the `tokenizer.tokenize` method. For converting tokens to IDs and decoding them back to text, use `tokenizer.convert_tokens_to_ids` and `tokenizer.decode` respectively:
	```python
	# Example text
	text = "Steenvliegen of oevervliegen[2] (Plecoptera) 华为发布Mate60手机"

	# Tokenization and decoding
	tokens = tokenizer.tokenize(text)
	token_ids = tokenizer.convert_tokens_to_ids(tokens)
	decoded_text = tokenizer.decode(token_ids)

	print(decoded_text)
	```

	## Dutch Tokenizer Arena
	Compare the effectiveness of this tokenizer on different inputs at the Hugging Face Space: [Dutch Tokenizer Arena](https://huggingface.co/spaces/yhavinga/dutch-tokenizer-arena).

	## Comparison with Other Tokenizers

	The following table shows the number of tokens produced by the Dutch-Llama Tokenizer, the Mistral Tokenizer, the GroNLP GPT-2 Dutch Tokenizer, and the UL2 Dutch Tokenizer on a variety of inputs.

	\| Input Type \| Dutch LLama (32k) \| Mistral (32k) \| GroNLP GPT-2 Dutch (40k) \| UL2 Dutch (32k) \|
	\|--------------\|-------------------\|---------------\|--------------------------\|-------------------\|
	\| Dutch news \| 440 \| 658 \| 408 \| 410 \|
	\| English news \| 414 \| 404 \| 565 \| 402 \|
	\| Code python \| 566 \| 582 \| 767 \| 639 (no newlines) \|
	\| LaTeX math \| 491 \| 497 \| 717 \| 666 (no newlines) \|
	\| Total \| 1911 \| 2141 \| 2457 \| 2117 \|


	🇳🇱 🇧🇪🐍📐