yhavinga's picture
Update README.md
d284ee0
---
license: apache-2.0
---
# Dutch-Llama Tokenizer
## Overview
The Dutch-Llama Tokenizer is a versatile tokenizer trained to handle a variety of languages and formats, including Dutch, English, Python code, Markdown, and general text. It's based on a dataset consisting of diverse sources, which ensures its capability to tokenize a wide range of text inputs effectively.
## Dataset Composition
The tokenizer was trained on a comprehensive dataset, including:
- MC4 Dutch and English texts (195M)
- English and Dutch Wikipedia (278M and 356M, respectively)
- Dutch and English book datasets (211M and 355M, respectively)
- Dutch news articles (256M)
- CodeParrot GitHub Python code (158M)
- CodeSearchNet Python code (126M)
- Markdown files with math markup (5.8M)
- Arxiv scientific papers (169M)
## Tokenizer Settings
The tokenizer was trained using the `spm_train` command with the following settings:
- Model Type: Byte Pair Encoding (BPE)
- Vocab Size: 32,000
- Character Coverage: 100%
- Support for splitting digits and whitespace-only pieces
- Optimized for large corpus training
- Byte Fallback and language acceptance for Dutch (nl) and English (en)
- Special tokens and IDs for unknown, beginning of sentence, end of sentence, padding, and custom user-defined symbols
## Installation
To use the Dutch-Llama Tokenizer, ensure you have Python 3.10.12 or later installed. Then, install the Transformers library from Hugging Face:
```shell
pip install transformers
```
## Usage
First, import the `AutoTokenizer` from the Transformers library and load the Dutch-Llama Tokenizer:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("yhavinga/dutch-llama-tokenizer")
```
To tokenize text, use the `tokenizer.tokenize` method. For converting tokens to IDs and decoding them back to text, use `tokenizer.convert_tokens_to_ids` and `tokenizer.decode` respectively:
```python
# Example text
text = "Steenvliegen of oevervliegen[2] (Plecoptera) ๅŽไธบๅ‘ๅธƒMate60ๆ‰‹ๆœบ"
# Tokenization and decoding
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
decoded_text = tokenizer.decode(token_ids)
print(decoded_text)
```
## Dutch Tokenizer Arena
Compare the effectiveness of this tokenizer on different inputs at the Hugging Face Space: [Dutch Tokenizer Arena](https://huggingface.co/spaces/yhavinga/dutch-tokenizer-arena).
## Comparison with Other Tokenizers
The following table shows the number of tokens produced by the Dutch-Llama Tokenizer, the Mistral Tokenizer, the GroNLP GPT-2 Dutch Tokenizer, and the UL2 Dutch Tokenizer on a variety of inputs.
| Input Type | Dutch LLama (32k) | Mistral (32k) | GroNLP GPT-2 Dutch (40k) | UL2 Dutch (32k) |
|--------------|-------------------|---------------|--------------------------|-------------------|
| Dutch news | 440 | 658 | 408 | 410 |
| English news | 414 | 404 | 565 | 402 |
| Code python | 566 | 582 | 767 | 639 (no newlines) |
| LaTeX math | 491 | 497 | 717 | 666 (no newlines) |
| **Total** | 1911 | 2141 | 2457 | 2117 |
๐Ÿ‡ณ๐Ÿ‡ฑ ๐Ÿ‡ง๐Ÿ‡ช๐Ÿ๐Ÿ“