|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
# Dutch-Llama Tokenizer |
|
|
|
## Overview |
|
The Dutch-Llama Tokenizer is a versatile tokenizer trained to handle a variety of languages and formats, including Dutch, English, Python code, Markdown, and general text. It's based on a dataset consisting of diverse sources, which ensures its capability to tokenize a wide range of text inputs effectively. |
|
|
|
## Dataset Composition |
|
The tokenizer was trained on a comprehensive dataset, including: |
|
- MC4 Dutch and English texts (195M) |
|
- English and Dutch Wikipedia (278M and 356M, respectively) |
|
- Dutch and English book datasets (211M and 355M, respectively) |
|
- Dutch news articles (256M) |
|
- CodeParrot GitHub Python code (158M) |
|
- CodeSearchNet Python code (126M) |
|
- Markdown files with math markup (5.8M) |
|
- Arxiv scientific papers (169M) |
|
|
|
## Tokenizer Settings |
|
The tokenizer was trained using the `spm_train` command with the following settings: |
|
- Model Type: Byte Pair Encoding (BPE) |
|
- Vocab Size: 32,000 |
|
- Character Coverage: 100% |
|
- Support for splitting digits and whitespace-only pieces |
|
- Optimized for large corpus training |
|
- Byte Fallback and language acceptance for Dutch (nl) and English (en) |
|
- Special tokens and IDs for unknown, beginning of sentence, end of sentence, padding, and custom user-defined symbols |
|
|
|
## Installation |
|
To use the Dutch-Llama Tokenizer, ensure you have Python 3.10.12 or later installed. Then, install the Transformers library from Hugging Face: |
|
```shell |
|
pip install transformers |
|
``` |
|
|
|
## Usage |
|
First, import the `AutoTokenizer` from the Transformers library and load the Dutch-Llama Tokenizer: |
|
```python |
|
from transformers import AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("yhavinga/dutch-llama-tokenizer") |
|
``` |
|
To tokenize text, use the `tokenizer.tokenize` method. For converting tokens to IDs and decoding them back to text, use `tokenizer.convert_tokens_to_ids` and `tokenizer.decode` respectively: |
|
```python |
|
# Example text |
|
text = "Steenvliegen of oevervliegen[2] (Plecoptera) ๅไธบๅๅธMate60ๆๆบ" |
|
|
|
# Tokenization and decoding |
|
tokens = tokenizer.tokenize(text) |
|
token_ids = tokenizer.convert_tokens_to_ids(tokens) |
|
decoded_text = tokenizer.decode(token_ids) |
|
|
|
print(decoded_text) |
|
``` |
|
|
|
## Dutch Tokenizer Arena |
|
Compare the effectiveness of this tokenizer on different inputs at the Hugging Face Space: [Dutch Tokenizer Arena](https://huggingface.co/spaces/yhavinga/dutch-tokenizer-arena). |
|
|
|
## Comparison with Other Tokenizers |
|
|
|
The following table shows the number of tokens produced by the Dutch-Llama Tokenizer, the Mistral Tokenizer, the GroNLP GPT-2 Dutch Tokenizer, and the UL2 Dutch Tokenizer on a variety of inputs. |
|
|
|
| Input Type | Dutch LLama (32k) | Mistral (32k) | GroNLP GPT-2 Dutch (40k) | UL2 Dutch (32k) | |
|
|--------------|-------------------|---------------|--------------------------|-------------------| |
|
| Dutch news | 440 | 658 | 408 | 410 | |
|
| English news | 414 | 404 | 565 | 402 | |
|
| Code python | 566 | 582 | 767 | 639 (no newlines) | |
|
| LaTeX math | 491 | 497 | 717 | 666 (no newlines) | |
|
| **Total** | 1911 | 2141 | 2457 | 2117 | |
|
|
|
|
|
๐ณ๐ฑ ๐ง๐ช๐๐ |
|
|
|
|