File size: 3,366 Bytes
32d48a6
 
 
3bda927
 
 
 
 
 
 
 
 
 
 
 
d284ee0
 
3bda927
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e974e27
3bda927
 
 
 
 
e974e27
3bda927
 
 
 
e974e27
3bda927
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
license: apache-2.0
---

# Dutch-Llama Tokenizer

## Overview
The Dutch-Llama Tokenizer is a versatile tokenizer trained to handle a variety of languages and formats, including Dutch, English, Python code, Markdown, and general text. It's based on a dataset consisting of diverse sources, which ensures its capability to tokenize a wide range of text inputs effectively.

## Dataset Composition
The tokenizer was trained on a comprehensive dataset, including:
- MC4 Dutch and English texts (195M)
- English and Dutch Wikipedia (278M and 356M, respectively)
- Dutch and English book datasets (211M and 355M, respectively)
- Dutch news articles (256M)
- CodeParrot GitHub Python code (158M)
- CodeSearchNet Python code (126M)
- Markdown files with math markup (5.8M)
- Arxiv scientific papers (169M)

## Tokenizer Settings
The tokenizer was trained using the `spm_train` command with the following settings:
- Model Type: Byte Pair Encoding (BPE)
- Vocab Size: 32,000
- Character Coverage: 100%
- Support for splitting digits and whitespace-only pieces
- Optimized for large corpus training
- Byte Fallback and language acceptance for Dutch (nl) and English (en)
- Special tokens and IDs for unknown, beginning of sentence, end of sentence, padding, and custom user-defined symbols

## Installation
To use the Dutch-Llama Tokenizer, ensure you have Python 3.10.12 or later installed. Then, install the Transformers library from Hugging Face:
```shell
pip install transformers
```

## Usage
First, import the `AutoTokenizer` from the Transformers library and load the Dutch-Llama Tokenizer:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("yhavinga/dutch-llama-tokenizer")
```
To tokenize text, use the `tokenizer.tokenize` method. For converting tokens to IDs and decoding them back to text, use `tokenizer.convert_tokens_to_ids` and `tokenizer.decode` respectively:
```python
# Example text
text = "Steenvliegen of oevervliegen[2] (Plecoptera) 华为发布Mate60手机"

# Tokenization and decoding
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
decoded_text = tokenizer.decode(token_ids)

print(decoded_text)
```

## Dutch Tokenizer Arena
Compare the effectiveness of this tokenizer on different inputs at the Hugging Face Space: [Dutch Tokenizer Arena](https://huggingface.co/spaces/yhavinga/dutch-tokenizer-arena).

## Comparison with Other Tokenizers

The following table shows the number of tokens produced by the Dutch-Llama Tokenizer, the Mistral Tokenizer, the GroNLP GPT-2 Dutch Tokenizer, and the UL2 Dutch Tokenizer on a variety of inputs.

| Input Type   | Dutch LLama (32k) | Mistral (32k) | GroNLP GPT-2 Dutch (40k) | UL2 Dutch (32k)   |
|--------------|-------------------|---------------|--------------------------|-------------------|
| Dutch news   | 440               | 658           | 408                      | 410               |
| English news | 414               | 404           | 565                      | 402               |
| Code python  | 566               | 582           | 767                      | 639 (no newlines) |
| LaTeX math   | 491               | 497           | 717                      | 666 (no newlines) |
| **Total**    | 1911              | 2141          | 2457                     | 2117              |


🇳🇱 🇧🇪🐍📐