--- license: apache-2.0 --- # Dutch-Llama Tokenizer ## Overview The Dutch-Llama Tokenizer is a versatile tokenizer trained to handle a variety of languages and formats, including Dutch, English, Python code, Markdown, and general text. It's based on a dataset consisting of diverse sources, which ensures its capability to tokenize a wide range of text inputs effectively. ## Dataset Composition The tokenizer was trained on a comprehensive dataset, including: - MC4 Dutch and English texts (195M) - English and Dutch Wikipedia (278M and 356M, respectively) - Dutch and English book datasets (211M and 355M, respectively) - Dutch news articles (256M) - CodeParrot GitHub Python code (158M) - CodeSearchNet Python code (126M) - Markdown files with math markup (5.8M) - Arxiv scientific papers (169M) ## Tokenizer Settings The tokenizer was trained using the `spm_train` command with the following settings: - Model Type: Byte Pair Encoding (BPE) - Vocab Size: 32,000 - Character Coverage: 100% - Support for splitting digits and whitespace-only pieces - Optimized for large corpus training - Byte Fallback and language acceptance for Dutch (nl) and English (en) - Special tokens and IDs for unknown, beginning of sentence, end of sentence, padding, and custom user-defined symbols ## Installation To use the Dutch-Llama Tokenizer, ensure you have Python 3.10.12 or later installed. Then, install the Transformers library from Hugging Face: ```shell pip install transformers ``` ## Usage First, import the `AutoTokenizer` from the Transformers library and load the Dutch-Llama Tokenizer: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("yhavinga/dutch-llama-tokenizer") ``` To tokenize text, use the `tokenizer.tokenize` method. For converting tokens to IDs and decoding them back to text, use `tokenizer.convert_tokens_to_ids` and `tokenizer.decode` respectively: ```python # Example text text = "Steenvliegen of oevervliegen[2] (Plecoptera) 华为发布Mate60手机" # Tokenization and decoding tokens = tokenizer.tokenize(text) token_ids = tokenizer.convert_tokens_to_ids(tokens) decoded_text = tokenizer.decode(token_ids) print(decoded_text) ``` ## Dutch Tokenizer Arena Compare the effectiveness of this tokenizer on different inputs at the Hugging Face Space: [Dutch Tokenizer Arena](https://huggingface.co/spaces/yhavinga/dutch-tokenizer-arena). ## Comparison with Other Tokenizers The following table shows the number of tokens produced by the Dutch-Llama Tokenizer, the Mistral Tokenizer, the GroNLP GPT-2 Dutch Tokenizer, and the UL2 Dutch Tokenizer on a variety of inputs. | Input Type | Dutch LLama (32k) | Mistral (32k) | GroNLP GPT-2 Dutch (40k) | UL2 Dutch (32k) | |--------------|-------------------|---------------|--------------------------|-------------------| | Dutch news | 440 | 658 | 408 | 410 | | English news | 414 | 404 | 565 | 402 | | Code python | 566 | 582 | 767 | 639 (no newlines) | | LaTeX math | 491 | 497 | 717 | 666 (no newlines) | | **Total** | 1911 | 2141 | 2457 | 2117 | 🇳🇱 🇧🇪🐍📐