ChocoLlama
A Llama-2/3-based family of Dutch language modelsChocoLlama-2-7B-instruct: Getting Started
We here present ChocoLlama-2-7B-instruct, an instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO. Its base model, ChocoLlama-2-7B-base, is a language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/ChocoLlama-2-7B-instruct')
model = AutoModelForCausalLM.from_pretrained('ChocoLlama/ChocoLlama-2-7B-instruct', device_map="auto")
messages = [
{"role": "system", "content": "Je bent een artificiële intelligentie-assistent en geeft behulpzame, gedetailleerde en beleefde antwoorden op de vragen van de gebruiker."},
{"role": "user", "content": "Jacques brel, Willem Elsschot en Jan Jambon zitten op café. Waar zouden ze over babbelen?"},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
new_terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = model.generate(
input_ids,
max_new_tokens=512,
eos_token_id=new_terminators,
do_sample=True,
temperature=0.8,
top_p=0.95,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
Note that the datasets used for instruction-tuning were translated using GPT-3.5/4, which means that this instruction-tuned model can not be used for commercial purposes. Hence, for any commercial applications, we recommend finetuning the base model on your own Dutch data.
Model Details
ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.
We provide 6 variants (of which 3 base and 3 instruction-tuned models):
- ChocoLlama-2-7B-base (link): A language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
- ChocoLlama-2-7B-instruct (link): An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
- ChocoLlama-2-7B-tokentrans-base (link): A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by Remy et al.. The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
- ChocoLlama-2-7B-tokentrans-instruct (link): An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
- Llama-3-ChocoLlama-8B-base (link): A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
- Llama-3-ChocoLlama-instruct (link): An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
For benchmark results for all models, including compared to their base models and other Dutch LLMs, we refer to our paper here.
Model Description
- Developed by: Matthieu Meeus, Anthony Rathé
- Funded by: Vlaams Supercomputer Centrum, through a grant of apx. 40K GPU hours (NVIDIA A100-80GB)
- Language(s): Dutch
- License: cc-by-nc-4.0
- Finetuned from model: ChocoLlama-2-7B-base
Model Sources
- Repository: on Github here.
- Paper: on ArXiv here.
Uses
Direct Use
This is an instruction-tuned (SFT + DPO) Dutch model, optimized for Dutch language generation in conversational settings. For optimal behavior, we advice to only use the model with the correct chat template (see Python code above), potentially supported by a system prompt.
Out-of-Scope Use
Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.
Bias, Risks, and Limitations
We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators. However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.
Training Details
We adopt the same strategy as used to align GEITje-7B to GEITje-7B-ultra. First, we apply supervised finetuning (SFT), utilizing the data made available by Vanroy:
- BramVanroy/ultrachat_200k_dutch
- BramVanroy/no_robots_dutch
- BramVanroy/stackoverflow-chat-dutch
- BramVanroy/alpaca-cleaned-dutch
- BramVanroy/dolly-15k-dutch
Next, we apply Direct Preference Optimization (DPO) to the SFT version of all the pretrained models we here develop, now utilizing a Dutch version of the data used to train Zephyr-7B-$\beta$, BramVanroy/ultra_feedback_dutch.
For both the SFT and DPO stage, we update all model weights and apply the same set of hyperparameters to all models as used in GEITje-7B-ultra:
- learning_rate: 5e-07
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 4
- total_train_batch_size: 64
- total_eval_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1
Further, we leverage the publicly available alignment handbook and use a set of 4 NVIDIA A100 (80 GB) for both stages.
Evaluation
Quantitative evaluation
We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.
Model | ARC | HellaSwag | MMLU | TruthfulQA | Avg. |
---|---|---|---|---|---|
Llama-3-ChocoLlama-instruct | 0.48 | 0.66 | 0.49 | 0.49 | 0.53 |
llama-3-8B-rebatch | 0.44 | 0.64 | 0.46 | 0.48 | 0.51 |
llama-3-8B-instruct | 0.47 | 0.59 | 0.47 | 0.52 | 0.51 |
llama-3-8B | 0.44 | 0.64 | 0.47 | 0.45 | 0.5 |
Reynaerde-7B-Chat | 0.44 | 0.62 | 0.39 | 0.52 | 0.49 |
Llama-3-ChocoLlama-base | 0.45 | 0.64 | 0.44 | 0.44 | 0.49 |
zephyr-7b-beta | 0.43 | 0.58 | 0.43 | 0.53 | 0.49 |
geitje-7b-ultra | 0.40 | 0.66 | 0.36 | 0.49 | 0.48 |
ChocoLlama-2-7B-tokentrans-instruct | 0.45 | 0.62 | 0.34 | 0.42 | 0.46 |
mistral-7b-v0.1 | 0.43 | 0.58 | 0.37 | 0.45 | 0.46 |
ChocoLlama-2-7B-tokentrans-base | 0.42 | 0.61 | 0.32 | 0.43 | 0.45 |
ChocoLlama-2-7B-instruct | 0.36 | 0.57 | 0.33 | 0.45 | **0.43 |
ChocoLlama-2-7B-base | 0.35 | 0.56 | 0.31 | 0.43 | 0.41 |
llama-2-7b-chat-hf | 0.36 | 0.49 | 0.33 | 0.44 | 0.41 |
llama-2-7b-hf | 0.36 | 0.51 | 0.32 | 0.41 | 0.40 |
On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.
Qualitative evaluation
In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable. For details, we refer to the paper and to our benchmark ChocoLlama-Bench.
Compute Infrastructure
All ChocoLlama models have been trained on the compute cluster provided by the Flemish Supercomputer Center (VSC). We used 8 to 16 NVIDIA A100 GPU's with 80 GB of VRAM.
Citation
If you found this useful for your work, kindly cite our paper:
@article{meeus2024chocollama,
title={ChocoLlama: Lessons Learned From Teaching Llamas Dutch},
author={Meeus, Matthieu and Rath{\'e}, Anthony and Remy, Fran{\c{c}}ois and Delobelle, Pieter and Decorte, Jens-Joris and Demeester, Thomas},
journal={arXiv preprint arXiv:2412.07633},
year={2024}
}
- Downloads last month
- 805