File size: 5,887 Bytes
62dfd87
 
6c75e9b
 
 
 
 
 
 
 
 
62dfd87
6b576e8
6c75e9b
6b576e8
6c75e9b
004d49b
6c75e9b
 
 
 
 
 
b51e819
0bf986a
b51e819
 
 
 
 
 
 
 
 
 
6c75e9b
 
0bf986a
 
 
 
 
 
 
 
6c75e9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0bf986a
6c75e9b
 
 
 
 
0bf986a
6c75e9b
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
library_name: transformers
tags:
- chocolatine
- dpo
license: apache-2.0
datasets:
- jpacifico/french-orca-dpo-pairs-revised
language:
- fr
- en
---
### Chocolatine-2-14B-Instruct-v2.0.3

DPO fine-tuning of the merged model [jpacifico/Chocolatine-2-merged-qwen25arch](https://huggingface.co/jpacifico/Chocolatine-2-merged-qwen25arch) (Qwen-2.5-14B architecture)  
using the [jpacifico/french-orca-dpo-pairs-revised](https://huggingface.co/datasets/jpacifico/french-orca-dpo-pairs-revised) RLHF dataset.  
Training in French also improves the model's overall capabilities.  
  
> [!TIP] Window context : up to 128K tokens


### OpenLLM Leaderboard

Chocolatine-2 is the best-performing 14B fine-tuned model (Ex-aequo with avg. score 41.08) on the [OpenLLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)    
[Updated 2025-02-12]   

|      Metric       |Value|
|-------------------|----:|
|**Avg.**               |**41.08**|
|IFEval     |70.37|
|BBH        |50.63|
|MATH Lvl 5 |40.56|
|GPQA       |17.23|
|MuSR       |19.07|
|MMLU-PRO   |48.60|


### LLM Leaderboard FR

Top 3 all categories on the French Government [Leaderboard LLM FR](https://huggingface.co/spaces/fr-gouv-coordination-ia/llm_leaderboard_fr#/) 

![image/png](https://github.com/jpacifico/Chocolatine-LLM/blob/main/Assets/Leaderboard_FR_14022025.png?raw=true)  

[Updated 2025-02-15]  

### MT-Bench-French

Chocolatine-2 outperforms its previous versions and its base architecture Qwen-2.5 model on [MT-Bench-French](https://huggingface.co/datasets/bofenghuang/mt-bench-french), used with [multilingual-mt-bench](https://github.com/Peter-Devine/multilingual_mt_bench) and GPT-4-Turbo as a LLM-judge.  
My goal was to achieve GPT-4o-mini's performance on the French language, this version equals the performance of the OpenAI model according to this benchmark

```
########## First turn ##########
                                             score
model                                 turn        
gpt-4o-mini                           1     9.287500
Chocolatine-2-14B-Instruct-v2.0.3     1     9.112500
Qwen2.5-14B-Instruct                  1     8.887500
Chocolatine-14B-Instruct-DPO-v1.2     1     8.612500
Phi-3.5-mini-instruct                 1     8.525000
Chocolatine-3B-Instruct-DPO-v1.2      1     8.375000
DeepSeek-R1-Distill-Qwen-14B          1     8.375000
phi-4                                 1     8.300000
Phi-3-medium-4k-instruct              1     8.225000
gpt-3.5-turbo                         1     8.137500
Chocolatine-3B-Instruct-DPO-Revised   1     7.987500
Meta-Llama-3.1-8B-Instruct            1     7.050000
vigostral-7b-chat                     1     6.787500
Mistral-7B-Instruct-v0.3              1     6.750000
gemma-2-2b-it                         1     6.450000

########## Second turn ##########
                                               score
model                                 turn
Chocolatine-2-14B-Instruct-v2.0.3     2     9.050000         
gpt-4o-mini                           2     8.912500
Qwen2.5-14B-Instruct                  2     8.912500
Chocolatine-14B-Instruct-DPO-v1.2     2     8.337500
DeepSeek-R1-Distill-Qwen-14B          2     8.200000
phi-4                                 2     8.131250
Chocolatine-3B-Instruct-DPO-Revised   2     7.937500
Chocolatine-3B-Instruct-DPO-v1.2      2     7.862500
Phi-3-medium-4k-instruct              2     7.750000
gpt-3.5-turbo                         2     7.679167
Phi-3.5-mini-instruct                 2     7.575000
Meta-Llama-3.1-8B-Instruct            2     6.787500
Mistral-7B-Instruct-v0.3              2     6.500000
vigostral-7b-chat                     2     6.162500
gemma-2-2b-it                         2     6.100000

########## Average ##########
                                          score
model                                          
gpt-4o-mini                            9.100000
Chocolatine-2-14B-Instruct-v2.0.3      9.081250
Qwen2.5-14B-Instruct                   8.900000
Chocolatine-14B-Instruct-DPO-v1.2      8.475000
DeepSeek-R1-Distill-Qwen-14B           8.287500
phi-4                                  8.215625
Chocolatine-3B-Instruct-DPO-v1.2       8.118750
Phi-3.5-mini-instruct                  8.050000
Phi-3-medium-4k-instruct               7.987500
Chocolatine-3B-Instruct-DPO-Revised    7.962500
gpt-3.5-turbo                          7.908333
Meta-Llama-3.1-8B-Instruct             6.918750
Mistral-7B-Instruct-v0.3               6.625000
vigostral-7b-chat                      6.475000
gemma-2-2b-it                          6.275000
```

### Usage

You can run this model using my [Colab notebook](https://github.com/jpacifico/Chocolatine-LLM/blob/main/Chocolatine_14B_inference_test_colab.ipynb) 

You can also run Chocolatine-2 using the following code:

```python
import transformers
from transformers import AutoTokenizer

# Format prompt
message = [
    {"role": "system", "content": "You are a helpful assistant chatbot."},
    {"role": "user", "content": "What is a Large Language Model?"}
]
tokenizer = AutoTokenizer.from_pretrained(new_model)
prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)

# Create pipeline
pipeline = transformers.pipeline(
    "text-generation",
    model=new_model,
    tokenizer=tokenizer
)

# Generate text
sequences = pipeline(
    prompt,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    num_return_sequences=1,
    max_length=200,
)
print(sequences[0]['generated_text'])
```

### Limitations

The Chocolatine-2 model series is a quick demonstration that a base model can be easily fine-tuned to achieve compelling performance.  
It does not have any moderation mechanism.  

- **Developed by:** Jonathan Pacifico, 2025  
- **Model type:** LLM 
- **Language(s) (NLP):** French, English  
- **License:** Apache-2.0  

Made with ❤️ in France