Improve eval mainly
Browse files
README.md
CHANGED
@@ -19,7 +19,7 @@ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama
|
|
19 |
# Model Information
|
20 |
|
21 |
- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
|
22 |
-
- 128k context length (approximately 80,000 Greek words)
|
23 |
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
|
24 |
* This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
|
25 |
* Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
|
@@ -41,6 +41,7 @@ Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **
|
|
41 |
|
42 |
# How to use
|
43 |
|
|
|
44 |
|
45 |
```python
|
46 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
@@ -58,7 +59,7 @@ outputs = model.generate(input_text['input_ids'], max_new_tokens=256, do_sample=
|
|
58 |
print(tokenizer.batch_decode(outputs)[0])
|
59 |
```
|
60 |
|
61 |
-
|
62 |
|
63 |
```bash
|
64 |
vllm serve ilsp/Llama-Krikri-8B-Base \
|
@@ -86,8 +87,14 @@ print(response.choices[0].text)
|
|
86 |
|
87 |
# Evaluation
|
88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
89 |
|
90 |
-
## Greek Benchmarks
|
91 |
|
92 |
The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this [lighteval fork](https://github.com/LeonVouk/lighteval).
|
93 |
|
@@ -96,7 +103,7 @@ Our evaluation suite includes:
|
|
96 |
* An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
|
97 |
* A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).
|
98 |
|
99 |
-
|
100 |
|
101 |
| | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
|
102 |
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
|
@@ -107,6 +114,8 @@ Our evaluation for Llama-Krikri-8B is performed in a few-shot setting, consisten
|
|
107 |
|
108 |
## English Benchmarks
|
109 |
|
|
|
|
|
110 |
| | Winogrande (5-shot) | Belebele (5-shot) | HellaSwag (10-shot) | ARC-Challenge (25-shot) | TruthfulQA MC2 (0-shot) | MMLU (5-shot) | Average |
|
111 |
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
|
112 |
| Meltemi 7B v1.5 | 73.4% | 77.7% | 79.6% | 54.1% | 40.5% | 56.9% | 63.7% |
|
|
|
19 |
# Model Information
|
20 |
|
21 |
- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
|
22 |
+
- 128k context length (**approximately 80,000 Greek words**)
|
23 |
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
|
24 |
* This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
|
25 |
* Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
|
|
|
41 |
|
42 |
# How to use
|
43 |
|
44 |
+
## With Transformers
|
45 |
|
46 |
```python
|
47 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
59 |
print(tokenizer.batch_decode(outputs)[0])
|
60 |
```
|
61 |
|
62 |
+
## With OpenAI compatible server via vLLM
|
63 |
|
64 |
```bash
|
65 |
vllm serve ilsp/Llama-Krikri-8B-Base \
|
|
|
87 |
|
88 |
# Evaluation
|
89 |
|
90 |
+
Below, we report improvements of Llama-Krikri-8B-Base over Llama-3.1-8B for Greek and English:
|
91 |
+
- **+10.8%** on Greek benchmarks
|
92 |
+
- **+0.8%** on English benchmarks
|
93 |
+
|
94 |
+
Our evaluations for Llama-Krikri-8B-Base, Llama-3.1-8B, and Meltemi 7B v1.5 are performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
|
95 |
+
|
96 |
+
## Greek Benchmarks
|
97 |
|
|
|
98 |
|
99 |
The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this [lighteval fork](https://github.com/LeonVouk/lighteval).
|
100 |
|
|
|
103 |
* An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
|
104 |
* A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).
|
105 |
|
106 |
+
We can see that our training enhances performance across all Greek test sets by a **+10.8%** average improvement. The results for the Greek test sets are shown in the following table:
|
107 |
|
108 |
| | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
|
109 |
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
|
|
|
114 |
|
115 |
## English Benchmarks
|
116 |
|
117 |
+
We can also see that our training methodology not only mitigates catastrophic forgetting effectively, but also improves average performance across all English test sets by **+0.8%**. The results for the English test sets are shown in the following table:
|
118 |
+
|
119 |
| | Winogrande (5-shot) | Belebele (5-shot) | HellaSwag (10-shot) | ARC-Challenge (25-shot) | TruthfulQA MC2 (0-shot) | MMLU (5-shot) | Average |
|
120 |
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
|
121 |
| Meltemi 7B v1.5 | 73.4% | 77.7% | 79.6% | 54.1% | 40.5% | 56.9% | 63.7% |
|