droussis commited on
Commit
4e4f7dc
·
verified ·
1 Parent(s): 7e877a9

Improve eval mainly

Browse files
Files changed (1) hide show
  1. README.md +13 -4
README.md CHANGED
@@ -19,7 +19,7 @@ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama
19
  # Model Information
20
 
21
  - Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
22
- - 128k context length (approximately 80,000 Greek words)
23
  - We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
24
  * This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
25
  * Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
@@ -41,6 +41,7 @@ Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **
41
 
42
  # How to use
43
 
 
44
 
45
  ```python
46
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -58,7 +59,7 @@ outputs = model.generate(input_text['input_ids'], max_new_tokens=256, do_sample=
58
  print(tokenizer.batch_decode(outputs)[0])
59
  ```
60
 
61
- # How to serve with OpenAI compatible server via vLLM
62
 
63
  ```bash
64
  vllm serve ilsp/Llama-Krikri-8B-Base \
@@ -86,8 +87,14 @@ print(response.choices[0].text)
86
 
87
  # Evaluation
88
 
 
 
 
 
 
 
 
89
 
90
- ## Greek Benchmarks
91
 
92
  The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this [lighteval fork](https://github.com/LeonVouk/lighteval).
93
 
@@ -96,7 +103,7 @@ Our evaluation suite includes:
96
  * An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
97
  * A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).
98
 
99
- Our evaluation for Llama-Krikri-8B is performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). We can see that our training enhances performance across all Greek test sets by a **+10.8%** average improvement. The results for the Greek test sets are shown in the following table:
100
 
101
  | | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
102
  |----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
@@ -107,6 +114,8 @@ Our evaluation for Llama-Krikri-8B is performed in a few-shot setting, consisten
107
 
108
  ## English Benchmarks
109
 
 
 
110
  | | Winogrande (5-shot) | Belebele (5-shot) | HellaSwag (10-shot) | ARC-Challenge (25-shot) | TruthfulQA MC2 (0-shot) | MMLU (5-shot) | Average |
111
  |----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
112
  | Meltemi 7B v1.5 | 73.4% | 77.7% | 79.6% | 54.1% | 40.5% | 56.9% | 63.7% |
 
19
  # Model Information
20
 
21
  - Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
22
+ - 128k context length (**approximately 80,000 Greek words**)
23
  - We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
24
  * This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
25
  * Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
 
41
 
42
  # How to use
43
 
44
+ ## With Transformers
45
 
46
  ```python
47
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
59
  print(tokenizer.batch_decode(outputs)[0])
60
  ```
61
 
62
+ ## With OpenAI compatible server via vLLM
63
 
64
  ```bash
65
  vllm serve ilsp/Llama-Krikri-8B-Base \
 
87
 
88
  # Evaluation
89
 
90
+ Below, we report improvements of Llama-Krikri-8B-Base over Llama-3.1-8B for Greek and English:
91
+ - **+10.8%** on Greek benchmarks
92
+ - **+0.8%** on English benchmarks
93
+
94
+ Our evaluations for Llama-Krikri-8B-Base, Llama-3.1-8B, and Meltemi 7B v1.5 are performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
95
+
96
+ ## Greek Benchmarks
97
 
 
98
 
99
  The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this [lighteval fork](https://github.com/LeonVouk/lighteval).
100
 
 
103
  * An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
104
  * A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).
105
 
106
+ We can see that our training enhances performance across all Greek test sets by a **+10.8%** average improvement. The results for the Greek test sets are shown in the following table:
107
 
108
  | | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
109
  |----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
 
114
 
115
  ## English Benchmarks
116
 
117
+ We can also see that our training methodology not only mitigates catastrophic forgetting effectively, but also improves average performance across all English test sets by **+0.8%**. The results for the English test sets are shown in the following table:
118
+
119
  | | Winogrande (5-shot) | Belebele (5-shot) | HellaSwag (10-shot) | ARC-Challenge (25-shot) | TruthfulQA MC2 (0-shot) | MMLU (5-shot) | Average |
120
  |----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
121
  | Meltemi 7B v1.5 | 73.4% | 77.7% | 79.6% | 54.1% | 40.5% | 56.9% | 63.7% |