ilsp
/

Llama-Krikri-8B-Base

@@ -19,22 +19,14 @@ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama
 # Model Information
 - Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
-- 128k context length
 - We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
-  * This corpus includes 55 billion monolingual Greek tokens, constructed from publicly available resources.
-  * Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (23,3 billion tokens) and Greek-English parallel data (5,26 billion tokens).
-  * The training corpus also contains 6 billion math and code tokens.
   * This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:
-| Sub-corpus   | # Tokens         | Percentage |
-|-----------|------------------|------------|
-| Greek     | 55,097,452,359   | 61.4%      |
-| English   | 23,340,749,356   | 26.0%      |
-| Parallel  |  5,262,998,873   | 6.0%       |
-| Math/Code |  5,951,964,497   | 6.6%       |
-| **Total** | **89,653,165,085**   |  **100%**       |
 | Sub-corpus   | # Tokens         | Percentage |
 |-----------|------------------|------------|
 | Greek     | 56.7 B   | 62.3 %      |
@@ -43,7 +35,8 @@ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama
 | Math/Code |  7.8 B   | 8.6 %       |
 | **Total** | 91 B   |  **100%**       |
-Chosen subsets of the 89.65 billion corpus were upsampled resulting in a size of **110 billion tokens**.
 # How to use
@@ -65,6 +58,29 @@ outputs = model.generate(input_text['input_ids'], max_new_tokens=256, do_sample=
 print(tokenizer.batch_decode(outputs)[0])
 ```
 # Evaluation

 # Model Information
 - Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
+- 128k context length (approximately 80,000 Greek words)
 - We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
+  * This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
+  * Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
+  * The training corpus also contains 7.8 billion math and code tokens.
   * This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:
 | Sub-corpus   | # Tokens         | Percentage |
 |-----------|------------------|------------|
 | Greek     | 56.7 B   | 62.3 %      |
 | Math/Code |  7.8 B   | 8.6 %       |
 | **Total** | 91 B   |  **100%**       |
+Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **110 billion tokens**.
 # How to use
 print(tokenizer.batch_decode(outputs)[0])
 ```
+# How to serve with OpenAI compatible server via vLLM
+```bash
+vllm serve ilsp/Llama-Krikri-8B-Base \
+  --enforce-eager \
+  --dtype 'bfloat16' \
+  --api-key token-abc123
+```
+Then, the model can be used through Python using:
+'''python
+from openai import OpenAI
+api_key = "token-abc123"
+base_url = "http://localhost:8000/v1"
+client = OpenAI(
+    api_key=api_key,
+    base_url=base_url,
+)
+response = client.completions.create(model="ilsp/Llama-Krikri-8B-Base",
+                                     prompt="Η εκπαίδευση μεγάλων γλωσσικών μοντέλων περιλαμβάνει")
+print(response.choices[0].text)
+'''
 # Evaluation