Update README.md
Browse files
README.md
CHANGED
@@ -51,6 +51,18 @@ All pre-training is done on the [Cultura-X](https://huggingface.co/datasets/uonl
|
|
51 |
## Tokenizer Details
|
52 |
We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
|
53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
54 |
## Uses
|
55 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
56 |
|
|
|
51 |
## Tokenizer Details
|
52 |
We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
|
53 |
|
54 |
+
## Evaluation
|
55 |
+
|
56 |
+
| | SambaLingo-Serbian-Base | sr-gpt2 | bloom-7b1 | xglm-7.5B | mGPT-13B |
|
57 |
+
|-------------------------------|---------|-----------|-----------|----------|--------|
|
58 |
+
| Perplexity (Lower Is Better) | **1.436** | - | 2.140 | 2.404 | 2.429 |
|
59 |
+
| FLORES en->sr (8 shot, CHRF) | **0.448** | 0.002 | 0.171 | 0.090 | 0.024 |
|
60 |
+
| FLORES sr->en (8 shot, CHRF) | **0.625** | 0.071 | 0.206 | 0.257 | 0.026 |
|
61 |
+
| FLORES en->sr (8 shot, BLEU) | **0.188** | 0.000 | 0.003 | 0.001 | 0.000 |
|
62 |
+
| FLORES sr->en (8 shot, BLEU) | **0.352** | 0.000 | 0.019 | 0.040 | 0.000 |
|
63 |
+
| Belebele (3 shot) | **48.33%** | 23.00% | 23.89% | 27.00% | 25.22% |
|
64 |
+
| SIB-200 (3 shot) | 55.39% | -% | 32.35% | **61.76%** | 39.22% |
|
65 |
+
|
66 |
## Uses
|
67 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
68 |
|