Update README.md
Browse files
README.md
CHANGED
@@ -1,24 +1,21 @@
|
|
1 |
---
|
2 |
license: llama2
|
3 |
datasets:
|
4 |
-
-
|
|
|
5 |
language:
|
6 |
- tr
|
7 |
- en
|
8 |
-
metrics:
|
9 |
-
- chrf
|
10 |
-
- accuracy
|
11 |
-
- bleu
|
12 |
---
|
13 |
|
14 |
|
15 |
|
16 |
-
# SambaLingo-Turkish-
|
17 |
|
18 |
<img src="SambaLingo_Logo.png" width="340" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
|
19 |
|
20 |
<!-- Provide a quick summary of what the model is/does. -->
|
21 |
-
SambaLingo-Turkish-Base is a
|
22 |
|
23 |
## Model Description
|
24 |
<!-- Provide a longer summary of what this model is. -->
|
@@ -35,8 +32,8 @@ SambaLingo-Turkish-Base is a pretrained Bi-lingual Turkish and English model tha
|
|
35 |
```python
|
36 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
37 |
|
38 |
-
tokenizer = AutoTokenizer.from_pretrained("sambanovasystems/SambaLingo-Turkish-
|
39 |
-
model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Turkish-
|
40 |
```
|
41 |
|
42 |
### Suggested Inference Parameters
|
@@ -45,12 +42,10 @@ model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Turkis
|
|
45 |
- Top-p: 0.9
|
46 |
|
47 |
### Suggested Prompting
|
48 |
-
This model is a pretrained checkpoint, so to use it effectively please use few shot prompting with exemplars. The only other prompt templating required is the standard \<s\> (BOS) token from the Llama tokenizer. If you want to interact with this model with direct questions or queries, please use the chat version of the model that has been aligned with human preferences [sambanovasystems/SambaLingo-Turkish-Chat](https://huggingface.co/sambanovasystems/SambaLingo-Turkish-Chat).
|
49 |
|
50 |
## Evaluation Results
|
51 |
|
52 |
## Training Details
|
53 |
-
All pre-training is done on the [Cultura-X](https://huggingface.co/datasets/uonlp/CulturaX) dataset. We mix the data to be 75% data from the language we are adapting to, and 25% English as suggested by [Csaki et al.](https://arxiv.org/abs/2311.05741) We pack the data into sequences of length 4096, and ensure that when learning a token we only attend to previous tokens in the context of the corresponding text document. We train with a global batch size of 1024, sequence length of 4096, maximum learning rate of 1e-4 with cosine decay, warmup ratio of 0.01 and a weight decay of 0.1.
|
54 |
|
55 |
## Uses
|
56 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
@@ -95,9 +90,9 @@ Hugging Face-H4 team for open source the zephyr training recipe and alignment ha
|
|
95 |
## Cite SambaLingo
|
96 |
```
|
97 |
@software{sambalingo,
|
98 |
-
title = {{SambaLingo: Language Experts
|
99 |
author = {SambaNova Systems},
|
100 |
-
url = {https://huggingface.co/sambanovasystems/SambaLingo-Turkish-
|
101 |
month = {2},
|
102 |
year = {2024},
|
103 |
version = {1.0},
|
|
|
1 |
---
|
2 |
license: llama2
|
3 |
datasets:
|
4 |
+
- HuggingFaceH4/ultrachat_200k
|
5 |
+
- HuggingFaceH4/ultrafeedback_binarized
|
6 |
language:
|
7 |
- tr
|
8 |
- en
|
|
|
|
|
|
|
|
|
9 |
---
|
10 |
|
11 |
|
12 |
|
13 |
+
# SambaLingo-Turkish-Chat
|
14 |
|
15 |
<img src="SambaLingo_Logo.png" width="340" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
|
16 |
|
17 |
<!-- Provide a quick summary of what the model is/does. -->
|
18 |
+
SambaLingo-Turkish-Base is a bi-lingual human aligned chat model trained for Turkish and English. It is trained using direct preference optimization on top the base model [SambaLingo-Turkish-Base](https://huggingface.co/sambanovasystems/SambaLingo-Turkish-Base). The base model adapts [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf) to Turkish by training on 63 billion tokens from the Turkish split of the [Cultura-X](https://huggingface.co/datasets/uonlp/CulturaX) dataset.
|
19 |
|
20 |
## Model Description
|
21 |
<!-- Provide a longer summary of what this model is. -->
|
|
|
32 |
```python
|
33 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
34 |
|
35 |
+
tokenizer = AutoTokenizer.from_pretrained("sambanovasystems/SambaLingo-Turkish-Chat")
|
36 |
+
model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Turkish-Chat", device_map="auto", torch_dtype="auto")
|
37 |
```
|
38 |
|
39 |
### Suggested Inference Parameters
|
|
|
42 |
- Top-p: 0.9
|
43 |
|
44 |
### Suggested Prompting
|
|
|
45 |
|
46 |
## Evaluation Results
|
47 |
|
48 |
## Training Details
|
|
|
49 |
|
50 |
## Uses
|
51 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
|
|
90 |
## Cite SambaLingo
|
91 |
```
|
92 |
@software{sambalingo,
|
93 |
+
title = {{SambaLingo: Open Source Language Experts}},
|
94 |
author = {SambaNova Systems},
|
95 |
+
url = {https://huggingface.co/sambanovasystems/SambaLingo-Turkish-Chat}
|
96 |
month = {2},
|
97 |
year = {2024},
|
98 |
version = {1.0},
|