Text Generation
Transformers
Safetensors
Turkish
English
llama
conversational
text-generation-inference
Inference Endpoints
zolicsaki commited on
Commit
9930822
·
verified ·
1 Parent(s): 7cf1209

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -13
README.md CHANGED
@@ -1,24 +1,21 @@
1
  ---
2
  license: llama2
3
  datasets:
4
- - uonlp/CulturaX
 
5
  language:
6
  - tr
7
  - en
8
- metrics:
9
- - chrf
10
- - accuracy
11
- - bleu
12
  ---
13
 
14
 
15
 
16
- # SambaLingo-Turkish-Base
17
 
18
  <img src="SambaLingo_Logo.png" width="340" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
19
 
20
  <!-- Provide a quick summary of what the model is/does. -->
21
- SambaLingo-Turkish-Base is a pretrained Bi-lingual Turkish and English model that adapts [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf) to Turkish by training on 63 billion tokens from the Turkish split of the [Cultura-X](https://huggingface.co/datasets/uonlp/CulturaX) dataset. This model reports state of the art evaluation results in perplexity and FLORES-200 translation. For the chat version of this model please see [sambanovasystems/SambaLingo-Turkish-Chat](https://huggingface.co/sambanovasystems/SambaLingo-Turkish-Chat).
22
 
23
  ## Model Description
24
  <!-- Provide a longer summary of what this model is. -->
@@ -35,8 +32,8 @@ SambaLingo-Turkish-Base is a pretrained Bi-lingual Turkish and English model tha
35
  ```python
36
  from transformers import AutoModelForCausalLM, AutoTokenizer
37
 
38
- tokenizer = AutoTokenizer.from_pretrained("sambanovasystems/SambaLingo-Turkish-Base")
39
- model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Turkish-Base", device_map="auto", torch_dtype="auto")
40
  ```
41
 
42
  ### Suggested Inference Parameters
@@ -45,12 +42,10 @@ model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Turkis
45
  - Top-p: 0.9
46
 
47
  ### Suggested Prompting
48
- This model is a pretrained checkpoint, so to use it effectively please use few shot prompting with exemplars. The only other prompt templating required is the standard \<s\> (BOS) token from the Llama tokenizer. If you want to interact with this model with direct questions or queries, please use the chat version of the model that has been aligned with human preferences [sambanovasystems/SambaLingo-Turkish-Chat](https://huggingface.co/sambanovasystems/SambaLingo-Turkish-Chat).
49
 
50
  ## Evaluation Results
51
 
52
  ## Training Details
53
- All pre-training is done on the [Cultura-X](https://huggingface.co/datasets/uonlp/CulturaX) dataset. We mix the data to be 75% data from the language we are adapting to, and 25% English as suggested by [Csaki et al.](https://arxiv.org/abs/2311.05741) We pack the data into sequences of length 4096, and ensure that when learning a token we only attend to previous tokens in the context of the corresponding text document. We train with a global batch size of 1024, sequence length of 4096, maximum learning rate of 1e-4 with cosine decay, warmup ratio of 0.01 and a weight decay of 0.1.
54
 
55
  ## Uses
56
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
@@ -95,9 +90,9 @@ Hugging Face-H4 team for open source the zephyr training recipe and alignment ha
95
  ## Cite SambaLingo
96
  ```
97
  @software{sambalingo,
98
- title = {{SambaLingo: Language Experts Adapted From Llama}},
99
  author = {SambaNova Systems},
100
- url = {https://huggingface.co/sambanovasystems/SambaLingo-Turkish-Base}
101
  month = {2},
102
  year = {2024},
103
  version = {1.0},
 
1
  ---
2
  license: llama2
3
  datasets:
4
+ - HuggingFaceH4/ultrachat_200k
5
+ - HuggingFaceH4/ultrafeedback_binarized
6
  language:
7
  - tr
8
  - en
 
 
 
 
9
  ---
10
 
11
 
12
 
13
+ # SambaLingo-Turkish-Chat
14
 
15
  <img src="SambaLingo_Logo.png" width="340" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
16
 
17
  <!-- Provide a quick summary of what the model is/does. -->
18
+ SambaLingo-Turkish-Base is a bi-lingual human aligned chat model trained for Turkish and English. It is trained using direct preference optimization on top the base model [SambaLingo-Turkish-Base](https://huggingface.co/sambanovasystems/SambaLingo-Turkish-Base). The base model adapts [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf) to Turkish by training on 63 billion tokens from the Turkish split of the [Cultura-X](https://huggingface.co/datasets/uonlp/CulturaX) dataset.
19
 
20
  ## Model Description
21
  <!-- Provide a longer summary of what this model is. -->
 
32
  ```python
33
  from transformers import AutoModelForCausalLM, AutoTokenizer
34
 
35
+ tokenizer = AutoTokenizer.from_pretrained("sambanovasystems/SambaLingo-Turkish-Chat")
36
+ model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Turkish-Chat", device_map="auto", torch_dtype="auto")
37
  ```
38
 
39
  ### Suggested Inference Parameters
 
42
  - Top-p: 0.9
43
 
44
  ### Suggested Prompting
 
45
 
46
  ## Evaluation Results
47
 
48
  ## Training Details
 
49
 
50
  ## Uses
51
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
90
  ## Cite SambaLingo
91
  ```
92
  @software{sambalingo,
93
+ title = {{SambaLingo: Open Source Language Experts}},
94
  author = {SambaNova Systems},
95
+ url = {https://huggingface.co/sambanovasystems/SambaLingo-Turkish-Chat}
96
  month = {2},
97
  year = {2024},
98
  version = {1.0},