metadata

license: llama2
datasets:
  - HuggingFaceH4/ultrachat_200k
  - HuggingFaceH4/ultrafeedback_binarized
language:
  - tr
  - en

SambaLingo-Turkish-Chat

SambaLingo-Turkish-Chat is a bi-lingual human aligned chat model trained for Turkish and English. It is trained using direct preference optimization on top the base model SambaLingo-Turkish-Base. The base model adapts Llama 2 to Turkish by training on 63 billion tokens from the Turkish split of the Cultura-X dataset.

Model Description

Developed by: SambaNova Systems
Model type: Language Model
Language(s): Turkish, English
Finetuned from model: Llama 2
Blog Post: Will be released soon!

Getting Started

Loading in model with Hugging Face

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sambanovasystems/SambaLingo-Turkish-Chat")
model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Turkish-Chat", device_map="auto", torch_dtype="auto")

Suggested Inference Parameters

Temperature: 0.8
Repetition penalty: 1.0
Top-p: 0.9

Suggested Prompting

Evaluation Results

Training Details

Uses

Direct Use

This model is intended for commercial and research use.

Out-of-Scope Use

SambaLingo should NOT be used for:

Mission-critical applications
Applications that involve the safety of others
Making highly important decisions

Bias, Risks, and Limitations

Like all LLMs, SambaLingo has certain limitations:

Hallucination: Model may sometimes generate responses that contain plausible-sounding but factually incorrect or irrelevant information.
Code Switching: The model might unintentionally switch between languages or dialects within a single response, affecting the coherence and understandability of the output.
Repetition: The Model may produce repetitive phrases or sentences, leading to less engaging and informative responses.
Coding and Math: The model's performance in generating accurate code or solving complex mathematical problems may be limited.
Toxicity: The model could inadvertently generate responses containing inappropriate or harmful content.

Acknowledgments

We extend our heartfelt gratitude to the open-source AI community; this endeavor would not have been achievable without open source. SambaNova embraces the open-source community and aspires to actively contribute to this initiative.

We would like to give a special thanks to the following groups:

Meta for open sourcing LLama 2 and open sourcing FLORES-200 dataset
Nguyen et al for open sourcing CulturaX dataset
CohereAI for their amazing work with AYA-101 and open sourcing a multilingual instruction tuning dataset
EleutherAI for their open source evaluation framework
Hugging Face-H4 team for open source the zephyr training recipe and alignment handbook repo

Cite SambaLingo

@software{sambalingo,
  title = {{SambaLingo: Open Source Language Experts}},
  author = {SambaNova Systems},
  url = {https://huggingface.co/sambanovasystems/SambaLingo-Turkish-Chat}
  month = {2},
  year = {2024},
  version = {1.0},
}