--- license: llama2 datasets: - HuggingFaceH4/ultrachat_200k - HuggingFaceH4/ultrafeedback_binarized language: - tr - en --- # SambaLingo-Turkish-Chat SambaLingo-Turkish-Chat is a bi-lingual human aligned chat model trained for Turkish and English. It is trained using direct preference optimization on top the base model [SambaLingo-Turkish-Base](https://huggingface.co/sambanovasystems/SambaLingo-Turkish-Base). The base model adapts [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf) to Turkish by training on 63 billion tokens from the Turkish split of the [Cultura-X](https://huggingface.co/datasets/uonlp/CulturaX) dataset. ## Model Description - **Developed by:** [SambaNova Systems](https://sambanova.ai/) - **Model type:** Language Model - **Language(s):** Turkish, English - **Finetuned from model:** [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf) - **Blog Post**: Will be released soon! ## Getting Started ### Loading in model with Hugging Face ```python from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("sambanovasystems/SambaLingo-Turkish-Chat") model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Turkish-Chat", device_map="auto", torch_dtype="auto") ``` ### Suggested Inference Parameters - Temperature: 0.8 - Repetition penalty: 1.0 - Top-p: 0.9 ### Suggested Prompting ## Evaluation Results ## Training Details ## Uses ### Direct Use This model is intended for commercial and research use. ### Out-of-Scope Use SambaLingo should NOT be used for: - Mission-critical applications - Applications that involve the safety of others - Making highly important decisions ## Bias, Risks, and Limitations Like all LLMs, SambaLingo has certain limitations: - Hallucination: Model may sometimes generate responses that contain plausible-sounding but factually incorrect or irrelevant information. - Code Switching: The model might unintentionally switch between languages or dialects within a single response, affecting the coherence and understandability of the output. - Repetition: The Model may produce repetitive phrases or sentences, leading to less engaging and informative responses. - Coding and Math: The model's performance in generating accurate code or solving complex mathematical problems may be limited. - Toxicity: The model could inadvertently generate responses containing inappropriate or harmful content. ## Acknowledgments We extend our heartfelt gratitude to the open-source AI community; this endeavor would not have been achievable without open source. SambaNova embraces the open-source community and aspires to actively contribute to this initiative. We would like to give a special thanks to the following groups: - Meta for open sourcing LLama 2 and open sourcing FLORES-200 dataset - Nguyen et al for open sourcing CulturaX dataset - CohereAI for their amazing work with AYA-101 and open sourcing a multilingual instruction tuning dataset - EleutherAI for their open source evaluation framework - Hugging Face-H4 team for open source the zephyr training recipe and alignment handbook repo ## Cite SambaLingo ``` @software{sambalingo, title = {{SambaLingo: Open Source Language Experts}}, author = {SambaNova Systems}, url = {https://huggingface.co/sambanovasystems/SambaLingo-Turkish-Chat} month = {2}, year = {2024}, version = {1.0}, } ```