Text Generation
Transformers
Safetensors
Turkish
English
llama
conversational
text-generation-inference
Inference Endpoints
zolicsaki's picture
Update README.md
fe026db verified
|
raw
history blame
4.08 kB
metadata
license: llama2
datasets:
  - HuggingFaceH4/ultrachat_200k
  - HuggingFaceH4/ultrafeedback_binarized
language:
  - tr
  - en

SambaLingo-Turkish-Chat

SambaLingo-Turkish-Chat is a bi-lingual human aligned chat model trained for Turkish and English. It is trained using direct preference optimization on top the base model SambaLingo-Turkish-Base. The base model adapts Llama 2 to Turkish by training on 63 billion tokens from the Turkish split of the Cultura-X dataset.

Model Description

  • Developed by: SambaNova Systems
  • Model type: Language Model
  • Language(s): Turkish, English
  • Finetuned from model: Llama 2
  • Blog Post: Will be released soon!

Getting Started

Loading in model with Hugging Face

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sambanovasystems/SambaLingo-Turkish-Chat")
model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Turkish-Chat", device_map="auto", torch_dtype="auto")

Suggested Inference Parameters

  • Temperature: 0.8
  • Repetition penalty: 1.0
  • Top-p: 0.9

Suggested Prompting

Evaluation Results

Training Details

Uses

Direct Use

This model is intended for commercial and research use.

Out-of-Scope Use

SambaLingo should NOT be used for:

  • Mission-critical applications
  • Applications that involve the safety of others
  • Making highly important decisions

Bias, Risks, and Limitations

Like all LLMs, SambaLingo has certain limitations:

  • Hallucination: Model may sometimes generate responses that contain plausible-sounding but factually incorrect or irrelevant information.
  • Code Switching: The model might unintentionally switch between languages or dialects within a single response, affecting the coherence and understandability of the output.
  • Repetition: The Model may produce repetitive phrases or sentences, leading to less engaging and informative responses.
  • Coding and Math: The model's performance in generating accurate code or solving complex mathematical problems may be limited.
  • Toxicity: The model could inadvertently generate responses containing inappropriate or harmful content.

Acknowledgments

We extend our heartfelt gratitude to the open-source AI community; this endeavor would not have been achievable without open source. SambaNova embraces the open-source community and aspires to actively contribute to this initiative.

We would like to give a special thanks to the following groups:

  • Meta for open sourcing LLama 2 and open sourcing FLORES-200 dataset
  • Nguyen et al for open sourcing CulturaX dataset
  • CohereAI for their amazing work with AYA-101 and open sourcing a multilingual instruction tuning dataset
  • EleutherAI for their open source evaluation framework
  • Hugging Face-H4 team for open source the zephyr training recipe and alignment handbook repo

Cite SambaLingo

@software{sambalingo,
  title = {{SambaLingo: Open Source Language Experts}},
  author = {SambaNova Systems},
  url = {https://huggingface.co/sambanovasystems/SambaLingo-Turkish-Chat}
  month = {2},
  year = {2024},
  version = {1.0},
}