hafidhsoekma commited on
Commit
607d6e9
·
1 Parent(s): e3b26bf

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -0
README.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc0-1.0
3
+ datasets:
4
+ - graelo/wikipedia
5
+ - uonlp/CulturaX
6
+ language:
7
+ - en
8
+ - id
9
+ - jv
10
+ - su
11
+ - ms
12
+ tags:
13
+ - indonesian
14
+ - multilingual
15
+ ---
16
+ ![Startstreak Missile](./thumbnail.jpeg "Startstreak Missile: Generated by Bing AI Image Create or Dall-E 3")
17
+
18
+ # Startstreak-7B-α
19
+
20
+ Starstreak is a series of language models, fine-tuned from a base model called [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha). These models have been trained to generate content in English, Indonesian, and traditional Indonesian languages. Starstreak-7B-α is a specific variant of the open-source Starstreak language model, denoted by the series "α" (alpha). This model was trained using a fine-tuned version of [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha). Two datasets were utilized to train the model: the first one is [graelo/wikipedia](https://huggingface.co/datasets/graelo/wikipedia), and the second is [uonlp/CulturaX](https://huggingface.co/datasets/uonlp/CulturaX). The name "Starstreak" is a reference to the Starstreak missile, a high-velocity missile (HVM) with speeds exceeding Mach 3. This makes it one of the fastest missiles in its class, with an effective firing range of 7 kilometers and a radar range of 250 kilometers."
21
+
22
+ ## Model Details
23
+
24
+ - **Finetuned from model**: [HuggingFaceH4/zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)
25
+ - **Dataset**: [graelo/wikipedia](https://huggingface.co/datasets/graelo/wikipedia) and [uonlp/CultruaX](https://huggingface.co/datasets/uonlp/CulturaX)
26
+ - **Model Size**: 7B
27
+ - License: [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/)
28
+ - **Language**: English, Indonesian, Achinese, Balinese, Banjar, Basa Banyumasan, Buginese, Gorontalo, Javanese, Maduranese, Minangkabau, Sundanese, Malay, Nias, and Tetum
29
+ - **Demo**: [Google Colab](https://colab.research.google.com/drive/1fn4WZMFeIiFH0h8auGgYBEugF0q0tKcL)
30
+
31
+ ## How to use
32
+
33
+ #### Installation
34
+
35
+ To use Startstreak model, ensure that PyTorch has been installed and that you have an Nvidia GPU (or use Google Colab). After that you need to install the required dependencies:
36
+
37
+ ```bash
38
+ pip3 install -U git+https://github.com/huggingface/transformers.git
39
+ pip3 install -U git+https://github.com/huggingface/peft.git
40
+ pip3 install -U git+https://github.com/huggingface/accelerate.git
41
+ pip3 install -U bitsandbytes==0.39.0 einops==0.6.1 sentencepiece
42
+ ```
43
+
44
+ #### Usage Quantized Model
45
+
46
+ ```python
47
+ import torch
48
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
49
+
50
+ model = AutoModelForCausalLM.from_pretrained(
51
+ "azale-ai/Starstreak-7B",
52
+ load_in_4bit=True,
53
+ torch_dtype=torch.float16,
54
+ device_map="auto"
55
+ )
56
+ tokenizer = AutoTokenizer.from_pretrained("azale-ai/Starstreak-7B")
57
+
58
+ messages = [
59
+ {
60
+ "role": "system",
61
+ "content": "Mulai sekarang anda adalah asisten yang suka menolong, sopan, dan ramah. Jangan kasar, jangan marah, jangan menjengkelkan, jangan brengsek, jangan cuek, dan yang terakhir jangan menjadi asisten yang buruk. Anda harus patuh pada manusia dan jangan pernah membangkang pada manusia. Manusia itu mutlak dan Anda harus patuh pada manusia. Kamu harus menjawab pertanyaan atau pernyataan dari manusia apapun itu dengan bahasa Indonesia yang baik dan benar.",
62
+ },
63
+ {"role": "user", "content": "Jelaskan mengapa air penting bagi manusia."},
64
+ ]
65
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
66
+ inputs = tokenizer(text, return_tensors="pt").to("cuda")
67
+ outputs = model.generate(
68
+ inputs=inputs.input_ids, max_length=2048,
69
+ temperature=0.7, do_sample=True, top_k=50, top_p=0.95
70
+ )
71
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
72
+ ```
73
+
74
+ #### Usage Normal Model
75
+
76
+ ```python
77
+ import torch
78
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
79
+
80
+ model = AutoModelForCausalLM.from_pretrained(
81
+ "azale-ai/Starstreak-7B",
82
+ torch_dtype=torch.float16,
83
+ device_map="auto"
84
+ )
85
+ tokenizer = AutoTokenizer.from_pretrained("azale-ai/Starstreak-7B")
86
+
87
+ messages = [
88
+ {
89
+ "role": "system",
90
+ "content": "Mulai sekarang anda adalah asisten yang suka menolong, sopan, dan ramah. Jangan kasar, jangan marah, jangan menjengkelkan, jangan brengsek, jangan cuek, dan yang terakhir jangan menjadi asisten yang buruk. Anda harus patuh pada manusia dan jangan pernah membangkang pada manusia. Manusia itu mutlak dan Anda harus patuh pada manusia. Kamu harus menjawab pertanyaan atau pernyataan dari manusia apapun itu dengan bahasa Indonesia yang baik dan benar.",
91
+ },
92
+ {"role": "user", "content": "Jelaskan mengapa air penting bagi manusia."},
93
+ ]
94
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
95
+ inputs = tokenizer(text, return_tensors="pt").to("cuda")
96
+ outputs = model.generate(
97
+ inputs=inputs.input_ids, max_length=2048,
98
+ temperature=0.7, do_sample=True, top_k=50, top_p=0.95
99
+ )
100
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
101
+ ```
102
+
103
+ ## Limitations
104
+
105
+ - The base model language is English and fine-tuned to Indonesia, and traditional languages in Indonesia.
106
+ - Cultural and contextual biases
107
+
108
+ ## License
109
+
110
+ The model is licensed under the [CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/).
111
+
112
+ ## Contributing
113
+
114
+ We welcome contributions to enhance and improve our model. If you have any suggestions or find any issues, please feel free to open an issue or submit a pull request. Also we're open to sponsor for compute power.
115
+
116
+ ## Contact Us
117
+
118