wxxwxxw commited on
Commit
6615e28
·
verified ·
1 Parent(s): e295123

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -0
README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ base_model: yentinglin/Llama-3-Taiwan-8B-Instruct-128k
6
+ language:
7
+ - zh
8
+ - en
9
+ tags:
10
+ - zhtw
11
+ ---
12
+
13
+ # wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ
14
+ This model is quantized using AWQ in 4 bits; the original model is [`yentinglin/Llama-3-Taiwan-8B-Instruct-128k`](https://huggingface.co/yentinglin/Llama-3-Taiwan-8B-Instruct-128k)
15
+
16
+ # quantize
17
+ ```python
18
+ from awq import AutoAWQForCausalLM
19
+ from transformers import AutoTokenizer
20
+
21
+ model_path = 'yentinglin/Llama-3-Taiwan-8B-Instruct-128k'
22
+ quant_path = 'Llama-3-Taiwan-8B-Instruct-128k-AWQ'
23
+ quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM", "modules_to_not_convert": []}
24
+
25
+ model = AutoAWQForCausalLM.from_pretrained(model_path)
26
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
27
+
28
+ model.quantize(tokenizer, quant_config=quant_config)
29
+
30
+ # Save quantized model
31
+ model.save_quantized(quant_path)
32
+ tokenizer.save_pretrained(quant_path)
33
+ ```
34
+
35
+ # inference with vllm
36
+ ```python
37
+ from vllm import LLM, SamplingParams
38
+
39
+ llm = LLM(model='wxxwxxw/Llama-3-Taiwan-8B-Instruct-128k-4bit-AWQ',
40
+ quantization="AWQ",
41
+ tensor_parallel_size=2, # number of gpus
42
+ gpu_memory_utilization=0.9,
43
+ dtype='half'
44
+ )
45
+
46
+ tokenizer = llm.get_tokenizer()
47
+ conversations = tokenizer.apply_chat_template(
48
+ [{'role': 'user', 'content': "how tall is taipei 101"}],
49
+ tokenize=False,
50
+ )
51
+
52
+ outputs = llm.generate(
53
+ [conversations],
54
+ SamplingParams(
55
+ temperature=0.5,
56
+ top_p=0.9,
57
+ min_tokens=20,
58
+ max_tokens=1024,
59
+ )
60
+ )
61
+
62
+ for output in outputs:
63
+ generated_ids = output.outputs[0].token_ids
64
+ generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
65
+ print(generated_text)
66
+ ```