GradientGuru commited on
Commit
5599253
·
1 Parent(s): 2026f61

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -0
README.md CHANGED
@@ -59,6 +59,24 @@ response = model.chat(tokenizer, messages)
59
  print(response)
60
  ```
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ## 模型详情
63
 
64
  ### 模型描述
 
59
  print(response)
60
  ```
61
 
62
+ ## 量化部署
63
+
64
+ Baichuan-13B 支持 int8 和 int4 量化,用户只需在推理代码中简单修改两行即可实现。请注意,如果是为了节省显存而进行量化,应加载原始精度模型到 CPU 后再开始量化;避免在 `from_pretrained` 时添加 `device_map='auto'` 或者其它会导致把原始精度模型直接加载到 GPU 的行为的参数。
65
+
66
+ Baichuan-13B supports int8 and int4 quantization, users only need to make a simple two-line change in the inference code to implement it. Please note, if quantization is done to save GPU memory, the original precision model should be loaded onto the CPU before starting quantization. Avoid adding parameters such as `device_map='auto'` or others that could cause the original precision model to be loaded directly onto the GPU when executing `from_pretrained`.
67
+
68
+ 使用 int8 量化 (To use int8 quantization):
69
+ ```python
70
+ model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan-13B-Chat", torch_dtype=torch.float16, trust_remote_code=True)
71
+ model = model.quantize(8).cuda()
72
+ ```
73
+
74
+ 同样的,如需使用 int4 量化 (Similarly, to use int4 quantization):
75
+ ```python
76
+ model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan-13B-Chat", torch_dtype=torch.float16, trust_remote_code=True)
77
+ model = model.quantize(4).cuda()
78
+ ```
79
+
80
  ## 模型详情
81
 
82
  ### 模型描述