DeepSeek-R1-AWQ / README.md

Updated README.md to include serving command and inference speed.

524e5e5 verified about 8 hours ago

1.16 kB

	---
	license: mit
	language:
	- en
	- zh
	base_model:
	- deepseek-ai/DeepSeek-R1
	pipeline_tag: text-generation
	library_name: transformers
	---
	# DeepSeek R1 AWQ
	AWQ of DeepSeek R1.

	This quant modified some of the model code to fix an overflow issue when using float16.

	To serve using vLLM with 8x 80GB GPUs, use the following command:
	```sh
	python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.97 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-reasoner --model cognitivecomputations/DeepSeek-R1-AWQ
	```
	The max model length flag ensures that KV cache usage won't be higher than available memory, the `moe_wna16` kernel doubles the inference speed, but you must build vLLM from source as of 2025/2/3. \
	You can download the wheel I built for PyTorch 2.6, Python 3.12 by clicking [here](https://huggingface.co/x2ray/wheels/resolve/main/vllm-0.7.1.dev69%2Bg4f4d427a.d20220101.cu126-cp312-cp312-linux_x86_64.whl).

	Inference speed with batch size 1 and short prompt:
	- 8x H100: 34 TPS
	- 8x A100: 27 TPS