--- license: mit license_name: mit license_link: LICENSE library_name: transformers tags: - fp8 - vllm language: - en - de - fr - it - pt - hi - es - th pipeline_tag: text-generation base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B --- # DeepSeek-R1-Distill-Qwen-14B-FP8 FP8-quantized version of [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B), optimized for inference with vLLM. The quantization reduces the model's memory footprint by approximately 50%. ## Model Overview - **Base Model**: DeepSeek-R1-Distill-Qwen-14B - **Quantization**: FP8 (weights and activations) - **Memory Reduction**: ~50% (from 16-bit to 8-bit) - **License**: MIT License (following original model's license) ## Compression Details Compressed using [LLM Compressor](https://github.com/vllm-project/llm-compressor) with: - 512 calibration samples from UltraChat - Symmetric per-tensor quantization - Applied to linear operators within transformer blocks The compression script is available in `compress.py`. ## Requirements - vLLM - transformers - torch - accelerate ## Note This is an experimental compression of the model. Performance metrics and optimal usage parameters have not been thoroughly tested yet.