|
--- |
|
license: mit |
|
license_name: mit |
|
license_link: LICENSE |
|
library_name: transformers |
|
tags: |
|
- fp8 |
|
- vllm |
|
language: |
|
- en |
|
- de |
|
- fr |
|
- it |
|
- pt |
|
- hi |
|
- es |
|
- th |
|
pipeline_tag: text-generation |
|
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |
|
--- |
|
|
|
# DeepSeek-R1-Distill-Qwen-14B-FP8 |
|
|
|
FP8-quantized version of [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B), optimized for inference with vLLM. The quantization reduces the model's memory footprint by approximately 50%. |
|
|
|
## Model Overview |
|
|
|
- **Base Model**: DeepSeek-R1-Distill-Qwen-14B |
|
- **Quantization**: FP8 (weights and activations) |
|
- **Memory Reduction**: ~50% (from 16-bit to 8-bit) |
|
- **License**: MIT License (following original model's license) |
|
|
|
## Compression Details |
|
|
|
Compressed using [LLM Compressor](https://github.com/vllm-project/llm-compressor) with: |
|
|
|
- 512 calibration samples from UltraChat |
|
- Symmetric per-tensor quantization |
|
- Applied to linear operators within transformer blocks |
|
|
|
The compression script is available in `compress.py`. |
|
|
|
## Requirements |
|
|
|
- vLLM |
|
- transformers |
|
- torch |
|
- accelerate |
|
|
|
## Note |
|
|
|
This is an experimental compression of the model. Performance metrics and optimal usage parameters have not been thoroughly tested yet. |
|
|