metadata
license: mit
license_name: mit
license_link: LICENSE
library_name: transformers
tags:
- fp8
- vllm
language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
pipeline_tag: text-generation
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
DeepSeek-R1-Distill-Qwen-14B-FP8
FP8-quantized version of DeepSeek-R1-Distill-Qwen-14B, optimized for inference with vLLM. The quantization reduces the model's memory footprint by approximately 50%.
Model Overview
- Base Model: DeepSeek-R1-Distill-Qwen-14B
- Quantization: FP8 (weights and activations)
- Memory Reduction: ~50% (from 16-bit to 8-bit)
- License: MIT License (following original model's license)
Compression Details
Compressed using LLM Compressor with:
- 512 calibration samples from UltraChat
- Symmetric per-tensor quantization
- Applied to linear operators within transformer blocks
The compression script is available in compress.py
.
Requirements
- vLLM
- transformers
- torch
- accelerate
Note
This is an experimental compression of the model. Performance metrics and optimal usage parameters have not been thoroughly tested yet.