DeepSeek-R1-Distill-Qwen-14B-FP8
FP8-quantized version of DeepSeek-R1-Distill-Qwen-14B, optimized for inference with vLLM. The quantization reduces the model's memory footprint by approximately 50%.
Model Overview
- Base Model: DeepSeek-R1-Distill-Qwen-14B
- Quantization: FP8 (weights and activations)
- Memory Reduction: ~50% (from 16-bit to 8-bit)
- License: MIT License (following original model's license)
Compression Details
Compressed using LLM Compressor with:
- 512 calibration samples from UltraChat
- Symmetric per-tensor quantization
- Applied to linear operators within transformer blocks
The compression script is available in compress.py
.
Requirements
- vLLM
- transformers
- torch
- accelerate
Note
This is an experimental compression of the model. Performance metrics and optimal usage parameters have not been thoroughly tested yet.
- Downloads last month
- 40
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for enferAI/DeepSeek-R1-Distill-Qwen-14B-FP8
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B