DeepSeek-R1-Distill-Qwen-14B-FP8

FP8-quantized version of DeepSeek-R1-Distill-Qwen-14B, optimized for inference with vLLM. The quantization reduces the model's memory footprint by approximately 50%.

Model Overview

Base Model: DeepSeek-R1-Distill-Qwen-14B
Quantization: FP8 (weights and activations)
Memory Reduction: ~50% (from 16-bit to 8-bit)
License: MIT License (following original model's license)

Compression Details

Compressed using LLM Compressor with:

512 calibration samples from UltraChat
Symmetric per-tensor quantization
Applied to linear operators within transformer blocks

The compression script is available in compress.py.

Requirements

vLLM
transformers
torch
accelerate

Note

This is an experimental compression of the model. Performance metrics and optimal usage parameters have not been thoroughly tested yet.

enferAI
/

DeepSeek-R1-Distill-Qwen-14B-FP8

DeepSeek-R1-Distill-Qwen-14B-FP8

Model Overview

Compression Details

Requirements

Note

Model tree for enferAI/DeepSeek-R1-Distill-Qwen-14B-FP8