DeepSeek-R1-Distill-Qwen-14B-FP8

FP8-quantized version of DeepSeek-R1-Distill-Qwen-14B, optimized for inference with vLLM. The quantization reduces the model's memory footprint by approximately 50%.

Model Overview

  • Base Model: DeepSeek-R1-Distill-Qwen-14B
  • Quantization: FP8 (weights and activations)
  • Memory Reduction: ~50% (from 16-bit to 8-bit)
  • License: MIT License (following original model's license)

Compression Details

Compressed using LLM Compressor with:

  • 512 calibration samples from UltraChat
  • Symmetric per-tensor quantization
  • Applied to linear operators within transformer blocks

The compression script is available in compress.py.

Requirements

  • vLLM
  • transformers
  • torch
  • accelerate

Note

This is an experimental compression of the model. Performance metrics and optimal usage parameters have not been thoroughly tested yet.

Downloads last month
40
Safetensors
Model size
14.8B params
Tensor type
BF16
·
F8_E4M3
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for enferAI/DeepSeek-R1-Distill-Qwen-14B-FP8

Quantized
(43)
this model