metadata

license: mit
license_name: mit
license_link: LICENSE
library_name: transformers
tags:
  - fp8
  - vllm
language:
  - en
  - de
  - fr
  - it
  - pt
  - hi
  - es
  - th
pipeline_tag: text-generation
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

DeepSeek-R1-Distill-Qwen-14B-FP8

FP8-quantized version of DeepSeek-R1-Distill-Qwen-14B, optimized for inference with vLLM. The quantization reduces the model's memory footprint by approximately 50%.

Model Overview

Base Model: DeepSeek-R1-Distill-Qwen-14B
Quantization: FP8 (weights and activations)
Memory Reduction: ~50% (from 16-bit to 8-bit)
License: MIT License (following original model's license)

Compression Details

Compressed using LLM Compressor with:

512 calibration samples from UltraChat
Symmetric per-tensor quantization
Applied to linear operators within transformer blocks

The compression script is available in compress.py.

Requirements

vLLM
transformers
torch
accelerate

Note

This is an experimental compression of the model. Performance metrics and optimal usage parameters have not been thoroughly tested yet.