File size: 1,314 Bytes
cb8511b
 
4411a7c
cb8511b
c0fc8fe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
license: mit
license_name: mit
license_link: LICENSE
library_name: transformers
tags:
    - fp8
    - vllm
language:
    - en
    - de
    - fr
    - it
    - pt
    - hi
    - es
    - th
pipeline_tag: text-generation
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
---

# DeepSeek-R1-Distill-Qwen-14B-FP8

FP8-quantized version of [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B), optimized for inference with vLLM. The quantization reduces the model's memory footprint by approximately 50%.

## Model Overview

-   **Base Model**: DeepSeek-R1-Distill-Qwen-14B
-   **Quantization**: FP8 (weights and activations)
-   **Memory Reduction**: ~50% (from 16-bit to 8-bit)
-   **License**: MIT License (following original model's license)

## Compression Details

Compressed using [LLM Compressor](https://github.com/vllm-project/llm-compressor) with:

-   512 calibration samples from UltraChat
-   Symmetric per-tensor quantization
-   Applied to linear operators within transformer blocks

The compression script is available in `compress.py`.

## Requirements

-   vLLM
-   transformers
-   torch
-   accelerate

## Note

This is an experimental compression of the model. Performance metrics and optimal usage parameters have not been thoroughly tested yet.