enferAI
/

DeepSeek-R1-Distill-Qwen-14B-FP8

Text Generation

text-generation-inference

Inference Endpoints

compressed-tensors

Model card Files Files and versions Community

DeepSeek-R1-Distill-Qwen-14B-FP8 / README.md

emilss's picture

Fix license name

4411a7c 19 days ago

|

history blame contribute delete

1.31 kB

	---
	license: mit
	license_name: mit
	license_link: LICENSE
	library_name: transformers
	tags:
	- fp8
	- vllm
	language:
	- en
	- de
	- fr
	- it
	- pt
	- hi
	- es
	- th
	pipeline_tag: text-generation
	base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
	---

	# DeepSeek-R1-Distill-Qwen-14B-FP8

	FP8-quantized version of [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B), optimized for inference with vLLM. The quantization reduces the model's memory footprint by approximately 50%.

	## Model Overview

	- Base Model: DeepSeek-R1-Distill-Qwen-14B
	- Quantization: FP8 (weights and activations)
	- Memory Reduction: ~50% (from 16-bit to 8-bit)
	- License: MIT License (following original model's license)

	## Compression Details

	Compressed using [LLM Compressor](https://github.com/vllm-project/llm-compressor) with:

	- 512 calibration samples from UltraChat
	- Symmetric per-tensor quantization
	- Applied to linear operators within transformer blocks

	The compression script is available in `compress.py`.

	## Requirements

	- vLLM
	- transformers
	- torch
	- accelerate

	## Note

	This is an experimental compression of the model. Performance metrics and optimal usage parameters have not been thoroughly tested yet.