---
base_model:
- deepseek-ai/DeepSeek-R1
---
# Model Overview
## Description:
The NVIDIA DeepSeek R1 FP4 model is the quantized version of the DeepSeek AI's DeepSeek R1 model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check [here](https://huggingface.co/deepseek-ai/DeepSeek-R1). The NVIDIA DeepSeek R1 FP4 model is quantized with [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
This model is ready for commercial/non-commercial use.
## Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA [(DeepSeek R1) Model Card](https://huggingface.co/deepseek-ai/DeepSeek-R1).
### License/Terms of Use:
[nvidia-open-model-license](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)
## Model Architecture:
**Architecture Type:** Transformers
**Network Architecture:** DeepSeek R1
## Input:
**Input Type(s):** Text
**Input Format(s):** String
**Input Parameters:** 1D (One Dimensional): Sequences
**Other Properties Related to Input:** Context length up to 128K
## Output:
**Output Type(s):** Text
**Output Format:** String
**Output Parameters:** 1D (One Dimensional): Sequences
**Other Properties Related to Output:** N/A
## Software Integration:
**Supported Runtime Engine(s):**
* Tensor(RT)-LLM
**Supported Hardware Microarchitecture Compatibility:**
* NVIDIA Blackwell
**Preferred Operating System(s):**
* Linux
## Model Version(s):
The model is quantized with nvidia-modelopt **v0.23.0**
## Datasets:
* Calibration Dataset: [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail)
** Data collection method: Automated.
** Labeling method: Unknown.
* Evaluation Dataset: [MMLU](https://github.com/hendrycks/test)
** Data collection method: Unknown.
** Labeling method: N/A.
## Inference:
**Engine:** Tensor(RT)-LLM
**Test Hardware:** B200
## Post Training Quantization
This model was obtained by quantizing the weights and activations of DeepSeek R1 to FP4 data type, ready for inference with TensorRT-LLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 8 to 4, reducing the disk size and GPU memory requirements by approximately 1.6x.
## Usage
### Deploy with TensorRT-LLM
To deploy the quantized checkpoint with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) LLM API, follow the sample codes below:
* LLM API sample usage:
```
from tensorrt_llm import LLM, SamplingParams
def main():
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="nvidia/DeepSeek-R1-FP4")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
# The entry point of the program need to be protected for spawning processes.
if __name__ == '__main__':
main()
```
* Accuracy evaluation:
1) Prepare the MMLU dataset:
```sh
mkdir data; wget https://people.eecs.berkeley.edu/~hendrycks/data.tar -O data/mmlu.tar
tar -xf data/mmlu.tar -C data && mv data/data data/mmlu
```
2) Measure MMLU:
```sh
python examples/mmlu_llmapi.py --data_dir data/mmlu --hf_model_dir nvidia/DeepSeek-R1-FP4 --backend=pytorch
```
* Throughputs evaluation:
Please refer to the [TensorRT-LLM benchmarking documentation](https://github.com/NVIDIA/TensorRT-LLM/tree/main/benchmarks) for details.
#### Evaluation
The accuracy (MMLU, 5-shot) benchmark results are presented in the table below:
Precision | MMLU |
FP8 | 90.8 |
FP4 | 86.9 |