How to deploy and fine-tune DeepSeek models on AWS
A running document to showcase how to deploy and fine-tune DeepSeek R1 models with Hugging Face on AWS.
What is DeepSeek-R1?
If you’ve ever struggled with a tough math problem, you know how useful it is to think a little longer and work through it carefully. OpenAI’s o1 model showed that when LLMs are trained to do the same—by using more compute during inference—they get significantly better at solving reasoning tasks like mathematics, coding, and logic.
However, the recipe behind OpenAI’s reasoning models has been a well kept secret. That is, until last week, when DeepSeek released their DeepSeek-R1 model and promptly broke the internet (and the stock market!).
DeepSeek AI open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen architectures. You can find them all in the DeepSeek R1 collection.
We collaborate with Amazon Web Services to make it easier for developers to deploy the latest Hugging Face models on AWS services to build better generative AI applications.
Let’s review how you can deploy and fine-tune DeepSeek R1 models with Hugging Face on AWS.
Deploy DeepSeek R1 models
Deploy on AWS with Hugging Face Inference Endpoints
Hugging Face Inference Endpoints offers an easy and secure way to deploy Machine Learning models on dedicated compute for use in production on AWS. Inference Endpoints empower developers and data scientists alike to create AI applications without managing infrastructure: simplifying the deployment process to a few clicks, including handling large volumes of requests with autoscaling, reducing infrastructure costs with scale-to-zero, and offering advanced security.
With Inference Endpoints, you can deploy any of the 6 distilled models from DeepSeek-R1 and also a quantized version of DeepSeek R1 made by Unsloth: https://huggingface.co/unsloth/DeepSeek-R1-GGUF. On the model page, click on Deploy, then on HF Inference Endpoints. You will be redirected to the Inference Endpoint page, where we selected for you an optimized inference container, and the recommended hardware to run the model. Once you created your endpoint, you can send your queries to DeepSeek R1 for 8.3$ per hour with AWS 🤯.
You can find DeepSeek R1 and distilled models, as well as other popular open LLMs, ready to deploy on optimized configurations in the Inference Endpoints Model Catalog.
| Note: The team is working on enabling DeepSeek models deployment on Inferentia instances. Stay tuned!
Deploy on Amazon Sagemaker AI with Hugging Face LLM DLCs
DeepSeek R1 on GPUs
| Note: The team is working on enabling DeepSeek-R1 deployment on Amazon Sagemaker AI with the Hugging Face LLM DLCs on GPU. Stay tuned!
Distilled models on GPUs
Let’s walk through the deployment of DeepSeek-R1-Distill-Llama-70B.
Code snippets are available on the model page under the Deploy button!
Before, let’s start with a few pre-requisites. Make sur you have a Sagemaker Domain configured, sufficient quota in Sagemaker, and a JupyterLab space. For DeepSeek-R1-Distill-Llama-70B, you should raise the default quota for ml.g6.48xlarge for endpoint usage to 1.
For reference, here are the hardware configurations we recommend you to use for each of the distilled variants:
Model | Instance Type | # of GPUs per replica |
---|---|---|
deepseek-ai/DeepSeek-R1-Distill-Llama-70B | ml.g6.48xlarge | 8 |
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | ml.g6.12xlarge | 4 |
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | ml.g6.12xlarge | 4 |
deepseek-ai/DeepSeek-R1-Distill-Llama-8B | ml.g6.2xlarge | 1 |
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | ml.g6.2xlarge | 1 |
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | ml.g6.2xlarge | 1 |
Once in a notebook, make sure to install the latest version of SageMaker SDK.
!pip install sagemaker --upgrade
Then, instantiate a sagemaker_session which is used to determine the current region and execution role.
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client("iam")
role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
Create the SageMaker Model object with the Python SDK:
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
model_name = hf_model_id.split("/")[-1].lower()
# Hub Model configuration. https://huggingface.co/models
hub = {
"HF_MODEL_ID": model_id,
"SM_NUM_GPUS": json.dumps(8)
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface", version="3.0.1"),
env=hub,
role=role,
)
Deploy the model to a SageMaker endpoint and test the endpoint:
endpoint_name = f"{model_name}-ep"
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
endpoint_name=endpoint_name,
initial_instance_count=1,
instance_type="ml.g6.48xlarge",
container_startup_health_check_timeout=2400,
)
# send request
predictor.predict({"inputs": "What is the meaning of life?"})
That’s it, you deployed a Llama 70B reasoning model!
Because you are using a TGI v3 container under the hood, the most performant parameters for the given hardware will be automatically selected.
Make sure you delete the endpoint once you finished testing it.
predictor.delete_model()
predictor.delete_endpoint()
Distilled models on Neuron
Let’s walk through the deployment of DeepSeek-R1-Distill-Llama-70B on a Neuron instance, like AWS Trainium 2 and AWS Inferentia 2.
Code snippets are available on the model page under the Deploy button!
The pre-requisites to deploy to a Neuron instance are the same. Make sure you have a SageMaker Domain configured, sufficient quota in SageMaker, and a JupyterLab space. For DeepSeek-R1-Distill-Llama-70B, you should raise the default quota for ml.inf2.48xlarge
for endpoint usage to 1.
Then, instantiate a sagemaker_session
which is used to determine the current region and execution role.
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client("iam")
role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
Create the SageMaker Model object with the Python SDK:
image_uri = get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.25")
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
model_name = hf_model_id.split("/")[-1].lower()
# Hub Model configuration
hub = {
"HF_MODEL_ID": model_id,
"HF_NUM_CORES": "24",
"HF_AUTO_CAST_TYPE": "bf16",
"MAX_BATCH_SIZE": "4",
"MAX_INPUT_TOKENS": "3686",
"MAX_TOTAL_TOKENS": "4096",
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=image_uri,
env=hub,
role=role,
)
Deploy the model to a SageMaker endpoint and test the endpoint:
endpoint_name = f"{model_name}-ep"
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
endpoint_name=endpoint_name,
initial_instance_count=1,
instance_type="ml.inf2.48xlarge",
container_startup_health_check_timeout=3600,
volume_size=512,
)
# send request
predictor.predict(
{
"inputs": "What is is the capital of France?",
"parameters": {
"do_sample": True,
"max_new_tokens": 128,
"temperature": 0.7,
"top_k": 50,
"top_p": 0.95,
}
}
)
That’s it, you deployed a Llama 70B reasoning model on a Neuron instance! Under the hood, it downloaded a pre-compiled model from Hugging Face to speed up the endpoint start time.
Make sure you delete the endpoint once you finished testing it.
predictor.delete_model()
predictor.delete_endpoint()
Deploy on EC2 Neuron with the Hugging Face Neuron Deep Learning AMI
This guide will detail how to export, deploy and run DeepSeek-R1-Distill-Llama-70B on a inf2.48xlarge
AWS EC2 Instance.
Before, let’s start with a few pre-requisites. Make sure you have subscribed to the Hugging Face Neuron Deep Learning AMI on the Marketplace. It provides you all the necessary dependencies to train and deploy Hugging Face models on Trainium & Inferentia. Then, launch an inf2.48xlarge instance in EC2 with the AMI and connect through SSH. You can check our step-by-step guide if you have never done it.
Once connected through the instance, you can deploy the model on an endpoint with this command:
docker run -p 8080:80 \
-v $(pwd)/data:/data \
--device=/dev/neuron0 \
--device=/dev/neuron1 \
--device=/dev/neuron2 \
--device=/dev/neuron3 \
--device=/dev/neuron4 \
--device=/dev/neuron5 \
--device=/dev/neuron6 \
--device=/dev/neuron7 \
--device=/dev/neuron8 \
--device=/dev/neuron9 \
--device=/dev/neuron10 \
--device=/dev/neuron11 \
-e HF_BATCH_SIZE=4 \
-e HF_SEQUENCE_LENGTH=4096 \
-e HF_AUTO_CAST_TYPE="bf16" \
-e HF_NUM_CORES=24 \
ghcr.io/huggingface/neuronx-tgi:latest \
--model-id deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
--max-batch-size 4 \
--max-total-tokens 4096
It will take a few minutes to download the compiled model from the Hugging Face cache and launch a TGI endpoint.
Then, you can test the endpoint:
curl localhost:8080/generate \
-X POST \
-d '{"inputs":"Why is the sky dark at night?"}' \
-H 'Content-Type: application/json'
Make sure you pause the EC2 instance once you are done testing it.
| Note: The team is working on enabling DeepSeek R1 deployment on Trainium & Inferentia with the Hugging Face Neuron Deep Learning AMI. Stay tuned!
Fine-tune DeepSeek R1 models
Fine tune on Amazon SageMaker AI with Hugging Face Training DLCs
| Note: The team is working on enabling all DeepSeek models fine tuning with the Hugging Face Training DLCs. Stay tuned!
Fine tune on EC2 Neuron with the Hugging Face Neuron Deep Learning AMI
| Note: The team is working on enabling all DeepSeek models fine tuning with the Hugging Face Neuron Deep Learning AMI . Stay tuned!