Deploying a Text-Generation Inference server (TGI) on a Google Cloud TPU instance

Text-Generation-Inference (TGI) enables serving Large Language Models (LLMs) on TPUs, with Optimum TPU delivering a specialized TGI runtime that’s fully optimized for TPU hardware.

TGI also offers an openAI-compatible API, making it easy to integrate with numerous tools.

For a list of supported models, check the Supported Models page.

Deploy TGI on a Cloud TPU Instance

This guide assumes you have a Cloud TPU instance running. If not, please refer to our deployment guide.

You have two options for deploying TGI:

Use our pre-built TGI image (recommended)
Build the image manually for the latest features

Option 1: Using the Pre-built Image

The optimum-tpu image is available at ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi. Please look at optimum-tpu container documentation for the latest TGI image. The tutorial on serving also walks you through how to start the TGI container from a pre-built image. Here’s how to deploy it:

docker run -p 8080:80 \
        --shm-size 16GB \
        --privileged \
        --net host \
        -e LOG_LEVEL=text_generation_router=debug \
        -v ~/hf_data:/data \
        -e HF_TOKEN=<your_hf_token_here> \
        ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \
        --model-id google/gemma-2b-it \
        --max-input-length 512 \
        --max-total-tokens 1024 \
        --max-batch-prefill-tokens 512 \
        --max-batch-total-tokens 1024

You need to replace with a HuggingFace access token that you can get [here](https://huggingface.co/settings/tokens)

If you already logged in via `huggingface-cli login` then you can set HF_TOKEN=$(cat ~/.cache/huggingface/token) for more convinence

You can also use the GCP-provided image as referenced in the optimum-tpu container page

Option 2: Manual Image Building

For the latest features (main branch of optimum-tpu) or custom modifications, build the image yourself:

Clone the repository:

git clone https://github.com/huggingface/optimum-tpu.git

Build the image:

make tpu-tgi

Run the container:

HF_TOKEN=<your_hf_token_here>
MODEL_ID=google/gemma-2b-it

sudo docker run --net=host \
                --privileged \
                -v $(pwd)/data:/data \
                -e HF_TOKEN=${HF_TOKEN} \
                huggingface/optimum-tpu:latest \
                --model-id ${MODEL_ID} \
                --max-concurrent-requests 4 \
                --max-input-length 32 \
                --max-total-tokens 64 \
                --max-batch-size 1

Executing requests against the service

You can query the model using either the /generate or /generate_stream routes:

curl localhost/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

curl localhost/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'