First TPU Inference on Google Cloud

This tutorial guides you through setting up and running inference on TPU using Text Generation Inference (TGI) (documentation). TGI server is compatible with OpenAI messages API, and it offers an optimized solution for serving models on TPU.

Prerequisites

Before starting, ensure you have:

A running TPU instance (see TPU Setup Guide)
SSH access to your TPU instance
A HuggingFace account

Step 1: Initial Setup

SSH Access

First, connect to your TPU instance via SSH.

Install Required Tools

Install the HuggingFace Hub CLI:

pip install huggingface_hub

Authentication

huggingface-cli login

Step 2: Model Deployment

Model Selection

We will use the gemma-2b-it model for this tutorial:

Visit https://huggingface.co/google/gemma-2b-it
Accept the model terms and conditions
This enables model download access

Launch TGI Server

We will use the Optimum-TPU image, a TPU-optimized TGI image provided by HuggingFace.

docker run -p 8080:80 \
        --shm-size 16GB \
        --privileged \
        --net host \
        -e LOG_LEVEL=text_generation_router=debug \
        -v ~/hf_data:/data \
        -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
        ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \
        --model-id google/gemma-2b-it \
        --max-input-length 512 \
        --max-total-tokens 1024 \
        --max-batch-prefill-tokens 512 \
        --max-batch-total-tokens 1024

Understanding the Configuration

Key parameters explained:

--shm-size 16GB --privileged --net=host: Required for docker to access the TPU
-v ~/hf_data:/data: Volume mount for model storage
--max-input-length: Maximum input sequence length
--max-total-tokens: Maximum combined input and output tokens
--max-batch-prefill-tokens: Maximum tokens for batch processing
--max-batch-total-tokens: Maximum total tokens in a batch

Step 3: Making Inference Requests

Server Readiness

Wait for the “Connected” message in the logs:

2025-01-11T10:40:00.256056Z  INFO text_generation_router::server: router/src/server.rs:2393: Connected

Your TGI server is now ready to serve requests.

Testing from the TPU VM

Query the server from another terminal on the TPU instance:

curl 0.0.0.0:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

Remote Access

To query from outside the TPU instance:

External IP TPU

Find your TPU’s external IP in Google Cloud Console
Replace the IP in the request:

curl 34.174.11.242:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

(Optional) Firewall Configuration

You may need to configure GCP firewall rules to allow remote access:

Use gcloud compute firewall-rules create to allow traffic
Ensure port 8080 is accessible
Consider security best practices for production

Request Parameters

Key parameters for inference requests:

inputs: The prompt text
max_new_tokens: Maximum number of tokens to generate
Additional parameters available in TGI documentation

Next Steps

Please check the TGI Consuming Guide to learn about how to query your new TGI server.
Check the rest of our documentation for advanced settings that can be used on your new TGI server.