First TPU Inference on Google Cloud
This tutorial guides you through setting up and running inference on TPU using Text Generation Inference (TGI) (documentation). TGI server is compatible with OpenAI messages API, and it offers an optimized solution for serving models on TPU.
Prerequisites
Before starting, ensure you have:
- A running TPU instance (see TPU Setup Guide)
- SSH access to your TPU instance
- A HuggingFace account
Step 1: Initial Setup
SSH Access
First, connect to your TPU instance via SSH.
Install Required Tools
Install the HuggingFace Hub CLI:
pip install huggingface_hub
Authentication
Log in to HuggingFace:
huggingface-cli login
Step 2: Model Deployment
Model Selection
We will use the gemma-2b-it
model for this tutorial:
- Visit https://huggingface.co/google/gemma-2b-it
- Accept the model terms and conditions
- This enables model download access
Launch TGI Server
We will use the Optimum-TPU image, a TPU-optimized TGI image provided by HuggingFace.
docker run -p 8080:80 \
--shm-size 16GB \
--privileged \
--net host \
-e LOG_LEVEL=text_generation_router=debug \
-v ~/hf_data:/data \
-e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \
--model-id google/gemma-2b-it \
--max-input-length 512 \
--max-total-tokens 1024 \
--max-batch-prefill-tokens 512 \
--max-batch-total-tokens 1024
Understanding the Configuration
Key parameters explained:
--shm-size 16GB --privileged --net=host
: Required for docker to access the TPU-v ~/hf_data:/data
: Volume mount for model storage--max-input-length
: Maximum input sequence length--max-total-tokens
: Maximum combined input and output tokens--max-batch-prefill-tokens
: Maximum tokens for batch processing--max-batch-total-tokens
: Maximum total tokens in a batch
Step 3: Making Inference Requests
Server Readiness
Wait for the “Connected” message in the logs:
2025-01-11T10:40:00.256056Z INFO text_generation_router::server: router/src/server.rs:2393: Connected
Your TGI server is now ready to serve requests.
Testing from the TPU VM
Query the server from another terminal on the TPU instance:
curl 0.0.0.0:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
Remote Access
To query from outside the TPU instance:
- Find your TPU’s external IP in Google Cloud Console
- Replace the IP in the request:
curl 34.174.11.242:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
(Optional) Firewall Configuration
You may need to configure GCP firewall rules to allow remote access:
- Use
gcloud compute firewall-rules create
to allow traffic - Ensure port 8080 is accessible
- Consider security best practices for production
Request Parameters
Key parameters for inference requests:
inputs
: The prompt textmax_new_tokens
: Maximum number of tokens to generate- Additional parameters available in TGI documentation
Next Steps
- Please check the TGI Consuming Guide to learn about how to query your new TGI server.
- Check the rest of our documentation for advanced settings that can be used on your new TGI server.