TGI Configuration Reference Guide

Required Configuration

Required Environment Variables

HF_TOKEN: HuggingFace authentication token

Required Command Line Arguments

docker specific parameters

--shm-size 16GB: Shared memory allocation
--privileged: Enable privileged container mode
--net host: Uses host network mode

Those are needed to run a TPU container so that the docker container can properly access the TPU hardware

TGI specific parameters

--model-id: Model identifier to load from the HuggingFace hub

Those are parameters used by TGI and optimum-TPU to configure the server behavior.

Optional Configuration

Optional Environment Variables

JETSTREAM_PT_DISABLE: Disable Jetstream PyTorch backend
QUANTIZATION: Enable int8 quantization
MAX_BATCH_SIZE: Set batch processing size, that is static on TPUs
LOG_LEVEL: Set logging verbosity (useful for debugging). It can be set to info, debug or a comma separated list of attribute such text_generation_launcher,text_generation_router=debug
SKIP_WARMUP: Skip model warmup phase

Note on warmup:

TGI performs warmup to compile TPU operations for optimal performance
For production use, never use SKIP_WARMUP=1; you can however use the parameters for debugging purposes to speed up model loading at the cost of slow model inference

You can view more options in the TGI documentation. Not all parameters might be compatible with TPUs (for example, all the CUDA-specific parameters)

TIP for TGI: you can pass most parameters to TGI as docker environment variables or docker arguments. So you can pass `--model-id google/gemma-2b-it` or `-e MODEL_ID=google/gemma-2b-it` to the `docker run` command

Optional Command Line Arguments

--max-input-length: Maximum input sequence length
--max-total-tokens: Maximum combined input/output tokens
--max-batch-prefill-tokens: Maximum tokens for batch processing
--max-batch-total-tokens: Maximum total tokens in batch

You can view more options in the TGI documentation. Not all parameters might be compatible with TPUs (for example, all the CUDA-specific parameters)

Docker Requirements

When running TGI inside a container (recommended), the container should be started with:

Privileged mode for TPU access
Shared memory allocation (16GB recommended)
Host IPC settings

Example Command

Here’s a complete example showing all major configuration options:

docker run -p 8080:80 \
    --shm-size 16GB \
    --privileged \
    --net host \
    -e QUANTIZATION=1 \
    -e MAX_BATCH_SIZE=2 \
    -e LOG_LEVEL=text_generation_router=debug \
    -v ~/hf_data:/data \
    -e HF_TOKEN=<your_hf_token_here> \
    ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \
    --model-id google/gemma-2b-it \
    --max-input-length 512 \
    --max-total-tokens 1024 \
    --max-batch-prefill-tokens 512 \
    --max-batch-total-tokens 1024

You need to replace with a HuggingFace access token that you can get [here](https://huggingface.co/settings/tokens)

If you already logged in via `huggingface-cli login`, then you can set HF_TOKEN=$(cat ~/.cache/huggingface/token) for more convenience

Additional Resources

TGI Documentation