optimum-tpu

Differences between Jetstream Pytorch and PyTorch XLA

This guide explains to optimum-tpu users the difference between Jetstream Pytorch and PyTorch XLA as those are two available backends in TGI.

JetStream PyTorch is a high-performance inference engine built on top of PyTorch XLA. It is optimized for throughput and memory efficiency when running Large Language Models (LLMs) on TPUs.

Feature	Jetstream Pytorch	PyTorch XLA
Training	❌	✅
Serving	✅	✅
Performance	Higher serving performance	Standard performance
Flexibility	Limited to serving	Full PyTorch ecosystem
Use Case	Production inference	Development and training
Integration	Optimized for deployment	Standard PyTorch workflow

Notes: By default, optimum-tpu is using PyTorch XLA for training and Jetstream Pytorch for serving.

You can configure optimum-tpu to use either version for serving with TGI. You can use the Pytorch XLA backend in TGI by setting up -e JETSTREAM_PT_DISABLE=1 in your docker run arguments.

You can find more information about:

PyTorch XLA: https://pytorch.org/xla/ and https://github.com/pytorch/xla
Jetstream Pytorch: https://github.com/AI-Hypercomputer/jetstream-pytorch