TGI on Gaudi

Text Generation Inference (TGI) on Intel® Gaudi® AI Accelerator is supported via Intel® Gaudi® TGI repository. Start TGI service on Gaudi system simply by pulling a TGI Gaudi Docker image and launching a local TGI service instance.

For example, TGI service on Gaudi for Llama 2 7B model can be started with:

docker run \
  -p 8080:80 \
  -v $PWD/data:/data \
  --runtime=habana \
  -e HABANA_VISIBLE_DEVICES=all \
  -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
  --cap-add=sys_nice \
  --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.1 \
  --model-id meta-llama/Llama-2-7b-hf \
  --max-input-tokens 1024 \
  --max-total-tokens 2048

You can then send a simple request:

curl 127.0.0.1:8080/generate \
  -X POST \
  -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":32}}' \
  -H 'Content-Type: application/json'

To run static benchmark test, please refer to TGI’s benchmark tool. More examples of running the service instances on single or multi HPU device system are available here.

< > Update on GitHub

Optimum

TGI on Gaudi