Readme, add infinity deployment documentation
#21
by
michaelfeil
- opened
This PR adds a short example for how to deploy the model via https://github.com/michaelfeil/infinity.
- Added a parallel deployment to simplify short context length embedding. https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct/discussions/20
docker run --gpus "0" -p "7997":"7997" michaelf34/infinity:0.0.68-trt-onnx v2 --model-id Alibaba-NLP/gte-Qwen2-1.5B-instruct --revision "refs/pr/20" --dtype float16 --batch-size 16 -
-device cuda --engine torch --port 7997
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO 2024-11-12 05:34:57,116 infinity_emb INFO: infinity_server.py:89
Creating 1engines:
engines=['Alibaba-NLP/gte-Qwen2-1.5B-instruct']
INFO 2024-11-12 05:34:57,120 infinity_emb INFO: Anonymized telemetry.py:30
telemetry can be disabled via environment variable
`DO_NOT_TRACK=1`.
INFO 2024-11-12 05:34:57,127 infinity_emb INFO: select_model.py:64
model=`Alibaba-NLP/gte-Qwen2-1.5B-instruct`
selected, using engine=`torch` and device=`cuda`
INFO 2024-11-12 05:34:57,322 SentenceTransformer.py:216
sentence_transformers.SentenceTransformer
INFO: Load pretrained SentenceTransformer:
Alibaba-NLP/gte-Qwen2-1.5B-instruct
INFO 2024-11-12 05:38:59,420 SentenceTransformer.py:355
sentence_transformers.SentenceTransformer
INFO: 1 prompts are loaded, with the keys:
['query']
INFO 2024-11-12 05:38:59,790 infinity_emb INFO: Adding acceleration.py:56
optimizations via Huggingface optimum.
The class `optimum.bettertransformers.transformation.BetterTransformer` is deprecated and will be removed in a future release.
WARNING 2024-11-12 05:38:59,792 infinity_emb WARNING: acceleration.py:67
BetterTransformer is not available for model: <class
'transformers_modules.Alibaba-NLP.gte-Qwen2-1.5B-ins
truct.2e8a2b8d43dcd68042d6f2bf7670086f90055a67.model
ing_qwen.Qwen2Model'> Continue without
bettertransformer modeling code.
INFO 2024-11-12 05:39:00,890 infinity_emb INFO: Getting select_model.py:97
timings for batch_size=16 and avg tokens per
sentence=2
2.11 ms tokenization
18.11 ms inference
0.09 ms post-processing
20.30 ms total
embeddings/sec: 788.19
INFO 2024-11-12 05:39:01,364 infinity_emb INFO: Getting select_model.py:103
timings for batch_size=16 and avg tokens per
sentence=513
9.03 ms tokenization
215.76 ms inference
0.24 ms post-processing
225.03 ms total
embeddings/sec: 71.10
INFO 2024-11-12 05:39:01,367 infinity_emb INFO: model select_model.py:104
warmed up, between 71.10-788.19 embeddings/sec at
batch_size=16
INFO 2024-11-12 05:39:01,368 infinity_emb INFO: batch_handler.py:386
creating batching engine
INFO 2024-11-12 05:39:01,370 infinity_emb INFO: ready batch_handler.py:453
to batch requests.
INFO 2024-11-12 05:39:01,373 infinity_emb INFO: infinity_server.py:104
♾️ Infinity - Embedding Inference Server
MIT License; Copyright (c) 2023-now Michael Feil
Version 0.0.68
Open the Docs via Swagger UI:
http://0.0.0.0:7997/docs
Access all deployed models via 'GET':
curl http://0.0.0.0:7997/models
Visit the docs for more information:
https://michaelfeil.github.io/infinity
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit)
michaelfeil
changed pull request title from
Update README.md
to Readme, add infinity deployment documentation
thenlper
changed pull request status to
merged