File size: 15,673 Bytes
e3af00f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
<!--
Copyright (c) 2022-2023, NVIDIA CORPORATION. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->


# Triton clients

The prerequisite for this page is to install PyTriton. You also need ```Linear``` model described in quick_start. You should run it so client can connect to it.

The clients section presents how to send requests to the Triton Inference Server using the PyTriton library.

## ModelClient

ModelClient is a simple client that can perform inference requests synchronously. You can use ModelClient to communicate with the deployed model using HTTP or gRPC protocol. You can specify the protocol when creating the ModelClient object.

For example, you can use ModelClient to send requests to a PyTorch model that performs linear regression:

<!-- This readme is for testing code snippets with pytest. It has codeblocks marked with pytest-codeblocks:cont to combine them into one test. -->

<!-- First test -->
<!--
```python

import torch

model = torch.nn.Linear(2, 3).eval()

import numpy as np
from pytriton.decorators import batch


@batch
def infer_fn(**inputs: np.ndarray):
    (input1_batch,) = inputs.values()
    input1_batch_tensor = torch.from_numpy(input1_batch)
    output1_batch_tensor = model(input1_batch_tensor)  # Calling the Python model inference
    output1_batch = output1_batch_tensor.detach().numpy()
    return [output1_batch]


from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton

# Connecting inference callable with Triton Inference Server
triton = Triton()
# Load model into Triton Inference Server
triton.bind(
    model_name="Linear",
    infer_func=infer_fn,
    inputs=[
        Tensor(dtype=np.float32, shape=(-1,)),
    ],
    outputs=[
        Tensor(dtype=np.float32, shape=(-1,)),
    ],
    config=ModelConfig(max_batch_size=128)
)


triton.run()
```
-->


<!--pytest-codeblocks:cont-->

```python
import torch
from pytriton.client import ModelClient

# Create some input data as a numpy array
input1_data = torch.randn(128, 2).cpu().detach().numpy()

# Create a ModelClient object with the server address and model name
client = ModelClient("localhost:8000", "Linear")
# Call the infer_batch method with the input data
result_dict = client.infer_batch(input1_data)
# Close the client to release the resources
client.close()

# Print the result dictionary
print(result_dict)
```

<!--pytest-codeblocks:cont-->
<!--
```python
# Stop the Triton server to free up resources
triton.stop()
# End of the first test

assert result_dict["OUTPUT_1"].shape == (128, 3)
```
-->


You can also use ModelClient to send requests to a model that performs image classification. The example assumes that a model takes in an image and returns the top 5 predicted classes. This model is not included in the PyTriton library.

You need to convert the image to a numpy array and resize it to the expected input shape. You can use Pillow package to do this.

<!--pytest.mark.skip-->
```python
import numpy as np
from PIL import Image
from pytriton.client import ModelClient

# Create some input data as a numpy array of an image
img = Image.open("cat.jpg")
img = img.resize((224, 224))
input_data = np.array(img)

# Create a ModelClient object with the server address and model name
client = ModelClient("localhost:8000", "ImageNet")
# Call the infer_sample method with the input data
result_dict = client.infer_sample(input_data)
# Close the client to release the resources
client.close()

# Print the result dictionary
print(result_dict)
```

You need to install Pillow package to run the above example:
```bash
pip install Pillow
```

## FuturesModelClient

FuturesModelClient is a concurrent.futures based client that can perform inference requests in a parallel way. You can use FuturesModelClient to communicate with the deployed model using HTTP or gRPC protocol. You can specify the protocol when creating the FuturesModelClient object.

For example, you can use FuturesModelClient to send multiple requests to a text generation model that takes in text prompts and returns generated texts. The TextGen model is not included in the PyTriton library. The example assumes that the model returns a single output tensor with the generated text. The example also assumes that the model takes in a list of text prompts and returns a list of generated texts.

You need to convert the text prompts to numpy arrays of bytes using a tokenizer from transformers. You also need to detokenize the output texts using the same tokenizer:

<!--pytest.mark.skip-->
```python
import numpy as np
from pytriton.client import FuturesModelClient
from transformers import AutoTokenizer

# Create some input data as a list of text prompts
input_data_list_text = ["Write a haiku about winter.", "Summarize the article below in one sentence.", "Generate a catchy slogan for PyTriton."]

# Create a tokenizer from transformers
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Convert the text prompts to numpy arrays of bytes using the tokenizer
input_data_list = [np.array(tokenizer.encode(prompt)) for prompt in input_data_list_text]

# Create a FuturesModelClient object with the server address and model name
with FuturesModelClient("localhost:8000", "TextGen") as client:
    # Call the infer_sample method for each input data in the list and store the returned futures
    output_data_futures = [client.infer_sample(input_data) for input_data in input_data_list]
    # Wait for all the futures to complete and get the results
    output_data_list = [output_data_future.result() for output_data_future in output_data_futures]

# Print tokens
print(output_data_list)

# Detokenize the output texts using the tokenizer and print them
output_texts = [tokenizer.decode(output_data["OUTPUT_1"]) for output_data in output_data_list]
for output_text in output_texts:
    print(output_text)
```

You need to install transformers package to run the above example:
```bash
pip install transformers
```

You can also use FuturesModelClient to send multiple requests to an image classification model that takes in image data and returns class labels or probabilities. The ImageNet model is described above.

In this case, you can use the infer_batch method to send a batch of images as input and get a batch of outputs. You need to stack the images along the first dimension to form a batch. You can also print the class names corresponding to the output labels:

<!--pytest.mark.skip-->
``` python
import numpy as np
from PIL import Image
from pytriton.client import FuturesModelClient

# Create some input data as a list of lists of image arrays
input_data_list = []
for batch in [["cat.jpg", "dog.jpg", "bird.jpg"], ["car.jpg", "bike.jpg", "bus.jpg"], ["apple.jpg", "banana.jpg", "orange.jpg"]]:
  batch_data = []
  for filename in batch:
    img = Image.open(filename)
    img = img.resize((224, 224))
    img = np.array(img)
    batch_data.append(img)
  # Stack the images along the first dimension to form a batch
  batch_data = np.stack(batch_data, axis=0)
  input_data_list.append(batch_data)

# Create a list of class names for ImageNet
class_names = ["tench", "goldfish", "great white shark", ...]

# Create a FuturesModelClient object with the server address and model name
with FuturesModelClient("localhost:8000", "ImageNet") as client:
    # Call the infer_batch method for each input data in the list and store the returned futures
    output_data_futures = [client.infer_batch(input_data) for input_data in input_data_list]
    # Wait for all the futures to complete and get the results
    output_data_list = [output_data_future.result() for output_data_future in output_data_futures]

# Print the list of result dictionaries
print(output_data_list)

# Print the class names corresponding to the output labels for each batch
for output_data in output_data_list:
  output_labels = output_data["OUTPUT_1"]
  for output_label in output_labels:
    class_name = class_names[output_label]
    print(f"The image is classified as {class_name}.")
```

## AsyncioModelClient

AsyncioModelClient is an asynchronous client that can perform inference requests using the asyncio library. You can use AsyncioModelClient to communicate with the deployed model using HTTP or gRPC protocol. You can specify the protocol when creating the AsyncioModelClient object.

For example, you can use AsyncioModelClient to send requests to a PyTorch model that performs linear regression:

<!--pytest.mark.skip-->
```python
import torch
from pytriton.client import AsyncioModelClient

# Create some input data as a numpy array
input1_data = torch.randn(2).cpu().detach().numpy()

# Create an AsyncioModelClient object with the server address and model name
client = AsyncioModelClient("localhost:8000", "Linear")
# Call the infer_sample method with the input data
result_dict = await client.infer_sample(input1_data)
# Close the client to release the resources
client.close()

# Print the result dictionary
print(result_dict)
```

You can also use FastAPI to create a web application that exposes the results of inference at an HTTP endpoint. FastAPI is a modern, fast, web framework for building APIs with Python 3.6+ based on standard Python type hints.

To use FastAPI, you need to install it with:

```bash
pip install fastapi
```

You also need an ASGI server, for production such as Uvicorn or Hypercorn.

To install Uvicorn, run:

```bash
pip install uvicorn[standard]
```

The `uvicorn` uses port `8000` as default for web server. Triton server default port is also `8000` for HTTP protocol. You can change uvicorn port by using `--port` option. PyTriton also supports custom ports configuration for Triton server. The class `TritonConfig` contains parameters for ports configuration. You can pass it to `Triton` during initialization:

<!--pytest.mark.skip-->
```python
config = TritonConfig(http_port=8015)
triton_server = Triton(config=config)
```

You can use this `triton_server` object to bind your inference model and run HTTP endpoint from Triton Inference Server at port `8015`.


Then you can create a FastAPI app that uses the AsyncioModelClient to perform inference and return the results as JSON:

<!--pytest.mark.skip-->
```python
from fastapi import FastAPI
import torch
from pytriton.client import AsyncioModelClient

# Create an AsyncioModelClient object with the server address and model name
config_client = AsyncioModelClient("localhost:8000", "Linear")

app = FastAPI()

@app.get("/predict")
async def predict():
    # Create some input data as a numpy array
    input1_data = torch.randn(2).cpu().detach().numpy()

    # Create an AsyncioModelClient object from existing client to avoid pulling config from server
    async with AsyncioModelClient.from_existing_client(config_client) as request_client:
        # Call the infer_sample method with the input data
        result_dict = await request_client.infer_sample(input1_data)

    # Return the result dictionary as JSON
    return result_dict

@app.on_event("shutdown")
async def shutdown():
    # Close the client to release the resources
    await config_client.close()
```

Save this file as `main.py`.

To run the app, use the command:

<!--pytest.mark.skip-->
```bash
uvicorn main:app --reload --port 8015
```

You can then access the endpoint at `http://127.0.0.1:8015/predict` and see the JSON response.

You can also check the interactive API documentation at `http://127.0.0.1:8015/docs`.

You can test your server using curl:

<!--pytest.mark.skip-->
```bash
curl -X 'GET' \
  'http://127.0.0.1:8015/predict' \
  -H 'accept: application/json'
```

Command will print three random numbers:
<!--pytest.mark.skip-->
```python
[-0.2608422636985779,-0.6435106992721558,-0.3492531180381775]
```

For more information about FastAPI and Uvicorn, check out these links:

- [FastAPI documentation](https://fastapi.tiangolo.com/)
- [Uvicorn documentation](https://www.uvicorn.org/)


## Client timeouts

When creating a [ModelClient][pytriton.client.client.ModelClient] or [FuturesModelClient][pytriton.client.client.FuturesModelClient] object, you can specify the timeout for waiting until the server and model are ready using the `init_timeout_s` parameter. By default, the timeout is set to 5 minutes (300 seconds).

Example usage:

<!--pytest.mark.skip-->
```python
import numpy as np
from pytriton.client import ModelClient, FuturesModelClient

input1_data = np.random.randn(128, 2)
with ModelClient("localhost", "MyModel", init_timeout_s=120) as client:
    # Raises PyTritonClientTimeoutError if the server or model is not ready within the specified timeout
    result_dict = client.infer_batch(input1_data)


with FuturesModelClient("localhost", "MyModel", init_timeout_s=120) as client:
    future = client.infer_batch(input1_data)
    ...
    # It will raise `PyTritonClientTimeoutError` if the server is not ready and the model is not loaded within 120 seconds
    # from the time `infer_batch` was called by a thread from `ThreadPoolExecutor`
    result_dict = future.result()
```

You can disable the default behavior of waiting for the server and model to be ready during first inference request by setting `lazy_init` to `False`:

<!--pytest.mark.skip-->
```python
import numpy as np
from pytriton.client import ModelClient, FuturesModelClient

input1_data = np.random.randn(128, 2)

# will raise PyTritonClientTimeoutError if server is not ready and model loaded
# within 120 seconds during intialization of client
with ModelClient("localhost", "MyModel", init_timeout_s=120, lazy_init=False) as client:
    result_dict = client.infer_batch(input1_data)
```

You can specify the timeout for the client to wait for the inference response from the server.
The default timeout is 60 seconds. You can specify the timeout when creating the [ModelClient][pytriton.client.client.ModelClient] or [FuturesModelClient][pytriton.client.client.FuturesModelClient] object:

<!--pytest.mark.skip-->
```python
import numpy as np
from pytriton.client import ModelClient, FuturesModelClient

input1_data = np.random.randn(128, 2)
with ModelClient("localhost", "MyModel", inference_timeout_s=240) as client:
    # Raises `PyTritonClientTimeoutError` if the server does not respond to inference request within 240 seconds
    result_dict = client.infer_batch(input1_data)


with FuturesModelClient("localhost", "MyModel", inference_timeout_s=240) as client:
    future = client.infer_batch(input1_data)
    ...
    # Raises `PyTritonClientTimeoutError` if the server does not respond within 240 seconds
    # from the time `infer_batch` was called by a thread from `ThreadPoolExecutor`
    result_dict = future.result()
```

!!! warning "gRPC client timeout not fully supported"

    There are some missing features in the gRPC client that prevent it from working correctly with timeouts
    used during the wait for the server and model to be ready. This may cause the client to hang if the server
    doesn't respond with the current server or model state.

!!! info "Server side timeout not implemented"

    Currently, there is no support for server-side timeout. The server will continue to process the request even if the client timeout is reached.