File size: 6,031 Bytes
e3af00f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
<!--
Copyright (c) 2022-2023, NVIDIA CORPORATION. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Quick Start

The prerequisite for this page is to install PyTriton, which can be found
in the [installation](installation.md)
page.

The Quick Start presents how to run a Python model in the Triton Inference Server without needing to change the current working
environment. In this example, we are using a simple `Linear` PyTorch model.

The integration of the model requires providing the following elements:

- The model - a framework or Python model or function that handles inference requests
- Inference Callable - function or class with `__call__` method, that handles the input data coming from Triton and returns the result
- Python function connection with Triton Inference Server - a binding for communication between Triton and the Inference Callable

The requirement for the example is to have PyTorch installed in your environment. You can do this by running:

<!--pytest.mark.skip-->

```shell
pip install torch
```

In the next step, define the `Linear` model:

```python
import torch

model = torch.nn.Linear(2, 3).to("cuda").eval()
```

In the second step, create an inference callable as a function. The function obtains the HTTP/gRPC request data as an argument, which should be in the form of a NumPy array. The expected return object should also be a NumPy array. You can define an inference callable as a function that uses the `@batch` decorator from PyTriton. This decorator converts the input request into a more suitable format that can be directly passed to the model. You can read more about [decorators here](decorators.md).

Example implementation:

<!--pytest-codeblocks:cont-->

```python
import numpy as np
import torch

from pytriton.decorators import batch


@batch
def infer_fn(**inputs: np.ndarray):
    (input1_batch,) = inputs.values()
    input1_batch_tensor = torch.from_numpy(input1_batch).to("cuda")
    output1_batch_tensor = model(input1_batch_tensor) # Calling the Python model inference
    output1_batch = output1_batch_tensor.cpu().detach().numpy()
    return [output1_batch]
```

In the next step, you can create the binding between the inference callable and Triton Inference Server using the `bind` method from PyTriton. This method takes the model name, the inference callable, the inputs and outputs tensors, and an optional model configuration object.

<!--pytest.mark.skip-->
```python
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton

# Connecting inference callable with Triton Inference Server
with Triton() as triton:
    triton.bind(
        model_name="Linear",
        infer_func=infer_fn,
        inputs=[
            Tensor(dtype=np.float32, shape=(-1,)),
        ],
        outputs=[
            Tensor(dtype=np.float32, shape=(-1,)),
        ],
        config=ModelConfig(max_batch_size=128)
    )
    ...
```

Finally, serve the model with the Triton Inference Server:

<!--pytest.mark.skip-->

```python
from pytriton.triton import Triton

with Triton() as triton:
    ...  # Load models here
    triton.serve()
```

The `bind` method creates a connection between the Triton Inference Server and the `infer_fn`, which handles
the inference queries. The `inputs` and `outputs` describe the model inputs and outputs that are exposed in
Triton. The config field allows more parameters for model deployment.

The `serve` method is blocking, and at this point, the application waits for incoming HTTP/gRPC requests. From that
moment, the model is available under the name `Linear` in the Triton server. The inference queries can be sent to
`localhost:8000/v2/models/Linear/infer`, which are passed to the `infer_fn` function.

If you would like to use Triton in the background mode, use `run`. More about that can be found
in the [Deploying Models](initialization.md) page.

Once the `serve` or `run` method is called on the `Triton` object, the server status can be obtained using:

<!--pytest.mark.skip-->

```shell
curl -v localhost:8000/v2/health/live
```

The model is loaded right after the server starts, and its status can be queried using:

<!--pytest.mark.skip-->

```shell
curl -v localhost:8000/v2/models/Linear/ready
```

Finally, you can send an inference query to the model:
<!--pytest.mark.skip-->

```shell
curl -X POST \
  -H "Content-Type: application/json"  \
  -d @input.json \
  localhost:8000/v2/models/Linear/infer
```

The `input.json` with sample query:

```json
{
  "id": "0",
  "inputs": [
    {
      "name": "INPUT_1",
      "shape": [1, 2],
      "datatype": "FP32",
      "parameters": {},
      "data": [[-0.04281254857778549, 0.6738349795341492]]
    }
  ]
}
```

Read more about the HTTP/gRPC interface in the Triton Inference Server
[documentation](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/inference_protocols.md#httprest-and-grpc-protocols).

You can also validate the deployed model using a simple client that can perform inference requests:

<!--pytest.mark.skip-->

```python
import torch
from pytriton.client import ModelClient

input1_data = torch.randn(128, 2).cpu().detach().numpy()

with ModelClient("localhost:8000", "Linear") as client:
    result_dict = client.infer_batch(input1_data)

print(result_dict)
```

The full example code can be found in [examples/linear_random_pytorch](../examples/linear_random_pytorch).

More information about running the server and models can be found in [Deploying Models](initialization.md) page.