YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
dinov2_vitl14_trt_a4000_fp16
Triton
make triton
Build TensorRT Model
make model
make trt
tree model_repository
model_repository/
βββ dinov2_vitl14
βββ 1
β βββ model.plan
βββ config.pbtxt
Perf
make perf
docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:23.04-py3-sdk perf_analyzer -m dinov2_vitl14 --percentile=95 -i grpc -u 0.0.0.0:6001 --concurrency-range 16:16 --shape input:3,560,560
=================================
== Triton Inference Server SDK ==
=================================
NVIDIA Release 23.04 (build 58408269)
Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 12.1 driver version 530.30.02 with kernel driver version 525.125.06.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
*** Measurement Settings ***
Batch size: 1
Service Kind: Triton
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Latency limit: 0 msec
Concurrency limit: 16 concurrent requests
Using synchronous calls for inference
Stabilizing using p95 latency
Request concurrency: 16
Client:
Request count: 4009
Throughput: 222.66 infer/sec
p50 latency: 70762 usec
p90 latency: 83940 usec
p95 latency: 90235 usec
p99 latency: 102226 usec
Avg gRPC time: 71655 usec ((un)marshal request/response 741 usec + response wait 70914 usec)
Server:
Inference count: 4009
Execution count: 728
Successful request count: 4009
Avg request latency: 66080 usec (overhead 8949 usec + queue 16114 usec + compute input 1163 usec + compute infer 24751 usec + compute output 15103 usec)
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 16, throughput: 222.66 infer/sec, latency 90235 usec
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.