Runtime Setup¶
We need torch as a dependency to make use of NVIDIA GPU deployment.
!pip install torch
import os
from pylre import LatentRuntimeEngine as LRE
import numpy as np
import torch
import time
optimized_output_dir = "optimized_outputs"
We will instantiate several LREs for the different optimizations we carried out.
lre_cuda_fp32 = LRE(f"{optimized_output_dir}/notebook_cuda_fp32/modelLibrary.so")
lre_trt_fp32 = LRE(f"{optimized_output_dir}/notebook_trt_fp32/modelLibrary.so")
lre_cuda_int8 = LRE(f"{optimized_output_dir}/notebook_cuda_int8/modelLibrary.so")
lre_trt_int8 = LRE(f"{optimized_output_dir}/notebook_trt_int8/modelLibrary.so")
We can setup a TensorRT cache for TensorRT engine build, which can then reuse these engines for inference.
cache_dir = f"{optimized_output_dir}/cache"
if not os.path.exists(cache_dir):
os.makedirs(cache_dir)
os.environ['TVM_TENSORRT_CACHE_DIR'] = cache_dir
We can set up a timing cache for TensorRT engine build, which can then reuse these caches for engine building.
timing_cache_dir = f"{optimized_output_dir}/timing_cache"
if not os.path.exists(timing_cache_dir):
os.makedirs(timing_cache_dir)
os.environ['LAI_TENSORRT_TIMING_CACHE'] = timing_cache_dir
Speed measurements¶
def speed_test(lre, sample_input, iterations):
print("==== Warm up ====")
t_start = time.time()
lre.warm_up(10)
warmup_time = time.time() - t_start
print("==== Inference ====")
torch.cuda.synchronize()
t_start = time.time()
for _ in range(iterations):
lre.infer(sample_input)
torch.cuda.synchronize()
elapsed_time = time.time() - t_start
latency = elapsed_time / iterations
fps = iterations / elapsed_time
torch.cuda.empty_cache()
print()
print(f"Warmup: {np.round(warmup_time, 2)}s")
print(f"FPS: {np.round(fps, 2)}; Latency: {np.round(latency, 2)}s")
return
We will create a random input for our measurements.
shape = lre_cuda_fp32.input_shapes[0]
type = lre_cuda_fp32.input_dtypes[0]
input = np.random.random(shape).astype(type)
We will first run the CUDA LREs.
speed_test(lre_cuda_fp32, input, 2)
speed_test(lre_cuda_int8, input, 2)
We can observe that the warm up is not necessary, since the model is already compiled.
We will then attempt the TRT models.
speed_test(lre_trt_fp32, input, 2)
speed_test(lre_trt_int8, input, 2)
We can observe the first time we run, the model takes a long time to generate TRT engines. Subsequent runs will be faster because these engines are cached.
Managing inference configurations¶
GPU inference entails copying data over to the GPU memory and copying back to CPU memory when we have completed the accelerated section.
lre_trt_fp32.is_cpu_output
Since the output is not on CPU we can follow two approaches depending our use case.
- Convert tensor to torch but maintain it on GPU memory
output = lre_trt_fp32(input)
torch_output = torch.from_dlpack(output[0])
- Set
set_cpu_outputto copy data over to CPU
lre_trt_fp32.set_cpu_output(True)
output = lre_trt_fp32(input)
numpy_output = np.from_dlpack(output[0])
TensorRT inference also allows you to configure model precision at runtime. You need to have relevant calibration data, therefore this is only applicable to the tensorrt quantized model.
lre_trt_int8.model_precision
speed_test(lre_trt_int8, input, 2)
lre_trt_int8.set_model_precision('float16')
speed_test(lre_trt_int8, input, 2)
lre_trt_int8.set_model_precision('float32')
speed_test(lre_trt_int8, input, 2)
Changing the model precision gives us different accuracy to latency tradeoffs.