Using LRE to Deploy your Model on an NVIDIA GPU¶

You have compiled or exported a model with LEIP Optimize for your target GPU. This tutorial walks you through NVIDIA-specific runtime options and steps. Please refer to LRE API for more details.

Runtime Setup¶

We need torch as a dependency to make use of NVIDIA GPU deployment.

In [ ]:

Copied!

!pip install torch
!pip install torch

In [10]:

Copied!





import os
from pylre import LatentRuntimeEngine as LRE
import numpy as np
import torch

import time

optimized_output_dir = "optimized_outputs"
import os
from pylre import LatentRuntimeEngine as LRE
import numpy as np
import torch

import time

optimized_output_dir = "optimized_outputs"

We will instantiate several LREs for the different optimizations we carried out.

In [ ]:

Copied!





lre_cuda_fp32 = LRE(f"{optimized_output_dir}/notebook_cuda_fp32/modelLibrary.so")
lre_trt_fp32 = LRE(f"{optimized_output_dir}/notebook_trt_fp32/modelLibrary.so")
lre_cuda_int8 = LRE(f"{optimized_output_dir}/notebook_cuda_int8/modelLibrary.so")
lre_trt_int8 = LRE(f"{optimized_output_dir}/notebook_trt_int8/modelLibrary.so")
lre_cuda_fp32 = LRE(f"{optimized_output_dir}/notebook_cuda_fp32/modelLibrary.so")
lre_trt_fp32 = LRE(f"{optimized_output_dir}/notebook_trt_fp32/modelLibrary.so")
lre_cuda_int8 = LRE(f"{optimized_output_dir}/notebook_cuda_int8/modelLibrary.so")
lre_trt_int8 = LRE(f"{optimized_output_dir}/notebook_trt_int8/modelLibrary.so")

We can setup a TensorRT cache for TensorRT engine build, which can then reuse these engines for inference.

In [5]:

Copied!





cache_dir = f"{optimized_output_dir}/cache"
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)
os.environ['TVM_TENSORRT_CACHE_DIR'] = cache_dir
cache_dir = f"{optimized_output_dir}/cache"
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)
os.environ['TVM_TENSORRT_CACHE_DIR'] = cache_dir

We can set up a timing cache for TensorRT engine build, which can then reuse these caches for engine building.

In [ ]:

Copied!





timing_cache_dir = f"{optimized_output_dir}/timing_cache"
if not os.path.exists(timing_cache_dir):
    os.makedirs(timing_cache_dir)
os.environ['LAI_TENSORRT_TIMING_CACHE'] = timing_cache_dir
timing_cache_dir = f"{optimized_output_dir}/timing_cache"
if not os.path.exists(timing_cache_dir):
    os.makedirs(timing_cache_dir)
os.environ['LAI_TENSORRT_TIMING_CACHE'] = timing_cache_dir

Speed measurements¶

In [6]:

Copied!





def speed_test(lre, sample_input, iterations):
    
    print("==== Warm up ====")
    t_start = time.time()
    lre.warm_up(10)
    warmup_time = time.time() - t_start

    print("==== Inference ====")
    
    torch.cuda.synchronize()

    t_start = time.time()
    for _ in range(iterations):
        lre.infer(sample_input)
    
    torch.cuda.synchronize()

    elapsed_time = time.time() - t_start

    latency = elapsed_time / iterations
    fps =  iterations / elapsed_time

    torch.cuda.empty_cache()

    print()
    print(f"Warmup: {np.round(warmup_time, 2)}s")
    print(f"FPS: {np.round(fps, 2)}; Latency: {np.round(latency, 2)}s")
    return
def speed_test(lre, sample_input, iterations):
    
    print("==== Warm up ====")
    t_start = time.time()
    lre.warm_up(10)
    warmup_time = time.time() - t_start

    print("==== Inference ====")
    
    torch.cuda.synchronize()

    t_start = time.time()
    for _ in range(iterations):
        lre.infer(sample_input)
    
    torch.cuda.synchronize()

    elapsed_time = time.time() - t_start

    latency = elapsed_time / iterations
    fps =  iterations / elapsed_time

    torch.cuda.empty_cache()

    print()
    print(f"Warmup: {np.round(warmup_time, 2)}s")
    print(f"FPS: {np.round(fps, 2)}; Latency: {np.round(latency, 2)}s")
    return 

We will create a random input for our measurements.

In [11]:

Copied!

shape = lre_cuda_fp32.input_shapes[0]
type = lre_cuda_fp32.input_dtypes[0]
input = np.random.random(shape).astype(type)
shape = lre_cuda_fp32.input_shapes[0]
type = lre_cuda_fp32.input_dtypes[0]
input = np.random.random(shape).astype(type)

We will first run the CUDA LREs.

In [ ]:

Copied!

speed_test(lre_cuda_fp32, input, 2)
speed_test(lre_cuda_fp32, input, 2)

In [ ]:

Copied!

speed_test(lre_cuda_int8, input, 2)
speed_test(lre_cuda_int8, input, 2)

We can observe that the warm up is not necessary, since the model is already compiled.

We will then attempt the TRT models.

In [ ]:

Copied!

speed_test(lre_trt_fp32, input, 2)
speed_test(lre_trt_fp32, input, 2)

In [ ]:

Copied!

speed_test(lre_trt_int8, input, 2)
speed_test(lre_trt_int8, input, 2)

We can observe the first time we run, the model takes a long time to generate TRT engines. Subsequent runs will be faster because these engines are cached.

Managing inference configurations¶

GPU inference entails copying data over to the GPU memory and copying back to CPU memory when we have completed the accelerated section.

In [ ]:

Copied!

lre_trt_fp32.is_cpu_output
lre_trt_fp32.is_cpu_output

Since the output is not on CPU we can follow two approaches depending our use case.

Convert tensor to torch but maintain it on GPU memory

In [19]:

Copied!

output = lre_trt_fp32(input)
torch_output = torch.from_dlpack(output[0])
output = lre_trt_fp32(input)
torch_output = torch.from_dlpack(output[0])

Set set_cpu_output to copy data over to CPU

In [ ]:

Copied!

lre_trt_fp32.set_cpu_output(True)
output = lre_trt_fp32(input)
numpy_output = np.from_dlpack(output[0])
lre_trt_fp32.set_cpu_output(True)
output = lre_trt_fp32(input)
numpy_output = np.from_dlpack(output[0])

TensorRT inference also allows you to configure model precision at runtime. You need to have relevant calibration data, therefore this is only applicable to the tensorrt quantized model.

In [ ]:

Copied!

lre_trt_int8.model_precision
lre_trt_int8.model_precision

In [ ]:

Copied!

speed_test(lre_trt_int8, input, 2)
speed_test(lre_trt_int8, input, 2)

In [ ]:

Copied!

lre_trt_int8.set_model_precision('float16')
speed_test(lre_trt_int8, input, 2)
lre_trt_int8.set_model_precision('float16')
speed_test(lre_trt_int8, input, 2)

In [ ]:

Copied!

lre_trt_int8.set_model_precision('float32')
speed_test(lre_trt_int8, input, 2)
lre_trt_int8.set_model_precision('float32')
speed_test(lre_trt_int8, input, 2)

Changing the model precision gives us different accuracy to latency tradeoffs.