Compile a Model for an NVIDIA Target¶

The Forge TVM tutorial covers basic compilation steps. However, different hardware backends offer device specific optimizations you can leverage to get significantly better peformance. This tutorial offers step-by-step instructions for targeting NVIDIA GPUs with Forge.

Environment Setup¶

To get started, you'll first need to install additional tools needed for this tutorial and set up a Forge environment. Follow the installation instructions to set up a Docker container or conda environment.

Dependencies for this tutorial:

ultralytics
onnx
torch
torchvision
PIL

In [ ]:

Copied!





!pip install torch==2.4.1 torchvision==0.19.1 --extra-index-url https://download.pytorch.org/whl/cu121
!pip install ultralytics
!apt-get update
!apt-get install -y libgl1
!pip install torch==2.4.1 torchvision==0.19.1 --extra-index-url https://download.pytorch.org/whl/cu121
!pip install ultralytics
!apt-get update
!apt-get install -y libgl1

In [ ]:

Copied!





import os
import urllib.request
import zipfile

from ultralytics import YOLO

import onnx
import torch

import forge

from torch.utils.data import Dataset
from torchvision import transforms
from PIL import Image
import os
import urllib.request
import zipfile

from ultralytics import YOLO

import onnx
import torch

import forge

from torch.utils.data import Dataset
from torchvision import transforms
from PIL import Image

Model and Dataset Setup¶

Next, we'll acquire some useful inputs for the compilation process and create a folder structure to maintain artifacts. We'll place our input traced model from Ultralytics in a models directory, download a COCO dataset for quantization as val2017, and save our compiled models inside an optimized_outputs directory.

In [ ]:

Copied!





model_path = "models"
if not os.path.exists(model_path):
    os.makedirs(model_path)
if not os.path.exists(f"{model_path}/yolov8n.onnx"):
    model = YOLO(f"{model_path}/yolov8n.pt")
    model.export(format="onnx")

dataset_dir = "val2017"
if not os.path.exists(dataset_dir):
    print("Downloading val2017 dataset from COCO. This is required to quantize our model with INT8 precision")
    url = "http://images.cocodataset.org/zips/val2017.zip"
    file_path = "val2017.zip"
    urllib.request.urlretrieve(url, file_path)
    with zipfile.ZipFile(file_path, 'r') as zip_ref:
        zip_ref.extractall(".")

optimized_output_dir = "optimized_outputs"
if not os.path.exists(optimized_output_dir):
    os.makedirs(optimized_output_dir)
model_path = "models"
if not os.path.exists(model_path):
    os.makedirs(model_path)
if not os.path.exists(f"{model_path}/yolov8n.onnx"):
    model = YOLO(f"{model_path}/yolov8n.pt")
    model.export(format="onnx")

dataset_dir = "val2017"
if not os.path.exists(dataset_dir):
    print("Downloading val2017 dataset from COCO. This is required to quantize our model with INT8 precision")
    url = "http://images.cocodataset.org/zips/val2017.zip"
    file_path = "val2017.zip"
    urllib.request.urlretrieve(url, file_path)
    with zipfile.ZipFile(file_path, 'r') as zip_ref:
        zip_ref.extractall(".")

optimized_output_dir = "optimized_outputs"
if not os.path.exists(optimized_output_dir):
    os.makedirs(optimized_output_dir)

Loading the Model¶

Forge supports loading models from various frameworks. The guide on loading models describes the available methods.

We can load the traced model with ONNX and then ingest it into Forge. We will use forge.from_onnx(): Load a model from an ONNX file. This will create an IR object which will be used by Forge for subsequent transformations.

In [3]:

Copied!

onnx_model = onnx.load(f"{model_path}/yolov8n.onnx")
ir = forge.from_onnx(onnx_model)
onnx_model = onnx.load(f"{model_path}/yolov8n.onnx")
ir = forge.from_onnx(onnx_model)

Now you can use the IR Object to introspect on the model. For instance, you can see what operators are used in this model using the following method. Knowing the operators can be useful to understand how a model will compile into machine code for a particular target.

In [ ]:

Copied!

ir.operators
ir.operators

An IR object is a graph, and graphs are easy to manipulate. IR objects can also be subjected to optimizations—quantization, for example—and eventually compiled to machine code. This makes an IR graph quite powerful, as it can be used to design, manipulate, and translate a representation of a machine learning model to create optimal machine code. Consult the LEIP Optimize how-to guides for further details. It is important to note that some of these transforms are irreversible.

Compiling the Model¶

You can compile any IR graph to compiled machine code. However, some hardware targets may not support your model. Forge employs hardware-specific compiler toolchains to do this translation. For NVIDIA GPU targets, Forge uses nvcc and hardware-optimized libraries cudnn and tensorrt. Let's target a GPU on the same device you're working on right now for deployment. For more compilation details, consult the compilation guide.

In [4]:

Copied!

target = "cuda"
target = "cuda"

We can set the target to be cuda and Forge will automatically target the accessible GPU for compilation. This will compile the model using accelerated kernels using CUDA libraries.

In [ ]:

Copied!

ir.compile(target=target, output_path=f"{optimized_output_dir}/cuda_fp32", force_overwrite=True)
ir.compile(target=target, output_path=f"{optimized_output_dir}/cuda_fp32", force_overwrite=True)

Pro Tip: Some operations over a graph are irreversible; it's good practice to make a copy before you do such transforms.

In [6]:

Copied!

ir_trt = ir.copy()
ir_trt = ir.copy()

However, NVIDIA also has a more specialized TensorRT library that has more device-specific acceleration. This does not support all operators, so first we will partition the graph into subgraphs supported and not supported by TensorRT.

In [ ]:

Copied!

ir_trt.partition_for_tensorrt()
ir_trt.partition_for_tensorrt()

Once partitioned, you can use the same command as before to compile to TensorRT.

In [ ]:

Copied!

ir_trt.compile(target=target, output_path=f"{optimized_output_dir}/trt_fp32", force_overwrite=True)
ir_trt.compile(target=target, output_path=f"{optimized_output_dir}/trt_fp32", force_overwrite=True)

Once you've decided on a compilation target, you can use the compile function. Note that we're setting the compiled output to optimized_outputs/notebook_cuda_fp32 and optimized_outputs/notebook_trt_fp32. By default, Forge will avoid overwriting compiled artifacts that already exist. We're forcing Forge to overwrite so we can observe the compilation process.

You can take this compiled model to a deployment environment and deploy it.

You can also proceed further and attempt an optimization—in this case, quantization—on your model before deployment.

Quantizing the Model¶

For CUDA compilation, you can do compile-time quantization. But for TensorRT, you can only do quantization at runtime. For CUDA, you will do both calibration and quantization; for TensorRT, you will only conduct calibration at compile time.

For more details, consult the how-to guide on quantization.

Loading a Calibration Dataset¶

Let's conduct the same calibration we did in the basics tutorial.

In [9]:

Copied!





class CustomImageDataset(Dataset):
    def __init__(self, img_dir, end_index = 20, transform=None):
        self.img_dir = img_dir
        self.transform = transform
        self.img_labels = [f for f in os.listdir(img_dir) if f.endswith(('jpg', 'jpeg', 'png'))]
        self.end_index = end_index if end_index <= len(self.img_labels) else len(self.img_labels)
        self.img_labels = self.img_labels[:end_index]
    def __len__(self):
        return self.end_index

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels[idx])
        image = Image.open(img_path).convert("RGB")
        if self.transform:
            image = self.transform(image).unsqueeze(0)
        return image
class CustomImageDataset(Dataset):
    def __init__(self, img_dir, end_index = 20, transform=None):
        self.img_dir = img_dir
        self.transform = transform
        self.img_labels = [f for f in os.listdir(img_dir) if f.endswith(('jpg', 'jpeg', 'png'))]
        self.end_index = end_index if end_index <= len(self.img_labels) else len(self.img_labels)
        self.img_labels = self.img_labels[:end_index]
    def __len__(self):
        return self.end_index

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels[idx])
        image = Image.open(img_path).convert("RGB")
        if self.transform:
            image = self.transform(image).unsqueeze(0)
        return image

In [12]:

Copied!





transform = transforms.Compose([
    transforms.Resize((640, 640)),  # Resize the images to a fixed size
    transforms.ToTensor(),          # Convert the images to tensors
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalize the images
])
coco_dataset = CustomImageDataset(img_dir=dataset_dir, transform=transform)
transform = transforms.Compose([
    transforms.Resize((640, 640)),  # Resize the images to a fixed size
    transforms.ToTensor(),          # Convert the images to tensors
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalize the images
])
coco_dataset = CustomImageDataset(img_dir=dataset_dir, transform=transform)

Calibrating the Model¶

We will first calibrate for CUDA.

In [ ]:

Copied!

ir.calibrate(coco_dataset)
ir.calibrate(coco_dataset)

For CUDA we have to do quantization at compile time.

In [16]:

Copied!

ir.quantize(activation_dtype="uint8", quant_type="static")
ir.quantize(activation_dtype="uint8", quant_type="static")

Once quantized we can compile.

In [ ]:

Copied!

ir.compile(target=target, output_path=f"{optimized_output_dir}/notebook_cuda_int8", force_overwrite=True)
ir.compile(target=target, output_path=f"{optimized_output_dir}/notebook_cuda_int8", force_overwrite=True)

Then we will calibrate for TensorRT.

In [ ]:

Copied!

ir_trt.calibrate(coco_dataset)
ir_trt.calibrate(coco_dataset)

Now we can compile a TensorRT model that allows us to generate a quantized engine during runtime.

In [ ]:

Copied!

ir_trt.compile(target=target, output_path=f"{optimized_output_dir}/trt_int8", force_overwrite=True)
ir_trt.compile(target=target, output_path=f"{optimized_output_dir}/trt_int8", force_overwrite=True)

Head over to LEIP Deploy to learn how to deploy this compiled model on your target with NVIDIA GPUs!