Compile a Model for an NVIDIA Target¶
The Forge TVM tutorial covers basic compilation steps. However, different hardware backends offer device specific optimizations you can leverage to get significantly better peformance. This tutorial offers step-by-step instructions for targeting NVIDIA GPUs with Forge.
Environment Setup¶
To get started, you'll first need to install additional tools needed for this tutorial and set up a Forge environment. Follow the installation instructions to set up a Docker container or conda environment.
Dependencies for this tutorial:
- ultralytics
- onnx
- torch
- torchvision
- PIL
!pip install torch==2.4.1 torchvision==0.19.1 --extra-index-url https://download.pytorch.org/whl/cu121
!pip install ultralytics
!apt-get update
!apt-get install -y libgl1
import os
import urllib.request
import zipfile
from ultralytics import YOLO
import onnx
import torch
import forge
from torch.utils.data import Dataset
from torchvision import transforms
from PIL import Image
Model and Dataset Setup¶
Next, we'll acquire some useful inputs for the compilation process and create a folder structure to maintain artifacts. We'll place our input traced model from Ultralytics in a models
directory, download a COCO dataset for quantization as val2017
, and save our compiled models inside an optimized_outputs
directory.
model_path = "models"
if not os.path.exists(model_path):
os.makedirs(model_path)
if not os.path.exists(f"{model_path}/yolov8n.onnx"):
model = YOLO(f"{model_path}/yolov8n.pt")
model.export(format="onnx")
dataset_dir = "val2017"
if not os.path.exists(dataset_dir):
print("Downloading val2017 dataset from COCO. This is required to quantize our model with INT8 precision")
url = "http://images.cocodataset.org/zips/val2017.zip"
file_path = "val2017.zip"
urllib.request.urlretrieve(url, file_path)
with zipfile.ZipFile(file_path, 'r') as zip_ref:
zip_ref.extractall(".")
optimized_output_dir = "optimized_outputs"
if not os.path.exists(optimized_output_dir):
os.makedirs(optimized_output_dir)
Loading the Model¶
Forge supports loading models from various frameworks. The guide on loading models describes the available methods.
We can load the traced model with ONNX and then ingest it into Forge. We will use forge.from_onnx()
: Load a model from an ONNX file. This will create an IR object which will be used by Forge for subsequent transformations.
onnx_model = onnx.load(f"{model_path}/yolov8n.onnx")
ir = forge.from_onnx(onnx_model)
Now you can use the IR Object to introspect on the model. For instance, you can see what operators are used in this model using the following method. Knowing the operators can be useful to understand how a model will compile into machine code for a particular target.
ir.operators
An IR object is a graph, and graphs are easy to manipulate. IR objects can also be subjected to optimizations—quantization, for example—and eventually compiled to machine code. This makes an IR graph quite powerful, as it can be used to design, manipulate, and translate a representation of a machine learning model to create optimal machine code. Consult the LEIP Optimize how-to guides for further details. It is important to note that some of these transforms are irreversible.
Compiling the Model¶
You can compile any IR graph to compiled machine code. However, some hardware targets may not support your model. Forge employs hardware-specific compiler toolchains to do this translation. For NVIDIA GPU targets, Forge uses nvcc and hardware-optimized libraries cudnn and tensorrt. Let's target a GPU on the same device you're working on right now for deployment. For more compilation details, consult the compilation guide.
target = "cuda"
We can set the target to be cuda
and Forge will automatically target the accessible GPU for compilation. This will compile the model using accelerated kernels using CUDA libraries.
ir.compile(target=target, output_path=f"{optimized_output_dir}/cuda_fp32", force_overwrite=True)
Pro Tip: Some operations over a graph are irreversible; it's good practice to make a copy before you do such transforms.
ir_trt = ir.copy()
However, NVIDIA also has a more specialized TensorRT library that has more device-specific acceleration. This does not support all operators, so first we will partition the graph into subgraphs supported and not supported by TensorRT.
ir_trt.partition_for_tensorrt()
Once partitioned, you can use the same command as before to compile to TensorRT.
ir_trt.compile(target=target, output_path=f"{optimized_output_dir}/trt_fp32", force_overwrite=True)
Once you've decided on a compilation target, you can use the compile
function. Note that we're setting the compiled output to optimized_outputs/notebook_cuda_fp32
and optimized_outputs/notebook_trt_fp32
. By default, Forge will avoid overwriting compiled artifacts that already exist. We're forcing Forge to overwrite so we can observe the compilation process.
You can take this compiled model to a deployment environment and deploy it.
You can also proceed further and attempt an optimization—in this case, quantization—on your model before deployment.
Quantizing the Model¶
For CUDA compilation, you can do compile-time quantization. But for TensorRT, you can only do quantization at runtime. For CUDA, you will do both calibration and quantization; for TensorRT, you will only conduct calibration at compile time.
For more details, consult the how-to guide on quantization.
Loading a Calibration Dataset¶
Let's conduct the same calibration we did in the basics tutorial.
class CustomImageDataset(Dataset):
def __init__(self, img_dir, end_index = 20, transform=None):
self.img_dir = img_dir
self.transform = transform
self.img_labels = [f for f in os.listdir(img_dir) if f.endswith(('jpg', 'jpeg', 'png'))]
self.end_index = end_index if end_index <= len(self.img_labels) else len(self.img_labels)
self.img_labels = self.img_labels[:end_index]
def __len__(self):
return self.end_index
def __getitem__(self, idx):
img_path = os.path.join(self.img_dir, self.img_labels[idx])
image = Image.open(img_path).convert("RGB")
if self.transform:
image = self.transform(image).unsqueeze(0)
return image
transform = transforms.Compose([
transforms.Resize((640, 640)), # Resize the images to a fixed size
transforms.ToTensor(), # Convert the images to tensors
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) # Normalize the images
])
coco_dataset = CustomImageDataset(img_dir=dataset_dir, transform=transform)
Calibrating the Model¶
We will first calibrate for CUDA.
ir.calibrate(coco_dataset)
For CUDA we have to do quantization at compile time.
ir.quantize(activation_dtype="uint8", quant_type="static")
Once quantized we can compile.
ir.compile(target=target, output_path=f"{optimized_output_dir}/notebook_cuda_int8", force_overwrite=True)
Then we will calibrate for TensorRT.
ir_trt.calibrate(coco_dataset)
Now we can compile a TensorRT model that allows us to generate a quantized engine during runtime.
ir_trt.compile(target=target, output_path=f"{optimized_output_dir}/trt_int8", force_overwrite=True)
Head over to LEIP Deploy to learn how to deploy this compiled model on your target with NVIDIA GPUs!