Quantize and Compile a Model¶
You can bring a model from most common frameworks into Forge. In this walkthrough, we will illustrate the compilation steps in Forge using a YOLOv8 model from Ultralytics. This tutorial will guide you through the steps required to compile the model for different targets. It will also show you how to quantize this float32 (FP32) model to run with integer8 (INT8) precision prior to compilation. Once we have successfully compiled the model, we can run it on any of these target devices using LEIP Deploy.
This tutorial uses Forge TVM backend, which is useful for pre-compiling a model for more efficient execution at runtime. For Forge Onnx backend, see the Forge Onnx tutorial. For TensorRT compilation, see the Forge TVM TensorRT tutorial.
Environment Setup¶
Before diving into the tutorial, let’s ensure that your environment is correctly set up with all the tools required.
You have two options:
- Set up a Docker container.
- Create a Conda environment.
Follow the installation guide for step-by-step instructions.
To run this tutorial, you’ll need the following Python packages:
- ultralytics
- onnx
- torch
- torchvision
- PIL
Let’s first ensure these dependencies are installed in your environment.
!pip install torch==2.4.1 torchvision==0.19.1 --extra-index-url https://download.pytorch.org/whl/cu121
!pip install ultralytics
!apt-get update
!apt-get install -y libgl1
import os
import urllib.request
import zipfile
from ultralytics import YOLO
import onnx
import torch
import forge
from torch.utils.data import Dataset
from torchvision import transforms
from PIL import Image
Model and Dataset Setup¶
Next, we'll acquire some useful inputs for the compilation process and create a folder structure to maintain artifacts. We'll place our input traced model from Ultralytics in the models
directory. We'll download a COCO dataset for quantization as val2017
. And we will save our compiled models inside the optimized_outputs
directory.
model_path = "models"
if not os.path.exists(model_path):
os.makedirs(model_path)
model = YOLO(f"{model_path}/yolov8n.pt")
model.export(format="onnx")
dataset_dir = "val2017"
if not os.path.exists(dataset_dir):
print("Downloading val2017 dataset from COCO. This is required to quantize our model with INT8 precision")
url = "http://images.cocodataset.org/zips/val2017.zip"
file_path = "val2017.zip"
urllib.request.urlretrieve(url, file_path)
with zipfile.ZipFile(file_path, 'r') as zip_ref:
zip_ref.extractall(".")
optimized_output_dir = "optimized_outputs"
if not os.path.exists(optimized_output_dir):
os.makedirs(optimized_output_dir)
Loading the Model¶
Forge supports loading models from various frameworks. The guide on loading a model describes the available methods.
We can load the traced model with ONNX and then ingest it into Forge. We will use forge.from_onnx()
: Load a model from an ONNX file. This will create an IR object which will be used by Forge for subsequent transformations.
onnx_model = onnx.load(f"{model_path}/yolov8n.onnx")
ir = forge.from_onnx(onnx_model)
Now you can use the IR Object to introspect on the model. For example, you can see what operators are used in this model as follows. Knowing the operators can be useful for understanding how a model will compile to machine code for a particular target.
ir.operators
An IR object is a graph, and graphs are easy to manipulate. IR objects can also be subjected to optimizations—quantization, for example—and eventually compiled to machine code. This makes an IR graph quite powerful, as it can be used to design, manipulate, and translate a representation of a machine learning model to create optimal machine code. Consult the LEIP Optimize how-to guides for further details. It is important to note that some of these transforms are irreversible.
Compiling the Model¶
You can compile any IR graph to machine code. However, some hardware targets may not support your model. Forge employs hardware-specific compiler toolchains to do this translation. For CPU targets, Forge uses llvm compiler. You can use llvm
cross-compilers and optimization flags that match your target hardware. For this tutorial, let's assume we deploy the model on the same device you're working on right now. For more compilation details, please consult the compilation guide.
Use this Linux command to get specific details about your CPU target device:
!lscpu | grep "Model name"
A quick search should reveal what family of Intel processor you have. You can set your target
for either basic compilation to the current device or something more specific. It is always possible to defer to basic llvm
compilation if you run into issues running the compiled artifact.
Please note that if you set incorrect compiler flags, your model will compile but still not work on your device. It is the user's responsibility to ensure the compilation target matches the target hardware.
# target = "llvm"
target = "llvm -mcpu=cascadelake"
Example target
tags for CPUs:
- target = "llvm -mcpu=cascadelake -mattr=avx2,+avx512f,+avx512cd,+avx512bw,+avx512dq,+avx512vl"
- target = "llvm -mattr=avx512_vnni2"
- target = "llvm -mcpu=cascadelake -mattr=avx512_vnni2"
- target = "llvm -mcpu=skylake"
- target = "llvm -mattr=avx512"
Pro Tip: Forge includes predefined aliases that simplify passing a target to the compiler. Methods for generating an up-to-date list of aliases are included in the compilation guide.
Once you've decided on a compilation target, you can call the compile
function.
ir.compile(target=target, output_path=f"{optimized_output_dir}/cpu_fp32", force_overwrite=True)
You can find the compiled artifact in optimized_outputs/cpu_fp32
as modelLibrary.so
. You can take this compiled model to a deployment environment and deploy it.
You can also proceed further and attempt an optimization—in this case, quantization—on your model before compilation and deployment.
Quantizing the Model¶
Model quantization is a process that reduces the precision of a neural network's weights to decrease its size. Reduced size translates into fewer and less expensive computations, which speeds up inference and reduces power consumption. It converts floating-point numbers into lower-precision formats like INT8, which can increase model efficiency for deployment on devices with limited resources. Depending on your data and model, quantization can provide a lot of performance and power benefits with negligible accuracy loss.
While some hardware supports integer formats, other hardware does not, and quantization is a lossy transformation (i.e., it reduces accuracy). It’s important to consider testing what quantization can accomplish during development. For more information, consult the guide on quantization.
In this section, we'll use the uncompiled model we have been working with to attempt static quantization. Then we'll compile the quantized model.
Loading a Calibration Dataset¶
For static quantization, we need to calibrate using a calibration dataset, a representative sample of input data, to collect statistics from the model's intermediate layers. This is to ensure calibration statistics mirror the expected distributions in a deployment environment. Because calibration can be a time-consuming task, we're running calibration on just 20 images for this tutorial. Unfortunately, there's no linear relationship between the sample size and the quality of statistics gathered.
Pro Tip: Modify your calibration dataset according to your model or select some images from your validation dataloader.
class CustomImageDataset(Dataset):
def __init__(self, img_dir, end_index = 20, transform=None):
self.img_dir = img_dir
self.transform = transform
self.img_labels = [f for f in os.listdir(img_dir) if f.endswith(('jpg', 'jpeg', 'png'))]
self.end_index = end_index if end_index <= len(self.img_labels) else len(self.img_labels)
self.img_labels = self.img_labels[:end_index]
def __len__(self):
return self.end_index
def __getitem__(self, idx):
img_path = os.path.join(self.img_dir, self.img_labels[idx])
image = Image.open(img_path).convert("RGB")
if self.transform:
image = self.transform(image).unsqueeze(0)
return image
transform = transforms.Compose([
transforms.Resize((640, 640)), # Resize the images to a fixed size
transforms.ToTensor(), # Convert the images to tensors
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) # Normalize the images
])
coco_dataset = CustomImageDataset(img_dir=dataset_dir, transform=transform)
Calibrating the Model¶
Now that we have our calibration dataset, we can calibrate the model. The objective of calibration is to maximize model accuracy retention and minimize information loss during subsequent quantization compared to quantization without these statistics. We'll create a copy of our IR Module since these statistics will be associated with the IR object. Even though we're compiling for CPU, we can still leverage GPU resources if we have them available in the developer environment. In fact, it is recommended to use GPU resources since this task can be quite efficiently parallelized which can leverage the GPU microarchitecture. For this tuturial, we have set the calibration to not use GPU by setting use_cuda
to False. But please try with enabling it if you have GPU resources.
Pro Tip: Check whether you have GPU resources with nvidia-smi
or equivalent command and if you do attempt calibration with use_cuda=True
.
ir_int8 = ir.copy()
ir_int8.calibrate(coco_dataset, use_cuda=False)
Static Quantizing and Compiling the Model¶
Forge supports different quantization formats for activations and kernels during static quantization. You can choose between INT8
and UINT8
formats based on your model's needs:
- Activations: Supported formats are
INT8
andUINT8
. - Kernels (Weights): Supported formats include
INT8
andUINT8
.
For more advanced options, consult the guide on quantization.
In this tutorial, we will choose UINT8
for both activation and kernel. For dynamic quantization, consult the relevant section of the quantization guide.
Pro Tip: Calibration is not device specific. You can use the same calibration data for different quantization schemes and compilation for different targets.
ir_int8.quantize(activation_dtype="uint8", kernel_dtype="uint8", quant_type="static")
ir_int8.compile(target=target, output_path=f"{optimized_output_dir}/cpu_int8", force_overwrite=True)
Head over to LEIP Deploy to learn how to deploy this compiled model on your target!