Explore Compilation, Quantization, and Calibration Parameters in LEIP Optimize¶

When compiling a model for deployment, it's important to choose the right compilation and quantization parameters to optimize performance for your target hardware. LEIP Design provides several options to customize these optimization settings. To get started, let's load a recipe from the detector GRDB:

In [2]:

Copied!





from pathlib import Path
import leip_recipe_designer as rd

workspace = Path('./workspace')
pantry = rd.Pantry.build(workspace / "./my_combined_pantry/", force_rebuild=False)
recipe = rd.create.from_recipe_id('59845', pantry=pantry, allow_upgrade=True)
from pathlib import Path
import leip_recipe_designer as rd

workspace = Path('./workspace')
pantry = rd.Pantry.build(workspace / "./my_combined_pantry/", force_rebuild=False)
recipe = rd.create.from_recipe_id('59845', pantry=pantry, allow_upgrade=True)

2025-09-07 22:59:47,670 | WARNING  | pantry.build-119 | You requested to build a Pantry, but haven't specified the desired execution contexts. Therefore, will use the installed ones ['leip_af', 'leip_forge', 'leip_stub_gen']

Downloading goldenrecipedb with name "xval_det" and variant "Xval0.3" (0)...
Download completed: workspace/goldenrecipedbs/xval_det/Xval0.3
This is the Cross-validation volume. Available methods are- 
get_golden_df 
describe_table

Selecting the Compiler¶

Our recipes include all that you need, but we're going to show you how the compiler parameters in the recipe can be modified.

Key Differences Between TVM and ONNXRuntime:¶

TVM:
- Versatile and supports a wide range of hardware (CPUs, GPUs, TPUs).
- Ideal for environments where custom optimizations are needed.
- Requires more manual tuning for best performance on GPUs.
ONNXRuntime:
- Flexible runtime supporting both CPU and CUDA execution.
- Can also delegate to NVIDIA TensorRT for maximum GPU performance.
- Provides built-in support for mixed precision (FP16 and INT8) when using TensorRT.

LEIP Design supports multiple compiler backends. To view the available options, use the recipe.options() method as shown below:

In [3]:

Copied!

recipe.options("compiler")
recipe.options("compiler")

Out[3]:

> Help: Compiler. 
> Ingredients that fit:
  Index  Parameter          Type      Version    UUID
      0  TensorRT Compiler  compiler  1.1.0      ceca5d1d2c1eb6c6f8394ba1695e04514dd55a3ed83fabbc3cae5de3f612bf04
      1  ONNXRuntime        compiler  1.0.0      fd5d9762d8c4825ce50b85ac772583cfe38b791592e2575854e7263c224de8cc
      2  TVM Compiler       compiler  1.1.0      ec505da794d7c5bf8590e5917cf01047dcac124aa448d65132f1ae764ac65b90
> Use recipe.assign_ingredients('compiler', ingredient_name) to add it to the recipe.
> Or alternatively, use recipe['compiler'] = ingredient_id.

Important Notice: The TensorRT Compiler ingredient is being deprecated. It is recommended to use ONNXRuntime with TensorRT enabled for NVIDIA GPU targets. This provides better compatibility and maintenance going forward.

TVM can be used for compiling to CPU devices and provides flexibility for custom hardware targets.
ONNXRuntime is recommended for devices with NVIDIA GPUs, where it can optionally use TensorRT as the execution provider for maximum performance.

First, we will compile for your local host machine CPU, so we'll choose the TVM Compiler. ONNXRuntime with TensorRT is limited to NVIDIA hardware, but it offers the best performance on those platforms.

In [4]:

Copied!

recipe.assign_ingredients('compiler', "TVM Compiler") 
recipe.assign_ingredients('compiler', "TVM Compiler")

Out[4]:

[{'choice_id': 'ec505da794d7c5bf8590e5917cf01047dcac124aa448d65132f1ae764ac65b90',
  'choice_name': 'TVM Compiler',
  'synonym': 'compiler',
  'parent': 'Full Recipe',
  'slot': 'slot:compiler',
  'path': ['slot:compiler']}]

There are a few things we will need to update for the compiler component. Let's print the compiler ingredient to see what the options are.

In [4]:

Copied!

recipe['compiler']
recipe['compiler']

VBox(children=(HTML(value='\n<style>\n    .recipe-accordion-style > div[class*="jupyter-widget-Accordion-"] > …

For now, we see that the quantizer component needs to be filled with an ingredient. Let's first look at the options for quantizer again using the recipe.options() API.

In [5]:

Copied!

recipe['compiler'].options('quantizer')
recipe['compiler'].options('quantizer')

Out[5]:

> Help: Quantizer. 
> Ingredients that fit:
  Index  Parameter      Type            Version    UUID
      0  No Quantizer   quantizer.none  1.0.0      5f4d1d4b877405e1a4abf47b8bc5fe02d46013ccffbf2a78c7d81176d15156fa
      1  TVM Quantizer  quantizer.tvm   1.0.0      89a87d628e252474ab602a5fb24128b9026f7faa7d3bfbcaed60503b36ffda05
> Use recipe.assign_ingredients('quantizer', ingredient_name) to add it to the recipe.
> Or alternatively, use recipe['quantizer'] = ingredient_id.

If you do not want to apply quantization, select the No Quantizer ingredient:

In [6]:

Copied!

recipe['compiler']['quantizer'] = recipe['compiler'].options('quantizer')[0]
recipe['compiler']['quantizer'] = recipe['compiler'].options('quantizer')[0]

Using TVM Quantizer - this is optional and helps you optimize your recipe. To optimize your model, you need to add an ingredient to the quantizer slot. Additionally, you will need to add a calibrator to the quantizer slot to determine the data you will use for calibrating your model.

Quantizer¶

The quantizer reduces the precision of model weights and activations (e.g., from FP32 to INT8), improving inference speed and reducing memory usage.
TVM supports multiple quantization methods, which can be assigned to the quantizer slot. Choosing the right quantizer is crucial to maintaining model accuracy.

Calibrator¶

Calibration is a key step when optimizing models for deployment, especially when applying quantization techniques. In essence, calibration helps adjust the model’s activations and parameters to better fit reduced-precision formats like INT8. It ensures that the model retains accuracy despite lower precision. We support:

average: Computes the average of the min-max extrema across the calibration dataset. Useful when the distribution of activations is relatively stable.
entropy: Optimizes quantization thresholds by maximizing the entropy of the quantized values, which can improve model performance on datasets with non-uniform activation distributions.
minmax: Uses the absolute most extreme min-max values from the calibration data. This method is ideal when you want to capture the full range of activations.
percentile: Sets quantization thresholds based on the 99th percentile of activation values across the calibration data, which can help exclude outliers and provide better overall performance.

In [7]:

Copied!

recipe['compiler'].assign_ingredients("quantizer", "TVM")
recipe['compiler']['quantizer'].assign_ingredients("calibrator", "^Calibrator$")
recipe['compiler'].assign_ingredients("quantizer", "TVM")
recipe['compiler']['quantizer'].assign_ingredients("calibrator", "^Calibrator$")

Out[7]:

[{'choice_id': '806df2fab9e2fef7cdb72977b5a9ca14f63c46d159c7a3a2740060e84fd62b0c',
  'choice_name': 'Calibrator',
  'synonym': 'calibrator',
  'parent': 'TVM Quantizer',
  'slot': 'slot:calibrator',
  'path': ['slot:compiler', 'slot:quantizer', 'slot:calibrator']}]

By default, we use average:

In [8]:

Copied!

recipe['compiler']['quantizer.calib_method']
recipe['compiler']['quantizer.calib_method']

Out[8]:

'average'

Once you've set your compiler, quantizer, and calibrator options, you can compile the model. Before compiling, ensure that the CUDA binary path is correctly set (if you're using any GPU-based calibrator). You can do this by adding CUDA to your PATH environment variable:

import os
os.environ["PATH"] = os.environ["PATH"] + ":/usr/local/cuda/bin/"

Now, compile away as shown before in the Getting Started tutorial! (Hint: use rd.tasks.compile()). To compile artifacts for different hardware targets, refer to the Setting Compiler Targets guide.