Explore Compilation, Quantization, and Calibration Parameters in LEIP Optimize¶

When compiling a model for deployment, it's important to choose the right compilation and quantization parameters to optimize performance for your target hardware. LEIP Design provides several options to customize these optimization settings. To get started, let's load a recipe:

In [1]:

Copied!





from pathlib import Path
import leip_recipe_designer as rd

workspace = Path('./workspace')
pantry = rd.Pantry.build(workspace / "./my_combined_pantry/", force_rebuild=False)
recipe = rd.create.from_recipe_id('59845', pantry=pantry, allow_upgrade=True)
from pathlib import Path
import leip_recipe_designer as rd

workspace = Path('./workspace')
pantry = rd.Pantry.build(workspace / "./my_combined_pantry/", force_rebuild=False)
recipe = rd.create.from_recipe_id('59845', pantry=pantry, allow_upgrade=True)

2024-10-29 11:53:18,915 | WARNING  | pantry.build-119 | You requested to build a Pantry, but haven't specified the desired execution contexts. Therefore, will use the installed ones ['leip_af', 'leip_forge', 'leip_stub_gen']

Skipped downloading goldenrecipedb with name "xval_det" and variant "Xval0.3" (0), as it already exists.
This is the Cross-validation volume. Available methods are- 
get_golden_df 
describe_table

Selecting the Compiler¶

Our recipes include all that you need, but we're going to show you how the compiler parameters in the recipe can be modified.

Key Differences Between TVM and TensorRT:¶

TVM:
- Versatile and supports a wide range of hardware (CPUs, GPUs, TPUs).
- Ideal for environments where custom optimizations are needed.
- Requires more manual tuning for best performance on GPUs.
TensorRT:
- Highly optimized for NVIDIA GPUs and provides the best performance on this hardware.
- Includes built-in support for mixed precision (FP16 and INT8), making it easier to achieve speedups.
- Limited to NVIDIA hardware but offers ease of use and superior performance in those environments.

LEIP Design supports multiple compiler backends. To view the available options, use the recipe.options() method as shown below:

In [2]:

Copied!

recipe.options("compiler")
recipe.options("compiler")

Out[2]:

> Help: Compiler. 
> Ingredients that fit:
  Index  Parameter          Type      Version    UUID
      0  TensorRT Compiler  compiler  1.0.1      f0494dd3296ae368a7d1b7c6a5aa45a6d633832847795c3cc5f255747b49e6bc
      1  TVM Compiler       compiler  1.0.1      30fa1c69d6756bb43531d054dddf241bcdfdd790a7b06acfeef164f18a5efec3
> Use recipe.assign_ingredients('compiler', ingredient_name) to add it to the recipe.
> Or alternatively, use recipe['compiler'] = ingredient_id.

TVM can be used for other CPU devices.
TensorRT is recommended for devices with NVIDIA GPUs.

First, we will compile for your local host machine CPU, so we'll choose the TVM Compiler. TensorRT is limited to NVIDIA hardware, and it’s less flexible when it comes to extending support for other hardware architectures.

In [3]:

Copied!

recipe.assign_ingredients('compiler', "TVM Compiler")
recipe.assign_ingredients('compiler', "TVM Compiler")

Out[3]:

[{'choice_id': '30fa1c69d6756bb43531d054dddf241bcdfdd790a7b06acfeef164f18a5efec3',
  'choice_name': 'TVM Compiler',
  'synonym': 'compiler',
  'parent': 'Full Recipe',
  'slot': 'slot:compiler',
  'path': ['slot:compiler']}]

There are a few things we will need to update for the compiler component. Let's print the compiler ingredient to see what the options are.

In [4]:

Copied!

recipe['compiler']
recipe['compiler']

VBox(children=(HTML(value='\n<style>\n    .recipe-accordion-style > div[class*="jupyter-widget-Accordion-"] > …

For now, we see that the quantizer component needs to be filled with an ingredient. Let's first look at the options for quantizer again using the recipe.options() API.

In [5]:

Copied!

recipe['compiler'].options('quantizer')
recipe['compiler'].options('quantizer')

Out[5]:

> Help: Quantizer. 
> Ingredients that fit:
  Index  Parameter      Type            Version    UUID
      0  No Quantizer   quantizer.none  1.0.0      2d87ce7d89753e291d11dc8d5eadedc2047bf82a0361e903f20d977faea5f681
      1  TVM Quantizer  quantizer.tvm   1.0.0      0516fd2b6b4d10e1de029e40a13a0b542aa90ea0546427b8b7f35d28606229c0
> Use recipe.assign_ingredients('quantizer', ingredient_name) to add it to the recipe.
> Or alternatively, use recipe['quantizer'] = ingredient_id.

If you do not want to apply quantization, select the No Quantizer ingredient:

In [6]:

Copied!

recipe['compiler']['quantizer'] = recipe['compiler'].options('quantizer')[0]
recipe['compiler']['quantizer'] = recipe['compiler'].options('quantizer')[0]

Using TVM Quantizer - this is optional and helps you optimize your recipe. To optimize your model, you need to add an ingredient to the quantizer slot. Additionally, you will need to add a calibrator to the quantizer slot to determine the data you will use for calibrating your model.

Quantizer¶

The quantizer reduces the precision of model weights and activations (e.g., from FP32 to INT8), improving inference speed and reducing memory usage.
TVM supports multiple quantization methods, which can be assigned to the quantizer slot. Choosing the right quantizer is crucial to maintaining model accuracy.

Calibrator¶

Calibration is a key step when optimizing models for deployment, especially when applying quantization techniques. In essence, calibration helps adjust the model’s activations and parameters to better fit reduced-precision formats like INT8. It ensures that the model retains accuracy despite lower precision. We support:

average: Computes the average of the min-max extrema across the calibration dataset. Useful when the distribution of activations is relatively stable.
entropy: Optimizes quantization thresholds by maximizing the entropy of the quantized values, which can improve model performance on datasets with non-uniform activation distributions.
minmax: Uses the absolute most extreme min-max values from the calibration data. This method is ideal when you want to capture the full range of activations.
percentile: Sets quantization thresholds based on the 99th percentile of activation values across the calibration data, which can help exclude outliers and provide better overall performance.

In [7]:

Copied!

recipe['compiler'].assign_ingredients("quantizer", "TVM")
recipe['compiler']['quantizer'].assign_ingredients("calibrator", "^Calibrator$")
recipe['compiler'].assign_ingredients("quantizer", "TVM")
recipe['compiler']['quantizer'].assign_ingredients("calibrator", "^Calibrator$")

Out[7]:

[{'choice_id': '5731364051ba4974cf10c2a71f94518ba7047df442fff5887e1d5760961cf8be',
  'choice_name': 'Calibrator',
  'synonym': 'calibrator',
  'parent': 'TVM Quantizer',
  'slot': 'slot:calibrator',
  'path': ['slot:compiler', 'slot:quantizer', 'slot:calibrator']}]

By default, we use average:

In [8]:

Copied!

recipe['compiler']['quantizer.calib_method']
recipe['compiler']['quantizer.calib_method']

Out[8]:

'average'

Once you've set your compiler, quantizer, and calibrator options, you can compile the model. Before compiling, ensure that the CUDA binary path is correctly set (if you're using any GPU-based calibrator). You can do this by adding CUDA to your PATH environment variable:

import os
os.environ["PATH"] = os.environ["PATH"] + ":/usr/local/cuda/bin/"

Now, compile away as shown before! (Hint: use rd.tasks.compile()).