Calibration and Quantization¶

Model quantization is a process that reduces the precision of a neural network's parameters to decrease its size and speed up inference. It converts floating-point numbers into lower-precision formats like integer8 (INT8), making the model more efficient for deployment on devices with limited resources.

Note

Forge only supports post-training quantization (does not support quantization-aware training).

Load an RelayModule

import forge
import onnx

onnx_model = onnx.load("path/to/model.onnx")
ir = forge.RelayModule.from_onnx(onnx_model)

Load an ONNXModule

import forge

ir = forge.ONNXModule("path/to/model.onnx")

Quantization with Forge¶

Quantize Method Docstring - Click to Expand & Collapse

forge.RelayModule.quantize ¶

quantize(activation_dtype='int8', kernel_dtype=None, bias_dtype=None, per_channel=False, calib_method='average', quant_type='any')

Applies quantization to the model with specified parameters.

This method quantizes the model's activations, kernels, and biases to the specified data types. If kernel_dtype and bias_dtype are None, they default to the activation_dtype. The quantization can be "static" (requiring prior calibration), "dynamic" (no calibration needed), or "any" (prioritizing static if possible).

Note: When using "static" quant_type, ensure calibration is performed beforehand or provide calib_data for calibration. If split_tensors is enabled, existing calibration data is discarded due to graph changes, necessitating fresh calib_data.

Parameters:

Name	Type	Description	Default
`activation_dtype`	`str`	Data type for activations ("int8", "uint8"), default is "int8".	`'int8'`
`kernel_dtype`	`Optional[str]`	Data type for kernels ("int8", "uint8"), defaults to `activation_dtype` if None.	`None`
`bias_dtype`	`Optional[str]`	Data type for biases in `nn.bias_add` operators. Can be set to match `activation_dtype` or "int32", defaults to `activation_dtype` if None.	`None`
`per_channel`	`bool`	If True, performs per-channel quantization on kernels. Default is False.	`False`
`calib_method`	`str`	Method for calibration ("average", "entropy", "minmax", "percentile"), default is "average". Overview of calibration methods: "average" - computed average of the min-max extrema across calibration data, "entropy" - distribution-based maximization of entropy of quantized values, "minmax" - absolute most extreme min-max values across calibration data, "percentile" - computed 99-th percentile cut-offs across calibration data.	`'average'`
`quant_type`	`str`	Type of quantization ("static", "dynamic", "any"), default is "any".	`'any'`

Returns:

Name	Type	Description
`None`	`None`	This method operates in place.

There are two ways to apply quantization to a model: dynamic and static, i.e. quant_type.

Dynamic quantization skips the calibration and only quantizes the weights of the model in advance. The activations are quantized on-the-fly during inference. This method doesn't require a representative dataset for calibration and offers more flexibility, as it adapts to the actual range of activation values seen at runtime. While dynamic quantization is simpler and faster to apply, it might not optimize computational performance as effectively as static quantization.

Static quantization involves two steps: calibration and then quantization. During calibration, the model runs on a subset of the dataset to gather statistics on how its weights and activations typically distribute. This step determines the optimal way to map these values to a lower precision. Once calibration is complete, the actual quantization step converts the model's weights and activations to this lower precision format, effectively reducing the model's size and potentially speeding up its inference.

Quantization Data Types¶

A user can specify the quantized data types of three elements: activation_dtype, kernel_dtype, and bias_dtype

Activation (activation_dtype) The data type for the activation can either be "int8" or "uint8". Generally, "int8" is preferred for optimal computational performance, while "uint8" is often used to minimize accuracy loss during quantization.

Convolution & Matrix-Multiply Weight (kernel_dtype) The data type of the weights for convolutional and matrix-multiply operators can be either "int8" or "uint8". Depending on target, the compiler sometimes accepts (and may prefer) specific data types.

Bias-Add Weight (bias_dtype) The data type of the weights for bias-add operators can either match the activation or be set to "int32". 8-bit data type may be preferred for optimal computational performance, while 32-bits ("int32") can be used to minimize accuracy loss during quantization.

Boosting Quantization Accuracy¶

Forge provides two optimizations for improving a quantized model's accuracy. While each optimization can be used to minimize accuracy loss during quantization, each strategy incurs the cost of computational performance.

Per-Channel Quantization (per_channel) Per-channel quantization is a technique where each channel of a model's weights is quantized independently, allowing for more fine-grained control over the quantization parameters. This approach often leads to better preservation of the model's accuracy compared to per-tensor quantization, which applies the same scale and zero-point to the entire weight tensor, as it accounts for the different distributions across channels.

Tensor-Splitting For matrix multiply operators like convolution, tensor splitting divides the model's weight matrix into a couple of operations with sub-matrices, each quantized separately for optimal quantization parameters. Selecting layers for splitting involves analyzing the weights with the widest distributions, often prioritizing those with more significant contributions to output variance. This targeted approach enhances precision and efficiency in critical areas, improving the quantized accuracy.

Dynamic Quantization¶

Dynamic quantization does not require the calibration step. Not all operators are dynamically quantized, only convolutional and matrix-multiply layers are transformed.

RelayModule Quantize Method Docstring - Click to Expand & Collapse

forge.RelayModule.quantize ¶

quantize(activation_dtype='int8', kernel_dtype=None, bias_dtype=None, per_channel=False, calib_method='average', quant_type='any')

Applies quantization to the model with specified parameters.

This method quantizes the model's activations, kernels, and biases to the specified data types. If kernel_dtype and bias_dtype are None, they default to the activation_dtype. The quantization can be "static" (requiring prior calibration), "dynamic" (no calibration needed), or "any" (prioritizing static if possible).

Note: When using "static" quant_type, ensure calibration is performed beforehand or provide calib_data for calibration. If split_tensors is enabled, existing calibration data is discarded due to graph changes, necessitating fresh calib_data.

Parameters:

Name	Type	Description	Default
`activation_dtype`	`str`	Data type for activations ("int8", "uint8"), default is "int8".	`'int8'`
`kernel_dtype`	`Optional[str]`	Data type for kernels ("int8", "uint8"), defaults to `activation_dtype` if None.	`None`
`bias_dtype`	`Optional[str]`	Data type for biases in `nn.bias_add` operators. Can be set to match `activation_dtype` or "int32", defaults to `activation_dtype` if None.	`None`
`per_channel`	`bool`	If True, performs per-channel quantization on kernels. Default is False.	`False`
`calib_method`	`str`	Method for calibration ("average", "entropy", "minmax", "percentile"), default is "average". Overview of calibration methods: "average" - computed average of the min-max extrema across calibration data, "entropy" - distribution-based maximization of entropy of quantized values, "minmax" - absolute most extreme min-max values across calibration data, "percentile" - computed 99-th percentile cut-offs across calibration data.	`'average'`
`quant_type`	`str`	Type of quantization ("static", "dynamic", "any"), default is "any".	`'any'`

Returns:

Name	Type	Description
`None`	`None`	This method operates in place.

Using Checkpoints

Quantization with Forge is an in-place transformation of the model. One can create a checkpoint by making an in-memory copy of an RelayModule, or saving it to disk. These checkpoints allow a user to quickly reference the model in its pre-quantized state. Example code in this guide demonstrates copying.

Copying (Deep Copy)

ir_a = ir.copy()
ir_b = ir.copy()

Saving-to-Disk (Pickling)

import pickle
pickle.dump(ir, open("ir.pkl", "wb"))

ir_a = pickle.load("ir.pkl")
ir_b = pickle.load("ir.pkl")

Example Code¶

# dynamic quantization w/ defaults
ir_a = ir.copy()
ir_a.quantize(quant_type="dynamic")

# dynamic quantization w/ per-channel quantization of weights
ir_b = ir.copy()
ir_b.quantize(activation_dtype="int8", per_channel=True, quant_type="dynamic")

# dynamic quantization w/ tensor-splitting
ir_c = ir.copy()
ir_c.split_tensors()
ir_c.quantize(quant_type="dynamic")

# dynamic quantization is default when irmodule is not calibrated
ir_d = ir.copy()
assert not ir_d.is_calibrated
ir_d.quantize(quant_type="any")

# setting of quantization datatypes
ir_e = ir.copy()
ir_e.quantize(activation_dtype="uint8", kernel_dtype="int8")

Static Quantization¶

Static quantization looks very similar to dynamic quantization, apart from an extra step of calibration prior quantization.

Calibration¶

For static quantization, users must use a calibration dataset—a representative sample of input data—to collect statistics from the model's intermediate layers, ensuring it mirrors the expected distributions in the test environment. The objective of calibration is to maximize model accuracy retention and minimize information loss.

For calibration, just provide an iterable of pre-processed data.

RelayModule Calibrate Method Docstring - Click to Expand & Collapse

forge.RelayModule.calibrate ¶

calibrate(calib_data, reset=True, use_cuda=True)

Calibrates the model by tracking intermediate layer statistics.

This method collects statistics from intermediate layers of the model using the provided calibration dataset. These statistics are used for deriving quantization parameters in a subsequent quantization process. It's essential that the calibration data is representative of the model's expected real-world inputs.

Note: Ensure that the calibration data is in the form of numpy arrays and has undergone the necessary pre-processing steps required for the model.

Parameters:

Name	Type	Description	Default
`calib_data`	`Iterable[Any]`	An iterable of data samples for calibration. The samples should be in a format compatible with the model's input requirements. Inspect the RelayModule.input_shapes`and RelayModule.input_dtypes` for details. For multiple inputs, each set of inputs should be an iterable of numpy arrays, e.g. a list or tuple of numpy arrays.	required
`reset`	`bool`	If True, any previous calibration data is cleared before new data is processed. Defaults to True.	`True`
`use_cuda`	`bool`	If True, Forge will utilize CUDA devices for calibration if GPUs can be found. Operation will fall back to CPU if GPUs are not found. Default is True.	`True`

Returns:

Name	Type	Description
`None`	`None`	This method operates in place.

Raises:

Type	Description
`ValueError`	If `calib_data` is not in the correct format.

Calibration Data Selecting the right calibration data often leans more towards an art than a science, requiring intuition and experience to balance representativeness and diversity, as no precise formula guarantees the perfect sample. When choosing a calibration dataset, ensure it's diverse enough to represent various scenarios, sufficiently large to capture the range of expected inputs, and closely reflects real-world data distribution to guarantee the quantized model retains its accuracy and reliability.

The number of samples needed for calibration can greatly vary, sometimes as few as one for simple models, and reaching into the thousands for complex scenarios to ensure robust and accurate performance.

Calibration Method Calibration methods help determine the range of values to be mapped during quantization by establishing the scale and zero-point that best fit the data distribution. Minmax uses the minimum and maximum values of the data to set the range. Average considers the average value, often to reduce the impact of outliers. Percentile uses specific percentiles (the 1st and 99th) instead of the absolute min and max, offering a balance between sensitivity to outliers and data representation. Each method has its strengths and the optimal method depends on the specific distribution and nature of the data being quantized. Entropy is a distribution-based approach that uses activation histograms to calculate optimal scaling factors that maximize the entropy of the quantized values.

Calibration Method Selection

The selection of the calibration method is an argument to quantization, i.e. calib_method.

RelayModule Quantize Method Docstring - Click to Expand & Collapse

forge.RelayModule.quantize ¶

quantize(activation_dtype='int8', kernel_dtype=None, bias_dtype=None, per_channel=False, calib_method='average', quant_type='any')

Applies quantization to the model with specified parameters.

This method quantizes the model's activations, kernels, and biases to the specified data types. If kernel_dtype and bias_dtype are None, they default to the activation_dtype. The quantization can be "static" (requiring prior calibration), "dynamic" (no calibration needed), or "any" (prioritizing static if possible).

Note: When using "static" quant_type, ensure calibration is performed beforehand or provide calib_data for calibration. If split_tensors is enabled, existing calibration data is discarded due to graph changes, necessitating fresh calib_data.

Parameters:

Name	Type	Description	Default
`activation_dtype`	`str`	Data type for activations ("int8", "uint8"), default is "int8".	`'int8'`
`kernel_dtype`	`Optional[str]`	Data type for kernels ("int8", "uint8"), defaults to `activation_dtype` if None.	`None`
`bias_dtype`	`Optional[str]`	Data type for biases in `nn.bias_add` operators. Can be set to match `activation_dtype` or "int32", defaults to `activation_dtype` if None.	`None`
`per_channel`	`bool`	If True, performs per-channel quantization on kernels. Default is False.	`False`
`calib_method`	`str`	Method for calibration ("average", "entropy", "minmax", "percentile"), default is "average". Overview of calibration methods: "average" - computed average of the min-max extrema across calibration data, "entropy" - distribution-based maximization of entropy of quantized values, "minmax" - absolute most extreme min-max values across calibration data, "percentile" - computed 99-th percentile cut-offs across calibration data.	`'average'`
`quant_type`	`str`	Type of quantization ("static", "dynamic", "any"), default is "any".	`'any'`

Returns:

Name	Type	Description
`None`	`None`	This method operates in place.

How to prepare your dataset for calibration? - Click to Expand & Collapse

Forge accepts any iterable of input tensors for calibration. This means you can use a list of NumPy arrays, a PyTorch Dataset, or any other Python iterable that yields input tensors matching your model's expected input shape.

Supported Calibration Data Formats: - List of NumPy arrays: Each array should have the same shape as the model's input (e.g., [batch_size, channels, height, width] for images). - PyTorch Dataset or DataLoader: You can pass a PyTorch Dataset or DataLoader that yields batches or individual samples as NumPy arrays or PyTorch tensors. - Custom Iterable: Any Python iterable that yields input tensors in the correct shape and type.

Batching:
Not yet supported

Example Shapes:
- For an image classification model: (batch_size, 3, 224, 224) (batch of RGB images) - For a single image: (1, 3, 224, 224)

Example Usage:

import numpy as np

# Example: List of NumPy arrays (single images)
rep_data = [np.random.rand(3, 224, 224).astype("float32") for _ in range(10)]
ir.calibrate(rep_data)

# Example: Using a PyTorch DataLoader
from torch.utils.data import DataLoader, TensorDataset
import torch

dataset = TensorDataset(torch.randn(20, 3, 224, 224))
loader = DataLoader(dataset, batch_size=1)
ir.calibrate(loader)

Tip:
Always ensure your calibration data is pre-processed in the same way as your model's expected input (e.g., normalization, resizing).

Example Code¶

Below shows how one can use checkpoints to save the calibrated statistics to be used multiple times across different quantization settings.

Tensor-Splitting and Calibration

When running static quantization with tensor-splitting, the model must be re-calibrated. Tensor-splitting results in a change in the graphical structure, which requires re-calibration for the newly inserted layers. To static quantize with tensor-splitting, first run the tensor-splitting procedure, then calibrate, before finally quantizing as normal.

Calibrate Once—Quantize Many

# calibration
ir_f = ir.copy()
ir_f.calibrate(rep_data_0)

# multiple quantization strategies w/ same calibration data
ir_f0 = ir_f.copy()
ir_f0.quantize(activation_dtype="int8", quant_type="static")

ir_f1 = ir_f.copy()
ir_f1.quantize(activation_dtype="uint8", quant_type="static")

ir_f2 = ir_f.copy()
ir_f2.quantize(activation_dtype="int8", per_channel=True)  # default `quant_type` auto-detects calibration and is "static"

ir_f3 = ir_f.copy()
ir_f3.quantize(calib_method="minmax")

ir_f4 = ir_f.copy()
ir_f4.split_tensors(force=True)
ir_f4.calibrate(rep_data_0)
ir_f4.quantize(calib_method="percentile")

Calibrate Checkpoints

# first calibration
ir_g0 = ir.copy()
ir_g0.calibrate(rep_data_0)

# continue calibration with more data (no reset)
ir_g1 = ir_g0.copy()
ir_g1.calibrate(rep_data_1, reset=False)

ir_g2 = ir_g1.copy()
ir_g2.calibrate(rep_data_2, reset=False)

# quantization with single quantization strategy and different snapshots of calibration statistics
ir_g0.quantize()
ir_g1.quantize()
ir_g2.quantize()

Quantization Type

By default the quantization type (quant_type) is set to "any". The behavior of "any" will automatically prioritize and run "static" quantization if the RelayModule is already calibrated. Otherwise, the quantization procedure will default to "dynamic" quantization.

Calibration and Quantization Flags

The followings properties of RelayModule's are useful to determine whether or not an RelayModule is calibrated/quantized and with what settings a model was quantized.

forge.RelayModule.is_calibrated `property` ¶

is_calibrated

Flag to check if RelayModule is calibrated or not

forge.RelayModule.is_quantized `property` ¶

is_quantized

Flag to check if RelayModule is quantized or not

forge.RelayModule.quantization_settings `property` ¶

quantization_settings

The applied quantization strategy settings (if applicable)

Calibration and Quantization¶

Quantization with Forge¶

forge.RelayModule.quantize ¶

Quantization Data Types¶

Boosting Quantization Accuracy¶

Dynamic Quantization¶

forge.RelayModule.quantize ¶

Example Code¶

Static Quantization¶

Calibration¶

forge.RelayModule.calibrate ¶

forge.RelayModule.quantize ¶

Example Code¶

forge.RelayModule.is_calibrated property ¶

forge.RelayModule.is_quantized property ¶

forge.RelayModule.quantization_settings property ¶

forge.RelayModule.is_calibrated `property` ¶

forge.RelayModule.is_quantized `property` ¶

forge.RelayModule.quantization_settings `property` ¶