Calibration and Quantization¶
Model quantization is a process that reduces the precision of a neural network's parameters to decrease its size and speed up inference. It converts floating-point numbers into lower-precision formats like integer8 (INT8), making the model more efficient for deployment on devices with limited resources.
Note
Forge only supports post-training quantization (does not support quantization-aware training).
Load an IRModule
import forge
import onnx
onnx_model = onnx.load("path/to/model.onnx")
ir = forge.from_onnx(onnx_model)
Quantization with Forge¶
Quantization with TensorRT
If compiling for TensorRT, skip below to section "Quantization with TensorRT".
Quantize Method Docstring - Click to Expand & Collapse
forge.IRModule.quantize(activation_dtype='int8', kernel_dtype=None, bias_dtype=None, per_channel=False, calib_method='average', quant_type='any')
¶
Applies quantization to the model with specified parameters.
This method quantizes the model's activations, kernels, and biases to the specified data types. If kernel_dtype and bias_dtype are None, they default to the activation_dtype. The quantization can be "static" (requiring prior calibration), "dynamic" (no calibration needed), or "any" (prioritizing static if possible).
Note: When using "static" quant_type, ensure calibration is performed beforehand
or provide calib_data
for calibration. If split_tensors
is enabled, existing
calibration data is discarded due to graph changes, necessitating fresh calib_data
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
activation_dtype
|
str
|
Data type for activations ("int8", "uint8"), default is "int8". |
'int8'
|
kernel_dtype
|
Optional[str]
|
Data type for kernels ("int8", "uint8"), defaults to |
None
|
bias_dtype
|
Optional[str]
|
Data type for biases in |
None
|
per_channel
|
bool
|
If True, performs per-channel quantization on kernels. Default is False. |
False
|
calib_method
|
str
|
Method for calibration ("average", "entropy", "minmax", "percentile"), default is "average". Overview of calibration methods: "average" - computed average of the min-max extrema across calibration data, "entropy" - distribution-based maximization of entropy of quantized values, "minmax" - absolute most extreme min-max values across calibration data, "percentile" - computed 99-th percentile cut-offs across calibration data. |
'average'
|
quant_type
|
str
|
Type of quantization ("static", "dynamic", "any"), default is "any". |
'any'
|
Returns:
Name | Type | Description |
---|---|---|
None |
None
|
This method operates in place. |
There are two ways to apply quantization to a model: dynamic and static, i.e. quant_type
.
Dynamic quantization skips the calibration and only quantizes the weights of the model in advance. The activations are quantized on-the-fly during inference. This method doesn't require a representative dataset for calibration and offers more flexibility, as it adapts to the actual range of activation values seen at runtime. While dynamic quantization is simpler and faster to apply, it might not optimize computational performance as effectively as static quantization.
Static quantization involves two steps: calibration and then quantization. During calibration, the model runs on a subset of the dataset to gather statistics on how its weights and activations typically distribute. This step determines the optimal way to map these values to a lower precision. Once calibration is complete, the actual quantization step converts the model's weights and activations to this lower precision format, effectively reducing the model's size and potentially speeding up its inference.
Quantization Data Types¶
A user can specify the quantized data types of three elements: activation_dtype
, kernel_dtype
, and bias_dtype
Activation (activation_dtype
)
The data type for the activation can either be "int8" or "uint8". Generally, "int8" is preferred for optimal computational performance, while "uint8" is often used to minimize accuracy loss during quantization.
Convolution & Matrix-Multiply Weight (kernel_dtype
)
The data type of the weights for convolutional and matrix-multiply operators can be either "int8" or "uint8". Depending on target, the compiler sometimes accepts (and may prefer) specific data types.
Bias-Add Weight (bias_dtype
)
The data type of the weights for bias-add operators can either match the activation or be set to "int32". 8-bit data type may be preferred for optimal computational performance, while 32-bits ("int32") can be used to minimize accuracy loss during quantization.
Boosting Quantization Accuracy¶
Forge provides two optimizations for improving a quantized model's accuracy. While each optimization can be used to minimize accuracy loss during quantization, each strategy incurs the cost of computational performance.
Per-Channel Quantization (per_channel
)
Per-channel quantization is a technique where each channel of a model's weights is quantized independently, allowing for more fine-grained control over the quantization parameters. This approach often leads to better preservation of the model's accuracy compared to per-tensor quantization, which applies the same scale and zero-point to the entire weight tensor, as it accounts for the different distributions across channels.
Tensor-Splitting For matrix multiply operators like convolution, tensor splitting divides the model's weight matrix into a couple of operations with sub-matrices, each quantized separately for optimal quantization parameters. Selecting layers for splitting involves analyzing the weights with the widest distributions, often prioritizing those with more significant contributions to output variance. This targeted approach enhances precision and efficiency in critical areas, improving the quantized accuracy.
Dynamic Quantization¶
Dynamic quantization does not require the calibration step. Not all operators are dynamically quantized, only convolutional and matrix-multiply layers are transformed.
IRModule Quantize Method Docstring - Click to Expand & Collapse
forge.IRModule.quantize(activation_dtype='int8', kernel_dtype=None, bias_dtype=None, per_channel=False, calib_method='average', quant_type='any')
¶
Applies quantization to the model with specified parameters.
This method quantizes the model's activations, kernels, and biases to the specified data types. If kernel_dtype and bias_dtype are None, they default to the activation_dtype. The quantization can be "static" (requiring prior calibration), "dynamic" (no calibration needed), or "any" (prioritizing static if possible).
Note: When using "static" quant_type, ensure calibration is performed beforehand
or provide calib_data
for calibration. If split_tensors
is enabled, existing
calibration data is discarded due to graph changes, necessitating fresh calib_data
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
activation_dtype
|
str
|
Data type for activations ("int8", "uint8"), default is "int8". |
'int8'
|
kernel_dtype
|
Optional[str]
|
Data type for kernels ("int8", "uint8"), defaults to |
None
|
bias_dtype
|
Optional[str]
|
Data type for biases in |
None
|
per_channel
|
bool
|
If True, performs per-channel quantization on kernels. Default is False. |
False
|
calib_method
|
str
|
Method for calibration ("average", "entropy", "minmax", "percentile"), default is "average". Overview of calibration methods: "average" - computed average of the min-max extrema across calibration data, "entropy" - distribution-based maximization of entropy of quantized values, "minmax" - absolute most extreme min-max values across calibration data, "percentile" - computed 99-th percentile cut-offs across calibration data. |
'average'
|
quant_type
|
str
|
Type of quantization ("static", "dynamic", "any"), default is "any". |
'any'
|
Returns:
Name | Type | Description |
---|---|---|
None |
None
|
This method operates in place. |
Using Checkpoints
Quantization with Forge is an in-place transformation of the model. One can create a checkpoint by making an in-memory copy of an IRModule, or saving it to disk. These checkpoints allow a user to quickly reference the model in its pre-quantized state. Example code in this guide demonstrates copying.
Copying (Deep Copy)
ir_a = ir.copy()
ir_b = ir.copy()
Saving-to-Disk (Pickling)
import pickle
pickle.dump(ir, open("ir.pkl", "wb"))
ir_a = pickle.load("ir.pkl")
ir_b = pickle.load("ir.pkl")
Example Code¶
# dynamic quantization w/ defaults
ir_a = ir.copy()
ir_a.quantize(quant_type="dynamic")
# dynamic quantization w/ per-channel quantization of weights
ir_b = ir.copy()
ir_b.quantize(activation_dtype="int8", per_channel=True, quant_type="dynamic")
# dynamic quantization w/ tensor-splitting
ir_c = ir.copy()
ir_c.split_tensors()
ir_c.quantize(quant_type="dynamic")
# dynamic quantization is default when irmodule is not calibrated
ir_d = ir.copy()
assert not ir_d.is_calibrated
ir_d.quantize(quant_type="any")
# setting of quantization datatypes
ir_e = ir.copy()
ir_e.quantize(activation_dtype="uint8", kernel_dtype="int8")
Static Quantization¶
Static quantization looks very similar to dynamic quantization, apart from an extra step of calibration prior quantization.
Calibration¶
For static quantization, users must use a calibration dataset—a representative sample of input data—to collect statistics from the model's intermediate layers, ensuring it mirrors the expected distributions in the test environment. The objective of calibration is to maximize model accuracy retention and minimize information loss.
For calibration, just provide an iterable of pre-processed data.
IRModule Calibrate Method Docstring - Click to Expand & Collapse
forge.IRModule.calibrate(calib_data, reset=True, use_cuda=True)
¶
Calibrates the model by tracking intermediate layer statistics.
This method collects statistics from intermediate layers of the model using the provided calibration dataset. These statistics are used for deriving quantization parameters in a subsequent quantization process. It's essential that the calibration data is representative of the model's expected real-world inputs.
Note: Ensure that the calibration data is in the form of numpy arrays and has undergone the necessary pre-processing steps required for the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
calib_data
|
Iterable[Any]
|
An iterable of data samples for calibration. The samples should be in
a format compatible with the model's input requirements. Inspect the
IRModule.input_shapes |
required |
reset
|
bool
|
If True, any previous calibration data is cleared before new data is processed. Defaults to True. Additionally, this argument defaults to True when a TensorRT-partitioned model is detected. |
True
|
use_cuda
|
bool
|
If True, Forge will utilize CUDA devices for calibration if GPUs can be found. Operation will fall back to CPU if GPUs are not found. Default is True. |
True
|
Returns:
Name | Type | Description |
---|---|---|
None |
None
|
This method operates in place. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
Calibration Data Selecting the right calibration data often leans more towards an art than a science, requiring intuition and experience to balance representativeness and diversity, as no precise formula guarantees the perfect sample. When choosing a calibration dataset, ensure it's diverse enough to represent various scenarios, sufficiently large to capture the range of expected inputs, and closely reflects real-world data distribution to guarantee the quantized model retains its accuracy and reliability.
The number of samples needed for calibration can greatly vary, sometimes as few as one for simple models, and reaching into the thousands for complex scenarios to ensure robust and accurate performance.
Calibration Method Calibration methods help determine the range of values to be mapped during quantization by establishing the scale and zero-point that best fit the data distribution. Minmax uses the minimum and maximum values of the data to set the range. Average considers the average value, often to reduce the impact of outliers. Percentile uses specific percentiles (the 1st and 99th) instead of the absolute min and max, offering a balance between sensitivity to outliers and data representation. Each method has its strengths and the optimal method depends on the specific distribution and nature of the data being quantized. Entropy is a distribution-based approach that uses activation histograms to calculate optimal scaling factors that maximize the entropy of the quantized values.
Calibration Method Selection
The selection of the calibration method is an argument to quantization, i.e. calib_method
.
IRModule Quantize Method Docstring - Click to Expand & Collapse
forge.IRModule.quantize(activation_dtype='int8', kernel_dtype=None, bias_dtype=None, per_channel=False, calib_method='average', quant_type='any')
¶
Applies quantization to the model with specified parameters.
This method quantizes the model's activations, kernels, and biases to the specified data types. If kernel_dtype and bias_dtype are None, they default to the activation_dtype. The quantization can be "static" (requiring prior calibration), "dynamic" (no calibration needed), or "any" (prioritizing static if possible).
Note: When using "static" quant_type, ensure calibration is performed beforehand
or provide calib_data
for calibration. If split_tensors
is enabled, existing
calibration data is discarded due to graph changes, necessitating fresh calib_data
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
activation_dtype
|
str
|
Data type for activations ("int8", "uint8"), default is "int8". |
'int8'
|
kernel_dtype
|
Optional[str]
|
Data type for kernels ("int8", "uint8"), defaults to |
None
|
bias_dtype
|
Optional[str]
|
Data type for biases in |
None
|
per_channel
|
bool
|
If True, performs per-channel quantization on kernels. Default is False. |
False
|
calib_method
|
str
|
Method for calibration ("average", "entropy", "minmax", "percentile"), default is "average". Overview of calibration methods: "average" - computed average of the min-max extrema across calibration data, "entropy" - distribution-based maximization of entropy of quantized values, "minmax" - absolute most extreme min-max values across calibration data, "percentile" - computed 99-th percentile cut-offs across calibration data. |
'average'
|
quant_type
|
str
|
Type of quantization ("static", "dynamic", "any"), default is "any". |
'any'
|
Returns:
Name | Type | Description |
---|---|---|
None |
None
|
This method operates in place. |
Example Code¶
Below shows how one can use checkpoints to save the calibrated statistics to be used multiple times across different quantization settings.
Tensor-Splitting and Calibration
When running static quantization with tensor-splitting, the model must be re-calibrated. Tensor-splitting results in a change in the graphical structure, which requires re-calibration for the newly inserted layers. To static quantize with tensor-splitting, first run the tensor-splitting procedure, then calibrate, before finally quantizing as normal.
Calibrate Once—Quantize Many
# calibration
ir_f = ir.copy()
ir_f.calibrate(rep_data_0)
# multiple quantization strategies w/ same calibration data
ir_f0 = ir_f.copy()
ir_f0.quantize(activation_dtype="int8", quant_type="static")
ir_f1 = ir_f.copy()
ir_f1.quantize(activation_dtype="uint8", quant_type="static")
ir_f2 = ir_f.copy()
ir_f2.quantize(activation_dtype="int8", per_channel=True) # default `quant_type` auto-detects calibration and is "static"
ir_f3 = ir_f.copy()
ir_f3.quantize(calib_method="minmax")
ir_f4 = ir_f.copy()
ir_f4.split_tensors(force=True)
ir_f4.calibrate(rep_data_0)
ir_f4.quantize(calib_method="percentile")
Calibrate Checkpoints
# first calibration
ir_g0 = ir.copy()
ir_g0.calibrate(rep_data_0)
# continue calibration with more data (no reset)
ir_g1 = ir_g0.copy()
ir_g1.calibrate(rep_data_1, reset=False)
ir_g2 = ir_g1.copy()
ir_g2.calibrate(rep_data_2, reset=False)
# quantization with single quantization strategy and different snapshots of calibration statistics
ir_g0.quantize()
ir_g1.quantize()
ir_g2.quantize()
Quantization Type
By default the quantization type (quant_type
) is set to "any". The behavior of "any" will automatically prioritize and run "static" quantization if the IRModule is already calibrated. Otherwise, the quantization procedure will default to "dynamic" quantization.
Calibration and Quantization Flags
The followings properties of IRModule's are useful to determine whether or not an IRModule is calibrated/quantized and with what settings a model was quantized.
forge.IRModule.is_calibrated
property
¶
Flag to check if IRModule is calibrated or not
forge.IRModule.is_quantized
property
¶
Flag to check if IRModule is quantized or not
forge.IRModule.quantization_settings
property
¶
The applied quantization strategy settings (if applicable)
Quantization with TensorRT¶
To quantize a model for compilation with TensorRT, the steps are slightly different from that described in "Quantization with Forge". TensorRT only supports one type of quantization strategy: "int8" data type, per-channel quantization, and an "entropy" calibration method.
TensorRT Quantization Steps 1. Partition the model for TensorRT
ir_trt = ir.copy()
ir_trt.partition_for_tensorrt()
reset
argument is always True by default for TensorRT-partitioned graphs.
ir_trt.calibrate(rep_data_0)
ir_trt.calibrate(rep_data_1) # overwrites calibration statistics
ir_trt.calibrate(rep_data_2, reset=False) # flag is ignored, still overwrites calibration stats
ir_trt.compile(target="cuda")
Tensor Splitting with TensorRT
Tensor-splitting can be applied to TensorRT quantization. Simply run the tensor-splitting procedure before partitioning and calibration.
ir.split_tensors()
ir.partition_for_tensorrt()
ir.calibrate(rep_data)
ir.compile(target="cuda")