LEIP Optimize
LEIP Optimize is a state-of-the-art model optimizer that applies post-quantization algorithms to a model and produces a binary representation based on the target specified by the user. The binary is in the form of a shared object file which is loaded into a small runtime for its execution.
Internally it consists of two phases:
LEIP Compress
Deep Neural Networks (DNNs) use a large number of parameters to learn. As a result, they pose a large memory and compute footprint during runtime inference. Such resource constraints limit their deployment to edge devices, which have limited memory, size, and power budgets. LEIP Compress provides developers with state-of-art quantization optimization to facilitate deployment of edge AI solutions.
Quantization Algorithms
With quantization, LEIP Compress transforms numerical representations of the DNN parameters, from floating point to integers, resulting in lower memory footprint and faster computation. LEIP supports the following Post Training Quantization (PTQ) techniques:
Symmetric
First, the maximum M of the inputs x_f in absolute value, M=max(|xf|), is selected. The floating-point range that is effectively being quantized is symmetric with respect to zero as is the quantized range.
Asymmetric
In asymmetric quantization, the min/max in the float range is mapped to the min/max of the integer range. The is performed by using a zero-point (also called quantization bias, or offset) in addition to the scale factor.
Powers-of-Two
Very often the weights are distributed around zero so that their histogram is in a bell-shape form (normal distribution). One would like to have higher resolution for values closer to zero. It can be achieved if we use logarithmic scale. More precisely, we can encode the weights using the formula based on power of two formula quantized=sign(w)2∗∗round(log|w|) (i.e. two to the power of rounded logarithm).
Per-Channel
This quantization is used when standard symmetric, asymmetric and powers-of-two algorithms fail to achieve a level of performance. This could happen when resolution of 256 values is not sufficient to encode the behavior of the network. The per-channel algorithm provides improvements by quantizing for each channel. Convolutional and dense layers consist of a significant number of channels. Instead of quantizing all of them in bulk, per-channel quantization can be used to quantize each channel separately to provide accuracy improvements.
Note that at this time, the per-channel quantizers are only available when casting to integer types. Support for the full LEIP workflow will be added soon.
The efficacy of the quantization techniques highly depends on the model and training dataset, and LEIP Compress provides the ability to explore their use, from the simplest to the more complex ones. The core thesis of these quantization algorithms is to analyze the distribution of floating point values and provide a mapping to integer values while minimizing loss in overall accuracy.
Optimizations
In addition to the quantization algorithms offered by LEIP Compress, there are two other optimization techniques that can be used in conjunction with the process of casting to integer. Depending on the type of model and the specific distributions of values for its parameters, these additional optimizations may or may not benefit the overall accuracy after quantization. But in some cases they can boost the accuracy to within a close percentage of the original baseline.
Tensor Splitting
The LEIP SDK supports a quantization technique called Tensor Splitting, where tensors are decomposed into sub-tensors to allow for separate and more optimal compression ratio.
The algorithm provides a flow to automatically determine the layers whose tensors should be split, using a predefined heuristic.
To try out this optimization, simply add --compress_optimization tensor_splitting
to the leip optimize
command (and set --data_type
to an integer), as shown in the example in the Using LEIP document.
Depending on the size of the model and its layers, the Tensor Splitting optimization pass could take several minutes.
Bias Correction
The LEIP SDK supports a quantization technique called Bias Correction. Generally, quantization introduces a biased error in the output activations. Bias Correction will calibrate the model and adjust the biases to reduce this error. In some cases, this optimization will significantly improve the model’s performance.
To try out Bias Correction, simply add --compress_optimization bias_correction
(and set --data_type
to an integer), as shown in the example in the LEIP Introduction document.
Depending on the size of the model and its layers, the Bias Correction optimization pass could take several minutes.
The tensor_splitting
and bias_correction
optimizations can be cascaded together by specifying --optimization tensor_splitting,bias_correction
.
Target Data Type
LEIP Optimize provides two types of outputs based on a --data_type
argument:
When
--data_type=float32
, only weights are quantized, and the output model has float32 dequantized weights.When
--data_type=int8
or--data_type=uint8
, both weights and operations are quantized, and the output model has (u)int8 weights.
LEIP Compile
This phase also has an independent command leip compile
and is described in LEIP Compile.
CLI Usage
The basic command is:
leip optimize --input_path inputModel/ \
--output_path optimizedModel/ \
--rep_dataset rep_dataset.txt
You may also use our legacy quantizer by adding a --use_legacy_quantizer
option:
leip optimize --input_path inputModel/ \
--output_path optimizedModel/ \
--rep_dataset rep_dataset.txt \
--use_legacy_quantizer true
For a detailed explanation of each option see the CLI Reference for LEIP Optimize.