LEIP Optimize
LEIP Optimize is a state-of-the-art model optimizer that applies post-quantization algorithms to a model and produces a binary representation based on the target specified by the user. The binary is in the form of a shared object file that is loaded into a small runtime for its execution.
Internally it consists of two phases:
LEIP Compress
Deep Neural Networks (DNNs) use a large number of parameters to learn. As a result, they pose a large memory and compute footprint during runtime inference. Such resource constraints limit their deployment to edge devices, and these have limited memory, size, and power budgets. LEIP Compress provides developers with state-of-art quantization optimization to facilitate deployment of edge AI solutions.
Quantization Algorithms
With quantization, LEIP Compress transforms numerical representations of the DNN parameters from floating point to integers. This results in a lower memory footprint and faster computation. LEIP supports the following Post Training Quantization (PTQ) techniques:
Symmetric
First, the maximum M of the inputs x_f in absolute value, M=max(|xf|), is selected. The floating point range that is effectively being quantized is symmetric with respect to zero as is the quantized range.
Asymmetric
In asymmetric quantization, the min/max in the float range is mapped to the min/max of the integer range. The is performed by using a zero-point (also called quantization bias, or offset) in addition to the scale factor.
Powers-of-Two
Very often the weights are distributed around zero so that their histogram is in a bell-shape form (normal distribution). One would like to have higher resolution for values closer to zero. It can be achieved if we use logarithmic scale. More precisely, we can encode the weights using the formula based on power of two formula quantized=sign(w)2∗∗round(log|w|) (that is, two to the power of the rounded logarithm).
Per-Channel
This quantization is used when standard symmetric, asymmetric, and powers-of-two algorithms fail to achieve a level of performance. This could happen when the resolution of 256 values is not sufficient to encode the behavior of the network. The per-channel algorithm provides improvements by quantizing for each channel. Convolutional and dense layers consist of a significant number of channels. Instead of quantizing all of them in bulk, per-channel quantization can be used to quantize each channel separately to provide accuracy improvements.
The efficacy of the quantization techniques highly depends on the model and training dataset, and LEIP Compress provides the ability to explore their use, from the simplest to the more complex ones. The core thesis of these quantization algorithms is to analyze the distribution of floating point values and provide a mapping to integer values while minimizing loss in overall accuracy.
Optimizations
There are two other optimization techniques that can be used in conjunction with the process of casting to integer in addition to the quantization algorithms offered by LEIP Compress. These additional optimizations may or may not benefit the overall accuracy after quantization depending on the type of model and the specific distributions of values for its parameters. But in some cases they can boost the accuracy to within a close percentage of the original baseline.
Tensor Splitting
LEIP SDK supports a quantization technique called Tensor Splitting. Tensors are decomposed into sub-tensors to allow for a separate and more optimal compression ratio. The algorithm provides a flow to automatically determine the layers whose tensors should be split using a predefined heuristic.
To try out this optimization, simply add --compress_optimization tensor_splitting
to the leip optimize
command, as shown in the example in the Using LEIP document. Depending on the size of the model and its layers, the Tensor Splitting optimization pass could take several minutes.
Bias Correction
LEIP SDK supports a quantization technique called Bias Correction. Generally, quantization introduces a biased error in the output activations. Bias Correction will calibrate the model and adjust the biases to reduce this error. In some cases, this optimization will significantly improve the model’s performance.
To try out Bias Correction, simply add --compress_optimization bias_correction
as shown in the example in the LEIP Introduction document. Depending on the size of the model and its layers, the Bias Correction optimization pass could take several minutes.
The tensor_splitting
and bias_correction
optimizations can be cascaded together by specifying --optimization tensor_splitting,bias_correction
.
Target Data Type
LEIP Optimize provides two types of outputs based on the --quantizer
argument:
When
--quantizer=none
, the model is only compiled, leaving weights in their original type.When
--quantizer=asymmetric
,--quantizer=symmetric
, or--quantizer=symmetricpc
, both weights and operations are quantized, and the output model has (u)int8 weights.
LEIP Compile
This phase also has an independent command leip compile
and is described in LEIP Compile.
CLI Usage
The basic command is:
leip optimize --input_path inputModel/ \
--output_path optimizedModel/ \
--rep_dataset rep_dataset.txt
For a detailed explanation of each option, refer to the CLI Reference for LEIP Optimize.