LEIP Compile takes a computational graph (for example, quantized model from quantization phase of LEIP Optimize) as input and produces a binary representation (for example, LRE Object) based on the target specified by the user. The binary LRE Object is in the form of a shared object file that can be loaded into a small runtime for its execution.
The runtime is created through a Python script that can perform any pre or post processing of the data as well as include any other components of the application. It is also possible to bundle the runtime along with the neural network model as a single binary.
LEIP Compile can perform several optimizations by manipulating the Compute Graph that represents the neural network. However, exposing the entire search space of optimizations can be computationally expensive and thus the default is to only perform standard optimizations.
Although LEIP Compile is capable of generating binaries for multiple processors, the ones fully supported at this time are those based on the x86, NVIDIA, and ARM architectures.
The compiler is capable of generating binaries that support 32-bit floating point, 8-bit integer, and mixed types. The compiler will match the best data type depending on the hardware capabilities of the target architecture.
The basic command is:
$ leip compile --input_path path/to/model \ --output_path path/to/output
You can specify the desired layout (NCHW or NHWC) using
--layout, but only for non CUDA targets. The default value is
You can use the
--target parameter to specify the desired target hardware CodeGen, and this defaults to
llvm. You can specify the architecture through the
-device=architecture flags for CPU targets. You can pass any other architecture specific flags supported by
llvm because we use
llvm as the low level optimizing compiler. Although only the x86 and ARM architectures have been tested by the LEIP tool flow, any other architectures supported by
llvm can also be targeted.
Optimized compilation is possible for hardware based accelerators using the parameter as
--target family[:model]. Currently
cuda family is supported.
As an example, this is how a model is compiled and optimized for NVIDIA 2080 Ti:
$ leip compile --input_path path/to/model \ --output_path path/to/output \ --target cuda:2080ti
The following example shows how to target an x86 Skylake:
$ leip compile --input_path path/to/model \ --output_path path/to/output \ --target llvm -mcpu=skylake
Finally, this example shows how to target the ARM processor on a Raspberry Pi 4 system:
$ leip compile --input_path path/to/model \ --output_path path/to/output \ --target llvm -device=arm_cpu -model=bcm2837 -mtriple=aarch64-linux-gnu -mattr=+neon -mcpu=cortex-a72
You can specify different kinds of optimization using the
leip optimize) parameter more than once. The following optimizations are supported:
Specifies a level of kernel optimization between 1 and 4 (the higher the better). The default is 3. Please note that a layout conversion will not be possible for levels below 3.
Specifies the CUDA optimization for CUDA targets only. By default, it is enabled for CUDA targets and disabled otherwise.
Specifies how many iterations to use for the graph optimization algorithm. This can be used as follows:
$ leip compile --input_path path/to/model \ --output_path path/to/output \ --optimization category:kernel,level:4 \ --optimization category:graph,iterations:2000
--inference_context family parameter needs to be set when calling inference on a hardware accelerator optimized model.