ONNXModule Class¶
forge.ONNXModule¶
forge.ONNXModule
¶
ONNXModule in Forge, an extension of ONNX's ModelProto, facilitates manipulation and optimizationof machine learning models.
This class provides a user-friendly API to easily quantize and export ONNX models for inference with the LEIP LatentRuntimeEngine (LRE).
mod: ModelProto
property
¶
The current state of the ONNX model
ir_version: int
property
¶
ONNX model's IR version
input_count: int
property
¶
ONNX model's number of expected inputs
input_shapes: List[Tuple[Union[int, str], ...]]
property
¶
List of ONNX model's input shapes
input_dtypes: List[str]
property
¶
List of ONNX model's input data types
input_names: List
property
¶
List of ONNX model's input names
output_count: int
property
¶
ONNX model's number of expected outputs
output_shapes: List[Tuple[Union[int, str], ...]]
property
¶
List of ONNX model's output shapes
output_dtypes: List[str]
property
¶
List of ONNX model's output data types
output_names: List
property
¶
List of ONNX model's output names
is_calibrated: bool
property
¶
Flag to check whether or not the module is calibrated (for quantization)
is_quantized: bool
property
¶
Flag to check whether or not the module is quantized (non-TensorRT)
is_quantized_for_tensorrt: bool
property
¶
Flag to check whether or not the module is 'quantized' for TensorRT
copy() -> ONNXModule
¶
Returns a deep copy of the instance
get_inference_function(providers: Optional[Union[str, List[str]]] = None, opt_level: Union[int, GraphOptimizationLevel] = ort.GraphOptimizationLevel.ORT_DISABLE_ALL) -> Callable
¶
Creates an ONNX Runtime inference function from the given model.
This function loads the current state of the ONNX model and returns a callable inference function that can be used to run predictions. The returned function automatically handles input and output names and shapes, and executes inference using the specified execution providers and graph optimization level.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
providers
|
Optional[Union[str, List[str]]]
|
The execution providers to use for inference. Can be a string, e.g. "CUDAExecutionProvider" or a list of provider strings to try in priority order. If not provided, defaults to "CPUExecutionProvider". |
None
|
opt_level
|
Union[int, GraphOptimizationLevel]
|
The level of graph optimization to apply during model loading. Defaults to
|
ORT_DISABLE_ALL
|
Returns:
Name | Type | Description |
---|---|---|
Callable |
Callable
|
A callable function that takes input data as positional arguments and
returns a dictionary mapping output names to their corresponding NumPy
arrays. The function has additional metadata attributes such as |
Inference Function Output
The inference function will return a dictionary mapping of output node names to respective node activations collected during inference.
Example
inference_fn = ir.get_inference_function()
output = inference_fn(input_data)
output_name = inference_fn.output_names[0])
print(output[output_name])
Raises:
Type | Description |
---|---|
ValueError
|
If the model, providers, or optimization level are invalid. |
calibrate(calib_data: Iterable[Any], use_cuda: bool = True, reset: bool = True) -> None
¶
Calibrates the model by tracking intermediate layer statistics.
This method collects statistics from intermediate layers of the model using the provided calibration dataset. These statistics are used for deriving quantization parameters in a subsequent quantization process. It's essential that the calibration data is representative of the model's expected real-world inputs.
Note: Ensure that the calibration data is in the form of numpy arrays and has undergone the necessary pre-processing steps required for the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
calib_data
|
Iterable[Any]
|
An iterable of data samples for calibration. The samples should be in
a format compatible with the model's input requirements. Inspect the
onnx_ir.input_shapes |
required |
reset
|
bool
|
If True, any previous calibration data is cleared before new data is processed. Defaults to True. |
True
|
use_cuda
|
bool
|
If True, Forge will utilize CUDA devices for calibration if GPUs can be found. Operation will fall back to CPU if GPUs are not found. Default is True. |
True
|
Returns:
Name | Type | Description |
---|---|---|
None |
None
|
This method operates in place. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
quantize(activation_dtype: str = 'int8', kernel_dtype: Optional[str] = None, per_channel: bool = False, calib_method: str = 'entropy', quant_type: str = 'any') -> None
¶
Applies quantization to the model with specified parameters.
This method quantizes the model's activations and kernels specified data types. If kernel_dtype is None, they default to the activation_dtype. The quantization can be "static" (requiring prior calibration), "dynamic" (no calibration needed), or "any" (prioritizing static if possible, i.e. the module 'is_calibrated').
This method will perform two processes in quantization
1) Quantize the current model state (non-TensorRT) 2) Compute 'quantization' for TensorRT. This step only accounts for the 'calib_method' argument. All other arguments have no effect on the 'quantization' for TensorRT.
Note: When using "static" quant_type, ensure calibration is performed beforehand
or provide calib_data
for calibration.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
activation_dtype
|
str
|
Data type for activations ("int8", "uint8"), default is "int8". |
'int8'
|
kernel_dtype
|
Optional[str]
|
Data type for kernels ("int8", "uint8"), defaults to |
None
|
per_channel
|
bool
|
If True, performs per-channel quantization on kernels. Default is False. |
False
|
calib_method
|
str
|
Method for calibration ("average", "entropy", "minmax", "percentile"), default is "entropy". Overview of calibration methods: "average" - computed average of the min-max extrema across calibration data, "entropy" - distribution-based maximization of entropy of quantized values, "minmax" - absolute most extreme min-max values across calibration data, "percentile" - computed 99-th percentile cut-offs across calibration data. |
'entropy'
|
quant_type
|
str
|
Type of quantization ("static", "dynamic", "any"), default is "any". |
'any'
|
Returns:
Name | Type | Description |
---|---|---|
None |
None
|
This method operates in place. |
export(f: Union[str, os.PathLike] = './model.onnx', force_overwrite: bool = False, is_tensorrt: bool = False, uuid: Optional[str] = None, encrypt_password: Optional[str] = None) -> None
¶
Exports the current state of the ONNX model to the specified output with metadata for inference with the LEIP LatentRuntimeEngine (LRE).
This method manages output path validation and enforces the '.onnx' file extension. If the model is quantized, related metadata will be included in the export. If exporting for downstream use with TensorRT, set 'is_tensorrt=True' (applicable to quantized models only); the process will export an unquantized version of the model along with 'quantization' parameters in its metadata.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
f
|
Union[str, Path]
|
A string containing a file name or a pathlike object. Defaults to "./model.onnx". |
'./model.onnx'
|
force_overwrite
|
bool
|
If True, overwrites the output path if it already exists. Defaults to False. |
False
|
is_tensorrt
|
bool
|
If True, only exports the model in its unquantized state, but exports the current state of collected calibration data needed to run the model with TensorRT's 8-bit quantization. See module 'calibrate()' and 'quantize()` methods, both necessary steps before any calibration data gets exported. |
False
|
uuid
|
Optional[str]
|
A custom UUID for the export. If not provided, a new UUID4 is generated. |
None
|
encrypt_password
|
Optional[str]
|
Optional specification of a password if is desirable to have the model encrypted. As an output, there will be the model file and the key. |
None
|
Returns:
Name | Type | Description |
---|---|---|
None |
None
|
This method operates in place. |