Skip to content

ONNXModule Class

forge.ONNXModule


forge.ONNXModule

ONNXModule(onnx_model: Union[str, PathLike, bytes, ModelProto])

ONNXModule in Forge, an extension of ONNX's ModelProto, facilitates manipulation and optimizationof machine learning models.

This class provides a user-friendly API to easily quantize and export ONNX models for inference with the LEIP LatentRuntimeEngine (LRE).

Initialize an ONNXModule instance.

Note

The ONNX model must be less than 2GB in size due to Protobuf serialization limitations.

Parameters:

Name Type Description Default
onnx_model Union[str, PathLike, bytes, ModelProto]

Can be: - A file-like object (has 'read' function) - A string/PathLike containing a file name to an ONNX model - A string containing serialized ModelProto - A ModelProto instance

required

Methods:

Name Description
copy

Returns a deep copy of the instance

get_inference_function

Creates an ONNX Runtime inference function from the given model.

calibrate

Calibrates the model by tracking intermediate layer statistics.

quantize

Applies quantization to the model with specified parameters.

export

Exports the current state of the ONNX model to the specified output with metadata

__repr__

Attributes:

Name Type Description
mod ModelProto

The current state of the ONNX model

ir_version int

ONNX model's IR version

input_count int

ONNX model's number of expected inputs

input_shapes List[Tuple[Union[int, str], ...]]

List of ONNX model's input shapes

input_dtypes List[str]

List of ONNX model's input data types

input_names List

List of ONNX model's input names

output_count int

ONNX model's number of expected outputs

output_shapes List[Tuple[Union[int, str], ...]]

List of ONNX model's output shapes

output_dtypes List[str]

List of ONNX model's output data types

output_names List

List of ONNX model's output names

is_calibrated bool

Flag to check whether or not the module is calibrated (for quantization)

is_quantized bool

Flag to check whether or not the module is quantized (non-TensorRT)

is_quantized_for_tensorrt bool

Flag to check whether or not the module is 'quantized' for TensorRT

Attributes

mod property
mod: ModelProto

The current state of the ONNX model

ir_version property
ir_version: int

ONNX model's IR version

input_count property
input_count: int

ONNX model's number of expected inputs

input_shapes property
input_shapes: List[Tuple[Union[int, str], ...]]

List of ONNX model's input shapes

input_dtypes property
input_dtypes: List[str]

List of ONNX model's input data types

input_names property
input_names: List

List of ONNX model's input names

output_count property
output_count: int

ONNX model's number of expected outputs

output_shapes property
output_shapes: List[Tuple[Union[int, str], ...]]

List of ONNX model's output shapes

output_dtypes property
output_dtypes: List[str]

List of ONNX model's output data types

output_names property
output_names: List

List of ONNX model's output names

is_calibrated property
is_calibrated: bool

Flag to check whether or not the module is calibrated (for quantization)

is_quantized property
is_quantized: bool

Flag to check whether or not the module is quantized (non-TensorRT)

is_quantized_for_tensorrt property
is_quantized_for_tensorrt: bool

Flag to check whether or not the module is 'quantized' for TensorRT

Functions

copy
copy() -> ONNXModule

Returns a deep copy of the instance

get_inference_function
get_inference_function(
    providers: Optional[Union[str, List[str]]] = None,
    opt_level: Union[int, GraphOptimizationLevel] = ORT_DISABLE_ALL,
) -> Callable

Creates an ONNX Runtime inference function from the given model.

This function loads the current state of the ONNX model and returns a callable inference function that can be used to run predictions. The returned function automatically handles input and output names and shapes, and executes inference using the specified execution providers and graph optimization level.

Parameters:

Name Type Description Default
providers Optional[Union[str, List[str]]]

The execution providers to use for inference. Can be a string, e.g. "CUDAExecutionProvider" or a list of provider strings to try in priority order. If not provided, defaults to "CPUExecutionProvider".

None
opt_level Union[int, GraphOptimizationLevel]

The level of graph optimization to apply during model loading. Defaults to GraphOptimizationLevel.ORT_DISABLE_ALL.

ORT_DISABLE_ALL

Returns:

Name Type Description
Callable Callable

A callable function that takes input data as positional arguments and returns a dictionary mapping output names to their corresponding NumPy arrays. The function has additional metadata attributes such as input_names, input_shapes, output_names, output_shapes, and session.

Inference Function Output

The inference function will return a dictionary mapping of output node names to respective node activations collected during inference.

Example
inference_fn = ir.get_inference_function()
output = inference_fn(input_data)
output_name = inference_fn.output_names[0])
print(output[output_name])

Raises:

Type Description
ValueError

If the model, providers, or optimization level are invalid.

calibrate
calibrate(
    calib_data: Iterable[Any], use_cuda: bool = True, reset: bool = True
) -> None

Calibrates the model by tracking intermediate layer statistics.

This method collects statistics from intermediate layers of the model using the provided calibration dataset. These statistics are used for deriving quantization parameters in a subsequent quantization process. It's essential that the calibration data is representative of the model's expected real-world inputs.

Note: Ensure that the calibration data is in the form of numpy arrays and has undergone the necessary pre-processing steps required for the model.

Parameters:

Name Type Description Default
calib_data Iterable[Any]

An iterable of data samples for calibration. The samples should be in a format compatible with the model's input requirements. Inspect the onnx_ir.input_shapesand onnx_ir.input_dtypes for details. For multiple inputs, each set of inputs should be an iterable of numpy arrays, e.g. a list or tuple of numpy arrays.

required
reset bool

If True, any previous calibration data is cleared before new data is processed. Defaults to True.

True
use_cuda bool

If True, Forge will utilize CUDA devices for calibration if GPUs can be found. Operation will fall back to CPU if GPUs are not found. Default is True.

True

Returns:

Name Type Description
None None

This method operates in place.

Raises:

Type Description
ValueError

If calib_data is not in the correct format.

quantize
quantize(
    activation_dtype: str = "int8",
    kernel_dtype: Optional[str] = None,
    per_channel: bool = False,
    calib_method: str = "entropy",
    quant_type: str = "any",
) -> None

Applies quantization to the model with specified parameters.

This method quantizes the model's activations and kernels specified data types. If kernel_dtype is None, they default to the activation_dtype. The quantization can be "static" (requiring prior calibration), "dynamic" (no calibration needed), or "any" (prioritizing static if possible, i.e. the module 'is_calibrated').

This method will perform two processes in quantization

1) Quantize the current model state (non-TensorRT) 2) Compute 'quantization' for TensorRT. This step only accounts for the 'calib_method' argument. All other arguments have no effect on the 'quantization' for TensorRT.

Note: When using "static" quant_type, ensure calibration is performed beforehand or provide calib_data for calibration.

Parameters:

Name Type Description Default
activation_dtype str

Data type for activations ("int8", "uint8"), default is "int8".

'int8'
kernel_dtype Optional[str]

Data type for kernels ("int8", "uint8"), defaults to activation_dtype if None.

None
per_channel bool

If True, performs per-channel quantization on kernels. Default is False.

False
calib_method str

Method for calibration ("average", "entropy", "minmax", "percentile"), default is "entropy". Overview of calibration methods: "average" - computed average of the min-max extrema across calibration data, "entropy" - distribution-based maximization of entropy of quantized values, "minmax" - absolute most extreme min-max values across calibration data, "percentile" - computed 99-th percentile cut-offs across calibration data.

'entropy'
quant_type str

Type of quantization ("static", "dynamic", "any"), default is "any".

'any'

Returns:

Name Type Description
None None

This method operates in place.

export
export(
    f: Union[str, PathLike] = "./model.onnx",
    force_overwrite: bool = False,
    is_tensorrt: bool = False,
    uuid: Optional[str] = None,
    encrypt_password: Optional[str] = None,
) -> None

Exports the current state of the ONNX model to the specified output with metadata for inference with the LEIP LatentRuntimeEngine (LRE).

This method manages output path validation and enforces the '.onnx' file extension. If the model is quantized, related metadata will be included in the export. If exporting for downstream use with TensorRT, set 'is_tensorrt=True' (applicable to quantized models only); the process will export an unquantized version of the model along with 'quantization' parameters in its metadata.

Parameters:

Name Type Description Default
f Union[str, Path]

A string containing a file name or a pathlike object. Defaults to "./model.onnx".

'./model.onnx'
force_overwrite bool

If True, overwrites the output path if it already exists. Defaults to False.

False
is_tensorrt bool

If True, only exports the model in its unquantized state, but exports the current state of collected calibration data needed to run the model with TensorRT's 8-bit quantization. See module 'calibrate()' and 'quantize()` methods, both necessary steps before any calibration data gets exported.

False
uuid Optional[str]

A custom UUID for the export. If not provided, a new UUID4 is generated.

None
encrypt_password Optional[str]

Optional specification of a password if is desirable to have the model encrypted. As an output, there will be the model file and the key.

None

Returns:

Name Type Description
None None

This method operates in place.

__repr__
__repr__() -> str