Skip to content

ONNXModule Class

forge.ONNXModule


forge.ONNXModule

ONNXModule in Forge, an extension of ONNX's ModelProto, facilitates manipulation and optimizationof machine learning models.

This class provides a user-friendly API to easily quantize and export ONNX models for inference with the LEIP LatentRuntimeEngine (LRE).

mod: ModelProto property

The current state of the ONNX model

ir_version: int property

ONNX model's IR version

input_count: int property

ONNX model's number of expected inputs

input_shapes: List[Tuple[Union[int, str], ...]] property

List of ONNX model's input shapes

input_dtypes: List[str] property

List of ONNX model's input data types

input_names: List property

List of ONNX model's input names

output_count: int property

ONNX model's number of expected outputs

output_shapes: List[Tuple[Union[int, str], ...]] property

List of ONNX model's output shapes

output_dtypes: List[str] property

List of ONNX model's output data types

output_names: List property

List of ONNX model's output names

is_calibrated: bool property

Flag to check whether or not the module is calibrated (for quantization)

is_quantized: bool property

Flag to check whether or not the module is quantized (non-TensorRT)

is_quantized_for_tensorrt: bool property

Flag to check whether or not the module is 'quantized' for TensorRT

copy() -> ONNXModule

Returns a deep copy of the instance

get_inference_function(providers: Optional[Union[str, List[str]]] = None, opt_level: Union[int, GraphOptimizationLevel] = ort.GraphOptimizationLevel.ORT_DISABLE_ALL) -> Callable

Creates an ONNX Runtime inference function from the given model.

This function loads the current state of the ONNX model and returns a callable inference function that can be used to run predictions. The returned function automatically handles input and output names and shapes, and executes inference using the specified execution providers and graph optimization level.

Parameters:

Name Type Description Default
providers Optional[Union[str, List[str]]]

The execution providers to use for inference. Can be a string, e.g. "CUDAExecutionProvider" or a list of provider strings to try in priority order. If not provided, defaults to "CPUExecutionProvider".

None
opt_level Union[int, GraphOptimizationLevel]

The level of graph optimization to apply during model loading. Defaults to GraphOptimizationLevel.ORT_DISABLE_ALL.

ORT_DISABLE_ALL

Returns:

Name Type Description
Callable Callable

A callable function that takes input data as positional arguments and returns a dictionary mapping output names to their corresponding NumPy arrays. The function has additional metadata attributes such as input_names, input_shapes, output_names, output_shapes, and session.

Inference Function Output

The inference function will return a dictionary mapping of output node names to respective node activations collected during inference.

Example
inference_fn = ir.get_inference_function()
output = inference_fn(input_data)
output_name = inference_fn.output_names[0])
print(output[output_name])

Raises:

Type Description
ValueError

If the model, providers, or optimization level are invalid.

calibrate(calib_data: Iterable[Any], use_cuda: bool = True, reset: bool = True) -> None

Calibrates the model by tracking intermediate layer statistics.

This method collects statistics from intermediate layers of the model using the provided calibration dataset. These statistics are used for deriving quantization parameters in a subsequent quantization process. It's essential that the calibration data is representative of the model's expected real-world inputs.

Note: Ensure that the calibration data is in the form of numpy arrays and has undergone the necessary pre-processing steps required for the model.

Parameters:

Name Type Description Default
calib_data Iterable[Any]

An iterable of data samples for calibration. The samples should be in a format compatible with the model's input requirements. Inspect the onnx_ir.input_shapesand onnx_ir.input_dtypes for details. For multiple inputs, each set of inputs should be an iterable of numpy arrays, e.g. a list or tuple of numpy arrays.

required
reset bool

If True, any previous calibration data is cleared before new data is processed. Defaults to True.

True
use_cuda bool

If True, Forge will utilize CUDA devices for calibration if GPUs can be found. Operation will fall back to CPU if GPUs are not found. Default is True.

True

Returns:

Name Type Description
None None

This method operates in place.

Raises:

Type Description
ValueError

If calib_data is not in the correct format.

quantize(activation_dtype: str = 'int8', kernel_dtype: Optional[str] = None, per_channel: bool = False, calib_method: str = 'entropy', quant_type: str = 'any') -> None

Applies quantization to the model with specified parameters.

This method quantizes the model's activations and kernels specified data types. If kernel_dtype is None, they default to the activation_dtype. The quantization can be "static" (requiring prior calibration), "dynamic" (no calibration needed), or "any" (prioritizing static if possible, i.e. the module 'is_calibrated').

This method will perform two processes in quantization

1) Quantize the current model state (non-TensorRT) 2) Compute 'quantization' for TensorRT. This step only accounts for the 'calib_method' argument. All other arguments have no effect on the 'quantization' for TensorRT.

Note: When using "static" quant_type, ensure calibration is performed beforehand or provide calib_data for calibration.

Parameters:

Name Type Description Default
activation_dtype str

Data type for activations ("int8", "uint8"), default is "int8".

'int8'
kernel_dtype Optional[str]

Data type for kernels ("int8", "uint8"), defaults to activation_dtype if None.

None
per_channel bool

If True, performs per-channel quantization on kernels. Default is False.

False
calib_method str

Method for calibration ("average", "entropy", "minmax", "percentile"), default is "entropy". Overview of calibration methods: "average" - computed average of the min-max extrema across calibration data, "entropy" - distribution-based maximization of entropy of quantized values, "minmax" - absolute most extreme min-max values across calibration data, "percentile" - computed 99-th percentile cut-offs across calibration data.

'entropy'
quant_type str

Type of quantization ("static", "dynamic", "any"), default is "any".

'any'

Returns:

Name Type Description
None None

This method operates in place.

export(f: Union[str, os.PathLike] = './model.onnx', force_overwrite: bool = False, is_tensorrt: bool = False, uuid: Optional[str] = None, encrypt_password: Optional[str] = None) -> None

Exports the current state of the ONNX model to the specified output with metadata for inference with the LEIP LatentRuntimeEngine (LRE).

This method manages output path validation and enforces the '.onnx' file extension. If the model is quantized, related metadata will be included in the export. If exporting for downstream use with TensorRT, set 'is_tensorrt=True' (applicable to quantized models only); the process will export an unquantized version of the model along with 'quantization' parameters in its metadata.

Parameters:

Name Type Description Default
f Union[str, Path]

A string containing a file name or a pathlike object. Defaults to "./model.onnx".

'./model.onnx'
force_overwrite bool

If True, overwrites the output path if it already exists. Defaults to False.

False
is_tensorrt bool

If True, only exports the model in its unquantized state, but exports the current state of collected calibration data needed to run the model with TensorRT's 8-bit quantization. See module 'calibrate()' and 'quantize()` methods, both necessary steps before any calibration data gets exported.

False
uuid Optional[str]

A custom UUID for the export. If not provided, a new UUID4 is generated.

None
encrypt_password Optional[str]

Optional specification of a password if is desirable to have the model encrypted. As an output, there will be the model file and the key.

None

Returns:

Name Type Description
None None

This method operates in place.

__repr__() -> str