Latent Runtime Engine API

`pylre.LatentRuntimeEngine(model_path: Union[str, os.PathLike], options: Optional[Union[pylre_onnx.ONNXOptions, pylre_tvm.TVMOptions]] = None)`

A Python wrapper around the C++ LRE.

This class exposes and provides a Python API to the underlying C++ LRE implementation. The Python LRE can run inference on any tensor inputs that follow the DLPack protocol, i.e. the tensor objects have a defined __dlpack__ method, e.g. NumPy arrays, PyTorch tensors, etc. The returned outputs will also be DLPack objects that can be ingested by common libraries like NumPy, PyTorch, etc.

Initialize a runtime instance

Parameters

model_path: os.PathLike Either an '.onnx' or '.so' artifact generated from LEIP Optimize

Optional[Union[pylre_onnx.ONNXOptions, pylre_tvm.TVMOptions]

The runtime options for the model. Use 'TVMOptions' for TVM-compiled binaries. Use 'ONNXOptions' for ONNX protobufs.

Attributes

`input_dtypes: List[str]` `property`

Model's input data types

`input_shapes: List[Tuple[int, ...]]` `property`

Model's input shapes

`is_cpu_output: bool` `property`

Flag is true if the runtime's current output device is CPU

`is_trt: bool` `property`

Flag is true if the runtime session uses TensorRT

`model_id: str` `property`

Model's UUID metadata field

`number_inputs: int` `property`

Model's number of inputs

`number_outputs: int` `property`

Model's number of outputs

`output_dtypes: List[str]` `property`

Model's output data types

`output_shapes: List[Tuple[int, ...]]` `property`

Model's output shapes

`runtime_options: Union[pylre_onnx.ONNXOptions, pylre_tvm.TVMOptions]` `property`

Runtime options of the current session

Functions

`call(inputs) -> List[PyDLPack]`

Method to invoke inference and return outputs by calling the instance. A composition of running inference and getting outputs.

`get_metadata() -> dict`

Get a dictionary of the model's metadata.

Returns

metadata: dict Dictionary of metadata key-values

`get_output(index: int) -> PyDLPack`

Get a specific tensor output by index from the last executed inference.

Parameters

index: int The desired output tensor index.

`get_outputs() -> List[PyDLPack]`

Returns the all the outputs from the last executed inference.

Returns

outputs: List[PyDLPack] List of DLPack-protocol objects

`infer(inputs) -> None`

Runs inference upon provided input(s). Outputs are saved to buffers.

Parameters

inputs: Union[DLPack-Tensor, List[DLPack-Tensor]] Either a single "DLPack-Tensor" or a list/tuple of "DLPack-Tensor" objects. A "DLPack-Tensor" object is any tensor that implements the DLPack protocol, i.e. has a __dlpack__ method defined.

`set_cpu_output(use_cpu: bool) -> None`

Sets whether the output should be a CPU PyDLPack tensor.

This method configures the output to be a CPU PyDLPack tensor if the inference device is CUDA. If the inference device is already set to CPU, this setting has no effect since the output is already on the CPU.

Parameters

use_cpu: bool If set to True, the output will be a CPU PyDLPack tensor when the inference device is CUDA. If set to False, the output will remain on the device used for inference.

`TVMOptions`

A class that provides options for configuring TVM-compiled models.

Example:

import pylre
from pylre import LatentRuntimeEngine as LRE

# Create TVMOptions with all possible configurations
options = pylre.TVMOptions(
    precision="int8",                    # Set precision to INT8
    tensorrt_timing_cache="timing_dir",  # Set TensorRT timing cache directory
    tensorrt_engine_cache="engine_dir",  # Set TensorRT engine cache directory
    device_id=0,                         # Use GPU device ID 0
    password="password",                 # Set encryption password
    key_path="path_to_key"               # Provide encryption key path
)

# Initialize the Latent Runtime Engine with the configured options
lre = LRE(model_path="path_to_model", options=options)

precision: Optional[str]
- The precision mode to use during inference. Possible values: "fp32", "fp16", "int8". Defaults to "fp32".

Supported runtime precision

The available precision modes at runtime depend on precision at compilation:

Precision at Compilation	INT8	FP16	FP32
INT8	✔️	✔️	✔️
FP16		✔️	✔️
FP32		✔️	✔️

tensorrt_timing_cache: Optional[str]
- A cache path for TensorRT timing data. If not specified, timing data will be stored in memory.
tensorrt_engine_cache: Optional[str]
- A cache path for TensorRT engine files. If not specified, engine files will be generated and stored in memory.
device_id: Optional[int]
- The ID of the device to use for inference
  - For vanilla CPU memory, pinned memory, or managed memory, this is set to 0.
  - For Multi GPU systems allows selecting specific GPU (e.g., "0" for GPU 0). Defaults to 0.
password: Optional[str]
- Password that was used to encrypt your model. If not specified, no password is required.
key_path: Optional[str]
- The path to a key file used for model encryption. If not specified, no key file is required.

`ONNXOptions`

A class that provides options for configuring ONNX-exported models.

Example:

import pylre
from pylre import LatentRuntimeEngine as LRE

# Create ONNXOptions with all possible configurations
options = pylre.ONNXOptions(
    execution_provider="cuda",        # Use CUDA for execution
    precision="int8",                 # Set precision to INT8
    tensorrt_timing_cache="timing_dir",  # Set TensorRT timing cache directory
    tensorrt_engine_cache="engine_dir",  # Set TensorRT engine cache directory
    device_id=0,                      # Use GPU device ID 0
    password="password",               # Set encryption password
    key_path="path_to_key"             # Provide encryption key path
)

# Initialize the Latent Runtime Engine with the configured options
lre = LRE(model_path="path_to_model", options=options)

execution_provider: Optional[str]
- The provider to use for model execution. Possible values: "cpu", "cuda", "tensorrt". Defaults to "cpu".
precision: Optional[str]
- The precision mode to use during inference. Possible values: "fp32", "int8". Defaults to "fp32".

Supported runtime precision

The available precision modes at runtime depend on precision at compilation:

Precision at Compilation	INT8	FP16	FP32
INT8	✔️	✔️	✔️
FP16		✔️	✔️
FP32		✔️	✔️

tensorrt_timing_cache: Optional[str]
- A cache path for TensorRT timing data. If not specified, timing data will be stored in memory.
tensorrt_engine_cache: Optional[str]
- A cache path for TensorRT engine files. If not specified, engine files will be generated and stored in memory.
device_id: Optional[int]
- The ID of the device to use for inference
  - For vanilla CPU memory, pinned memory, or managed memory, this is set to 0.
  - For Multi GPU systems allows selecting specific GPU (e.g., "0" for GPU 0). Defaults to 0.
password: Optional[str]
- Password that was used to encrypt your model. If not specified, no password is required.
key_path: Optional[str]
- The path to a key file used for model encryption. If not specified, no key file is required.

Latent Runtime Engine API

pylre.LatentRuntimeEngine(model_path: Union[str, os.PathLike], options: Optional[Union[pylre_onnx.ONNXOptions, pylre_tvm.TVMOptions]] = None)

Parameters

Attributes

input_dtypes: List[str] property

input_shapes: List[Tuple[int, ...]] property

is_cpu_output: bool property

is_trt: bool property

model_id: str property

number_inputs: int property

number_outputs: int property

output_dtypes: List[str] property

output_shapes: List[Tuple[int, ...]] property

runtime_options: Union[pylre_onnx.ONNXOptions, pylre_tvm.TVMOptions] property

Functions

__call__(inputs) -> List[PyDLPack]

get_metadata() -> dict

Returns

get_output(index: int) -> PyDLPack

Parameters

get_outputs() -> List[PyDLPack]

Returns

infer(inputs) -> None

Parameters

set_cpu_output(use_cpu: bool) -> None

Parameters

TVMOptions

ONNXOptions

`pylre.LatentRuntimeEngine(model_path: Union[str, os.PathLike], options: Optional[Union[pylre_onnx.ONNXOptions, pylre_tvm.TVMOptions]] = None)`

`input_dtypes: List[str]` `property`

`input_shapes: List[Tuple[int, ...]]` `property`

`is_cpu_output: bool` `property`

`is_trt: bool` `property`

`model_id: str` `property`

`number_inputs: int` `property`

`number_outputs: int` `property`

`output_dtypes: List[str]` `property`

`output_shapes: List[Tuple[int, ...]]` `property`

`runtime_options: Union[pylre_onnx.ONNXOptions, pylre_tvm.TVMOptions]` `property`

`call(inputs) -> List[PyDLPack]`

`get_metadata() -> dict`

`get_output(index: int) -> PyDLPack`

`get_outputs() -> List[PyDLPack]`

`infer(inputs) -> None`

`set_cpu_output(use_cpu: bool) -> None`

`TVMOptions`

`ONNXOptions`