Skip to content

Latent Runtime Engine API

pylre.LatentRuntimeEngine

LatentRuntimeEngine(
    model_path: PathLike,
    execution_provider: ExecutionProvider = None,
    precision: ModelPrecision = None,
    tensorrt_timing_cache: PathLike | None = None,
    tensorrt_engine_cache: PathLike | None = None,
    device_id: int = 0,
    enable_cuda_graph: bool | None = None,
    cuda_stream: str | int | None = None,
    password: str | None = None,
    key_path: PathLike | None = None,
)

A Python wrapper around the C++ Latent Runtime Engine (LRE).

This class exposes and provides a Python API to the underlying C++ LRE implementation. The Python LRE can run inference on any tensor inputs that follow the DLPack protocol, i.e. the tensor objects have a defined __dlpack__ method, e.g. NumPy arrays, PyTorch tensors, etc. The returned outputs will also be DLPack objects that can be ingested by common libraries like NumPy, PyTorch, etc.

Examples:

>>> import numpy as np
>>> from pylre import LatentRuntimeEngine
>>>
>>> # Initialize the LRE with a model
>>> lre = LatentRuntimeEngine("path/to/model.onnx", execution_provider="cpu")
>>>
>>> # Prepare input data using model's expected input shape and dtype
>>> input_shape = lre.input_shapes[0]  # Get the first input's shape
>>> input_dtype = lre.input_dtypes[0]  # Get the first input's data type
>>> # Create random data with the correct shape and dtype
>>> input_data = np.random.rand(*input_shape).astype(input_dtype)
>>>
>>> # Ensure output is in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>>
>>> # Run inference
>>> outputs = lre(input_data)
>>>
>>> # Process outputs
>>> for i, output in enumerate(outputs):
>>>     # Convert to numpy array using from_dlpack
>>>     np_output = np.from_dlpack(output)
>>>     print(f"Output {i} shape: {np_output.shape}, dtype: {np_output.dtype}")

Initialize a Latent Runtime Engine instance.

This constructor initializes a new LRE instance by loading a model from the
specified path. The model can be either an ONNX file (.onnx) or a TVM compiled
model (.so). The appropriate backend (ONNX or TVM) is selected automatically
based on the file extension or through fallback mechanisms if the extension
is not recognized.
Parameters
model_path: Union[str, os.PathLike]
    Path to either an `.onnx` or `.so` artifact generated from LEIP Optimize.
execution_provider : Optional[str], default set from the optimized model.
    The provider to use for model execution. Possible values: "cpu",
    "cuda", "tensorrt".
precision : Optional[str], default="float32"
    Precision to use for inference (only for ONNX). Options depend on the precision set during
    model export. See the following compatibility tables:

    | **TRT** Precision at Export             | "float32" | "float16" | "int8" |
    | --------------------------------------- | --------- | --------- | ------ |
    | **float32**                             |     X     |     X     |        |
    | **int8**                                |           |           |    X   |

    | **non-TRT** Precision at Export         | "float32" | "float16" | "int8" |
    | --------------------------------------- | --------- | --------- | ------ |
    | **float32**                             |     X     |           |        |
    | **float16**                             |           |     X     |        |
    | **int8**                                |           |           |    X   |

tensorrt_timing_cache : Optional[os.PathLike], default="~/.cache/lre"
    A cache path for TensorRT timing data. Only applicable when using TensorRT.
tensorrt_engine_cache : Optional[os.PathLike], default="~/.cache/lre"
    A cache path for TensorRT engine files. Only applicable when using TensorRT.
enable_cuda_graph: Optional[bool], default=None
    Whether to enable CUDA graph optimization. When None, automatically enables
    for static models. CUDA graphs can significantly improve inference performance
    on NVIDIA GPUs by recording and replaying GPU operations, reducing CPU
    overhead. Only applicable when using CUDA or TensorRT execution
    providers.
cuda_stream: Optional[Union[str, int]], default=None
    CUDA stream to use for inference. Can be specified as either a string
    identifier or an integer stream ID. Only applicable when using
    TensorRT execution providers. If None, the default CUDA stream will be used.
device_id : Optional[int], default=0
    The ID of the device to use for inference.
    - For vanilla CPU memory, pinned memory, or managed memory, this is
      set to `0`.
    - For multi-GPU systems, allows selecting a specific GPU (e.g.,
      `0` for GPU 0).
password : Optional[str], default=None
    Password that was used to encrypt your model. If not specified, no
    password is required.
key_path : Optional[os.PathLike], default=None
    The path to a key file used for model encryption. If not specified,
    no key file is required.

Examples:

>>> # Initialize with an ONNX model using CPU
>>> lre = LatentRuntimeEngine("path/to/model.onnx", execution_provider="cpu")
>>>
>>> # Initialize with a TVM model
>>> lre = LatentRuntimeEngine("path/to/model.so")
>>>
>>> # Initialize with an ONNX model using TensorRT
>>> lre = LatentRuntimeEngine(
>>>     "path/to/model.onnx",
>>>     execution_provider="tensorrt",
>>>     precision="float16",
>>>     device_id=0
>>> )
>>>
>>> # Initialize with an ONNX model on a cuda stream
>>> lre = LatentRuntimeEngine(
>>>     "path/to/model.onnx",
>>>     execution_provider="tensorrt",
>>>     cuda_stream=torch.cuda.Stream().cuda_stream
>>> )
Raises
FileNotFoundError
    If the specified model file does not exist.
RuntimeError
    If the model cannot be loaded with either TVM or ONNX backends.

Methods:

Name Description
__call__

Run inference and return outputs by calling the instance directly.

get_metadata

Get a dictionary of the model's metadata.

get_output

Get a specific tensor output by index from the last executed inference.

get_outputs

Get all output tensors from the last executed inference.

infer

Run inference on the provided input(s) and save outputs to internal buffers.

set_cpu_output

Configure whether output tensors should be placed in CPU memory.

Attributes:

Name Type Description
input_dtypes List[str]

Get the data types of all input tensors expected by the model.

input_shapes List[Tuple[int, ...]]

Get the shapes of all input tensors expected by the model.

is_cpu_output bool

Check if the runtime's current output device is CPU.

is_trt bool

Check if the runtime session is using TensorRT (only for ONNX).

model_id str

Get the unique identifier (UUID) of the loaded model.

model_precision str

Get the precision of the loaded model.

number_inputs int

Get the number of input tensors expected by the model.

number_outputs int

Get the number of output tensors produced by the model.

output_dtypes List[str]

Get the data types of all output tensors produced by the model.

output_shapes List[Tuple[int, ...]]

Get the shapes of all output tensors produced by the model.

runtime_options ONNXOptions | TVMOptions

Get the runtime options used for the current session.

Attributes

input_dtypes property

input_dtypes: List[str]

Get the data types of all input tensors expected by the model.

This property returns a list of data type strings, one for each input tensor.

Returns:

Type Description
List[str]

A list of input tensor data types as strings (e.g., "float32", "int8").

input_shapes property

input_shapes: List[Tuple[int, ...]]

Get the shapes of all input tensors expected by the model.

This property returns a list of shapes, where each shape is a tuple of integers representing the dimensions of an input tensor.

Returns:

Type Description
List[Tuple[int, ...]]

A list of input tensor shapes. Each shape is a tuple of integers. For example, [(1, 3, 224, 224)] for a model with a single input of shape (batch_size=1, channels=3, height=224, width=224).

is_cpu_output property

is_cpu_output: bool

Check if the runtime's current output device is CPU.

This property returns a boolean indicating whether the outputs from inference will be placed in CPU memory. This is particularly useful when working with libraries like NumPy that require tensors to be in CPU memory.

Returns:

Type Description
bool

True if outputs will be placed in CPU memory, False if they will remain on the device used for inference (e.g., GPU).

See Also

set_cpu_output : Method to configure whether outputs should be in CPU memory.

is_trt property

is_trt: bool

Check if the runtime session is using TensorRT (only for ONNX).

This property returns a boolean indicating whether the current runtime session is using TensorRT as the execution provider. TensorRT is a high-performance deep learning inference optimizer and runtime for NVIDIA GPUs.

Returns:

Type Description
bool

True if the runtime session is using TensorRT, False otherwise.

model_id property

model_id: str

Get the unique identifier (UUID) of the loaded model.

This property returns the UUID that was assigned to the model during compilation/export, which can be used to uniquely identify the model.

Returns:

Type Description
str

The UUID of the model as a string.

model_precision property

model_precision: str

Get the precision of the loaded model.

This property returns the precision of the loaded model, which indicates the numerical format used for weights and computations (e.g., "float32", "float16", "int8").

Returns:

Type Description
str

The precision of the model as a string (e.g., "float32", "float16", "int8"). The available precision options depend on what was set during model compilation/export.

Examples:

>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> precision = lre.model_precision
>>> print(f"Model is using {precision} precision")

number_inputs property

number_inputs: int

Get the number of input tensors expected by the model.

This property returns the count of input tensors that the model expects.

Returns:

Type Description
int

The number of input tensors.

number_outputs property

number_outputs: int

Get the number of output tensors produced by the model.

This property returns the count of output tensors that the model produces.

Returns:

Type Description
int

The number of output tensors.

output_dtypes property

output_dtypes: List[str]

Get the data types of all output tensors produced by the model.

This property returns a list of data type strings, one for each output tensor.

Returns:

Type Description
List[str]

A list of output tensor data types as strings (e.g., "float32", "int8").

output_shapes property

output_shapes: List[Tuple[int, ...]]

Get the shapes of all output tensors produced by the model.

This property returns a list of shapes, where each shape is a tuple of integers representing the dimensions of an output tensor.

Returns:

Type Description
List[Tuple[int, ...]]

A list of output tensor shapes. Each shape is a tuple of integers. For example, [(1, 1000)] for a classification model with 1000 classes.

runtime_options property

runtime_options: ONNXOptions | TVMOptions

Get the runtime options used for the current session.

This property returns the options that were used to initialize the runtime engine, which may include settings like execution provider, precision, etc.

Returns:

Type Description
Union[ONNXOptions, TVMOptions]

The options object used to configure the runtime engine. The type depends on whether the model was loaded with ONNX or TVM.

Functions

__call__

__call__(inputs: PyDLPack | List[PyDLPack]) -> List[PyDLPack]

Run inference and return outputs by calling the instance directly.

This method provides a convenient way to run inference and get outputs in a single call. It's equivalent to calling infer() followed by get_outputs().

Parameters:

Name Type Description Default
inputs Union[PyDLPack, List[PyDLPack]]

Either a single DLPack-compatible tensor or a list of DLPack-compatible tensors. These tensors must implement the DLPack protocol (have a __dlpack__ method).

required

Returns:

Type Description
List[PyDLPack]

A list of output tensors as DLPack-compatible objects.

Examples:

>>> import numpy as np
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> # Get input shape and dtype from the model
>>> input_shape = lre.input_shapes[0]
>>> input_dtype = lre.input_dtypes[0]
>>> # Create input tensor with the correct shape and dtype
>>> input_tensor = np.random.rand(*input_shape).astype(input_dtype)
>>> # Ensure output is in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>> outputs = lre(input_tensor)
>>> # Process outputs
>>> for i, output in enumerate(outputs):
>>>     np_output = np.from_dlpack(output)
>>>     print(f"Output {i} shape: {np_output.shape}, dtype: {np_output.dtype}")

get_metadata

get_metadata() -> Dict[str, Any]

Get a dictionary of the model's metadata.

This method retrieves metadata associated with the loaded model, which may include information such as model version, creation date, author, and other custom metadata fields.

Returns:

Type Description
Dict[str, Any]

Dictionary of metadata key-value pairs. The specific keys available depend on what was stored in the model during export/compilation.

Examples:

>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> metadata = lre.get_metadata()
>>> # Access specific metadata fields
>>> if 'version' in metadata:
>>>     print(f"Model version: {metadata['version']}")

get_output

get_output(index: int) -> PyDLPackProtocol

Get a specific tensor output by index from the last executed inference.

This method retrieves a specific output tensor from the internal buffers populated by the most recent inference operation.

Parameters:

Name Type Description Default
index int

The index of the desired output tensor. Must be less than the total number of outputs (accessible via number_outputs property).

required

Returns:

Type Description
PyDLPack

The output tensor at the specified index as a DLPack-compatible object.

Raises:

Type Description
IndexError

If the index is out of range.

Examples:

>>> import numpy as np
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> # Get input shape and dtype from the model
>>> input_shape = lre.input_shapes[0]
>>> input_dtype = lre.input_dtypes[0]
>>> # Create input tensor with the correct shape and dtype
>>> input_tensor = np.random.rand(*input_shape).astype(input_dtype)
>>> # Ensure output is in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>> # Run inference
>>> lre.infer(input_tensor)
>>> # Get a specific output
>>> first_output = lre.get_output(0)
>>> # Convert to numpy array using from_dlpack
>>> np_output = np.from_dlpack(first_output)
>>> print(f"Output shape: {np_output.shape}, dtype: {np_output.dtype}")

get_outputs

get_outputs() -> List[PyDLPack]

Get all output tensors from the last executed inference.

This method retrieves all output tensors from the internal buffers populated by the most recent inference operation.

Returns:

Type Description
List[PyDLPack]

A list of all output tensors as DLPack-compatible objects.

Examples:

>>> import numpy as np
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> # Get input shape and dtype from the model
>>> input_shape = lre.input_shapes[0]
>>> input_dtype = lre.input_dtypes[0]
>>> # Create input tensor with the correct shape and dtype
>>> input_tensor = np.random.rand(*input_shape).astype(input_dtype)
>>> # Ensure output is in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>> # Run inference
>>> lre.infer(input_tensor)
>>> # Get all outputs
>>> outputs = lre.get_outputs()
>>> # Process all outputs using from_dlpack
>>> for i, output in enumerate(outputs):
>>>     np_output = np.from_dlpack(output)
>>>     print(f"Output {i} shape: {np_output.shape}, dtype: {np_output.dtype}")

infer

infer(inputs: PyDLPack | List[PyDLPack]) -> None

Run inference on the provided input(s) and save outputs to internal buffers.

This method runs inference on the provided inputs and stores the results in internal buffers. The outputs can be retrieved using get_output() or get_outputs().

Parameters:

Name Type Description Default
inputs PyDLPack | List[PyDLPack]

Either a single DLPack-compatible tensor or a list/tuple of DLPack-compatible tensors. A DLPack-compatible tensor is any tensor that implements the DLPack protocol (has a __dlpack__ method).

required

Examples:

>>> import numpy as np
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> # Get input shape and dtype from the model
>>> input_shape = lre.input_shapes[0]
>>> input_dtype = lre.input_dtypes[0]
>>> # Create input tensor with the correct shape and dtype
>>> input_tensor = np.random.rand(*input_shape).astype(input_dtype)
>>> # Ensure output is in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>> lre.infer(input_tensor)
>>> # Get all outputs
>>> outputs = lre.get_outputs()
>>> for i, output in enumerate(outputs):
>>>     np_output = np.from_dlpack(output)
>>>     print(f"Output {i} shape: {np_output.shape}, dtype: {np_output.dtype}")
>>> # Or get a specific output
>>> first_output = lre.get_output(0)
>>> np_first_output = np.from_dlpack(first_output)
>>> print(f"First output shape: {np_first_output.shape}, dtype: {np_first_output.dtype}")

set_cpu_output

set_cpu_output(use_cpu: bool) -> None

Configure whether output tensors should be placed in CPU memory.

This method configures the output to be a CPU PyDLPack tensor if the inference device is CUDA. If the inference device is already set to CPU, this setting has no effect since the output is already on the CPU.

This is particularly useful when working with libraries like NumPy that require tensors to be in CPU memory for processing.

Parameters:

Name Type Description Default
use_cpu bool

If set to True, the output will be a CPU PyDLPack tensor when the inference device is CUDA. If set to False, the output will remain on the device used for inference.

required

Examples:

>>> import numpy as np
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> # Get input shape and dtype from the model
>>> input_shape = lre.input_shapes[0]
>>> input_dtype = lre.input_dtypes[0]
>>> # Create input tensor with the correct shape and dtype
>>> input_tensor = np.random.rand(*input_shape).astype(input_dtype)
>>> # Set outputs to be in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>> lre.infer(input_tensor)
>>> outputs = lre.get_outputs()
>>> # Now outputs can be directly used with numpy.from_dlpack
>>> for i, output in enumerate(outputs):
>>>     np_output = np.from_dlpack(output)
>>>     print(f"Output {i} shape: {np_output.shape}, dtype: {np_output.dtype}")