Latent Runtime Engine API

pylre.LatentRuntimeEngine

LatentRuntimeEngine(
    model_path: PathLike,
    execution_provider: ExecutionProvider = None,
    precision: ModelPrecision = None,
    tensorrt_timing_cache: PathLike | None = None,
    tensorrt_engine_cache: PathLike | None = None,
    device_id: int = 0,
    enable_cuda_graph: bool | None = None,
    cuda_stream: str | int | None = None,
    password: str | None = None,
    key_path: PathLike | None = None,
)

A Python wrapper around the C++ Latent Runtime Engine (LRE).

This class exposes and provides a Python API to the underlying C++ LRE implementation. The Python LRE can run inference on any tensor inputs that follow the DLPack protocol, i.e. the tensor objects have a defined __dlpack__ method, e.g. NumPy arrays, PyTorch tensors, etc. The returned outputs will also be DLPack objects that can be ingested by common libraries like NumPy, PyTorch, etc.

Examples:

>>> import numpy as np
>>> from pylre import LatentRuntimeEngine
>>>
>>> # Initialize the LRE with a model
>>> lre = LatentRuntimeEngine("path/to/model.onnx", execution_provider="cpu")
>>>
>>> # Prepare input data using model's expected input shape and dtype
>>> input_shape = lre.input_shapes[0]  # Get the first input's shape
>>> input_dtype = lre.input_dtypes[0]  # Get the first input's data type
>>> # Create random data with the correct shape and dtype
>>> input_data = np.random.rand(*input_shape).astype(input_dtype)
>>>
>>> # Ensure output is in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>>
>>> # Run inference
>>> outputs = lre(input_data)
>>>
>>> # Process outputs
>>> for i, output in enumerate(outputs):
>>>     # Convert to numpy array using from_dlpack
>>>     np_output = np.from_dlpack(output)
>>>     print(f"Output {i} shape: {np_output.shape}, dtype: {np_output.dtype}")

Initialize a Latent Runtime Engine instance.

This constructor initializes a new LRE instance by loading a model from the
specified path. The model can be either an ONNX file (.onnx) or a TVM compiled
model (.so). The appropriate backend (ONNX or TVM) is selected automatically
based on the file extension or through fallback mechanisms if the extension
is not recognized.

Parameters

model_path: Union[str, os.PathLike]
    Path to either an `.onnx` or `.so` artifact generated from LEIP Optimize.
execution_provider : Optional[str], default set from the optimized model.
    The provider to use for model execution. Possible values: "cpu",
    "cuda", "tensorrt".
precision : Optional[str], default="float32"
    Precision to use for inference (only for ONNX). Options depend on the precision set during
    model export. See the following compatibility tables:

    | **TRT** Precision at Export             | "float32" | "float16" | "int8" |
    | --------------------------------------- | --------- | --------- | ------ |
    | **float32**                             |     X     |     X     |        |
    | **int8**                                |           |           |    X   |

    | **non-TRT** Precision at Export         | "float32" | "float16" | "int8" |
    | --------------------------------------- | --------- | --------- | ------ |
    | **float32**                             |     X     |           |        |
    | **float16**                             |           |     X     |        |
    | **int8**                                |           |           |    X   |

tensorrt_timing_cache : Optional[os.PathLike], default="~/.cache/lre"
    A cache path for TensorRT timing data. Only applicable when using TensorRT.
tensorrt_engine_cache : Optional[os.PathLike], default="~/.cache/lre"
    A cache path for TensorRT engine files. Only applicable when using TensorRT.
enable_cuda_graph: Optional[bool], default=None
    Whether to enable CUDA graph optimization. When None, automatically enables
    for static models. CUDA graphs can significantly improve inference performance
    on NVIDIA GPUs by recording and replaying GPU operations, reducing CPU
    overhead. Only applicable when using CUDA or TensorRT execution
    providers.
cuda_stream: Optional[Union[str, int]], default=None
    CUDA stream to use for inference. Can be specified as either a string
    identifier or an integer stream ID. Only applicable when using
    TensorRT execution providers. If None, the default CUDA stream will be used.
device_id : Optional[int], default=0
    The ID of the device to use for inference.
    - For vanilla CPU memory, pinned memory, or managed memory, this is
      set to `0`.
    - For multi-GPU systems, allows selecting a specific GPU (e.g.,
      `0` for GPU 0).
password : Optional[str], default=None
    Password that was used to encrypt your model. If not specified, no
    password is required.
key_path : Optional[os.PathLike], default=None
    The path to a key file used for model encryption. If not specified,
    no key file is required.

Examples:

>>> # Initialize with an ONNX model using CPU
>>> lre = LatentRuntimeEngine("path/to/model.onnx", execution_provider="cpu")
>>>
>>> # Initialize with a TVM model
>>> lre = LatentRuntimeEngine("path/to/model.so")
>>>
>>> # Initialize with an ONNX model using TensorRT
>>> lre = LatentRuntimeEngine(
>>>     "path/to/model.onnx",
>>>     execution_provider="tensorrt",
>>>     precision="float16",
>>>     device_id=0
>>> )
>>>
>>> # Initialize with an ONNX model on a cuda stream
>>> lre = LatentRuntimeEngine(
>>>     "path/to/model.onnx",
>>>     execution_provider="tensorrt",
>>>     cuda_stream=torch.cuda.Stream().cuda_stream
>>> )

Raises

FileNotFoundError
    If the specified model file does not exist.
RuntimeError
    If the model cannot be loaded with either TVM or ONNX backends.

Methods:

Name	Description
`__call__`	Run inference and return outputs by calling the instance directly.
`get_metadata`	Get a dictionary of the model's metadata.
`get_output`	Get a specific tensor output by index from the last executed inference.
`get_outputs`	Get all output tensors from the last executed inference.
`infer`	Run inference on the provided input(s) and save outputs to internal buffers.
`set_cpu_output`	Configure whether output tensors should be placed in CPU memory.

Attributes:

Name	Type	Description
`input_dtypes`	`List[str]`	Get the data types of all input tensors expected by the model.
`input_shapes`	`List[Tuple[int, ...]]`	Get the shapes of all input tensors expected by the model.
`is_cpu_output`	`bool`	Check if the runtime's current output device is CPU.
`is_trt`	`bool`	Check if the runtime session is using TensorRT (only for ONNX).
`model_id`	`str`	Get the unique identifier (UUID) of the loaded model.
`model_precision`	`str`	Get the precision of the loaded model.
`number_inputs`	`int`	Get the number of input tensors expected by the model.
`number_outputs`	`int`	Get the number of output tensors produced by the model.
`output_dtypes`	`List[str]`	Get the data types of all output tensors produced by the model.
`output_shapes`	`List[Tuple[int, ...]]`	Get the shapes of all output tensors produced by the model.
`runtime_options`	`ONNXOptions \| TVMOptions`	Get the runtime options used for the current session.

Attributes

input_dtypes `property`

input_dtypes: List[str]

Get the data types of all input tensors expected by the model.

This property returns a list of data type strings, one for each input tensor.

Returns:

Type	Description
`List[str]`	A list of input tensor data types as strings (e.g., "float32", "int8").

input_shapes `property`

input_shapes: List[Tuple[int, ...]]

Get the shapes of all input tensors expected by the model.

This property returns a list of shapes, where each shape is a tuple of integers representing the dimensions of an input tensor.

Returns:

Type	Description
`List[Tuple[int, ...]]`	A list of input tensor shapes. Each shape is a tuple of integers. For example, [(1, 3, 224, 224)] for a model with a single input of shape (batch_size=1, channels=3, height=224, width=224).

is_cpu_output `property`

is_cpu_output: bool

Check if the runtime's current output device is CPU.

This property returns a boolean indicating whether the outputs from inference will be placed in CPU memory. This is particularly useful when working with libraries like NumPy that require tensors to be in CPU memory.

Returns:

Type	Description
`bool`	True if outputs will be placed in CPU memory, False if they will remain on the device used for inference (e.g., GPU).

is_trt `property`

is_trt: bool

Check if the runtime session is using TensorRT (only for ONNX).

This property returns a boolean indicating whether the current runtime session is using TensorRT as the execution provider. TensorRT is a high-performance deep learning inference optimizer and runtime for NVIDIA GPUs.

Returns:

Type	Description
`bool`	True if the runtime session is using TensorRT, False otherwise.

model_id `property`

model_id: str

Get the unique identifier (UUID) of the loaded model.

This property returns the UUID that was assigned to the model during compilation/export, which can be used to uniquely identify the model.

Returns:

Type	Description
`str`	The UUID of the model as a string.

model_precision `property`

model_precision: str

Get the precision of the loaded model.

This property returns the precision of the loaded model, which indicates the numerical format used for weights and computations (e.g., "float32", "float16", "int8").

Returns:

Type	Description
`str`	The precision of the model as a string (e.g., "float32", "float16", "int8"). The available precision options depend on what was set during model compilation/export.

Examples:

>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> precision = lre.model_precision
>>> print(f"Model is using {precision} precision")

number_inputs `property`

number_inputs: int

Get the number of input tensors expected by the model.

This property returns the count of input tensors that the model expects.

Returns:

Type	Description
`int`	The number of input tensors.

number_outputs `property`

number_outputs: int

Get the number of output tensors produced by the model.

This property returns the count of output tensors that the model produces.

Returns:

Type	Description
`int`	The number of output tensors.

output_dtypes `property`

output_dtypes: List[str]

Get the data types of all output tensors produced by the model.

This property returns a list of data type strings, one for each output tensor.

Returns:

Type	Description
`List[str]`	A list of output tensor data types as strings (e.g., "float32", "int8").

output_shapes `property`

output_shapes: List[Tuple[int, ...]]

Get the shapes of all output tensors produced by the model.

This property returns a list of shapes, where each shape is a tuple of integers representing the dimensions of an output tensor.

Returns:

Type	Description
`List[Tuple[int, ...]]`	A list of output tensor shapes. Each shape is a tuple of integers. For example, [(1, 1000)] for a classification model with 1000 classes.

runtime_options `property`

runtime_options: ONNXOptions | TVMOptions

Get the runtime options used for the current session.

This property returns the options that were used to initialize the runtime engine, which may include settings like execution provider, precision, etc.

Returns:

Type	Description
`Union[ONNXOptions, TVMOptions]`	The options object used to configure the runtime engine. The type depends on whether the model was loaded with ONNX or TVM.

Functions

call

__call__(inputs: PyDLPack | List[PyDLPack]) -> List[PyDLPack]

Run inference and return outputs by calling the instance directly.

This method provides a convenient way to run inference and get outputs in a single call. It's equivalent to calling infer() followed by get_outputs().

Parameters:

Name	Type	Description	Default
`inputs`	`Union[PyDLPack, List[PyDLPack]]`	Either a single DLPack-compatible tensor or a list of DLPack-compatible tensors. These tensors must implement the DLPack protocol (have a `__dlpack__` method).	required

Returns:

Type	Description
`List[PyDLPack]`	A list of output tensors as DLPack-compatible objects.

Examples:

>>> import numpy as np
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> # Get input shape and dtype from the model
>>> input_shape = lre.input_shapes[0]
>>> input_dtype = lre.input_dtypes[0]
>>> # Create input tensor with the correct shape and dtype
>>> input_tensor = np.random.rand(*input_shape).astype(input_dtype)
>>> # Ensure output is in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>> outputs = lre(input_tensor)
>>> # Process outputs
>>> for i, output in enumerate(outputs):
>>>     np_output = np.from_dlpack(output)
>>>     print(f"Output {i} shape: {np_output.shape}, dtype: {np_output.dtype}")

get_metadata

get_metadata() -> Dict[str, Any]

Get a dictionary of the model's metadata.

This method retrieves metadata associated with the loaded model, which may include information such as model version, creation date, author, and other custom metadata fields.

Returns:

Type	Description
`Dict[str, Any]`	Dictionary of metadata key-value pairs. The specific keys available depend on what was stored in the model during export/compilation.

Examples:

>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> metadata = lre.get_metadata()
>>> # Access specific metadata fields
>>> if 'version' in metadata:
>>>     print(f"Model version: {metadata['version']}")

get_output

get_output(index: int) -> PyDLPackProtocol

Get a specific tensor output by index from the last executed inference.

This method retrieves a specific output tensor from the internal buffers populated by the most recent inference operation.

Parameters:

Name	Type	Description	Default
`index`	`int`	The index of the desired output tensor. Must be less than the total number of outputs (accessible via `number_outputs` property).	required

Returns:

Type	Description
`PyDLPack`	The output tensor at the specified index as a DLPack-compatible object.

Raises:

Type	Description
`IndexError`	If the index is out of range.

Examples:

>>> import numpy as np
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> # Get input shape and dtype from the model
>>> input_shape = lre.input_shapes[0]
>>> input_dtype = lre.input_dtypes[0]
>>> # Create input tensor with the correct shape and dtype
>>> input_tensor = np.random.rand(*input_shape).astype(input_dtype)
>>> # Ensure output is in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>> # Run inference
>>> lre.infer(input_tensor)
>>> # Get a specific output
>>> first_output = lre.get_output(0)
>>> # Convert to numpy array using from_dlpack
>>> np_output = np.from_dlpack(first_output)
>>> print(f"Output shape: {np_output.shape}, dtype: {np_output.dtype}")

get_outputs

get_outputs() -> List[PyDLPack]

Get all output tensors from the last executed inference.

This method retrieves all output tensors from the internal buffers populated by the most recent inference operation.

Returns:

Type	Description
`List[PyDLPack]`	A list of all output tensors as DLPack-compatible objects.

Examples:

>>> import numpy as np
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> # Get input shape and dtype from the model
>>> input_shape = lre.input_shapes[0]
>>> input_dtype = lre.input_dtypes[0]
>>> # Create input tensor with the correct shape and dtype
>>> input_tensor = np.random.rand(*input_shape).astype(input_dtype)
>>> # Ensure output is in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>> # Run inference
>>> lre.infer(input_tensor)
>>> # Get all outputs
>>> outputs = lre.get_outputs()
>>> # Process all outputs using from_dlpack
>>> for i, output in enumerate(outputs):
>>>     np_output = np.from_dlpack(output)
>>>     print(f"Output {i} shape: {np_output.shape}, dtype: {np_output.dtype}")

infer

infer(inputs: PyDLPack | List[PyDLPack]) -> None

Run inference on the provided input(s) and save outputs to internal buffers.

This method runs inference on the provided inputs and stores the results in internal buffers. The outputs can be retrieved using get_output() or get_outputs().

Parameters:

Name	Type	Description	Default
`inputs`	`PyDLPack \| List[PyDLPack]`	Either a single DLPack-compatible tensor or a list/tuple of DLPack-compatible tensors. A DLPack-compatible tensor is any tensor that implements the DLPack protocol (has a `__dlpack__` method).	required

Examples:

>>> import numpy as np
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> # Get input shape and dtype from the model
>>> input_shape = lre.input_shapes[0]
>>> input_dtype = lre.input_dtypes[0]
>>> # Create input tensor with the correct shape and dtype
>>> input_tensor = np.random.rand(*input_shape).astype(input_dtype)
>>> # Ensure output is in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>> lre.infer(input_tensor)
>>> # Get all outputs
>>> outputs = lre.get_outputs()
>>> for i, output in enumerate(outputs):
>>>     np_output = np.from_dlpack(output)
>>>     print(f"Output {i} shape: {np_output.shape}, dtype: {np_output.dtype}")
>>> # Or get a specific output
>>> first_output = lre.get_output(0)
>>> np_first_output = np.from_dlpack(first_output)
>>> print(f"First output shape: {np_first_output.shape}, dtype: {np_first_output.dtype}")

set_cpu_output

set_cpu_output(use_cpu: bool) -> None

Configure whether output tensors should be placed in CPU memory.

This method configures the output to be a CPU PyDLPack tensor if the inference device is CUDA. If the inference device is already set to CPU, this setting has no effect since the output is already on the CPU.

This is particularly useful when working with libraries like NumPy that require tensors to be in CPU memory for processing.

Parameters:

Name	Type	Description	Default
`use_cpu`	`bool`	If set to True, the output will be a CPU PyDLPack tensor when the inference device is CUDA. If set to False, the output will remain on the device used for inference.	required

Examples:

>>> import numpy as np
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> # Get input shape and dtype from the model
>>> input_shape = lre.input_shapes[0]
>>> input_dtype = lre.input_dtypes[0]
>>> # Create input tensor with the correct shape and dtype
>>> input_tensor = np.random.rand(*input_shape).astype(input_dtype)
>>> # Set outputs to be in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>> lre.infer(input_tensor)
>>> outputs = lre.get_outputs()
>>> # Now outputs can be directly used with numpy.from_dlpack
>>> for i, output in enumerate(outputs):
>>>     np_output = np.from_dlpack(output)
>>>     print(f"Output {i} shape: {np_output.shape}, dtype: {np_output.dtype}")

Latent Runtime Engine API

pylre.LatentRuntimeEngine

Attributes

input_dtypes property

input_shapes property

is_cpu_output property

is_trt property

model_id property

model_precision property

number_inputs property

number_outputs property

output_dtypes property

output_shapes property

runtime_options property

Functions

__call__

get_metadata

get_output

get_outputs

infer

set_cpu_output

input_dtypes `property`

input_shapes `property`

is_cpu_output `property`

is_trt `property`

model_id `property`

model_precision `property`

number_inputs `property`

number_outputs `property`

output_dtypes `property`

output_shapes `property`

runtime_options `property`

call