Latent Runtime Engine API
pylre.LatentRuntimeEngine
LatentRuntimeEngine(
model_path: PathLike,
execution_provider: ExecutionProvider = None,
precision: ModelPrecision = None,
tensorrt_timing_cache: PathLike | None = None,
tensorrt_engine_cache: PathLike | None = None,
device_id: int = 0,
enable_cuda_graph: bool | None = None,
cuda_stream: str | int | None = None,
password: str | None = None,
key_path: PathLike | None = None,
)
A Python wrapper around the C++ Latent Runtime Engine (LRE).
This class exposes and provides a Python API to the underlying C++ LRE
implementation. The Python LRE can run inference on any tensor inputs that
follow the DLPack protocol, i.e. the tensor objects have a defined
__dlpack__ method, e.g. NumPy arrays, PyTorch tensors, etc. The returned
outputs will also be DLPack objects that can be ingested by common libraries
like NumPy, PyTorch, etc.
Examples:
>>> import numpy as np
>>> from pylre import LatentRuntimeEngine
>>>
>>> # Initialize the LRE with a model
>>> lre = LatentRuntimeEngine("path/to/model.onnx", execution_provider="cpu")
>>>
>>> # Prepare input data using model's expected input shape and dtype
>>> input_shape = lre.input_shapes[0] # Get the first input's shape
>>> input_dtype = lre.input_dtypes[0] # Get the first input's data type
>>> # Create random data with the correct shape and dtype
>>> input_data = np.random.rand(*input_shape).astype(input_dtype)
>>>
>>> # Ensure output is in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>>
>>> # Run inference
>>> outputs = lre(input_data)
>>>
>>> # Process outputs
>>> for i, output in enumerate(outputs):
>>> # Convert to numpy array using from_dlpack
>>> np_output = np.from_dlpack(output)
>>> print(f"Output {i} shape: {np_output.shape}, dtype: {np_output.dtype}")
Initialize a Latent Runtime Engine instance.
This constructor initializes a new LRE instance by loading a model from the
specified path. The model can be either an ONNX file (.onnx) or a TVM compiled
model (.so). The appropriate backend (ONNX or TVM) is selected automatically
based on the file extension or through fallback mechanisms if the extension
is not recognized.
Parameters
Parameters
model_path: Union[str, os.PathLike]
Path to either an `.onnx` or `.so` artifact generated from LEIP Optimize.
execution_provider : Optional[str], default set from the optimized model.
The provider to use for model execution. Possible values: "cpu",
"cuda", "tensorrt".
precision : Optional[str], default="float32"
Precision to use for inference (only for ONNX). Options depend on the precision set during
model export. See the following compatibility tables:
| **TRT** Precision at Export | "float32" | "float16" | "int8" |
| --------------------------------------- | --------- | --------- | ------ |
| **float32** | X | X | |
| **int8** | | | X |
| **non-TRT** Precision at Export | "float32" | "float16" | "int8" |
| --------------------------------------- | --------- | --------- | ------ |
| **float32** | X | | |
| **float16** | | X | |
| **int8** | | | X |
tensorrt_timing_cache : Optional[os.PathLike], default="~/.cache/lre"
A cache path for TensorRT timing data. Only applicable when using TensorRT.
tensorrt_engine_cache : Optional[os.PathLike], default="~/.cache/lre"
A cache path for TensorRT engine files. Only applicable when using TensorRT.
enable_cuda_graph: Optional[bool], default=None
Whether to enable CUDA graph optimization. When None, automatically enables
for static models. CUDA graphs can significantly improve inference performance
on NVIDIA GPUs by recording and replaying GPU operations, reducing CPU
overhead. Only applicable when using CUDA or TensorRT execution
providers.
cuda_stream: Optional[Union[str, int]], default=None
CUDA stream to use for inference. Can be specified as either a string
identifier or an integer stream ID. Only applicable when using
TensorRT execution providers. If None, the default CUDA stream will be used.
device_id : Optional[int], default=0
The ID of the device to use for inference.
- For vanilla CPU memory, pinned memory, or managed memory, this is
set to `0`.
- For multi-GPU systems, allows selecting a specific GPU (e.g.,
`0` for GPU 0).
password : Optional[str], default=None
Password that was used to encrypt your model. If not specified, no
password is required.
key_path : Optional[os.PathLike], default=None
The path to a key file used for model encryption. If not specified,
no key file is required.
Examples:
>>> # Initialize with an ONNX model using CPU
>>> lre = LatentRuntimeEngine("path/to/model.onnx", execution_provider="cpu")
>>>
>>> # Initialize with a TVM model
>>> lre = LatentRuntimeEngine("path/to/model.so")
>>>
>>> # Initialize with an ONNX model using TensorRT
>>> lre = LatentRuntimeEngine(
>>> "path/to/model.onnx",
>>> execution_provider="tensorrt",
>>> precision="float16",
>>> device_id=0
>>> )
>>>
>>> # Initialize with an ONNX model on a cuda stream
>>> lre = LatentRuntimeEngine(
>>> "path/to/model.onnx",
>>> execution_provider="tensorrt",
>>> cuda_stream=torch.cuda.Stream().cuda_stream
>>> )
Raises
Raises
FileNotFoundError
If the specified model file does not exist.
RuntimeError
If the model cannot be loaded with either TVM or ONNX backends.
Methods:
| Name | Description |
|---|---|
__call__ |
Run inference and return outputs by calling the instance directly. |
get_metadata |
Get a dictionary of the model's metadata. |
get_output |
Get a specific tensor output by index from the last executed inference. |
get_outputs |
Get all output tensors from the last executed inference. |
infer |
Run inference on the provided input(s) and save outputs to internal buffers. |
set_cpu_output |
Configure whether output tensors should be placed in CPU memory. |
Attributes:
| Name | Type | Description |
|---|---|---|
input_dtypes |
List[str]
|
Get the data types of all input tensors expected by the model. |
input_shapes |
List[Tuple[int, ...]]
|
Get the shapes of all input tensors expected by the model. |
is_cpu_output |
bool
|
Check if the runtime's current output device is CPU. |
is_trt |
bool
|
Check if the runtime session is using TensorRT (only for ONNX). |
model_id |
str
|
Get the unique identifier (UUID) of the loaded model. |
model_precision |
str
|
Get the precision of the loaded model. |
number_inputs |
int
|
Get the number of input tensors expected by the model. |
number_outputs |
int
|
Get the number of output tensors produced by the model. |
output_dtypes |
List[str]
|
Get the data types of all output tensors produced by the model. |
output_shapes |
List[Tuple[int, ...]]
|
Get the shapes of all output tensors produced by the model. |
runtime_options |
ONNXOptions | TVMOptions
|
Get the runtime options used for the current session. |
Attributes
input_dtypes
property
input_dtypes: List[str]
Get the data types of all input tensors expected by the model.
This property returns a list of data type strings, one for each input tensor.
Returns:
| Type | Description |
|---|---|
List[str]
|
A list of input tensor data types as strings (e.g., "float32", "int8"). |
input_shapes
property
input_shapes: List[Tuple[int, ...]]
Get the shapes of all input tensors expected by the model.
This property returns a list of shapes, where each shape is a tuple of integers representing the dimensions of an input tensor.
Returns:
| Type | Description |
|---|---|
List[Tuple[int, ...]]
|
A list of input tensor shapes. Each shape is a tuple of integers. For example, [(1, 3, 224, 224)] for a model with a single input of shape (batch_size=1, channels=3, height=224, width=224). |
is_cpu_output
property
is_cpu_output: bool
Check if the runtime's current output device is CPU.
This property returns a boolean indicating whether the outputs from inference will be placed in CPU memory. This is particularly useful when working with libraries like NumPy that require tensors to be in CPU memory.
Returns:
| Type | Description |
|---|---|
bool
|
True if outputs will be placed in CPU memory, False if they will remain on the device used for inference (e.g., GPU). |
See Also
set_cpu_output : Method to configure whether outputs should be in CPU memory.
is_trt
property
is_trt: bool
Check if the runtime session is using TensorRT (only for ONNX).
This property returns a boolean indicating whether the current runtime session is using TensorRT as the execution provider. TensorRT is a high-performance deep learning inference optimizer and runtime for NVIDIA GPUs.
Returns:
| Type | Description |
|---|---|
bool
|
True if the runtime session is using TensorRT, False otherwise. |
model_id
property
model_id: str
Get the unique identifier (UUID) of the loaded model.
This property returns the UUID that was assigned to the model during compilation/export, which can be used to uniquely identify the model.
Returns:
| Type | Description |
|---|---|
str
|
The UUID of the model as a string. |
model_precision
property
model_precision: str
Get the precision of the loaded model.
This property returns the precision of the loaded model, which indicates the numerical format used for weights and computations (e.g., "float32", "float16", "int8").
Returns:
| Type | Description |
|---|---|
str
|
The precision of the model as a string (e.g., "float32", "float16", "int8"). The available precision options depend on what was set during model compilation/export. |
Examples:
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> precision = lre.model_precision
>>> print(f"Model is using {precision} precision")
number_inputs
property
number_inputs: int
Get the number of input tensors expected by the model.
This property returns the count of input tensors that the model expects.
Returns:
| Type | Description |
|---|---|
int
|
The number of input tensors. |
number_outputs
property
number_outputs: int
Get the number of output tensors produced by the model.
This property returns the count of output tensors that the model produces.
Returns:
| Type | Description |
|---|---|
int
|
The number of output tensors. |
output_dtypes
property
output_dtypes: List[str]
Get the data types of all output tensors produced by the model.
This property returns a list of data type strings, one for each output tensor.
Returns:
| Type | Description |
|---|---|
List[str]
|
A list of output tensor data types as strings (e.g., "float32", "int8"). |
output_shapes
property
output_shapes: List[Tuple[int, ...]]
Get the shapes of all output tensors produced by the model.
This property returns a list of shapes, where each shape is a tuple of integers representing the dimensions of an output tensor.
Returns:
| Type | Description |
|---|---|
List[Tuple[int, ...]]
|
A list of output tensor shapes. Each shape is a tuple of integers. For example, [(1, 1000)] for a classification model with 1000 classes. |
runtime_options
property
runtime_options: ONNXOptions | TVMOptions
Get the runtime options used for the current session.
This property returns the options that were used to initialize the runtime engine, which may include settings like execution provider, precision, etc.
Returns:
| Type | Description |
|---|---|
Union[ONNXOptions, TVMOptions]
|
The options object used to configure the runtime engine. The type depends on whether the model was loaded with ONNX or TVM. |
Functions
__call__
__call__(inputs: PyDLPack | List[PyDLPack]) -> List[PyDLPack]
Run inference and return outputs by calling the instance directly.
This method provides a convenient way to run inference and get outputs in a single call.
It's equivalent to calling infer() followed by get_outputs().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
Union[PyDLPack, List[PyDLPack]]
|
Either a single DLPack-compatible tensor or a list of DLPack-compatible tensors.
These tensors must implement the DLPack protocol (have a |
required |
Returns:
| Type | Description |
|---|---|
List[PyDLPack]
|
A list of output tensors as DLPack-compatible objects. |
Examples:
>>> import numpy as np
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> # Get input shape and dtype from the model
>>> input_shape = lre.input_shapes[0]
>>> input_dtype = lre.input_dtypes[0]
>>> # Create input tensor with the correct shape and dtype
>>> input_tensor = np.random.rand(*input_shape).astype(input_dtype)
>>> # Ensure output is in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>> outputs = lre(input_tensor)
>>> # Process outputs
>>> for i, output in enumerate(outputs):
>>> np_output = np.from_dlpack(output)
>>> print(f"Output {i} shape: {np_output.shape}, dtype: {np_output.dtype}")
get_metadata
get_metadata() -> Dict[str, Any]
Get a dictionary of the model's metadata.
This method retrieves metadata associated with the loaded model, which may include information such as model version, creation date, author, and other custom metadata fields.
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dictionary of metadata key-value pairs. The specific keys available depend on what was stored in the model during export/compilation. |
Examples:
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> metadata = lre.get_metadata()
>>> # Access specific metadata fields
>>> if 'version' in metadata:
>>> print(f"Model version: {metadata['version']}")
get_output
get_output(index: int) -> PyDLPackProtocol
Get a specific tensor output by index from the last executed inference.
This method retrieves a specific output tensor from the internal buffers populated by the most recent inference operation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int
|
The index of the desired output tensor. Must be less than the total
number of outputs (accessible via |
required |
Returns:
| Type | Description |
|---|---|
PyDLPack
|
The output tensor at the specified index as a DLPack-compatible object. |
Raises:
| Type | Description |
|---|---|
IndexError
|
If the index is out of range. |
Examples:
>>> import numpy as np
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> # Get input shape and dtype from the model
>>> input_shape = lre.input_shapes[0]
>>> input_dtype = lre.input_dtypes[0]
>>> # Create input tensor with the correct shape and dtype
>>> input_tensor = np.random.rand(*input_shape).astype(input_dtype)
>>> # Ensure output is in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>> # Run inference
>>> lre.infer(input_tensor)
>>> # Get a specific output
>>> first_output = lre.get_output(0)
>>> # Convert to numpy array using from_dlpack
>>> np_output = np.from_dlpack(first_output)
>>> print(f"Output shape: {np_output.shape}, dtype: {np_output.dtype}")
get_outputs
get_outputs() -> List[PyDLPack]
Get all output tensors from the last executed inference.
This method retrieves all output tensors from the internal buffers populated by the most recent inference operation.
Returns:
| Type | Description |
|---|---|
List[PyDLPack]
|
A list of all output tensors as DLPack-compatible objects. |
Examples:
>>> import numpy as np
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> # Get input shape and dtype from the model
>>> input_shape = lre.input_shapes[0]
>>> input_dtype = lre.input_dtypes[0]
>>> # Create input tensor with the correct shape and dtype
>>> input_tensor = np.random.rand(*input_shape).astype(input_dtype)
>>> # Ensure output is in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>> # Run inference
>>> lre.infer(input_tensor)
>>> # Get all outputs
>>> outputs = lre.get_outputs()
>>> # Process all outputs using from_dlpack
>>> for i, output in enumerate(outputs):
>>> np_output = np.from_dlpack(output)
>>> print(f"Output {i} shape: {np_output.shape}, dtype: {np_output.dtype}")
infer
infer(inputs: PyDLPack | List[PyDLPack]) -> None
Run inference on the provided input(s) and save outputs to internal buffers.
This method runs inference on the provided inputs and stores the results in
internal buffers. The outputs can be retrieved using get_output() or get_outputs().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
PyDLPack | List[PyDLPack]
|
Either a single DLPack-compatible tensor or a list/tuple of DLPack-compatible
tensors. A DLPack-compatible tensor is any tensor that implements the
DLPack protocol (has a |
required |
Examples:
>>> import numpy as np
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> # Get input shape and dtype from the model
>>> input_shape = lre.input_shapes[0]
>>> input_dtype = lre.input_dtypes[0]
>>> # Create input tensor with the correct shape and dtype
>>> input_tensor = np.random.rand(*input_shape).astype(input_dtype)
>>> # Ensure output is in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>> lre.infer(input_tensor)
>>> # Get all outputs
>>> outputs = lre.get_outputs()
>>> for i, output in enumerate(outputs):
>>> np_output = np.from_dlpack(output)
>>> print(f"Output {i} shape: {np_output.shape}, dtype: {np_output.dtype}")
>>> # Or get a specific output
>>> first_output = lre.get_output(0)
>>> np_first_output = np.from_dlpack(first_output)
>>> print(f"First output shape: {np_first_output.shape}, dtype: {np_first_output.dtype}")
set_cpu_output
set_cpu_output(use_cpu: bool) -> None
Configure whether output tensors should be placed in CPU memory.
This method configures the output to be a CPU PyDLPack tensor if the inference device is CUDA. If the inference device is already set to CPU, this setting has no effect since the output is already on the CPU.
This is particularly useful when working with libraries like NumPy that require tensors to be in CPU memory for processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
use_cpu
|
bool
|
If set to True, the output will be a CPU PyDLPack tensor when the inference device is CUDA. If set to False, the output will remain on the device used for inference. |
required |
Examples:
>>> import numpy as np
>>> # Assuming lre is an initialized LatentRuntimeEngine
>>> # Get input shape and dtype from the model
>>> input_shape = lre.input_shapes[0]
>>> input_dtype = lre.input_dtypes[0]
>>> # Create input tensor with the correct shape and dtype
>>> input_tensor = np.random.rand(*input_shape).astype(input_dtype)
>>> # Set outputs to be in CPU memory for NumPy compatibility
>>> lre.set_cpu_output(True)
>>> lre.infer(input_tensor)
>>> outputs = lre.get_outputs()
>>> # Now outputs can be directly used with numpy.from_dlpack
>>> for i, output in enumerate(outputs):
>>> np_output = np.from_dlpack(output)
>>> print(f"Output {i} shape: {np_output.shape}, dtype: {np_output.dtype}")