Skip to content

GraphExecutor Class

forge.GraphExecutor


forge.GraphExecutor

GraphExecutor(module_path: Union[str, Path], device: str)

Lightweight wrapper around the TVM GraphExecutor to load and run a compiled model.

This is a quick and easy way to test the resulting compilation. It is not the ideal usage of the compiled model since it requires 1) Python and 2) TVM. But it can be very helpful for debugging, "in-the-loop" testing, and approximating benchmarked latencies (see GraphExecutor.benchmark()).

Initialize an instance of the GraphExecutor to run inference with a compiled model.

Parameters:

Name Type Description Default
module_path Union[str, Path]

A path to the compile .so binary

required
device str

A string that denotes the device the .so binary is compiled for, e.g. 'cpu' or 'cuda'. One can set the device ID with the device string by adding a ':' followed by the ID number, e.g. 'cuda:1'. If not set, the device ID always defaults to 0.

required

Methods:

Name Description
set_output_type

Set the object type of the array-like outputs.

infer

Perform inference using the model with the provided input data.

get_outputs

Retrieves the outputs of the model stored internally.

__call__

Inference that mimics the inference call of Torch modules

benchmark

Calculate runtime of a function by repeatedly calling it.

Attributes:

Name Type Description
input_count int

Model's input data types

input_shapes List[Tuple[int, ...]]

Model's input shapes

input_dtypes List[str]

Model's input data types

output_count int

Model's number of outputs

output_shapes List[Tuple[int, ...]]

Model's output shapes

output_dtypes List[str]

Model's output data types

output_type str

Output type for the output tensors

Attributes

input_count property
input_count: int

Model's input data types

input_shapes property
input_shapes: List[Tuple[int, ...]]

Model's input shapes

input_dtypes property
input_dtypes: List[str]

Model's input data types

output_count property
output_count: int

Model's number of outputs

output_shapes property
output_shapes: List[Tuple[int, ...]]

Model's output shapes

output_dtypes property
output_dtypes: List[str]

Model's output data types

output_type property
output_type: str

Output type for the output tensors

It will be one of three types: 'numpy', 'dlpack', 'torch' A user can set the output type with GraphExecutor.set_output_types().

Functions

set_output_type
set_output_type(tensor_type: str) -> None

Set the object type of the array-like outputs.

A user can select one of three options: 'numpy', 'dlpack', or 'torch'. And the user can check the current selection with GraphExecutor.output_type. This conversion to the selected output type happens in the GraphExecutor.get_outputs() call. Selecting 'dlpack' or 'torch' can provide a user more control over the location of the output tensor. For GraphExecutor instances running graphs set on 'cuda' targets, 'torch' and 'dlpack' options allow a user to keep the outputs on the GPU, whereas 'numpy' outputs will always result in a transfer of data from GPU to CPU.

Parameters:

Name Type Description Default
tensor_type str

One of three options: 'numpy', 'dlpack', 'torch'

required

Returns:

Name Type Description
None None

This operation happens in place. The setting is reflected in GraphExecutor.output_type.

infer
infer(*input_data: Any) -> None

Perform inference using the model with the provided input data.

This method processes the input data through the model to produce an inference result. It is designed to handle a variable number of input arguments, each of which should be an array or tensor-type object.

Parameters:

Name Type Description Default
*input_data Any

A variable number of arguments containing the input data for the model. The number of arguments should match the GraphExecutor.input_count. The shapes of the arguments should match the GraphExecutor.input_shapes. And the dtypes of the arguments should match the GraphExecutor.input_dtypes. The arguments can be numpy, torch, or tensorflow tensors but the tensors must be located on the host-device, typically this is the CPU.

()

Returns:

Name Type Description
None None

This method does not return a value. The inference result is stored internally within the object. Use GraphExecutor.get_outputs() to retrieve the outputs. Or invoke GraphExecutor(*input_data) to run inference and get outputs with one call.

get_outputs
get_outputs() -> List[Any]

Retrieves the outputs of the model stored internally.

This method gets the output tensors located at the output nodes of the underlying compute graph and converts them into the designated output type (see GraphExecutor.output_type and GraphExecutor.set_output_type()).

Returns:

Name Type Description
outputs List[Any]

List[Any]: Returns the output tensors currently located at the output. The outputs will reflect the results of the last-ran inference.

__call__
__call__(*input_data: Any) -> List[Any]

Inference that mimics the inference call of Torch modules

This method composes two methods to run inference and return the results of the inference. See GraphExecutor.infer() and GraphExecutor.get_outputs().

Parameters:

Name Type Description Default
*input_data Any

A variable number of arguments containing the input data for the model. The number of arguments should match the GraphExecutor.input_count. The shapes of the arguments should match the GraphExecutor.input_shapes. And the dtypes of the arguments should match the GraphExecutor.input_dtypes. The arguments can be numpy, torch, or tensorflow tensors but the tensors must be located on the host-device, typically this is the CPU.

()

Returns:

Name Type Description
outputs List[Any]

List[Any]: Returns the output tensors currently located at the output. The outputs will reflect the results of the last-ran inference.

benchmark
benchmark(
    repeat: int = 5,
    number: int = 5,
    min_repeat_ms: Optional[int] = None,
    limit_zero_time_iterations: Optional[int] = 100,
    end_to_end: bool = False,
    cooldown_interval_ms: Optional[int] = 0,
    repeats_to_cooldown: Optional[int] = 1,
    return_ms: bool = True,
)

Calculate runtime of a function by repeatedly calling it.

Use this function to get an accurate measurement of the runtime of a function. The function is run multiple times in order to account for variability in measurements, processor speed or other external factors. Mean, median, standard deviation, min and max runtime are all reported. On GPUs, CUDA and ROCm specifically, special on-device timers are used so that synchonization and data transfer operations are not counted towards the runtime. This allows for fair comparison of runtimes across different functions and models. The end_to_end flag switches this behavior to include data transfer operations in the runtime.

The benchmarking loop looks approximately like so:

.. code-block:: python

for r in range(repeat):
    time_start = now()
    for n in range(number):
        func_name()
    time_end = now()
    total_times.append((time_end - time_start)/number)

Parameters:

Name Type Description Default
repeat int

Number of times to run the outer loop of the timing code (see above). The output will contain repeat number of datapoints. Defaults to 5.

5
number int

Number of times to run the inner loop of the timing code. This inner loop is run in between the timer starting and stopping. In order to amortize any timing overhead, number should be increased when the runtime of the function is small (less than a 1/10 of a millisecond). Defaults to 5.

5
min_repeat_ms Optional[int]

If set, the inner loop will be run until it takes longer than min_repeat_ms milliseconds. This can be used to ensure that the function is run enough to get an accurate measurement. Defaults to None.

None
limit_zero_time_iterations Optional[int]

The maximum number of repeats when measured time is equal to 0. It helps to avoid hanging during measurements. Defaults to 100.

100
end_to_end bool

If enabled, include time to transfer input tensors to the device and time to transfer returned tensors in the total runtime. This will give accurate timings for end to end workloads. Defaults to False.

False
cooldown_interval_ms Optional[int]

The cooldown interval in milliseconds between the number of repeats defined by repeats_to_cooldown. Defaults to 0.

0
repeats_to_cooldown Optional[int]

The number of repeats before the cooldown is activated. Defaults to 1.

1
return_ms bool

A flag to convert all measurements to milliseconds. Defaults to True.

True

Returns:

Name Type Description
timing_results Dict

Runtime results broken out into the raw "results", along with the computed statistics of "max", "median", "min", "mean", and "std".