Skip to content

GraphExecutor Class

forge.GraphExecutor


forge.GraphExecutor(module_path: Union[str, Path], device: str)

Lightweight wrapper around the TVM GraphExecutor to load and run a compiled model.

This is a quick and easy way to test the resulting compilation. It is not the ideal usage of the compiled model since it requires 1) Python and 2) TVM. But it can be very helpful for debugging, "in-the-loop" testing, and approximating benchmarked latencies (see GraphExecutor.benchmark()).

Initialize an instance of the GraphExecutor to run inference with a compiled model.

Parameters:

Name Type Description Default
module_path Union[str, Path]

A path to the compile .so binary

required
device str

A string that denotes the device the .so binary is compiled for, e.g. 'cpu' or 'cuda'. One can set the device ID with the device string by adding a ':' followed by the ID number, e.g. 'cuda:1'. If not set, the device ID always defaults to 0.

required

input_count: int property

Model's input data types

input_shapes: List[Tuple[int, ...]] property

Model's input shapes

input_dtypes: List[str] property

Model's input data types

output_count: int property

Model's number of outputs

output_shapes: List[Tuple[int, ...]] property

Model's output shapes

output_dtypes: List[str] property

Model's output data types

output_type: str property

Output type for the output tensors

It will be one of three types: 'numpy', 'dlpack', 'torch' A user can set the output type with GraphExecutor.set_output_types().

set_output_type(tensor_type: str) -> None

Set the object type of the array-like outputs.

A user can select one of three options: 'numpy', 'dlpack', or 'torch'. And the user can check the current selection with GraphExecutor.output_type. This conversion to the selected output type happens in the GraphExecutor.get_outputs() call. Selecting 'dlpack' or 'torch' can provide a user more control over the location of the output tensor. For GraphExecutor instances running graphs set on 'cuda' targets, 'torch' and 'dlpack' options allow a user to keep the outputs on the GPU, whereas 'numpy' outputs will always result in a transfer of data from GPU to CPU.

Parameters:

Name Type Description Default
tensor_type str

One of three options: 'numpy', 'dlpack', 'torch'

required

Returns:

Name Type Description
None None

This operation happens in place. The setting is reflected in GraphExecutor.output_type.

infer(*input_data: Any) -> None

Perform inference using the model with the provided input data.

This method processes the input data through the model to produce an inference result. It is designed to handle a variable number of input arguments, each of which should be an array or tensor-type object.

Parameters:

Name Type Description Default
*input_data Any

A variable number of arguments containing the input data for the model. The number of arguments should match the GraphExecutor.input_count. The shapes of the arguments should match the GraphExecutor.input_shapes. And the dtypes of the arguments should match the GraphExecutor.input_dtypes. The arguments can be numpy, torch, or tensorflow tensors but the tensors must be located on the host-device, typically this is the CPU.

()

Returns:

Name Type Description
None None

This method does not return a value. The inference result is stored internally within the object. Use GraphExecutor.get_outputs() to retrieve the outputs. Or invoke GraphExecutor(*input_data) to run inference and get outputs with one call.

get_outputs() -> List[Any]

Retrieves the outputs of the model stored internally.

This method gets the output tensors located at the output nodes of the underlying compute graph and converts them into the designated output type (see GraphExecutor.output_type and GraphExecutor.set_output_type()).

Returns:

Name Type Description
outputs List[Any]

List[Any]: Returns the output tensors currently located at the output. The outputs will reflect the results of the last-ran inference.

__call__(*input_data: Any) -> List[Any]

Inference that mimics the inference call of Torch modules

This method composes two methods to run inference and return the results of the inference. See GraphExecutor.infer() and GraphExecutor.get_outputs().

Parameters:

Name Type Description Default
*input_data Any

A variable number of arguments containing the input data for the model. The number of arguments should match the GraphExecutor.input_count. The shapes of the arguments should match the GraphExecutor.input_shapes. And the dtypes of the arguments should match the GraphExecutor.input_dtypes. The arguments can be numpy, torch, or tensorflow tensors but the tensors must be located on the host-device, typically this is the CPU.

()

Returns:

Name Type Description
outputs List[Any]

List[Any]: Returns the output tensors currently located at the output. The outputs will reflect the results of the last-ran inference.

benchmark(repeat: int = 5, number: int = 5, min_repeat_ms: Optional[int] = None, limit_zero_time_iterations: Optional[int] = 100, end_to_end: bool = False, cooldown_interval_ms: Optional[int] = 0, repeats_to_cooldown: Optional[int] = 1, return_ms: bool = True)

Calculate runtime of a function by repeatedly calling it.

Use this function to get an accurate measurement of the runtime of a function. The function is run multiple times in order to account for variability in measurements, processor speed or other external factors. Mean, median, standard deviation, min and max runtime are all reported. On GPUs, CUDA and ROCm specifically, special on-device timers are used so that synchonization and data transfer operations are not counted towards the runtime. This allows for fair comparison of runtimes across different functions and models. The end_to_end flag switches this behavior to include data transfer operations in the runtime.

The benchmarking loop looks approximately like so:

.. code-block:: python

for r in range(repeat):
    time_start = now()
    for n in range(number):
        func_name()
    time_end = now()
    total_times.append((time_end - time_start)/number)

Parameters:

Name Type Description Default
repeat int

Number of times to run the outer loop of the timing code (see above). The output will contain repeat number of datapoints. Defaults to 5.

5
number int

Number of times to run the inner loop of the timing code. This inner loop is run in between the timer starting and stopping. In order to amortize any timing overhead, number should be increased when the runtime of the function is small (less than a 1/10 of a millisecond). Defaults to 5.

5
min_repeat_ms Optional[int]

If set, the inner loop will be run until it takes longer than min_repeat_ms milliseconds. This can be used to ensure that the function is run enough to get an accurate measurement. Defaults to None.

None
limit_zero_time_iterations Optional[int]

The maximum number of repeats when measured time is equal to 0. It helps to avoid hanging during measurements. Defaults to 100.

100
end_to_end bool

If enabled, include time to transfer input tensors to the device and time to transfer returned tensors in the total runtime. This will give accurate timings for end to end workloads. Defaults to False.

False
cooldown_interval_ms Optional[int]

The cooldown interval in milliseconds between the number of repeats defined by repeats_to_cooldown. Defaults to 0.

0
repeats_to_cooldown Optional[int]

The number of repeats before the cooldown is activated. Defaults to 1.

1
return_ms bool

A flag to convert all measurements to milliseconds. Defaults to True.

True

Returns:

Name Type Description
timing_results Dict

Runtime results broken out into the raw "results", along with the computed statistics of "max", "median", "min", "mean", and "std".