GraphExecutor Class¶

forge.GraphExecutor¶

forge.GraphExecutor ¶

GraphExecutor(module_path: Union[str, Path], device: str)

Lightweight wrapper around the TVM GraphExecutor to load and run a compiled model.

This is a quick and easy way to test the resulting compilation. It is not the ideal usage of the compiled model since it requires 1) Python and 2) TVM. But it can be very helpful for debugging, "in-the-loop" testing, and approximating benchmarked latencies (see GraphExecutor.benchmark()).

Initialize an instance of the GraphExecutor to run inference with a compiled model.

Parameters:

Name	Type	Description	Default
`module_path`	`Union[str, Path]`	A path to the compile .so binary	required
`device`	`str`	A string that denotes the device the .so binary is compiled for, e.g. 'cpu' or 'cuda'. One can set the device ID with the `device` string by adding a ':' followed by the ID number, e.g. 'cuda:1'. If not set, the device ID always defaults to 0.	required

Methods:

Name	Description
`set_output_type`	Set the object type of the array-like outputs.
`infer`	Perform inference using the model with the provided input data.
`get_outputs`	Retrieves the outputs of the model stored internally.
`__call__`	Inference that mimics the inference call of Torch modules
`benchmark`	Calculate runtime of a function by repeatedly calling it.

Attributes:

Name	Type	Description
`input_count`	`int`	Model's input data types
`input_shapes`	`List[Tuple[int, ...]]`	Model's input shapes
`input_dtypes`	`List[str]`	Model's input data types
`output_count`	`int`	Model's number of outputs
`output_shapes`	`List[Tuple[int, ...]]`	Model's output shapes
`output_dtypes`	`List[str]`	Model's output data types
`output_type`	`str`	Output type for the output tensors

Attributes¶

input_count `property` ¶

input_count: int

Model's input data types

input_shapes `property` ¶

input_shapes: List[Tuple[int, ...]]

Model's input shapes

input_dtypes `property` ¶

input_dtypes: List[str]

Model's input data types

output_count `property` ¶

output_count: int

Model's number of outputs

output_shapes `property` ¶

output_shapes: List[Tuple[int, ...]]

Model's output shapes

output_dtypes `property` ¶

output_dtypes: List[str]

Model's output data types

output_type `property` ¶

output_type: str

Output type for the output tensors

It will be one of three types: 'numpy', 'dlpack', 'torch' A user can set the output type with GraphExecutor.set_output_types().

Functions¶

set_output_type ¶

set_output_type(tensor_type: str) -> None

Set the object type of the array-like outputs.

A user can select one of three options: 'numpy', 'dlpack', or 'torch'. And the user can check the current selection with GraphExecutor.output_type. This conversion to the selected output type happens in the GraphExecutor.get_outputs() call. Selecting 'dlpack' or 'torch' can provide a user more control over the location of the output tensor. For GraphExecutor instances running graphs set on 'cuda' targets, 'torch' and 'dlpack' options allow a user to keep the outputs on the GPU, whereas 'numpy' outputs will always result in a transfer of data from GPU to CPU.

Parameters:

Name	Type	Description	Default
`tensor_type`	`str`	One of three options: 'numpy', 'dlpack', 'torch'	required

Returns:

Name	Type	Description
`None`	`None`	This operation happens in place. The setting is reflected in `GraphExecutor.output_type`.

infer ¶

infer(*input_data: Any) -> None

Perform inference using the model with the provided input data.

This method processes the input data through the model to produce an inference result. It is designed to handle a variable number of input arguments, each of which should be an array or tensor-type object.

Parameters:

Name	Type	Description	Default
`*input_data`	`Any`	A variable number of arguments containing the input data for the model. The number of arguments should match the `GraphExecutor.input_count`. The shapes of the arguments should match the `GraphExecutor.input_shapes`. And the dtypes of the arguments should match the `GraphExecutor.input_dtypes`. The arguments can be numpy, torch, or tensorflow tensors but the tensors must be located on the host-device, typically this is the CPU.	`()`

Returns:

Name	Type	Description
`None`	`None`	This method does not return a value. The inference result is stored internally within the object. Use `GraphExecutor.get_outputs()` to retrieve the outputs. Or invoke `GraphExecutor(*input_data)` to run inference and get outputs with one call.

get_outputs ¶

get_outputs() -> List[Any]

Retrieves the outputs of the model stored internally.

This method gets the output tensors located at the output nodes of the underlying compute graph and converts them into the designated output type (see GraphExecutor.output_type and GraphExecutor.set_output_type()).

Returns:

Name	Type	Description
`outputs`	`List[Any]`	List[Any]: Returns the output tensors currently located at the output. The outputs will reflect the results of the last-ran inference.

call ¶

__call__(*input_data: Any) -> List[Any]

Inference that mimics the inference call of Torch modules

This method composes two methods to run inference and return the results of the inference. See GraphExecutor.infer() and GraphExecutor.get_outputs().

Parameters:

Name	Type	Description	Default
`*input_data`	`Any`	A variable number of arguments containing the input data for the model. The number of arguments should match the `GraphExecutor.input_count`. The shapes of the arguments should match the `GraphExecutor.input_shapes`. And the dtypes of the arguments should match the `GraphExecutor.input_dtypes`. The arguments can be numpy, torch, or tensorflow tensors but the tensors must be located on the host-device, typically this is the CPU.	`()`

Returns:

Name	Type	Description
`outputs`	`List[Any]`	List[Any]: Returns the output tensors currently located at the output. The outputs will reflect the results of the last-ran inference.

benchmark ¶

benchmark(
    repeat: int = 5,
    number: int = 5,
    min_repeat_ms: Optional[int] = None,
    limit_zero_time_iterations: Optional[int] = 100,
    end_to_end: bool = False,
    cooldown_interval_ms: Optional[int] = 0,
    repeats_to_cooldown: Optional[int] = 1,
    return_ms: bool = True,
)

Calculate runtime of a function by repeatedly calling it.

Use this function to get an accurate measurement of the runtime of a function. The function is run multiple times in order to account for variability in measurements, processor speed or other external factors. Mean, median, standard deviation, min and max runtime are all reported. On GPUs, CUDA and ROCm specifically, special on-device timers are used so that synchonization and data transfer operations are not counted towards the runtime. This allows for fair comparison of runtimes across different functions and models. The end_to_end flag switches this behavior to include data transfer operations in the runtime.

The benchmarking loop looks approximately like so:

.. code-block:: python

for r in range(repeat):
    time_start = now()
    for n in range(number):
        func_name()
    time_end = now()
    total_times.append((time_end - time_start)/number)

Parameters:

Name	Type	Description	Default
`repeat`	`int`	Number of times to run the outer loop of the timing code (see above). The output will contain `repeat` number of datapoints. Defaults to 5.	`5`
`number`	`int`	Number of times to run the inner loop of the timing code. This inner loop is run in between the timer starting and stopping. In order to amortize any timing overhead, `number` should be increased when the runtime of the function is small (less than a 1/10 of a millisecond). Defaults to 5.	`5`
`min_repeat_ms`	`Optional[int]`	If set, the inner loop will be run until it takes longer than `min_repeat_ms` milliseconds. This can be used to ensure that the function is run enough to get an accurate measurement. Defaults to None.	`None`
`limit_zero_time_iterations`	`Optional[int]`	The maximum number of repeats when measured time is equal to 0. It helps to avoid hanging during measurements. Defaults to 100.	`100`
`end_to_end`	`bool`	If enabled, include time to transfer input tensors to the device and time to transfer returned tensors in the total runtime. This will give accurate timings for end to end workloads. Defaults to False.	`False`
`cooldown_interval_ms`	`Optional[int]`	The cooldown interval in milliseconds between the number of repeats defined by `repeats_to_cooldown`. Defaults to 0.	`0`
`repeats_to_cooldown`	`Optional[int]`	The number of repeats before the cooldown is activated. Defaults to 1.	`1`
`return_ms`	`bool`	A flag to convert all measurements to milliseconds. Defaults to True.	`True`

Returns:

Name	Type	Description
`timing_results`	`Dict`	Runtime results broken out into the raw "results", along with the computed statistics of "max", "median", "min", "mean", and "std".

GraphExecutor Class¶

forge.GraphExecutor¶

forge.GraphExecutor ¶

Attributes¶

input_count property ¶

input_shapes property ¶

input_dtypes property ¶

output_count property ¶

output_shapes property ¶

output_dtypes property ¶

output_type property ¶

Functions¶

set_output_type ¶

infer ¶

get_outputs ¶

__call__ ¶

benchmark ¶

input_count `property` ¶

input_shapes `property` ¶

input_dtypes `property` ¶

output_count `property` ¶

output_shapes `property` ¶

output_dtypes `property` ¶

output_type `property` ¶

call ¶