GraphExecutor Class¶
forge.GraphExecutor¶
forge.GraphExecutor(module_path: Union[str, Path], device: str)
¶
Lightweight wrapper around the TVM GraphExecutor to load and run a compiled model.
This is a quick and easy way to test the resulting compilation. It is not the
ideal usage of the compiled model since it requires 1) Python and 2) TVM. But
it can be very helpful for debugging, "in-the-loop" testing, and approximating
benchmarked latencies (see GraphExecutor.benchmark()
).
Initialize an instance of the GraphExecutor
to run inference with a compiled model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
module_path
|
Union[str, Path]
|
A path to the compile .so binary |
required |
device
|
str
|
A string that denotes the device the .so binary is compiled for, e.g. 'cpu'
or 'cuda'. One can set the device ID with the |
required |
input_count: int
property
¶
Model's input data types
input_shapes: List[Tuple[int, ...]]
property
¶
Model's input shapes
input_dtypes: List[str]
property
¶
Model's input data types
output_count: int
property
¶
Model's number of outputs
output_shapes: List[Tuple[int, ...]]
property
¶
Model's output shapes
output_dtypes: List[str]
property
¶
Model's output data types
output_type: str
property
¶
Output type for the output tensors
It will be one of three types: 'numpy', 'dlpack', 'torch'
A user can set the output type with GraphExecutor.set_output_types()
.
set_output_type(tensor_type: str) -> None
¶
Set the object type of the array-like outputs.
A user can select one of three options: 'numpy', 'dlpack', or 'torch'. And the user
can check the current selection with GraphExecutor.output_type
. This conversion to
the selected output type happens in the GraphExecutor.get_outputs()
call. Selecting
'dlpack' or 'torch' can provide a user more control over the location of the output
tensor. For GraphExecutor
instances running graphs set on 'cuda' targets, 'torch'
and 'dlpack' options allow a user to keep the outputs on the GPU, whereas 'numpy'
outputs will always result in a transfer of data from GPU to CPU.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tensor_type
|
str
|
One of three options: 'numpy', 'dlpack', 'torch' |
required |
Returns:
Name | Type | Description |
---|---|---|
None |
None
|
This operation happens in place. The setting is reflected in
|
infer(*input_data: Any) -> None
¶
Perform inference using the model with the provided input data.
This method processes the input data through the model to produce an inference result. It is designed to handle a variable number of input arguments, each of which should be an array or tensor-type object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*input_data
|
Any
|
A variable number of arguments containing the input data for the model.
The number of arguments should match the |
()
|
Returns:
Name | Type | Description |
---|---|---|
None |
None
|
This method does not return a value. The inference result is stored internally
within the object. Use |
get_outputs() -> List[Any]
¶
Retrieves the outputs of the model stored internally.
This method gets the output tensors located at the output nodes of the underlying
compute graph and converts them into the designated output type (see
GraphExecutor.output_type
and GraphExecutor.set_output_type()
).
Returns:
Name | Type | Description |
---|---|---|
outputs |
List[Any]
|
List[Any]: Returns the output tensors currently located at the output. The outputs will reflect the results of the last-ran inference. |
__call__(*input_data: Any) -> List[Any]
¶
Inference that mimics the inference call of Torch modules
This method composes two methods to run inference and return the results of the inference.
See GraphExecutor.infer()
and GraphExecutor.get_outputs()
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*input_data
|
Any
|
A variable number of arguments containing the input data for the model.
The number of arguments should match the |
()
|
Returns:
Name | Type | Description |
---|---|---|
outputs |
List[Any]
|
List[Any]: Returns the output tensors currently located at the output. The outputs will reflect the results of the last-ran inference. |
benchmark(repeat: int = 5, number: int = 5, min_repeat_ms: Optional[int] = None, limit_zero_time_iterations: Optional[int] = 100, end_to_end: bool = False, cooldown_interval_ms: Optional[int] = 0, repeats_to_cooldown: Optional[int] = 1, return_ms: bool = True)
¶
Calculate runtime of a function by repeatedly calling it.
Use this function to get an accurate measurement of the runtime of a function. The function
is run multiple times in order to account for variability in measurements, processor speed
or other external factors. Mean, median, standard deviation, min and max runtime are all
reported. On GPUs, CUDA and ROCm specifically, special on-device timers are used so that
synchonization and data transfer operations are not counted towards the runtime. This allows
for fair comparison of runtimes across different functions and models. The end_to_end
flag
switches this behavior to include data transfer operations in the runtime.
The benchmarking loop looks approximately like so:
.. code-block:: python
for r in range(repeat):
time_start = now()
for n in range(number):
func_name()
time_end = now()
total_times.append((time_end - time_start)/number)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
repeat
|
int
|
Number of times to run the outer loop of the timing code (see above). The output will
contain |
5
|
number
|
int
|
Number of times to run the inner loop of the timing code. This inner loop is run in
between the timer starting and stopping. In order to amortize any timing overhead,
|
5
|
min_repeat_ms
|
Optional[int]
|
If set, the inner loop will be run until it takes longer than |
None
|
limit_zero_time_iterations
|
Optional[int]
|
The maximum number of repeats when measured time is equal to 0. It helps to avoid hanging during measurements. Defaults to 100. |
100
|
end_to_end
|
bool
|
If enabled, include time to transfer input tensors to the device and time to transfer returned tensors in the total runtime. This will give accurate timings for end to end workloads. Defaults to False. |
False
|
cooldown_interval_ms
|
Optional[int]
|
The cooldown interval in milliseconds between the number of repeats defined by
|
0
|
repeats_to_cooldown
|
Optional[int]
|
The number of repeats before the cooldown is activated. Defaults to 1. |
1
|
return_ms
|
bool
|
A flag to convert all measurements to milliseconds. Defaults to True. |
True
|
Returns:
Name | Type | Description |
---|---|---|
timing_results |
Dict
|
Runtime results broken out into the raw "results", along with the computed statistics of "max", "median", "min", "mean", and "std". |