Guide to Inference with Forge¶
GraphExecutor¶
This guide will show you how to run inference on the compiled binary with Forge using its GraphExecutor.
Inference with Forge
It should be noted that running inference with Forge is not an optimal setup for edge deployment. It is a heavy environment. But Forge does allow one to run inference with the compiled binary locally for convenience and testing purposes.
Loading a GraphExecutor¶
It is simple to load a compiled binary. One only needs to provide the path to the compiled .so
file, and denote the device-string (usually "cpu" or "cuda"). An error will occur if an incompatible device-string is passed, e.g. using device="cpu"
for a GPU-compiled binary.
Example Code
import forge
gx = forge.GraphExecutor("path/to/modelLibrary.so", device="cpu")
GraphExecutor Introspection¶
There are some properties for high-level introspection.
Example Code
gx.input_count # number of inputs
gx.input_shapes # list of input shapes
gx.input_dtypes # list of input dtypes
gx.output_count # number of outputs
gx.output_shapes # list of output shapes
gx.output_dtypes # list of output dtypes
gx.output_type # string-type of the inference (see "GraphExecutor Inference")
GraphExecutor Benchmarking¶
The GraphExecutor
provides a wrapper around TVM's benchmarking function. Below is the type-signature and docstring.
GraphExecutor Benchmark Method Docstring - Click to Expand & Collapse
forge.GraphExecutor.benchmark(repeat=5, number=5, min_repeat_ms=None, limit_zero_time_iterations=100, end_to_end=False, cooldown_interval_ms=0, repeats_to_cooldown=1, return_ms=True)
¶
Calculate runtime of a function by repeatedly calling it.
Use this function to get an accurate measurement of the runtime of a function. The function
is run multiple times in order to account for variability in measurements, processor speed
or other external factors. Mean, median, standard deviation, min and max runtime are all
reported. On GPUs, CUDA and ROCm specifically, special on-device timers are used so that
synchonization and data transfer operations are not counted towards the runtime. This allows
for fair comparison of runtimes across different functions and models. The end_to_end
flag
switches this behavior to include data transfer operations in the runtime.
The benchmarking loop looks approximately like so:
.. code-block:: python
for r in range(repeat):
time_start = now()
for n in range(number):
func_name()
time_end = now()
total_times.append((time_end - time_start)/number)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
repeat
|
int
|
Number of times to run the outer loop of the timing code (see above). The output will
contain |
5
|
number
|
int
|
Number of times to run the inner loop of the timing code. This inner loop is run in
between the timer starting and stopping. In order to amortize any timing overhead,
|
5
|
min_repeat_ms
|
Optional[int]
|
If set, the inner loop will be run until it takes longer than |
None
|
limit_zero_time_iterations
|
Optional[int]
|
The maximum number of repeats when measured time is equal to 0. It helps to avoid hanging during measurements. Defaults to 100. |
100
|
end_to_end
|
bool
|
If enabled, include time to transfer input tensors to the device and time to transfer returned tensors in the total runtime. This will give accurate timings for end to end workloads. Defaults to False. |
False
|
cooldown_interval_ms
|
Optional[int]
|
The cooldown interval in milliseconds between the number of repeats defined by
|
0
|
repeats_to_cooldown
|
Optional[int]
|
The number of repeats before the cooldown is activated. Defaults to 1. |
1
|
return_ms
|
bool
|
A flag to convert all measurements to milliseconds. Defaults to True. |
True
|
Returns:
Name | Type | Description |
---|---|---|
timing_results |
Dict
|
Runtime results broken out into the raw "results", along with the computed statistics of "max", "median", "min", "mean", and "std". |
Benchmarking Details
It is important to note that this benchmarking does not account for...
-
the setting of the input
-
the retrieval of the output
i.e. there is no benchmarking for the setting of input and retrieval of output in the the benchmark loop.
Example Code
gx.benchmark(repeat=10, number=5, return_ms=False)
gx.benchmark(end_to_end=True)
GraphExecutor Inference¶
Inference Methods¶
There are a couple of ways to run inference.
Method 1:
gx.infer(input_data) # runs inference but does not return output
output = gx.get_outputs() # retrieve a list of output tensors
Method 2:
output = gx(input_data) # runs inference and returns output
The input_data
can be a singular or multiple set of positional arguments. The input_data
should be Numpy, Torch, or TensorFlow tensor objects, i.e. any tensor object of which uphold the DLPack protocol.
Output Type¶
By default, the list of outputs returned will be Numpy arrays. However, a user can manually set this to return one of three supported options: "numpy", "dlpack", or "torch". By using "dlpack" and "torch" a user can force the resulting output to stay on the target device (e.g. the GPU) to avoid the expense of memory transferring the output from the target device to the CPU. If "dlpack" is elected, it is upon the user to ingest the object into a framework of their choice.
gx.set_output_type("torch")
gx.output_type # the string denoting the tensor object type returned
Environment
One must have torch
installed in the environment to have the GraphExecutor
return torch.Tensor
objects.
Inference with a TensorRT-Compiled Model¶
To use the GraphExecutor
on a model compiled with TensorRT, the functional flow is no different as described from above. But it should be noted that the inference engines are built by TensorRT at runtime.
The first time a TensorRT-compiled model is loaded and ran, TensorRT will kick off a engine-building process using the compiled .so
. This is important because the first inference may appear exceptionally slow for a TensorRT-compiled model if engine-building is required. Built engines are cached in the same directory as the .so
and future loading and runs of inference will not result in engine-building so long as the cached engines are found.