Guide to Loading into Forge¶

This guide will show how to load a model into Forge. All models will be ingested into Forge and converted into Forge's intermediate representation.

Loading a Model¶

Framework Packages¶

Loading a model is relatively simple. Forge does not ship with any ML framework dependencies, so its on the user to install the framework(s) and versions needed for their models. Simply pip install as needed.

Forge's Loading Functions¶

Forge will have all the loading functions accessible from the Forge module's top-level.

import forge

forge.from_onnx()
forge.from_torch()
forge.from_tensorflow()
forge.from_tflite()
forge.from_keras()

Each of the loading functions expect a framework's model in-memory, and not a path to a model. So it's upon the user to load the model first with the corresponding framework module. Each loader will convert the framework model into Forge's intermediate representation module, the forge.IRModule.

import forge
import onnx
import torch

# ONNX example
onnx_model = onnx.load("path/to/model.onnx")
ir = forge.from_onnx(onnx_model)

# PyTorch example
torch_model = torch.jit.load("path/to/traced_model.pt")
ir = forge.from_torch(torch_model, [("input0", (1, 3, 224, 224))])

Nuances and Differences Between Loaders

It's important to note that not all the loaders have same type-signatures. This is observable in the snippet above, where the PyTorch example requires a user to provide more details than the ONNX example. The reasons for this is three-fold:

Inherent differences in how these frameworks represent and handle model graphs and data types
TVM's frontend functions reflect the inherent difference between frameworks
Forge wraps around TVM's frontend functions without modification

As a concrete example, let's quickly consider why the ONNX and PyTorch loaders are different.

ONNX Frontend Static Graph: ONNX (Open Neural Network Exchange) uses a static computational graph. This means the graph's structure, along with the shapes and types of the tensors, are usually defined and fixed (but not always) when the model is saved. No Need for Input Information: Since ONNX models already contain this static information, the conversion process typically doesn't require sample input data. The model graph provides enough information for the Relay frontend to parse and convert it.

PyTorch Frontend Dynamic Nature: PyTorch is known for its dynamic computational graph (eager execution), meaning the graph structure can change at runtime. This flexibility requires additional context during model conversion. Input Information Requirement: When converting a PyTorch model, you typically need to provide input data (or a sample input tensor). This input is necessary because it helps in tracing the model to generate a static graph representation and infer shapes and types that are not fixed until runtime.

Loading Considerations Since each of the frameworks have independent loading functions and logic, a model that gets converted from a framework to a different framework before loading may result in a slightly different intermediate representation, i.e. sementically identical, computationally different.

Using the Loaders Each of the loaders are documented and provide more arguments and parameters than one would typically need. However, occasionally, one may need to consult the documentation to overcome edge-cases. For example, an ONNX model is not always completely statically typed. This is easy to rectify since the loader function allows a user to specify input information like the PyTorch loader.

Optimization Passes During Loading¶

When using the Forge loading functions, the model will automatically undergo two optimization passes during ingestion.

Constant Folding¶

An optimization technique in where expressions involving constant values are pre-calculated and simplified before program compile and execution. This process reduces computation time and improves performance by replacing expressions with their computed values. For example, an expression like 3 + 4 would be replaced with 7 during compilation.

Batch-Norm Folding¶

An optimization technique in where the parameters of batch normalization layers are merged into the weights of the preceding or succeeding matrix-multiply layers (including convolutions). This reduces the computational complexity during inference by decreasing the number of operations required, as it effectively eliminates the standalone batch normalization layers.