Skip to content

RelayModule Class

forge.RelayModule


forge.RelayModule

RelayModule(
    mod: obj,
    params: Optional[Dict[str, Union[obj, ndarray]]] = None,
    inline_partitions: bool = False,
    fold_constants: bool = True,
    fold_batch_norms: bool = True,
)

RelayModule in Forge, an extension of TVM's Relay.IRModule, facilitates advanced manipulation, optimization, and compilation of machine learning models.

This class provides a user-friendly, graph-based interface for the Relay Intermediate Representation (IR), making it easier to work with compared to the standard TVM IRModule. It's designed to accommodate both beginners and expert users in machine learning, offering tools for model calibration, optimization, quantization, and the entry-point for direct graph manipulation.

Parameters:

Name Type Description Default
mod obj

A Relay IRModule, i.e. TVM-IRModule

required
params Dict[str, Union[obj, ndarray]]

The weights of the IRModule, defaults to None.

None
inline_partitions bool

Inlines all partitions during initialization if enabled, defaults to False.

False
fold_constants bool

Folds all constant branches during initialization if enabled, defaults to True.

True
fold_batch_norms bool

Folds all nn.batch_norm operators during initialization if enabled, defaults to True.

True

Methods:

Name Description
copy

Get a deepcopy of the RelayModule.

get_inference_function

Get a Python callable that emulates the model's inference

set_batch_size

Sets the RelayModule's batch-size

partition_for_tensorrt

DEPRECATED: This method is deprecated and will be removed in future releases.

split_tensors

Performs tensor splitting on the weight tensors of convolution layers.

inline_partitions

DEPRECATED: This method is deprecated and will be removed in future releases.

compile

Compiles the model for a specified target with various configuration options.

calibrate

Calibrates the model by tracking intermediate layer statistics.

quantize

Applies quantization to the model with specified parameters.

__eq__

Compares for equality of hash

__hash__

Structural hash of the RelayModule

__iter__

Iterator of the underlying Node objects in evaluation order

Attributes:

Name Type Description
graphs Dict[str, Graph]
fingerprint str

Deterministic hashing of a RelayModule's structure and data

params Dict[str, obj]

The weights of the RelayModule that are not "frozen" into the graph

input_count int

RelayModule's number of expected inputs

input_shapes List[Sequence[int]]

List of RelayModule's input shapes

input_dtypes List[str]

List of RelayModule's input data types

input_nodes List[Node]

The RelayModule's computational graph's input Node objects

output_count int

RelayModule's number of expected ouputs

output_shapes List[Sequence[int]]

List of RelayModule's output shapes

output_dtypes List[Sequence[int]]

List of RelayModule's output data types

output_node Node

The RelayModule's computational graph's output Node object

graph_count int

Number of computational graphs in the RelayModule, includes the "main" graph

subgraph_count int

Number of computational subgraphs in the RelayModule, excludes the "main" graph

operators Dict[str, int]

Full count of the RelayModule's operators

mod obj

Relay IRModule (i.e. TVM-IRModule) without type-annotations

typed_mod obj

Relay IRModule (i.e. TVM-IRModule) with type-annotations

main Graph

The RelayModule's "main" computational Graph object

is_tensorrt bool

Flag to check if RelayModule is partitioned for TensorRT or not

is_calibrated bool

Flag to check if RelayModule is calibrated or not

is_quantized bool

Flag to check if RelayModule is quantized or not

is_split_tensors bool

Flag to check if RelayModule has been tensor-split or not

Attributes

graphs class-attribute instance-attribute
graphs: Dict[str, Graph] = {
    name_hint: Graph(relay_expr=func_expr, params=params, sink_name=name_hint)
    for (gv, func_expr) in items()
}
fingerprint property
fingerprint: str

Deterministic hashing of a RelayModule's structure and data

To get a structural hash (excluding data) use the hash() function.

params property
params: Dict[str, obj]

The weights of the RelayModule that are not "frozen" into the graph

input_count property
input_count: int

RelayModule's number of expected inputs

input_shapes property
input_shapes: List[Sequence[int]]

List of RelayModule's input shapes

input_dtypes property
input_dtypes: List[str]

List of RelayModule's input data types

input_nodes property
input_nodes: List[Node]

The RelayModule's computational graph's input Node objects

output_count property
output_count: int

RelayModule's number of expected ouputs

output_shapes property
output_shapes: List[Sequence[int]]

List of RelayModule's output shapes

output_dtypes property
output_dtypes: List[Sequence[int]]

List of RelayModule's output data types

output_node property
output_node: Node

The RelayModule's computational graph's output Node object

graph_count property
graph_count: int

Number of computational graphs in the RelayModule, includes the "main" graph

Note: This is a strictly positive number.

subgraph_count property
subgraph_count: int

Number of computational subgraphs in the RelayModule, excludes the "main" graph

Note: This is a strictly non-negative number.

operators property
operators: Dict[str, int]

Full count of the RelayModule's operators

mod property
mod: obj

Relay IRModule (i.e. TVM-IRModule) without type-annotations

typed_mod property
typed_mod: obj

Relay IRModule (i.e. TVM-IRModule) with type-annotations

Note: This property is a quick means of validating for "correctness" in a well-formed Relay.IRModule. This will throw a TVMError for an invalid RelayModule.

main property
main: Graph

The RelayModule's "main" computational Graph object

is_tensorrt property
is_tensorrt: bool

Flag to check if RelayModule is partitioned for TensorRT or not

is_calibrated property
is_calibrated: bool

Flag to check if RelayModule is calibrated or not

is_quantized property
is_quantized: bool

Flag to check if RelayModule is quantized or not

is_split_tensors property
is_split_tensors: bool

Flag to check if RelayModule has been tensor-split or not

Functions

copy
copy() -> RelayModule

Get a deepcopy of the RelayModule.

Copying an RelayModule can be very useful and helpful for getting a "checkpoint" of the RelayModule. This can be especially useful before any in place transformations like quantization or partitioning.

Returns:

Type Description
RelayModule

A duplicate copy of the RelayModule

get_inference_function
get_inference_function(target: str = 'llvm') -> Callable

Get a Python callable that emulates the model's inference

This can be useful for debugging purposes and validating accuracy/correctness. The returned callable is not an optimized compilation and should not to be measured for optimized latency. A user of the returned calalble should provide numpy arrays as inputs, and can expect a NumPy array or a list of numpy arrays as output.

Parameters:

Name Type Description Default
target str

A string that corresponds to the desired device target. A user typically should not need to explcitly set this (unless they really wish to run on GPU, i.e. target="cuda"). Please see the docstrings for RelayModule.compile()` for more details on target strings.

'llvm'

Returns:

Type Description
Callable

Inference function

set_batch_size
set_batch_size(batch_size: int) -> None

Sets the RelayModule's batch-size

Parameters:

Name Type Description Default
batch_size int

The desired batch-size for the model

required

Returns:

Name Type Description
None None

This method operates in place.

partition_for_tensorrt
partition_for_tensorrt(
    remove_stacks: bool = True,
    simplify_batch_matmul: bool = True,
    simplify_scalar_add: bool = True,
    remove_no_mac_subgraphs: bool = True,
) -> None

DEPRECATED: This method is deprecated and will be removed in future releases. Please use forge.ONNXModule for TensorRT optimization

Partitions an RelayModule for TensorRT optimization by identifying and separating TensorRT-compatible subgraphs from the "main" computational graph.

This function analyzes the computational graph within the given RelayModule to identify subgraphs that can be optimized using TensorRT. It then partitions the graph, isolating these TensorRT-compatible subgraphs. The partitioning process ensures that only supported operations are included in these subgraphs, while the remaining graph continues to be handled by the default execution engine.

Note: This method will flag an RelayModule so that the RelayModule.is_tensorrtflag will be true. To undo the partitioning, use the RelayModule.inline_partitions() method.

Parameters:

Name Type Description Default
remove_stacks bool

Flag to run passes that remove stack operators by re-expressing them in different forms. Currently stack operators are not optimized by TensorRT compilation. Defaults to True.

True
simplify_batch_matmul bool

Flag to run a pass that will simplify "static" nn.batch_matmul operators. This pass is to avoid an error that can arise in TensorRT compilation. Defaults to True.

True
simplify_scalar_add bool

Flag to run a pass that will convert the values of scalar adds into "broadcasted" tensors. This is to circumvent the limitation of the TVM-TensorRT bridge, which doesn't accept scalar adds in its subgraphs. Defaults to True.

True
remove_no_mac_subgraphs bool

Flag to remove any subgraphs that don't contain multiply-accumulate (MAC) operators. Defaults to True.

True

Returns:

Name Type Description
None None

This method operates in place.

split_tensors
split_tensors(force: bool = False) -> None

Performs tensor splitting on the weight tensors of convolution layers.

This process divides a large weight tensor into two smaller tensors, which can facilitate parallel computation and aid in maintaining or improving quantization accuracy by allowing for more fine-grained parameterization over the quantization of different parts of the tensor.

Note: This method will flag and RelayModule so that the RelayModule.is_split_tensors` flag will be true.

Parameters:

Name Type Description Default
force bool

If False, the operation will raise a ValueError when previously captured calibration data is detected. When True, the operation will wipe previously recorded calibration data. Default is False.

False
inline_partitions
inline_partitions() -> None

DEPRECATED: This method is deprecated and will be removed in future releases. Please use forge.ONNXModule for TensorRT optimization

Undoes any partitions in RelayModule and inlines all partitions back into the "main" computational graph, i.e. the inverse operation of RelayModule.partition_for_tensorrt()`.

Returns:

Name Type Description
None None

This method operates in place.

compile
compile(
    target: Union[str, Dict[str, Any]] = "llvm",
    host: Optional[Union[str, Dict[str, Any]]] = None,
    output_path: Optional[Union[str, Path]] = "./compile_output",
    opt_level: int = 3,
    set_float16: bool = False,
    set_channel_layout: Optional[str] = None,
    export_relay: bool = False,
    export_metadata: bool = False,
    force_overwrite: bool = False,
    uuid: Optional[str] = None,
    encrypt_password: Optional[str] = None,
) -> None

Compiles the model for a specified target with various configuration options.

This method compiles the model for a given target, which can be a string or a dictionary specifying the target attributes. The compilation can be customized through various parameters, including optimization level and data type settings.

Parameters:

Name Type Description Default
target Union[str, Dict[str, Any]]

Can be one of a literal target string, a target tag (pre-defined target alias), a json string describing, a configuration, or a dictionary of configuration options. When using a dictionary or json string to configure target, the possible values are:

kind : str (required) Which codegen path to use, for example "llvm" or "cuda".

keys : List of str (optional) A set of strategies that can be dispatched to. When using "kind=opencl" for example, one could set keys to ["mali", "opencl", "gpu"].

device : str (optional) A single key that corresponds to the actual device being run on. This will be effectively appended to the keys.

libs : List of str (optional) The set of external libraries to use. For example ["cblas", "mkl"].

system-lib : bool (optional) If True, build a module that contains self registered functions. Useful for environments where dynamic loading like dlopen is banned.

mcpu : str (optional) The specific cpu being run on. Serves only as an annotation.

model : str (optional) An annotation indicating what model a workload came from.

runtime : str (optional) An annotation indicating which runtime to use with a workload.

mtriple : str (optional) The llvm triplet describing the target, for example "arm64-linux-android".

mattr : List of str (optional) The llvm features to compile with, for example ["+avx512f", "+mmx"].

mfloat-abi : str (optional) An llvm setting that is one of "hard" or "soft" indicating whether to use hardware or software floating-point operations.

mabi : str (optional) An llvm setting. Generate code for the specified ABI, for example "lp64d".

host : Union[str, Dict[str, Any]] (optional) Description for target host. Can be recursive. Similar to target.

'llvm'
host Optional[Union[str, Dict[str, Any]]]

Similar to target but for target host. Can be one of a literal target host string, a target tag (pre-defined target alias), a json string describing a configuration, or a dictionary of configuration options. When using a dictionary or json string to configure target, the possible values are same as target.

None
output_path Optional[Union[str, Path]]

The path to save the compiled output, ./compile_output by default.

'./compile_output'
opt_level int

Optimization level, ranging from 0 to 4. Larger numbers, correspond with more aggressive compilation optimizations. Default is 3.

3
set_float16 bool

If True, enables Float16 data type for all operators permitted. Default is False.

False
set_channel_layout Optional[str]

Optional specification of the channel layout ("first", "last"), defaults to no changing of layout if None.

None
export_relay bool

If True, exports the Relay text representation of the model. Default is False.

False
export_metadata bool

If True, exports the metadata JSON of the model as a text file. Default is False.

False
force_overwrite bool

If True, the method will overwrite if the provided output path already exists. A ValueError will be thrown if False and the output path already exists. Default is False.

False
uuid Optional[str]

Optional specification of a uuid from the user when the model needs to have a unique identifier that will be set by the user, when this value is not set, the uuid will be a randomely generated one.

None
encrypt_password Optional[str]

Optional specification of a password if is desirable to have the model encrypted. As an output, there will be the model file and the key.

None

Returns:

Name Type Description
None None

The method operates in place.

calibrate
calibrate(
    calib_data: Iterable[Any], reset: bool = True, use_cuda: bool = True
) -> None

Calibrates the model by tracking intermediate layer statistics.

This method collects statistics from intermediate layers of the model using the provided calibration dataset. These statistics are used for deriving quantization parameters in a subsequent quantization process. It's essential that the calibration data is representative of the model's expected real-world inputs.

Note: Ensure that the calibration data is in the form of numpy arrays and has undergone the necessary pre-processing steps required for the model.

Parameters:

Name Type Description Default
calib_data Iterable[Any]

An iterable of data samples for calibration. The samples should be in a format compatible with the model's input requirements. Inspect the RelayModule.input_shapesand RelayModule.input_dtypes for details. For multiple inputs, each set of inputs should be an iterable of numpy arrays, e.g. a list or tuple of numpy arrays.

required
reset bool

If True, any previous calibration data is cleared before new data is processed. Defaults to True.

True
use_cuda bool

If True, Forge will utilize CUDA devices for calibration if GPUs can be found. Operation will fall back to CPU if GPUs are not found. Default is True.

True

Returns:

Name Type Description
None None

This method operates in place.

Raises:

Type Description
ValueError

If calib_data is not in the correct format.

quantize
quantize(
    activation_dtype: str = "int8",
    kernel_dtype: Optional[str] = None,
    bias_dtype: Optional[str] = None,
    per_channel: bool = False,
    calib_method: str = "average",
    quant_type: str = "any",
) -> None

Applies quantization to the model with specified parameters.

This method quantizes the model's activations, kernels, and biases to the specified data types. If kernel_dtype and bias_dtype are None, they default to the activation_dtype. The quantization can be "static" (requiring prior calibration), "dynamic" (no calibration needed), or "any" (prioritizing static if possible).

Note: When using "static" quant_type, ensure calibration is performed beforehand or provide calib_data for calibration. If split_tensors is enabled, existing calibration data is discarded due to graph changes, necessitating fresh calib_data.

Parameters:

Name Type Description Default
activation_dtype str

Data type for activations ("int8", "uint8"), default is "int8".

'int8'
kernel_dtype Optional[str]

Data type for kernels ("int8", "uint8"), defaults to activation_dtype if None.

None
bias_dtype Optional[str]

Data type for biases in nn.bias_add operators. Can be set to match activation_dtype or "int32", defaults to activation_dtype if None.

None
per_channel bool

If True, performs per-channel quantization on kernels. Default is False.

False
calib_method str

Method for calibration ("average", "entropy", "minmax", "percentile"), default is "average". Overview of calibration methods: "average" - computed average of the min-max extrema across calibration data, "entropy" - distribution-based maximization of entropy of quantized values, "minmax" - absolute most extreme min-max values across calibration data, "percentile" - computed 99-th percentile cut-offs across calibration data.

'average'
quant_type str

Type of quantization ("static", "dynamic", "any"), default is "any".

'any'

Returns:

Name Type Description
None None

This method operates in place.

__eq__
__eq__(other) -> bool

Compares for equality of hash

__hash__
__hash__() -> int

Structural hash of the RelayModule

__iter__
__iter__() -> Iterator

Iterator of the underlying Node objects in evaluation order