5. Extending TVM¶
This section explains how to extend TVM to help improve the inference performance of a model. These extensions should not require a rebuild of TVM.
5.1. Transforming Relay IR¶
After a model is converted into TVM’s Relay IR format, it is possible to rewrite Relay IR
into another equivalent Relay IR for various reasons such as: simplifying arithmetic,
removing identity ops, or maximizing TIDL offload.
The python/tvm/relay/backend/contrib/tidl/prepare.py
script contains the transformations
run before TIDL partitioning and C7x code generation. These transformations are Python based,
straightforward to understand, and easy to write.
The following example shows how transforming Relay IR can help maximize TIDL offload.
The example is available in the tests/python/relay/ti_tests/unit_tests/conv2d_1x2_stride.py
script.
The input for this example has a convolution layer with 3x3 kernel size and a 1x2 stride, as shown in the below Relay IR.
%0 = nn.conv2d(%i0, %w1, strides=[1, 2], padding=[1, 1, 1, 1], kernel_size=[3, 3]);
However, TIDL does not support such a kernel size and stride combination. TIDL does support
the combination of 3x3 kernel size and 1x1 stride. Using the ConvertConvStride
transformation,
you can rewrite the original convolution as a convolution with 3x3 kernel size and 1x1 stride,
followed by maxpooling with 1x1 pool size and 1x2 stride, as shown in the below Relay IR.
%0 = nn.conv2d(%i0, meta[relay.Constant][0] /* ty=Tensor[(16, 8, 3, 3), float32] */,
Tensor[(1, 8, 224, 224), float32], Tensor[(16, 8, 3, 3), float32],
padding=[1, 1, 1, 1], kernel_size=[3, 3]) /* ty=Tensor[(1, 16, 224, 224), float32] */;
%1 = nn.max_pool2d(%0, Tensor[(1, 16, 224, 224), float32], pool_size=[1, 1], strides=[1, 2],
padding=[0, 0, 0, 0]) /* ty=Tensor[(1, 16, 224, 112), float32] */;
After the transformation, the convolution layer can be offloaded to TIDL and can benefit from the performance boost offered by TIDL on C7x/MMA. The max pooling layer with a 1x1 pool size and 1x2 stride is still not supported by TIDL, but the computation is much simpler and we can use TVM C7x code generation to generate code for this layer. Although this transformation doubles the amount of computation from what was originally required, the overall performance still is better than running the original convolution layer on Arm or C7x.
5.2. Customizing Compute and Schedule¶
TVM uses strategies to turn each operator into code. An operator can have many strategies
associated with it. A strategy consists of two parts, the compute
and the schedule
.
The compute specifies what this operator does and is defined as a math formula.
The schedule specifies how to realize the computation via loop nests and data movement.
For each operator, you can customize the compute, the schedule, or both in order to generate better performing code on C7x. The following subsection provide some examples.
5.2.1. Customizing compute¶
The python/tvm/topi/c7x/resize.py
script makes a copy of the default
tvm.relay.op.image._image.compute_resize2d
and simplifies the index computation so that
the generated code can be software pipelined on C7x devices.
The python/tvm/relay/op/strategy/c7x.py
script overrides the default strategy for resize2d
on C7x with the updated compute and schedule.
5.2.2. Customizing schedule¶
The python/tvm/topi/c7x/injective.py
script uses a customized schedule for C7x injective
operators. It transforms loop nests, applies DMA with double buffering, and vectorizes the
innermost loop for C7x.
5.2.3. Customizing compute and schedule only for a special case¶
Significantly simplified computation is possible if all of the following are true:
Pool size is 1x1.
Data layout is “NCHW”.
Dilation is 1x1.
No padding is used.
The python/tvm/relay/op/strategy/c7x.py
script customizes the strategy for max_pool2d
only in this special case. The simplified computation is defined in the compute_max_pool2d_1x1_pool_size
function in the python/tvm/topi/c7x/pooling.py
, which treats the operator as an injective operator to use the C7x injective schedule.
For all other cases, the default strategy is used.
5.2.4. Customizing compute and schedule in compilation script¶
It is also possible to customize the strategy for a relay operator in your compilation script.
The tests/python/relay/ti_tests/unit_tests/resize_nchw_1x2.py
script shows how to overwrite
the default resize2d strategy for C7x in the add_c7x_resize_strategy
function.
5.2.5. Customizing a relay operator to call into an external library¶
If the resize2d operator has a 1x2 upscaling factor on NCHW float data, the same script used in the previous section, tests/python/relay/ti_tests/unit_tests/resize_nchw_1x2.py
customizes the compute
to call an external function, resize_nchw_1x2
, in an external library.
The resize_nchw_1x2.cpp
C++ file contains the function definition. Tensors are passed in a DLTensor
data structure from TVM runtime to the C/C++ function. The definition of DLTensor
can be
found in the
3rdparty/dlpack/include/dlpack/dlpack.h header file.
The V1
version of the code shows a naive implementation of resize2d with a 1x2 upscaling factor. The V2
version of the code optimizes the implementation using the C7x streaming engine (SE) feature.
The C++ file is compiled into a library and linked into
the C7x deployable module for the model.
As shown in the example, build_and_set_ext_lib
function in the unit_utils.py
script,
the CGT7X_EXT_LIBS
environment variable specifies additional libraries that
can be linked into the C7x deployable module. For example:
CGT7X_EXT_LIBS="-l /path/to/lib1 -l /path/to/lib2"
Note
resize_nchw_1x2.cpp
is not a generic implementation for all resize2d cases with an upscaling
factor 1x2. This file handles only float data in “NCHW” layout.