Introduction

The floating point computations are not cost and power efficient. These floating point computations can be substituted with fixed point computations (8 or 16 bit) without losing inference accuracy.
Matrix Multiplier Accelerator (MMA) in J7 platform supports 8-bit, 16-bit and 32-bit inference of Deep Learning models.
8-bit inference supports a multiplier throughput at 4096 MACs per cycle when doing 64x64 matrix multiply. Hence 8-bit inference is desired to be used for J7 platform
- 16-bit and 32-bit will come with significant performance penalty
- 16-bit inference multiplier throughput is 1024 MACs per cycle
- Memory I/O needed would be high

In TIDL, the scales for each Parameter, and feature Maps are selected separately based other range (Minim and Maximum values). The scales are passed through layer processing to till last layer. All the feature maps and parameters are maintained in fixed point representation ans associated floating point scale. At the end user can convert the final output tensors to floating point by dividing each element in the tensor by scale.

TIDL - Basic Quantization

The are multiple schemes to select the scale for a given tensor based on the expected range. The scale values could be power of two for optimal implementation based on supported device architecture. The image below show two schemes for selecting scale for signed tensor

TIDL - Scale Selection Schemes

If the tensor data is unsigned (for example output of ReLU layer), then the entire range can be used for representing positive range only. This would help in achieving lesser quantization error.

TIDL - Scale Selection Schemes

TIDL SW supports either Power of two scales (TIDL_QuantStyleP2Dynamic) or Non-power two scales (TIDL_QuantStyleNP2Fixed) for Parameters/Weights . Power of two scales for Feature maps. Feature maps can be either signed or unsigned

TIDL Layers which Requires Parameter Quantization

Convolution Layer
De-convolution Layer
Inner-Product Layer
Batch Normalization (Scale/Mul, Bias/Add, PReLU)

Quantization Options

TIDL provides below quantization options to the user

A. Post Training Quantization
B. Training for Quantization
C. Quantization aware Training

We recommend to use option 'A' for network first, if the quantization accuracy loss is not acceptable, then you can try option 'B'. The result with 'B' also not acceptable, then user can use option 'C'. This solution shall work all the time. The only drawback of this solution is that, this would need addition effort from user to re-train the network.

TIDL - Quantization Options

A. Post Training Quantization

Training free Quantization– Most preferred option by user.
Model Parameters are quantized offline.
- Power of 2 or Non Power of 2 scale
- Scale selected based on min and max values in the given layer.
- Range for each Feature vectors are calibrated offline with few sample inputs.
- Calibrated range (Min and Mix) Values are used for Quantizing feature vectors in target during inference (real time).
- Option to update range dynamically – Has performance Impact
- Observed accuracy drop less than 1% w.r.t floating point for many networks with 8-bits
  - Resnets, SqueezeNet, VGG, MobileNet1+SSD etc
Future/Planned Improvements
- Improved offline range calibration for both weights and features
- Per Channel quantization for Depth wise convolution Parameters
- Mixed Precision – 8-bit or 16-bit compute selection at layer level

B. Training for Quantization

his option is incremental over the Post Training Quantization
For certain networks Post-training quantization is NOT enough to reach desired accuracy.
Fine tuning network with regularization (Weight decay)
Does not need full re-training
Regularization is supported in all popular training framework.
Observed accuracy drop less than 1% with respect to floating point for MobileNet v1 trained in tensorflow

C. Quantization aware Training

Model parameter are trained to comprehend the 8-bit fixed point inference loss.
This would need support/change in training framework
The future vector range values are also part of the model. No - need for range calibration features. Example – CLIP, Minimum, PACT, RelU6 operators.
Accuracy drop could be limited to very close to zero.
Jacinto - Deep Learning/CNN Training Examples & Quantization can be sed for this.Link

Handling of ReLU6

Rectified linear unit (ReLU) is used as activation in layers of many CNN (for vision processing) networks.
Unlike sigmoid activation (which is used in Traditionally neural networks) the ReLU layers activations values range can be very large based on the input.
To make the feature activation range restricted, the recent networks uses ReLU6 as activation instead of RelU. Example : MobileNet v1 and MobileNet V2 trained in tensorflow.
Implementing saturation to 6 in fixed point inference would need floating point computation / Look up table. Floating point computation / Look up table are not efficient in terms of cost and power.
To overcome this we Propose following :
- Fine tuning the network by replacing ReLU 6 with RelU.
- Train the network with power of two threshold (ReLU 4 or ReLU 8). This can be achieved by using Minimum or CLIP operators
- If fine tuning is not desired solution, we propose replacing ReLU6 with ReLU8 in inference (instead of ReLU). This is automatically done by TIDL
- ReLU8 can be performed with just shift operation in fixed point inference without needing a floating point computation /look up table.

TIDL - Relu6

Table of Contents