MMALIB Release Notes
Version: 00.07.01
Contents
-
Introduction
-
Licensing
-
Getting Started
-
Documentation
-
What's New
-
Upgrade and Compatibility Information
-
Device Support
-
Validation Information
-
Fixed Issues
-
Known Issues
-
Technical Support
-
Package Versioning
Introduction
The MMALIB package consists of the Texas Instruments optimized kernels for CNN, FFT, LINALG algorithms.
Licensing
The licensing information of this library and a complete manifest along with export control information is detailed here [TBD].
Getting Started
The MMALIB User Guide [TBD] provides the documentation and references necessary to begin development on TI's platforms.
Documentation
Refer to following documentation for further details:
MMALIB User Guide |
Build instructions, API Guide |
[TBD]
|
Test Reports |
Misra C reports, conformance test reports, TI platform test reports |
[TBD]
|
Software Manifest |
Licenses, terms of use |
[TBD]
|
What's New
Here are a few of the new features supported in this release:
CNN
[new] Linux host emulation support
[new] Row convolution 1x1 stride by 2
[new] Optimized row convolution 8 bit K < 3 for strided and non strided cases
[new] Last column processing will be handled
[new] Column convolution 3x3 stride by 2
[new] Fully connected layer and utility to generate Bias predicate registers
[new] Static performance test cases of row, columns convolution and fully connected. Static input test generated using Caffe
8 bit, 16 bit signed support
MMALIB_CNN_convolve_row_ixX_ixX_oxX
MMALIB_CNN_convolve_row_1x1stride2_ixX_ixX_oxX
MMALIB_CNN_convolve_row_3x3stride2_ixX_ixX_oxX
MMALIB_CNN_convolve_row_5x5stride2_ixX_ixX_oxX
MMALIB_CNN_convolve_row_7x7stride2_ixX_ixX_oxX
MMALIB_CNN_convolve_row_11x11stride4_ixX_ixX_oxX
MMALIB_CNN_convolve_col_smallNo_ixX_ixX_oxX
MMALIB_CNN_fullyConnected_ixX_ixX_oxX
MMALIB_CNN_generateFillSeamPredicateRegisters
MMALIB_CNN_generateFillBiasPredicateRegisters
LINALG
8, 16 and 32 bit signed integer support
MMALIB_LINALG_matrixMatrixMultiply_ixX_ixX_oxX
MMALIB_LINALG_pointwiseMatrixMatrixMultiply_ixX_ixX_oxX
MMALIB_LINALG_matrixMatrixMultiplyAccumulate_ixX_ixX_ixX_oxX
FFT
Support for 16-bit and 32-bit signed complex input/output
Support for real and imaginary parts of data to be in either interleaved or non-interleaved format. Optimal performance assumes interleaved data format
Support for FFT sizes from 2 through 8192. FFT sizes are assumed to be powers of 2
Support for batch processing of multiple FFTs in one kernel call. For optimal performance, input data and output data for the entire batch, and twiddle factor data are assumed to fit in L2 memory
MMALIB_FFT_dftSmall_ixX_cxX_oxX
MMALIB_FFT_dftLarge_ixX_cxX_oxX
MMALIB_FFT_highRadixDecompositions_ixX_cxX_oxX
MMALIB_FFT_fft_ixX_cxX_oxX
Utility functions (with suffixes _getSizes) to compute sizes required for data and twiddle factor buffers
Utility functions (with suffixes _twGen) to compute relevant DFT matrices and twiddle factors
Details
Dilation
D = 1: works
D > 1: untested
CNN style 2D convolution FxF stride by 1
K = 1: optimized 8 bit, 16 bit works with natural C
K = 2: optimized 8 bit, 16 bit works with natural C
K = 3: optimized 8 bit, 16 bit works with natural C
K > 3: optimized
1x1 : supported as regular convolution
CNN style 2D convolution 1x1 stride by 2
K = 1: currently natural C
K = 2: optimized 8 bit, 16 bit works with natural C
K = 3: optimized 8 bit, 16 bit works with natural C
K > 3: optimized
CNN style 2D convolution 3x3 stride by 2
K = 1: optimized 8 bit tested, 16 bit not tested
K = 2: optimized 8 bit tested, 16 bit not tested
K = 3: optimized 8 bit tested, 16 bit not tested
K > 3: optimized
CNN style 2D convolution 5x5 stride by 2
K = 1: optimized 8 bit tested, 16 bit not tested
K = 2: optimized 8 bit tested, 16 bit not tested
K = 3: optimized 8 bit tested, 16 bit not tested
K > 3: optimized
CNN style 2D convolution 7x7 stride by 2
K = 1: currently natural C
K = 2: currently natural C
K = 3: optimized 8 bit tested, 16 bit not tested
K > 3: optimized
CNN style 2D convolution 11x11 stride by 4
K = 1: currently natural C
K = 2: currently natural C
K = 3: currently natural C
K > 3: optimized
Optimized code runs the natural C version for the cases not optimized. Wait on further two stage PE scheduling from compiler
Bias Rows flexible and use of DECDIM feature
Use of LEZR feature for optimization of K < 3 cases.
Strided convolution expects every kernel call to give fixed number of columns
Host emulation not tested for Strided 16 bit convolution and column convolution
CNN style 2D convolution 3x3 5x5 7x7 Ni = No = 1
stride-by-1
optimized-C and natural-C for data arrangement case 1.A.0 (even number of block columns to process)
[new]3x3 stride-by-2
optimized-C and natural-C for data arrangement case 1.A.0 (even number of block columns to process)
optimized-C requires the number of rows in a feature map to be even
All:
[new] multiple groups processed in single kernel call
[new] supports multiple rows of bias provided the rows do not push the computation beyond the MMA block size
CNN style 2D convolution 3x3 5x5 7x7 Ni, No = small
All: not yet implemented
[New] Fully Connected
Supports 8- and 16-bit datatypes
Supports signed and unsigned combinations of datatypes with kernel matrix always assumed to be signed
Support of multiple Bias rows using 2D DECDIM feature
DFT building blocks
16-bit, interleaved/non-interleaved, any FFT_SIZE, any BATCH_SIZE: works
32-bit, interleaved/non-interleaved, any FFT_SIZE, any BATCH_SIZE: works
FFT building blocks
16-bit, interleaved/non-interleaved, 64 < FFT_SIZE < 8192, BATCH_SIZE = 1,2,4,8,16,32,64: works
32-bit, interleaved/non-interleaved, 32 < FFT_SIZE < 8192, BATCH_SIZE = 1,2,4,8,16,32,64: works
FFT coordination
16-bit, interleaved/non-interleaved, any FFT_SIZE, any BATCH_SIZE: works
32-bit, interleaved/non-interleaved, any FFT_SIZE, any BATCH_SIZE: works
Upgrade and Compatibility Information
File |
Change description |
User application change required |
User application recompile required |
Device Support
SoC |
HOST (OS) |
Target (OS) |
Test Plaform |
J7ES C7x, MMA | No OS |
No OS |
Loki |
Validation Information
This release was built and validated using the following tools:
Build Tools (NOT included in MMALIB):
-
C7x CGT Alpha version: 7.2
-
Loki v4.7.339 Linux Includes loki7x executable and loki shared libraries only. Loki also has a dependency on GCC 4.9.x lib64 libraries
Fixed Issues
- [MMALIB-111] Last block is not handled for case when subMChannels == MChannels
- [MMALIB-124] Row flow non-strided case not working for K < 1
- [MMALIB-125] Row Flow : Seam insertion logic is not working for K<1 case
Known Issues
-
None
Technical Support
For technical support, please post your questions on TI E2E Forum for Automotive ADAS SoCs.
For additional assistance, contact local TI Field Application Engineer
Package Versioning
Each package version is composed of 4 period-delimited numbers - represented here by the letters M, m, p and b [M.m.p.b]
. The table below provides a descriptive reference regarding package version numbering.
Digit | Meaning | Description |
1 (M=Major) | Major revision | Incremented when the new version is substantially different from the previous For example, a new module added or an existing module's algorithm significantly altered. |
2 (m=minor) | Minor revision | Incremented when the new version has changed but not in a major way. For example, some minor changes in the API or feature set. |
3 (p=patch) | Patch number | Incremented for all other source code changes. This include any packaging support code. |
4 (b=build) | Build number | Incremented for each release delivery to CM. Reset for any change to M, m or p |
Copyright 2018, Texas Instruments Incorporated