4.11. Memory Optimizations¶

Optimizations that improve the loading and storing of data are often crucial to the performance of an application. A detailed examination of useful memory optimizations on Keystone 3 devices is outside the scope of this document. However, the following are the most common optimizations used to aid memory system throughput and reduce memory hierarchy latency.

Blocking: Input, output, and temporary arrays/objects are often too large to fit into Multicore Shared Memory Controller (MSMC) or L2 memory. For example, when performing an algorithm over an entire 1000x1000 pixel image, the image is too large to fit into most or all configurations of L2 memory, and the algorithm may thrash the caches, leading to poor performance. Keeping the data as close to the CPU as possible improves memory system performance, but how do we do this when the image is too large to fit into the L2 cache? Depending on the algorithm, it may be useful to use a technique called "blocking," in which the algorithm is modified to operate on only a portion of the data at a given time. Once that "block" of data is processed, the algorithm moves to the next block. This technique is often paired with the other techniques in this list.
Direct Memory Access (DMA): Consider using the asynchronous DMA capabilities of the device to move new data into MSMC memory or L2 memory and DMA to move processed data out. This frees the C7000 CPU to perform computations while the DMA is readying data for the next frame, block, or layer.
Ping-Pong Buffers: Consider using ping-pong memory buffers so that the C7000 CPU is processing data in one buffer while a DMA transfer is occurring to/from another buffer. When the C7000 CPU is finished processing the first buffer, the algorithm switches to the second buffer, which now has new data as a result of a DMA transfer. Consider placing these buffers in MSMC or L2 memory, which is much faster than DDR memory.