Introductory Concepts

A given OpenVX graph consists of the states seen in the below diagram. These 4 states are described below:

The "Create" phase involves the creation of all OpenVX objects. These objects may be the OpenVX graphs, nodes, or data objects, such as images, tensors, etc. OpenVX provides simple API's for creating each OpenVX object. For example, an OpenVX graph is created using the vxCreateGraph() API and an image data object is create using the vxCreateImage() API.
The "Verify" phase consists of a single API, vxVerifyGraph(). The verify graph API returns a status to the application, describing whether or not this graph is valid and can be processed. Many operations occur within this single API, which are vendor specific. This section will detail the operations occurring from a memory management point of view. For more information on other operations occurring during the verify phase, please see the User Target Kernels section.
The "Execute" phase consists of the actual scheduling and processing of the OpenVX graph. During this phase, the process callbacks of each of the nodes are called in a sorted sequence determined in the verify graph phase. The verify graph phase performs a topological sort of the OpenVX nodes in order to determine any data dependencies among the nodes in the graph. During the execute phase, each node is processed according to that order. Once each node completes, any nodes that were dependent on the output data from that node is processed. The below sections describe how this data is transferred across the cores of the SoC.
Similar to the "Create" phase, the "Destroy" phase involves the freeing of all OpenVX objects. Each corresponding object in OpenVX has an associated API for freeing the object. For example, a graph can be freed by calling the vxReleaseGraph() API while an image can be freed by calling the vxReleaseImage() API.

TIOVX Create Phase

At init time of an application using OpenVX, the API tivxInit is called upon each core that is being used within an OpenVX application. One of the purposes of the init API is to perform a static allocation of the handles used for OpenVX objects in a non-cached region of DDR. The framework enables logging of the actual amount of these statically allocated structures being used in an application in the event that these values need to be modified.

As mentioned previously, the "Create" phase calls the appropriate create API's for each of the OpenVX data objects within the application. With respect to memory management, the create API's do not allocate the memory needed for the data objects; the memory is allocated in the next phase, the verify phase. Instead, it simply returns an opaque handle to the OpenVX object. Therefore, the memory is not accessible directly within the application. These handles point to object descriptors referred to early that reside in a non-cached region of DDR as well as a set of cached attributes, including the OpenVX object data buffers.

This call sequence highlights the process from the application and framework perspectives when an object is created:

Data Object Handle Acquisition Process

As an example, the following illustrates this procedure when creating image and graph objects:

Example: Image and Graph Handle Acquisition

In the event that the memory must be accessed within the application or initialized to a specific value, OpenVX provides certain standard API's for performing a map or copy of the data object. For example, the image object has the map and unmap API's vxMapImagePatch and vxUnmapImagePatch as well as a copy API, vxCopyImagePatch. In the event that the memory must be accessed prior to the verify phase, the framework will allocate the buffer(s) associated with the object within shared memory so that it can be accessed.

TIOVX Verify Phase

The allocation and mapping of the OpenVX data objects occurs during the verification phase. The memory for each of the data objects is allocated from a carved out region of DDR shared memory by using the tivxMemBufferAlloc API internally within the framework. This allocation is done on the host CPU. By performing this allocation within the vxVerifyGraph call, the framework can return an error at a single point within the application to notify the application that the graph is valid and can be scheduled.

In addition to allocating memory for the OpenVX data objects, the local memory needed for each kernel instance is also allocated during the call to vxVerifyGraph. In this context, local memory refers to memory regions accessible by the given core the target kernel is run on. This local memory allocation occurs as the vxVerifyGraph call loops through all the nodes within the graph and calls the "create" callbacks of the given nodes. The design intention of the framework is that all memory allocation occurs during the create callbacks of each kernel, thereby avoiding memory allocation at run-time inside the process callbacks. For more information about the callbacks for each kernel see vxAddUserKernel and tivxAddTargetKernel. For more information about the order in which each of these callbacks are called during vxVerifyGraph, see User Target Kernels.

Each kernel instance has its own context that allows the kernel instance to store context variables within a data structure using the API's tivxSetTargetKernelInstanceContext and tivxGetTargetKernelInstanceContext. Therefore, local memory can be allocated and stored within the kernel structure. The simple API's are provided to allocate kernel memory are tivxMemAlloc and tivxMemFree. These API's allow memory to be allocated from various memory regions given here tivx_mem_heap_region_e. For instance, if multiple algorithms are running consecutively within a kernel, intermediate data can be allocated within the kernel context. Also, within the tivx_mem_heap_region_e, there are attributes to create persistent memory within DDR or to create non-persistent, scratch memory within DDR.

TIOVX Execute Phase

During the scheduling and execution of the OpenVX graphs, the OpenVX data buffers reside in external shared memory and the pointers to each of these data buffers are passed along to subsequent nodes in the graph. Inside each node process callbacks, the node may access the external shared memory via the pointers that have been passed from the previous node. The tivxMemBufferMap and tivxMemBufferUnmap encapsulate the mapping and cache maintenance operations necessary for mapping the shared memory to the target core. After a node completes, the framework handles the trigerring of nodes depending on the data from the output of the current node.

TIOVX Destroy Phase

As mentioned previously, the "Delete" phase calls the appropriate delete API's for each of the OpenVX data objects within the application. At this point, the data buffer(s) of the OpenVX data objects are freed from shared memory using the tivxMemBufferFree API. This freeing of memory occurs within the individual object's release API's, such as the vxReleaseImage for the image object.

TIOVX Memory Optimizations

As mentioned above, the default behavior of OpenVX data buffer transfer is to write intermediate buffers to DDR prior to these buffers being read by subsequent nodes. Below are a few recommendations and suggestions of how to optimize memory transfer:

As mentioned during the "Create" section, the amount of statically allocated structures in non-cached memory can be queried and modified. To do so, refer to the following:
- The maximum values of statically allocated structures are defined in the files <TIOVX_PATH>/include/TI/tivx_config.h and <TIOVX_PATH>/include/TI/tivx_config_<SOC>.h
- These values were sized according to the applications using OpenVX within the vision_apps repo. However, these values can be increased or decreased depending on the needs of a given application.
- The following utility functions were developed to assist in optimizing these values:
  - tivxPrintAllResourceStats prints the currently used value, maximum used value, and minimum required values
  - tivxQueryResourceStats provides information about the parameter values of a specific resource whose name is passed as a parameter to the function
  - tivxExportAllResourceMaxUsedValueToFile generates a new configuration file called "tivx_config_generated.h" at VX_TEST_DATA_PATH. This config file initializes each parameter to the maximum used value as determined by the previous runtime.
- All of the parameter maximum values are documented in TIOVX Configuration Parameters
- All of these API's are documented further in c: Application Interface APIs
In the event that multiple algorithms must be run consecutively on a single core, it is typically recommended to encapsulate this into a single OpenVX kernel. The reason for this is so that the intermediate data can be written to local memory, thereby improving memory access time. This avoids the alternative of splitting the operations into two separate kernels and writing the intermediate data to DDR. Note: this limits the flexibility of these algorithms being deployed separately, and thus must also be weighed in the decision to merge these two algorithms into a single kernel.
Another optimization technique that can be taken advantage of is the usage of DMA to parallelize the memory fetch with the compute of a given target kernel. By using DMA to fetch tiled portions of a given input, a kernel can operate on the input and generate an output in a block-based manner. This can greatly improve the throughput of a given kernel.