2.2. C7000 Split Datapath and Functional Units

Figure 2.1 shows the datapath split on the C7100 DSP CPU. There is an A-side datapath and a B-side datapath. The diagram shows the functional units and multiple, heterogeneous register files. The A-side datapath is responsible for scalar computation, loading and storing scalars and vectors to and from memory, and control-flow (branches, calls). The B-side datapath handles vector math operations, permutations of data, and vector predication operations.

../_images/splitdatapath_functionalunits.png

Figure 2.1 C7000 Datapath Block Diagram

To simplify Figure 2.1, some data movement capabilities and data paths are not shown in this figure.

  • In general, a functional unit can write to any register file on the same datapath.

  • Most functional units can obtain data from one or both of the streaming engines.

  • There is one 64-bit cross path per datapath (A/B). Each cross path allows one read per cycle from the opposite side global register file.

C7100 and C7120 cores have a 512-bit vector width. C7504 and C7524 cores have a 256-bit vector width. Registers have 64 bits per register ("scalar") or a "vector-width" number of bits per register. Thus, C7100 and C7120 cores have 512-bit vector registers, while C7504 and C7524 cores have 256-bit vector registers.

On a given datapath, there are several different kinds of register files. On a given datapath, each functional unit can write to the global register file on that datapath and most of the local register files on that datapath. However, only some functional units can read from a local register file.

  • D1 and D2 units: These reside on the A-side datapath and can load from and store to memory. Two 64-bit loads can execute in parallel. Two 64-bit stores can execute in parallel. A 64-bit and vector-width load can execute in parallel with a 64-bit or vector-width store. It is not possible for two vector-width stores to execute in parallel or two vector-width loads to execute in parallel.

  • L1, S1, M1, and N1 units: These are general-purpose functional units, handling a varied mix of scalar and small vector computation. The M1 and N1 functional units perform various multiplication instructions.

  • L2, S2, M2, and N2 units: These are also general-purpose functional units, and can operate on full-width vector data. The M2 and N2 functional units perform various multiplication instructions.

  • B unit: This unit handles indirect branches and calls.

  • C unit: This unit performs permutations and shuffles of data.

  • P unit: This unit computes predicates used to mask off vector lanes so particular lanes are not computed or are not stored to memory.

In addition to the D1 and D2 units providing CPU access to the memory hierarchy, the C7100 DSP has two streaming engines that facilitate a fast path to obtain data from memory. A streaming engine is a hardware feature that allows you (or the compiler) to specify a pattern of memory addresses to obtain from memory. The streaming engine will do its best to pre-fetch that data from the memory hierarchy into a scratchpad memory close to the CPU, to minimize CPU stalls due to cold cache misses.