6. Comparison to C28x+FPU¶

Note

This discussion is limited to the C28x + 32-bit FPU and does not include TMU or FPU64.

For this discussion, let us define the following instruction sets:

C28x Instruction Set

This is the original fixed-point instruction set.
C28x+FPU Instruction Set

This is the C28x Instruction Set plus additional instructions to support native single-precision (32-bit) floating-point operations. While the additional instructions are mostly to support single-precision floating-point math, there are some other useful instructions like RPTB (repeat block) included. Since they are part of the superset, and only available on devices with the FPU, we still refer to them as part of the FPU instructions.
CLA Instruction Set

The CLA instruction set is a subset of the FPU instructions. A few FPU instructions are not supported on CLA - for example the repeat block is not supported. The CLA also has a few instructions that the FPU does not have. For example: the CLA has some native integer math instructions as well as a native branch/call/return.

6.1. Are benchmarks for FPU div, sin, cos…etc. the same?¶

The CLA instructions are a subset of the C28x+FPU and for the math instructions there are equivalents on each, but there are still differences that impact benchmarks. For example:

Cycle differences in multiply and conversion (see next question)
Differences in branch and call instructions
Resource differences (ex: 8 floating-point registers vs 4)
Addressing modes
CLA lacks RPTB (repeat block)

6.2. Is the CLA floating-point multiply faster?¶

Consider the following:

C28x 32-bit FPU:

Multiply or conversions take 2p cycles. This means that they take two cycles to complete, but remember you can put another instruction in that delay slot including another math instruction.
CLA:

The math instructions and conversions take 1 cycle - no delay slot needed. So if you were not able to use that delay cycle on the FPU to do meaningful work, then you could say the CLA is faster if you are just counting cycles.

6.3. What are the main differences?¶

The following table summarizes some of the main differences between the C28x + FPU and the CLA CPU at this time. Refer to the CLA documentation listed in the other resources for the latest information.

Table 6.1 CLA and C28x+FPU Comparison¶
Item	CLA	C28x+FPU
Execution	Independent, parallel execution with the C28x CPU	Part of the main C28x CPU
Floating-Point Registers	4 (MR0 - MR3)	8 (R0H - R7H)
Auxillary Registers	2 16-bits, (MAR0, MAR1) – Can access all of CLA data	8, 32-bits, (XAR0 - XAR7) – Shared with fixed-point instructions
Pipeline	8-stage pipeline – completely independent from C28x	8-stage pipeline – CPU fetch and decode shared with the fixed-point instructions
Single Step	Moves pipeline ahead 1 cycle	Completely flushes the pipeline
Addressing Modes	2: Direct and indirect with post increment. No data page pointer.	All C28x addressing modes
Interrupt Sources	Device dependent. Interrupts come directly to the CLA.	All available interrupts through the PIE.
Nesting Interrupts	No stack pointer. CLA type 0 & 1: not supported. CLA Type 2: supports 1 background task	Nesting enabled through software
Instruction Set	Independent subset of FPU instructions. Similar mnemonics to C28x+FPU but with leading ‘M’ ex: MMPYF32 MR0, MR1, MR2.	Floating-point instructions are in addition(superset) to the C28x fixed-point instructions.
Repeated Instructions	No single repeat or repeat block	Repeat MACF32 & repeat block (RPTB)
Communication with C28x	Through shared RAM, message RAM, and interrupts. C28x can read CLA registers.	One CPU. Can copy between fixed and float registers
Math and Conversion	Single cycle	2p cycles (2 pipelined cycles)
Integer Operations	Limited support. Native instructions for AND, OR, XOR, ADD, SUB, shifts etc..	Uses fixed-point instructions
Flow Control	Native branch/call/return conditional delayed. 3 instructions before/after branch are always executed - performance can be improved by using delay cycles	Uses fixed-point flow control. Branches are not delayed – Instructions after are only executed if the branch is not taken Requires copy of float flags to fixed-point ST0
Memory Access	CLA program, data and message RAMs only. Refer to the memory map in the data manual.	All memory on the device
Register Access	Refer to the specific datasheet or TRM.	All peripherals on the device or specific C28x sub-system
Programming	CLA Assembly or CLA C Compiler. Requires C28x codegen 6.1.0 or later.	C or C++ or Assembly
Operating Frequency	Refer to the device datasheet	Refer to the device datasheet