6. Comparison to C28x+FPU¶
Note
This discussion is limited to the C28x + 32-bit FPU and does not include TMU or FPU64.
For this discussion, let us define the following instruction sets:
- C28x Instruction Set
- This is the original fixed-point instruction set.
- C28x+FPU Instruction Set
- This is the C28x Instruction Set plus additional instructions to support native single-precision (32-bit) floating-point operations. While the additional instructions are mostly to support single-precision floating-point math, there are some other useful instructions like RPTB (repeat block) included. Since they are part of the superset, and only available on devices with the FPU, we still refer to them as part of the FPU instructions.
- CLA Instruction Set
- The CLA instruction set is a subset of the FPU instructions. A few FPU instructions are not supported on CLA - for example the repeat block is not supported. The CLA also has a few instructions that the FPU does not have. For example: the CLA has some native integer math instructions as well as a native branch/call/return.
6.1. Are benchmarks for FPU div, sin, cos…etc. the same?¶
The CLA instructions are a subset of the C28x+FPU and for the math instructions there are equivalents on each, but there are still differences that impact benchmarks. For example:
- Cycle differences in multiply and conversion (see next question)
- Differences in branch and call instructions
- Resource differences (ex: 8 floating-point registers vs 4)
- Addressing modes
- CLA lacks RPTB (repeat block)
6.2. Is the CLA floating-point multiply faster?¶
Consider the following:
C28x 32-bit FPU:
Multiply or conversions take 2p cycles. This means that they take two cycles to complete, but remember you can put another instruction in that delay slot including another math instruction.
CLA:
The math instructions and conversions take 1 cycle - no delay slot needed. So if you were not able to use that delay cycle on the FPU to do meaningful work, then you could say the CLA is faster if you are just counting cycles.
6.3. What are the main differences?¶
The following table summarizes some of the main differences between the C28x + FPU and the CLA CPU at this time. Refer to the CLA documentation listed in the other resources for the latest information.
Item | CLA | C28x+FPU |
---|---|---|
Execution | Independent, parallel execution with the C28x CPU | Part of the main C28x CPU |
Floating-Point Registers | 4 (MR0 - MR3) | 8 (R0H - R7H) |
Auxillary Registers | 2 16-bits, (MAR0, MAR1) – Can access all of CLA data | 8, 32-bits, (XAR0 - XAR7) – Shared with fixed-point instructions |
Pipeline | 8-stage pipeline – completely independent from C28x | 8-stage pipeline – CPU fetch and decode shared with the fixed-point instructions |
Single Step | Moves pipeline ahead 1 cycle | Completely flushes the pipeline |
Addressing Modes | 2: Direct and indirect with post increment. No data page pointer. | All C28x addressing modes |
Interrupt Sources | Device dependent. Interrupts come directly to the CLA. | All available interrupts through the PIE. |
Nesting Interrupts | No stack pointer. CLA type 0 & 1: not supported. CLA Type 2: supports 1 background task | Nesting enabled through software |
Instruction Set | Independent subset of FPU instructions. Similar mnemonics to C28x+FPU but with leading ‘M’ ex: MMPYF32 MR0, MR1, MR2. | Floating-point instructions are in addition(superset) to the C28x fixed-point instructions. |
Repeated Instructions | No single repeat or repeat block | Repeat MACF32 & repeat block (RPTB) |
Communication with C28x | Through shared RAM, message RAM, and interrupts. C28x can read CLA registers. | One CPU. Can copy between fixed and float registers |
Math and Conversion | Single cycle | 2p cycles (2 pipelined cycles) |
Integer Operations | Limited support. Native instructions for AND, OR, XOR, ADD, SUB, shifts etc.. | Uses fixed-point instructions |
Flow Control | Native branch/call/return conditional delayed. 3 instructions before/after branch are always executed - performance can be improved by using delay cycles | Uses fixed-point flow control. Branches are not delayed – Instructions after are only executed if the branch is not taken Requires copy of float flags to fixed-point ST0 |
Memory Access | CLA program, data and message RAMs only. Refer to the memory map in the data manual. | All memory on the device |
Register Access | Refer to the specific datasheet or TRM. | All peripherals on the device or specific C28x sub-system |
Programming | CLA Assembly or CLA C Compiler. Requires C28x codegen 6.1.0 or later. | C or C++ or Assembly |
Operating Frequency | Refer to the device datasheet | Refer to the device datasheet |