C6000 Embedded Design Workshop

Student Guide

C6000 Embedded Design Workshop

Day 3
10. Dynamic Memory
11. C6000 Introduction
12. C6000 Architecture
13. C6000 Optimizations

Day 4
14. C6000 Cache
15. Using EDMA3

GrabBag
Using DSP/BIOS
Flash Boot
Stream I/O & PSP
C66x Intro

Technical Training
Notice

Creation of derivative works unless agreed to in writing by the copyright owner is forbidden. No portion of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission from the copyright holder.

Texas Instruments reserves the right to update this Guide to reflect the most current product information for the spectrum of users. If there are any differences between this Guide and a technical reference manual, references should always be made to the most current reference manual. Information contained in this publication is believed to be accurate and reliable. However, responsibility is assumed neither for its use nor any infringement of patents or rights of others that may result from its use. No license is granted by implication or otherwise under any patent or patent right of Texas Instruments or others.

Copyright ©2013 by Texas Instruments Incorporated. All rights reserved.

Technical Training Organization
Semiconductor Group
Texas Instruments Incorporated
7839 Churchill Way, MS 3984
Dallas, TX  75251-1903

Revision History

Rev 1.00 - Oct 2013 - Re-formatted labs/ppts to fit alongside new TI-RTOS Kernel workshop
Rev 1.10 – Oct 2013 – Added chapter 10 (Dyn Memory) as first optional chapter
Rev 1.20 – Nov 2013 – upgraded all labs to use UIA/SA
Using Dynamic Memory

Introduction

In this chapter, you will learn about how to pass data between threads and how to protect resources during critical sections of code – including using Events, MUTEXs, BIOS “contains” such as Mailboxes and Queues and other methods of helping threads (mainly Tasks) communicate with each other.

Objectives

- Compare/contrast static and dynamic systems
- Define heaps and describe how to configure the different types of heaps (std, HeapBuf, etc.)
- Describe how to eliminate the drawbacks of using std heaps (fragments, non-determinism)
- Implement dynamic object creation
- Lab – Using the previous Task/Sem lab, create our Semaphores and Tasks dynamically
Module Topics

Using Dynamic Memory ........................................................................................................ 10-1

Static vs. Dynamic .............................................................................................................. 10-3

Dynamic Memory Concepts .............................................................................................. 10-4
  Using Dynamic Memory ................................................................................................. 10-4
  Creating A Heap ............................................................................................................ 10-6

Different Types of Heaps ................................................................................................. 10-7
  HeapMem ...................................................................................................................... 10-7
  HeapBuf ......................................................................................................................... 10-8
  HeapMultiBuf ............................................................................................................... 10-9
  Default System Heap .................................................................................................... 10-10

Dynamic Module Creation ............................................................................................... 10-11

Custom Section Placement .............................................................................................. 10-13

Lab 10:  Using Dynamic Memory ..................................................................................... 10-15

Lab 10 – Procedure – Using Dynamic Task/Sem ............................................................. 10-16
  Import Project ................................................................................................................ 10-16
  Check Dynamic Memory Settings ............................................................................... 10-17
  Inspect New Code in main() ....................................................................................... 10-18
  Delete the Semaphore and Add It Dynamically .......................................................... 10-18
  Build, Load, Run, Verify .............................................................................................. 10-19
  Delete Task and Add It Dynamically .......................................................................... 10-20

Additional Information ..................................................................................................... 10-22

Notes ............................................................................................................................... 10-23

More Notes....................................................................................................................... 10-24
Static vs. Dynamic

Static vs Dynamic Systems

- **Static Memory**
  - **Link Time:**
    - Allocate Buffers
  - **Execute:**
    - Read data
    - Process data
    - Write data
  - Allocated at **LINK** time
  - + Easy to manage (less thought/planning)
  - + Smaller code size, faster startup
  - + Deterministic, atomic (interrupts won’t mess it up)
  - + Fixed allocation of memory resources
  - + Optimal when most resources needed concurrently

- **Dynamic Memory (HEAP)**
  - **Create:**
    - Allocate Buffers
  - **Execute:**
    - R/W & Process
  - **Delete:**
    - FREE Buffers
  - Allocated at **RUN** time
  - + Limited resources are SHARED
  - + Objects (buffers) can be freed back to the heap
  - + Smaller RAM budget due to re-use
  - - Larger code size, more difficult to manage
  - - NOT deterministic, NOT atomic
  - - Optimal when multi threads share same resource or memory needs not known until runtime

BIOS \(\rightarrow\) Runtime Cfg – Dynamic Memory

- **Memory Policies** – Dynamic or Static?
  - Dynamic is the default policy (recommended)
  - Static policy can save some code/data memory
  - Select via .CFG GUI:

- **MAU** – Minimum Addressable Unit
  - Memory allocation sizes are measured in MAUs
  - 8 bits: C6000, MSP430, ARM
  - 16 bits: C28x

Note: ~5K bytes savings on a C6000 choosing “static only” vs. “dynamic”
Dynamic Memory Concepts

Using Dynamic Memory

Dynamic Memory Usage (Heap)

Using Memory Efficiently
- Common memory reuse within C language
- A Heap (i.e. system memory) allocates, then frees chunks of memory from a common system block

Code Example...

```
#define SIZE 32
char x[SIZE]; /*allocate*/
char a[SIZE];
x={…}; /*initialize*/
a={…};
filter(…); /*execute*/
```

“Normal” (static) C Coding

“Dynamic” C Coding

- Create
  
  ```
  x=malloc(SIZE); // MAUs
  a=malloc(SIZE); // MAUs
  ```

- Execute
  
  ```
  filter(…);
  ```

- Delete
  
  ```
  free(x);
  free(a);
  ```

- High-performance DSP users have traditionally used static embedded systems

- As DSPs and compilers have improved, the benefits of dynamic systems often allow enhanced flexibility (more threads) at lower costs
Dynamic Memory (Heap)

Using Memory Efficiently
- Common memory reuse within C language
- A Heap (i.e., system memory) allocates, then frees chunks of memory from a common system block

What if I need two heaps?
- Say, a big image array off-chip, and
- Fast scratch memory heap on-chip?

Multiple Heaps

- BIOS enables multiple heaps to be created
- Create and name heaps in .CFG file or via C code
- Use `Memory_alloc()` function to allocate memory and specify which heap
Dynamic Memory Concepts

Creating A Heap

Creating A Heap (HeapMem)

1. Use HeapMem (Available Products)

2. Create HeapMem (myHeap): size, alignment, name

<table>
<thead>
<tr>
<th>Static</th>
<th>Dynamic</th>
</tr>
</thead>
<tbody>
<tr>
<td>OR...</td>
<td></td>
</tr>
</tbody>
</table>

Usage
buf1 = Memory_alloc(myHeap, 64, 0, &eb)
Different Types of Heaps

Heap Types

- Users can choose from 3 different types of Heaps:
  1. HeapMem
     - Allocate variable-size blocks
     - Default system heap type
  2. HeapBuf
     - Allocate fixed-size blocks
  3. HeapMultiBuf
     - Specify variable-size blocks, but internally, allocate from a variety of fixed-size blocks

HeapMem

- Most flexible – allows allocation of variable-sized blocks (like malloc())
- Ideal when size of memory is not known until runtime
- Creation: .CFG (static) or C code (dynamic)
- Like malloc(), there are drawbacks:
  - NOT Deterministic – Memory Manager traverses linked list to find blocks
  - Fragmentation – After frequent allocate/free, fragments occur

Is there a heap type without these drawbacks?
HeapBuf

- Allows allocation of *fixed-size* blocks (no fragmentation)
- *Deterministic, no reentrancy problems*
- Ideal when using a varying number of fixed-size blocks (e.g. 4-6 buffers of 64 bytes each)
- Creation: .CFG (static) or C code (dynamic)
- For blockSize=64: Ask for 16, get 64. Ask for 66, get NULL

How do you create a HeapBuf?

Creating A *HeapBuf*

1. **Use HeapBuf** *(Available Products)*

2. **Create HeapBuf** *(myBuf)*: blk size, # of blocks, name

   **Static**

   OR...

   **Dynamic**

   ```c
   prms.blockSize = 64;
   prms.numBlocks = 8;
   prms.bufSize = 256;
   myHeapBuf = HeapBuf_create(&prms, &eb);
   ```

   **Usage**

   ```c
   buf1 = Memory_alloc(myHeapBuf, 64, 0, &eb);
   ```

   What if I need multiple sizes (16, 32, 128)?
## Multiple HeapBufs

<table>
<thead>
<tr>
<th>Buffer</th>
<th>16</th>
<th>16</th>
<th>16</th>
<th>16</th>
<th>16</th>
<th>16</th>
<th>16</th>
<th>16</th>
</tr>
</thead>
<tbody>
<tr>
<td>heapBuf1</td>
<td>16</td>
<td>16</td>
<td>16</td>
<td>16</td>
<td>16</td>
<td>16</td>
<td>16</td>
<td>16</td>
</tr>
<tr>
<td>heapBuf2</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
</tr>
<tr>
<td>heapBuf3</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
</tr>
<tr>
<td></td>
<td>128</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

- Given this configuration, what happens when we allocate the 9th 16-byte location from heapBuf1?
- What “mechanism” would you want to exist to avoid the NULL return pointer?

### HeapMultiBuf

<table>
<thead>
<tr>
<th>Buffer</th>
<th>16</th>
<th>16</th>
<th>16</th>
<th>16</th>
<th>16</th>
<th>16</th>
<th>16</th>
<th>16</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
</tr>
<tr>
<td></td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
<td>32</td>
</tr>
<tr>
<td></td>
<td>128</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

- Allows variable-size allocation from a variety of fixed-size blocks
- Services requests for ANY memory size, but always returns the most efficient-sized available block
- Can be configured to “block borrow” from the “next size up”
- Creation: .CFG (static) or C code (dynamic)
- Ask for 17, get 32. Ask for 36, get 128.
Default System Heap

- BIOS automatically creates a default system heap of type **HeapMem**
- How do you configure the default heap?
- In the .CFG GUI, of course:

```c
buf1 = Memory_alloc(NULL, 128, 0, &eb);
myAlgo(buf1);
Memory_free(NULL, buf1, 128);
```

- How to USE this heap?

[Image of BIOS and .CFG GUI configuration]

If NULL, uses default heap

align
Dynamic Module Creation

Dynamically Creating SYS/BIOS Objects

- **Module_create**
  - Allocates memory for object out of heap
  - Returns a Module_Handle to the created object
- **Module_delete**
  - Frees the object’s memory
- **Example: Semaphore creation/deletion:**

```c
#define COUNT 0
Semaphore_Handle hMySem; hMySem = Semaphore_create(COUNT,NULL,&eb); C
Semaphore_post(hMySem); X
Semaphore_delete(&hMySem); D
```

Note: always check return value of _create APIs !

Example – Dynamic Task API

```c
Task_Handle hMyTsk;
Task_Params taskParams;

Task_Params_init(&taskParams);
taskParams.priority = 3;

hMyTsk = Task_create(myCode,&taskParams,&eb); C
  // "MyTsk" now active w/priority = 3 ... X
Task_delete(&hMyTsk); D
```

**taskParams includes:** heap location, priority, stack ptr/size, environment ptr, name
What is Error Block?

**Usage**

```c
buf1 = Memory_alloc (myBuf, 64, 0, &eb)
```

**Setup Code**

```c
Error_Block eb;
Error_init (&eb);
```

- Most SYS/BIOS APIs that expect an **error block** also return a handle to the created object or allocated memory.
- If NULL is passed instead of an initialized Error_Block and an error occurs, the application aborts and the error can be output using `System_printf()`.
- This may be the best behavior in systems where an error is fatal and you do not want to do any error checking.
- The main advantage of passing and testing Error_block is that your program controls when it aborts.
- **Typically, systems pass Error_block and check resource pointer to see if it is NULL, then make a decision...**

Can check Error_Block using: `Error_check()`
Custom Section Placement

Custom Placement of Data and Code

- Problem #1: You have a function or buffer that you want to place at a specific address in the memory map. How is this accomplished?

```plaintext
.myCode
myFxn

.myBuf
myBuffer

Mem1
Mem2
```

- Problem #2: Have two buffers, you want one to be linked at Ram1 and the other at Ram2. How do you “split” the .bss (compiler’s default) section??

```plaintext
.bss
buf1
buf2

Ram1
buf1

Ram2
buf2
```

Making Custom Sections

- Create custom code & data sections using:

```plaintext
#pragma CODE_SECTION (myFxn, "myCode");
void myFxn(*ptr, *ptr2, ...){ }
#pragma DATA_SECTION (myBuffer, "myBuf");
int16_t myBuffer[32];
```

* `myFxn` & `myBuffer` is the name of the fn/xvar
* "myCode" & "myBuf" are the names of the custom sections

- Split default compiler section using SUB sections:

```plaintext
#pragma DATA_SECTION (buf1, "bss:buf1");
int16_t buf1[8];
#pragma DATA_SECTION (buf2, "bss:buf2");
int16_t buf2[8];
```

How do you LINK these custom sections?
Create your own linker.cmd file for custom sections

- CCS projects can have multiple linker CMD files
- May need to create custom MEMORY segments also (device-specific)
- "bss:" used as protection against custom section not being linked
- –w warns if unexpected section encountered
Lab 10: Using Dynamic Memory

You might notice this system block diagram looks the same as what we used back in Lab 8 – that’s because it IS.

We’ll have the same objects and events, it’s just that we will create the objects dynamically instead of statically.

In this lab, you will delete the current STATIC configuration of the Task and Semaphore and create them dynamically. Then, if your LED blinks once again, you were successful.

Lab 10 – Creating Task/Sem Dynamically

main() {
    init_hw();
    Timer (500ms)
    BIOS_start();
}

main.c

Procedure

- Import archived (.zip) project (from Task lab)
- Delete Task/Sem objects (for ledToggle)
- Write code to create Task/Sem Dynamically
- Build, “Play”, Debug
- Use ROV/UIA to debug/analyze

-ledToggle() {
    while(1) {
        Semaphore_pend(LedSem);
        Toggle_LED;
    }
}

ledToggleTask

Time: 30 min
Lab 10 – Procedure – Using Dynamic Task/Sem

In this lab, you will import the solution for the Task lab from before and modify it by DELETING the static declaration of the Task and Semaphore in the .cfg file and then add code to create them DYNAMICALLY in main().

Import Project

1. **Open CCS and make sure all existing projects are closed.**
   - Close any open projects (right-click Close Project) before moving on. With many main.c and app.cfg files floating around, it might be easy to get confused about WHICH file you are editing.
   - Also, make sure all file windows are closed.

2. **Import existing project from \Lab10.**
   Just like last time, the author has already created a project for you and it’s contained in an archived .zip file in your lab folder.
   - Import the following archive from your /Lab_10 folder:
     Lab_10_TARGET_STARTER_blink_Mem.zip
   - Click Finish.
   The project “blink_TARGET_MEM” should now be sitting in your Project Explorer. This is the SOLUTION of the earlier Task lab with a few modifications explained later.
   - Expand the project to make sure the contents look correct.

3. **Build, load and run the project to make sure it works properly.**
   We want to make sure the imported project runs fine before moving on. Because this is the solution from the previous lab, well, it should build and run.
   - Build – fix errors.
   - Then run it and make sure it works. If all is well, move on to the next step…
   If you’re having any difficulties, ask a neighbor for help…
Check Dynamic Memory Settings

4. Open BIOS ➔ Runtime and check settings.
   - Open app.cfg and click on BIOS ➔ Runtime.
   - Make sure the “Enable Dynamic Instance Creation” checkbox is checked (it should already be checked):

     ![Dynamic Instance Creation Support]

     - Enable Dynamic Instance Creation
     - A savings in code and data size can be achieved

   - Check the Runtime Memory Options and make sure the settings below are set properly for stack and heap sizes.

     ![Runtime Memory Options]

     - System (Hwi and Swi) stack size: 1024
     - Heap size: 256

   We need SOME heap to create the Semaphore and Task out of, so 256 is a decent number to start with. We will see if it is large enough as we go along.

   - Save app.cfg.

   The author also wants you to know that there is duplication of these numbers throughout the .cfg file which causes some confusion – especially for new users. First, BIOS ➔ Runtime is THE place to change the stack and heap sizes.

   Other areas of the app.cfg file are “followers” of these numbers – they reflect these settings. Sometimes they are displayed correctly in other “modules” and some show “zero”. No worries, just use the BIOS ➔ Runtime numbers and ignore all the rest.

   But, you need to see for yourself that these numbers actually show up in four places in the app.cfg file. Of course, BIOS ➔ Runtime is the first and ONLY place you should use.

   - However, click on the following modules and see where these numbers show up (don’t modify any numbers – just click and look):
     - Hwi
     - Memory
     - Program

   Yes, this can be confusing, but now you know. Just use BIOS ➔ Runtime and ignore the other locations for these settings.

   **Hint:** If you change the stack or heap sizes in any of these other windows, it may result in a BIOS CFG warning of some kind. So, the author will say this one more time – ONLY use BIOS Runtime to change stack and heap sizes.
Lab 10 – Procedure – Using Dynamic Task/Sem

Inspect New Code in main()

5. Open main.c and inspect the new code.

The author has already written some code for you in main(). Why? Well, instead of making you type the code and make spelling or syntax errors and deal with the build errors, it is just easier to provide commented code and have you uncomment it. Plus, when you create the Task dynamically, the casting of the Task function pointer is a bit odd.

► Open main.c and find main().

► Inspect the new code that creates the Semaphore and Task dynamically (DO NOT UNCOMMENT ANYTHING YET):

```c
void main(void)
{
    //------
    // [START] - DYNAMIC CREATION OF TASKS AND SEMAPHORES
    //------

    // Task_Params taskParams;
    // ??? = Semaphore_create(0, NULL, NULL); // create ledToggleSem Semaphore
    // Task_Params_init(&taskParams);
    // taskParams.priority = ???;
    // ??? = Task_create((Task_FuncPtr)ledToggle, &taskParams, NULL);

    //------
    // [END] - DYNAMIC CREATION OF TASKS AND SEMAPHORES
    //------
}
```

As you go through this lab, you will be uncommenting pieces of this code to create the Semaphore and Task dynamically and you’ll have to fill in the “????” with the proper names or values. Hey, we couldn’t do ALL the work for you. 😊

Also notice in the global variable declaration area that there are two handles for the Semaphore and Task also provided.

In order to use functions like Semaphore_create() and Task_create(), you will need to uncomment the necessary #include for the header files also.

Delete the Semaphore and Add It Dynamically


   ► Remove ledToggleSem from the app.cfg file and save app.cfg.

7. Uncomment the two lines of code associated with creating ledToggleSem dynamically.

   ► In the global declaration area above main(), uncomment the line associated with the handle for the Semaphore and name the Semaphore ledToggleSem.

   ► In main(), uncomment the line of code for Semaphore_create() and use the same name for the Semaphore.

   ► In the #include section near the top of main.c, uncomment the #include for Semaphore.h.

   ► Save main.c.
Build, Load, Run, Verify

8. Build, load and run your code.
   ► Build the new code, load it and run it for 5 blinks.
   Is it working? If not, it is debug time. If it is working, you can move on…

9. Check heap in ROV.
   So, how much heap memory does a Semaphore take? Where do you find the heap sizes and how much was used? ROV, of course…
   ► Open ROV and click on HeapMem (the standard heap type), then click on Detailed:

   ![Heap Memory Table]

   So, in this example (C28x), the starting heap size was 0x100 (256) and 0xd0 is still free (208), so the Semaphore object took 48 16-bit locations on the C28x (assuming nothing else is on the heap). Ok. So, we didn’t run out of heap. Good thing.
   ► Write down how many bytes your Semaphore required here: _____________
   ► How much free size do you have left over? ____________

   So, when you create a Task, which has its own stack, if you create it with a stack larger than the free size left over, what might happen?

   ________________________________________________________

   Well, let’s go try it…
Delete Task and Add It Dynamically

10. Delete the Task in app.cfg.
   Remove the Task from the app.cfg file and save app.cfg.

11. Uncomment some lines of code and declarations.
   ► Uncomment the #include for Task.h.
   ► Uncomment the declaration of the Task_Handle.
   ► Uncomment the code in main() that creates the Task (ledToggleTask) and fill in the ??? properly.
   ► Create the Task at priority 2.
   ► Save main.c.

   ► Build and run your code for five blinks. No blink? Read further…
   ► Halt your code.

Your code probably is probably sitting at abort(). How would the author know that? Well, when you create a Task, it needs a stack. On the C6000, the default stack size is 2048 bytes. For C28x, it is 256.

You probably aborted with a message that looks similar to this:

```
abort() at exit.c:109 0xC3013820
```

What happened? Two things. First, your heap is not big enough to create a Task because the Task requires a stack that is larger than the entire heap.

Also, did you pass an error block in the Task_create() function? Probably not. So, what happens if you get a NULL pointer back and you do NOT pass an error block? BIOS aborts. Well, that’s what it looks like.

13. Open ROV to see the damage.
   ► Open ROV and click on Task. You should see something similar to this:

<table>
<thead>
<tr>
<th>address</th>
<th>label</th>
<th>priority</th>
<th>mode</th>
<th>addr</th>
<th>a.</th>
<th>a.</th>
<th>stackSize</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x0000a180</td>
<td>ti.sysbios...</td>
<td>0</td>
<td>Running</td>
<td>ti.sysbios_knl_idle_loop</td>
<td>0..</td>
<td>0..</td>
<td>256</td>
</tr>
<tr>
<td>0x0000b1e4</td>
<td>ledToggle</td>
<td>2</td>
<td>Blocked</td>
<td>ledToggle</td>
<td>0..</td>
<td>0..</td>
<td>256</td>
</tr>
</tbody>
</table>

   ► Look at the size of “stackSize” for ledToggle (name may or may not show up). This screen capture was for C28x, so your size may be different (probably larger).
   ► What size did you set the heap to in BIOS Runtime? __________ bytes
   ► What is the size of the stack needed for ledToggle (shown in ROV)? __________ bytes

Get the picture? You need to increase the size of the heap…
14. Go back and increase the size of the heap.
   Open BIOS → Runtime and use the following heap sizes:
   - C28x: 1024
   - C6000: 4096
   - MSP430: 1024
   - TM4C: 4096

   We probably don’t need THIS large of a heap for this application – it could be tuned better – we’re just using a larger number to see the application work.

   Save app.cfg.

15. Wait, what about Error Block?

   In a real application, the user has a choice whether to use Error Block or not. For debug purposes, maybe it is best to leave it off so that your program aborts when the handle to the requested resource is NULL. If you don’t like that, then use Error Block and check the return handle and deal with it however you choose – user preference.

   In our lab, we chose to ignore Error Block, but at least you know it is there, how to initialize one and how it works.

16. Rebuild and run again.

   Rebuild and run the new project with the larger heap. Run for 5 blinks – it should work fine now.

17. Terminate your debug session, close the project and close CCS.

You’re finished with this lab. Help a neighbor who is struggling – you know you KNOW IT when you can help someone else – and it’s being a good neighbor. But, if you want to be selfish and just leave the room because the workshop is OVER, no one will look at you funny !!
Additional Information

Placing a Specific Section into Memory

- **Via the Platform File (C6000 Only)** – hi-level, but works fine:

  - SYS/BIOS GUI now supports specific placements of sections (like .far, .bss, etc.) into specific memory segments (like IRAM, DDR, etc.):
    - Via the Platform File (C6000 Only) – hi-level, but works fine
      - Via the Platform File (C6000 Only) – hi-level, but works fine
        - SYS/BIOS GUI now supports specific placements of sections (like .far, .bss, etc.) into specific memory segments (like IRAM, DDR, etc.):
More Notes…

*** the very end ***
Introduction

This is the first chapter that specifically addresses ONLY the C6000 architecture. All chapters from here on assume the student has already taken the 2-day TI-RTOS Kernel workshop.

During those past two days, some specific C6000 architecture items were skipped in favor of covering all TI EP processors with the same focus. Now, it is time to dive deeper into the C6000 specifics.

The first part of this chapter focuses on the C6000 family of devices. The 2nd part dives deeper into topics already discussed in the previous two days of the TI-RTOS Kernel workshop. In a way, this chapter is "catching up" all the C6000 users to understand this target environment specifically.

After this chapter, we plan to dive even deeper into specific parts of the architecture like optimizations, cache and EDMA.

Objectives

- Introduce the C6000 Core and the C6748 target device
- Highlight a few uncommon pieces of the architecture – e.g. the SCR and PRU
- “Catch up” from the TI-RTOS Kernel discussions are C6000-specific topics such as Interrupts, Platforms and Target Config Files
- Lab 11 – Create a custom platform and create an Hwi to respond to the audio interrupts
Module Topics

C6000 Introduction.........................................................................................................................11-1

Module Topics...............................................................................................................................11-2
TI EP Product Portfolio.................................................................................................................11-3
DSP Core.......................................................................................................................................11-4
Devices & Documentation ............................................................................................................11-6
Peripherals......................................................................................................................................11-7
PRU..............................................................................................................................................11-7
SCR / EDMA3..............................................................................................................................11-8
Pin Muxing.................................................................................................................................11-9
Example Device: C6748 DSP.......................................................................................................11-11
Choosing a Device.........................................................................................................................11-12
C6000 Arch “Catchup”..................................................................................................................11-13
C64x+ Interrupts...........................................................................................................................11-13
Event Combiner............................................................................................................................11-14
Target Config Files......................................................................................................................11-14
Creating Custom Platforms.......................................................................................................11-15
Quiz..............................................................................................................................................11-19
Quiz - Answers..............................................................................................................................11-20
Using Double Buffers ..................................................................................................................11-21
Lab 11: An Hwi-Based Audio System...........................................................................................11-23
Lab 11 – Procedure.......................................................................................................................11-24
Hack LogicPD’s BSL types.h.........................................................................................................11-24
PART B (Optional) – Using the Profiler Clock............................................................................11-34
Additional Information..................................................................................................................11-35
Notes..............................................................................................................................................11-36
### TI EP Product Portfolio

<table>
<thead>
<tr>
<th>Microcontrollers (MCU)</th>
<th>Application (MPU)</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>MSP430</strong></td>
<td><strong>DSP</strong></td>
</tr>
<tr>
<td>16-bit Ultra Low Power &amp; Cost</td>
<td>16/32-bit All-around DSP</td>
</tr>
<tr>
<td>32-bit Real-time</td>
<td></td>
</tr>
<tr>
<td>32-bit All-around MCU</td>
<td></td>
</tr>
<tr>
<td>32-bit Safety</td>
<td></td>
</tr>
<tr>
<td>32-bit Linux/Android</td>
<td></td>
</tr>
<tr>
<td><strong>C2000</strong></td>
<td><strong>Multicore</strong></td>
</tr>
<tr>
<td>ARM Cortex-M3</td>
<td></td>
</tr>
<tr>
<td>ARM Cortex-M4F</td>
<td></td>
</tr>
<tr>
<td>ARM Cortex-A8</td>
<td></td>
</tr>
<tr>
<td>ARM Cortex-A9</td>
<td></td>
</tr>
<tr>
<td>DSP C5000 C6000</td>
<td></td>
</tr>
<tr>
<td>MSP430 ULP RISC MCU</td>
<td></td>
</tr>
<tr>
<td>• Real-time</td>
<td></td>
</tr>
<tr>
<td>• C28x MCU</td>
<td></td>
</tr>
<tr>
<td>• ARM M3+C28</td>
<td></td>
</tr>
<tr>
<td>• 32-bit Float</td>
<td></td>
</tr>
<tr>
<td>• Nested Vector</td>
<td></td>
</tr>
<tr>
<td>• Precision</td>
<td></td>
</tr>
<tr>
<td>• Timer/PWM</td>
<td></td>
</tr>
<tr>
<td>• Ethernet</td>
<td></td>
</tr>
<tr>
<td>• Lock step</td>
<td></td>
</tr>
<tr>
<td>• Dual-core R4</td>
<td></td>
</tr>
<tr>
<td>• ECC Memory</td>
<td></td>
</tr>
<tr>
<td>• SIL3 Certified</td>
<td></td>
</tr>
<tr>
<td>• $5 Linux CPU</td>
<td></td>
</tr>
<tr>
<td>• 3D Graphics</td>
<td></td>
</tr>
<tr>
<td>Linux, Android, SYS/BIOS</td>
<td></td>
</tr>
<tr>
<td>Flash: 512K FRAM: 64K</td>
<td></td>
</tr>
<tr>
<td>TI RTOS (SYS/BIOS)</td>
<td></td>
</tr>
<tr>
<td>TI RTOS (SYS/BIOS)</td>
<td></td>
</tr>
<tr>
<td>TI RTOS (SYS/BIOS)</td>
<td></td>
</tr>
<tr>
<td>TI RTOS (SYS/BIOS)</td>
<td></td>
</tr>
<tr>
<td>N/A</td>
<td></td>
</tr>
<tr>
<td>L1: 32K x 2 L2: 256K</td>
<td></td>
</tr>
<tr>
<td>L1: 32K x 2 L2: 256K</td>
<td></td>
</tr>
<tr>
<td>25 MHz</td>
<td></td>
</tr>
<tr>
<td>300 MHz</td>
<td></td>
</tr>
<tr>
<td>80 MHz</td>
<td></td>
</tr>
<tr>
<td>220 MHz</td>
<td></td>
</tr>
<tr>
<td>1.35 GHz</td>
<td></td>
</tr>
<tr>
<td>$0.25 to $9.00</td>
<td></td>
</tr>
<tr>
<td>$1.85 to $20.00</td>
<td></td>
</tr>
<tr>
<td>$1.00 to $8.00</td>
<td></td>
</tr>
<tr>
<td>$5.00 to $30.00</td>
<td></td>
</tr>
<tr>
<td>$5.00 to $25.00</td>
<td></td>
</tr>
</tbody>
</table>

- **C5000 Low Power DSP**
- **Up to 12 cores**
- **A15 + 8 C66x**
- **DSP MMAC’s: 352.000**
- **C5x: DSP/BIOS**
- **C6x: SYS/BIOS**
- **C5000 DSP**
- **C6000 DSP**
- **32-bit fix/float**
- **C6000 DSP**
- **DSP MMAC’s:**
- **$0.25 to $9.00**
- **$1.85 to $20.00**
- **$1.00 to $8.00**
- **$5.00 to $30.00**
- **$5.00 to $25.00**
- **$2.00 to $25.00**
- **$30.00 to $225.00**
DSP Core

What Problem Are We Trying To Solve?

![Diagram showing digital sampling of an analog signal: A(t) → DAC, DSP → Y, ADC → X.](image)

Most DSP algorithms can be expressed with MAC:

\[ Y = \sum_{i=1}^{\text{count}} \text{coeff}_i \times x_i \]

for \( i = 0; i < \text{count}; i++ \) {
\[ Y += \text{coeff}[i] \times x[i]; \]}

How is the architecture designed to maximize computations like this?

'C6x CPU Architecture

- 'C6x Compiler excels at Natural C
- Multiplier (.M) and ALU (.L) provide up to 8 MACs/cycle (8x8 or 16x16)
- Specialized instructions accelerate intensive, non-MAC oriented calculations. Examples include:
  - Video compression, Machine Vision, Reed Solomon, …
- While MMACs speed math intensive algorithms, flexibility of 8 independent functional units allows the compiler to quickly perform other types of processing
- 'C6x CPU can dispatch up to eight parallel instructions each cycle
- All 'C6x instructions are conditional allowing efficient hardware pipelining

Note: More details later…
C6000 DSP Family CPU Roadmap

- **C6000 DSP Core**
- **C66x**
  - Fixed and Floating Point
  - Lower power
  - EDMA3
  - PRU
  - Available on the most recent releases
- **C64x+**
  - L1 RAM/Cache
  - Compact Instr’s
  - EDMA3
- **C67x+**
  - Fixed Point
  - Video/Imaging Enhanced
  - EDMA2
  - EDMA3
- **C674**
  - Floating Point
  - 1GHz
  - EDMA (v2)
  - 2x Register Set
  - SIMD Instr’s (Packed Data Proc)
  - 32x32 int/Multiply
  - Enhanced Instr for FFT/FFT Complex
- **C621x**
  - EDMA
  - L1 Cache
  - L2 Cache/RAM
  - Lower Cost
  - DMAX (PRU)
  - 2x Register Set
  - FFT enhancements
- **C62x**
  - EDMA
  - L1 Cache
  - L2 Cache/RAM
  - Lower Cost
  - Exceptions
  - Supervisor/User modes
- **C61x**
  - EDMA
  - L1 Cache
  - L2 Cache/RAM
  - Lower Cost
  - Exceptions
  - Supervisor/User modes
# Devices & Documentation

## DSP Generations: DSP and ARM+DSP

<table>
<thead>
<tr>
<th>Fixed-Point Cores</th>
<th>Float-Point Cores</th>
<th>DSP</th>
<th>DSP+DSP (Multi-core)</th>
<th>ARM+DSP</th>
</tr>
</thead>
<tbody>
<tr>
<td>C62x</td>
<td>C67x</td>
<td>C620x, C670x</td>
<td></td>
<td></td>
</tr>
<tr>
<td>C621x</td>
<td>C67x</td>
<td>C6211, C671x</td>
<td></td>
<td></td>
</tr>
<tr>
<td>C64x</td>
<td></td>
<td>C641x DM642</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>C67x+</td>
<td>C672x</td>
<td></td>
<td></td>
</tr>
<tr>
<td>C64x+</td>
<td>DM643x DM645x</td>
<td>C647x</td>
<td>DM64xx, OMAP35x, DM37x</td>
<td></td>
</tr>
<tr>
<td>C674x</td>
<td></td>
<td>C6748</td>
<td>OMAP-L138*  C6A8168</td>
<td></td>
</tr>
<tr>
<td>C66x</td>
<td>Future</td>
<td>C667x C665x (new)</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

## Key C6000 Manuals

<table>
<thead>
<tr>
<th>Manual</th>
<th>C64x/C64x+</th>
<th>C674</th>
<th>C66x</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPU Instruction Set Ref Guide</td>
<td>SPRU732</td>
<td>SPRUFE8</td>
<td>SPRUGH7</td>
</tr>
<tr>
<td>Megamodule/Corepac Ref Guide</td>
<td>SPRU971</td>
<td>SPRUFK5</td>
<td>SPRUGW0</td>
</tr>
<tr>
<td>Peripherals Overview Ref Guide</td>
<td>SPRUE52</td>
<td>SPRUFB9</td>
<td>N/A</td>
</tr>
<tr>
<td>Cache User's Guide</td>
<td>SPRU862</td>
<td>SPRUG82</td>
<td>SPRUGY8</td>
</tr>
<tr>
<td>Programmers Guide</td>
<td>SPRU198</td>
<td>SPRU198</td>
<td>SPRAB27</td>
</tr>
</tbody>
</table>

**DSP/BIOS Real-Time Operating System**
- SPRU423 - DSP/BIOS (v5) User's Guide
- SPRU403 - DSP/BIOS (v5) C6000 API Guide
- SPRUEX3 - SYS/BIOS (v6) User’s Guide
- SPRU186 - Assembly Language Tools User’s Guide

To find a manual, [at www.ti.com](http://www.ti.com) and enter the document number in the Keyword field:

or...

[www.ti.com/lit/<litnum>](http://www.ti.com/lit/<litnum>)
## Peripherals

### Peripherals

<table>
<thead>
<tr>
<th>Serial</th>
<th>Storage</th>
<th>Master</th>
<th>Timing</th>
</tr>
</thead>
<tbody>
<tr>
<td>McBSP</td>
<td>DDR2</td>
<td>PCIe</td>
<td>Timers</td>
</tr>
<tr>
<td>McBSP</td>
<td>DDR3</td>
<td>USB 2.0</td>
<td>Watch</td>
</tr>
<tr>
<td>ASP</td>
<td>SDRAM</td>
<td>EMAC</td>
<td>PWM</td>
</tr>
<tr>
<td>UART</td>
<td>Async</td>
<td>uPP</td>
<td>eCAP</td>
</tr>
<tr>
<td>SPI</td>
<td>SD/MMC</td>
<td>HPI</td>
<td>RTC</td>
</tr>
<tr>
<td>I2C</td>
<td>ATA/CF</td>
<td>EDMA3</td>
<td></td>
</tr>
<tr>
<td>CAN</td>
<td>SATA</td>
<td>SCR</td>
<td>GPIO</td>
</tr>
</tbody>
</table>

### Video/Display Subsystem

- CAN
- UART
- What's Next?
- DIY...
- Capture
- Analog Display
- Digital Display
- LCD Controller

We'll just look at three of these: PRU and SCR/EDMA3

### PRU

**Programmable Realtime Unit (PRU)**

**PRU consists of:**
- 2 Independent, Realtime RISC Cores
- Access to pins (GPIO)
- Its own interrupt controller
- Access to memory (master via SCR)
- Device power mgmt control (ARM/DSP clock gating)

- Use as a **soft peripheral** to implement add’l on-chip peripherals
- Examples implementations include:
  - Soft UART
  - Soft CAN
- Create **custom peripherals** or setup non-linear DMA moves.
- No C compiler (ASM only)
- Implement smart power controller:
  - Allows switching off both ARM and DSP clocks
  - Maximize power down time by evaluating system events before waking up DSP and/or ARM
PRU SubSystem: IS / IS-NOT

<table>
<thead>
<tr>
<th>Is</th>
<th>IsNot</th>
</tr>
</thead>
<tbody>
<tr>
<td>Dual 32bit RISC processor specifically</td>
<td>Is not a HW accelerator used to speed up algorithm computations.</td>
</tr>
<tr>
<td>designed for manipulation of packed memory mapped data structures and implementing system features that have tight real time constraints.</td>
<td></td>
</tr>
<tr>
<td>Simple RISC ISA:</td>
<td></td>
</tr>
<tr>
<td>• Approximately 40 instructions</td>
<td>Is not a general purpose RISC processor:</td>
</tr>
<tr>
<td>• Logical, arithmetic, and flow control ops all complete in a single cycle</td>
<td>• No multiply hardware/instructions</td>
</tr>
<tr>
<td>• No cache or pipeline</td>
<td>• No C programming</td>
</tr>
<tr>
<td>Simple tooling:</td>
<td></td>
</tr>
<tr>
<td>Basic commandline assembler/linker</td>
<td>Is not integrated with CCS. Doesn’t include advanced debug options.</td>
</tr>
<tr>
<td>Includes example code to demonstrate</td>
<td>No Operating System or high-level application software stack.</td>
</tr>
<tr>
<td>various features. Examples can be used as building blocks.</td>
<td></td>
</tr>
</tbody>
</table>

SCR / EDMA3

System Architecture – SCR/EDMA

- SCR – Switched Central Resource
- Masters initiate accesses to/from slaves via the SCR
- Most Masters (requestors) and Slaves (resources) have their own port to the SCR
- Lower bandwidth masters (HPI, PCI66, etc) share a port
- There is a default priority (0 to 7) to SCR resources that can be modified.

Note: this picture is the "general idea". Every device has a different scheme for SCRs and peripheral muxing. In other words: "check your data sheet".
**Pin Muxing**

**What is Pin Multiplexing?**

- How many pins are on your device?
- How many pins would all your peripheral require?
- Pin Multiplexing is the answer – only so many peripherals can be used at the same time ... in other words, to reduce costs, peripherals must share available pins
- Which ones can you use simultaneously?
  - Designers examine app use cases when deciding best muxing layout
  - Read datasheet for final authority on how pins are muxed
  - Graphical utility can assist with figuring out pin-muxing...

**TMS320C6748 Interconnect Matrix**

Table 3-1. TMS320C6748 DSP System Interconnect Matrix

<table>
<thead>
<tr>
<th>Interface</th>
<th>C6748 Default</th>
<th>SCMA</th>
<th>EMIF A</th>
<th>DDB/DSR</th>
<th>128K RAM</th>
<th>EDM3A0 TC6/TC1</th>
<th>EDM3A1 TC0</th>
<th>Peripheral Group</th>
</tr>
</thead>
<tbody>
<tr>
<td>HPI</td>
<td>0</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td></td>
</tr>
<tr>
<td>uPP</td>
<td>2</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td></td>
</tr>
<tr>
<td>Pin Muxing Example</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Note: not ALL connections are valid
Peripherals

**Pin Muxing Tools**

- Graphical Utilities For Determining which Peripherals can be Used Simultaneously
- Provides Pin Mux Register Configurations. Warns user about conflicts.
- ARM-based devices: [www.ti.com/tool/pinmuxtool](http://www.ti.com/tool/pinmuxtool) others: see product page
Example Device: C6748 DSP

**TMS320C674x Architecture - Overview**

- **Performance & Memory**
  - Up to 456MHz
  - 256K L2 (cache/SRAM)
  - 32K L1P/D Cache/SRAM
  - 16-bit DDR2-266
  - 16-bit EMIF (NAND Flash)

- **Communications**
  - 64-Channel EDMA 3.0
  - 10/100 EMAC
  - USB 1.1 & 2.0
  - SATA

- **Power/Packaging**
  - 13x13mm nPBGA & 16x16mm PBGA
  - Pin-to-pin compatible w/OMAP L138 (+ARM9), 361-pin pkg
  - Dynamic voltage/freq scaling
  - Total Power < 420mW
Choosing a Device

DSP & ARM MPU Selection Tool

[Diagram of the DSP & ARM MPU Selection Tool]

How do Interrupts Work?

1. An interrupt occurs
   - EDMA
   - McASP
   - Timer
   - Ext'1 pins

2. Interrupt Selector

3. Sets flag in Interrupt Flag Register (IFR)

4. Is this specific interrupt enabled? (IER)

5. Are interrupts globally enabled? (GIE/NMIE)

6. • CPU Acknowledge
   • Auto hardware sequence
   • HWI Dispatcher (vector)
   • Branch to ISR

7. Interrupt Service Routine (ISR)
   • Context Save, ISR, Context Restore

◼ User is responsible for setting up the following:
   - #2 – Interrupt Selector (choose which 12 of 128 interrupt sources to use)
   - #4 – Interrupt Enable Register (IER) – individually enable the proper interrupt sources
   - #5 – Global Interrupt Enable (GIE/NMIE) – globally enable all interrupts

C64x+ Hardware Interrupts

◼ C6748 has 128 possible interrupt sources (but only 12 CPU interrupts)

◼ 4-Step Programming:
  1. Interrupt Selector – choose which of the 128 sources are tied to the 12 CPU ints
  2. IER – enable the individual interrupts that you want to “listen to” (in BIOS .cfg)
  3. GIE – enable global interrupts (turned on automatically if BIOS is used)
  4. Note: HWI Dispatcher performs “smart” context save/restore (automatic for BIOS Hwi)

Note: NMIE must also be enabled. BIOS automatically sets NMIE=1. If BIOS is NOT used, the user must turn on both GIE and NMIE manually.
Event Combiner

Event Combiner (ECM)
- Use only if you need more than 12 interrupt events
- ECM combines multiple events (e.g. 4-31) into one event (e.g. EVT0)
- EVT_x ISR must parse MEVTFLAG to determine which event occurred

Target Config Files

Creating a New Target Config File (.ccxml)
- **Target Configuration** – defines your “target” – i.e. emulator/device used, GEL scripts (replaces the old CCS Setup)
- Create user-defined configurations (select based on chosen board)

Advanced Tab

Specify GEL script here

More on GEL files...
What is a GEL File?

- GEL – General Extension Language (not much help, but there you go...)
- A GEL file is basically a “batch file” that sets up the CCS debug environment including:
  - Memory Map
  - Watchdog
  - UART
  - Other periphs

- The board manufacturer (e.g. SD or LogicPD) supplies GEL files with each board.
- To create a “stand-alone” or “bootable” system, the user must write code to perform these actions (optional chapter covers these details)

Creating Custom Platforms

Creating Custom Platforms - Procedure

- Most users will want to create their own custom platform package (Stellaris/c28X – maybe not – they will use a .cmd file directly)
- Here is the process:
  1. Create a new platform package
  2. Select repository, add to project path, select device
  3. Import the existing “seed” platform
  4. Modify settings
  5. [Save] – creates a custom platform pkg
  6. Build Options – select new custom platform
Creating Custom Platforms - Procedure

1. Create New Platform (via DEBUG perspective)

2. Configure New Platform
   - Platform Package Details
   - Package Name: emc5748_TTO
   - Platform Package Repository: C:\SYSB0S\l4\Beans\SYSB0S_Platforms
   - "Add Repository to Project Package Path"
   - Custom Repository vs. XDC default location

3. New Device Page – Click “Import” (copy "seed" platform)

4. Customize Settings
Creating Custom Platforms - Procedure

5. [SAVE] New Platform (creates custom platform package)
6. Select New Platform in Build Options (RTSC tab)

![Diagram showing Custom Repository vs. XDC default location and path addition to find new platform.]

Custom Repository vs. XDC default location

With path added, the tools find new platform
*** this page is blank for absolutely no reason ***
Chapter Quiz

1. How many functional units does the C6000 CPU have?

2. What is the size of a C6000 instruction word?

3. What is the name of the main “bus arbiter” in the architecture?

4. What is the main difference between a bus “master” and “slave”?

5. Fill in the names of the following blocks of memory and bus:
Quiz - Answers

Chapter Quiz

1. How many functional units does the C6000 CPU have?
   • 8 functional units or “execution units”

2. What is the size of a C6000 instruction word?
   • 256 bits (8 units x 32-bit instructions per unit)

3. What is the name of the main “bus arbiter” in the architecture?
   • Switched Central Resource (SCR)

4. What is the main difference between a bus “master” and “slave”?
   • Masters can initiate a memory transfer (e.g. EDMA, CPU…)

5. Fill in the names of the following blocks of memory and bus:
Using Double Buffers

**Single vs Double Buffer Systems**

**Single buffer system: collect data or process data – not both!**

- Nowhere to store new data when prior data is being processed

**Double buffer system: process and collect data – real-time compliant!**

- One buffer can be processed while another is being collected
- When Swi/Task finishes buffer, it is returned to Hwi
- Task is now ‘caught up’ and meeting real-time expectations
- Hwi must have priority over Swi/Task to get new data while prior data is being processed – standard in SYS/BIOS
*** this page is also blank – please stop staring at blank pages…it is not healthy ***
Lab 11: An Hwi-Based Audio System

In this lab, we will use an Hwi to respond to McASP interrupts. The McASP/AIC3106 init code has already been written for you. The McASP interrupts have been enabled. However, it is your challenge to create an Hwi and ensure all the necessary conditions to respond to the interrupt are set up properly.

This lab also employs double buffers – ping and pong. Both the RCV and XMT sides have a ping and pong buffer. The concept here is that when you are processing one, the other is being filled. A Boolean variable (pingPong) is used to keep track of which “side” you’re on.

Application: Audio pass-thru using Hwi and McASP/AIC3106
Key Ideas: Hwi creation, Hwi conditions to trigger an interrupt, Ping-Pong memory management

Pseudo Code:

- main() – init BSL, init LED, return to BIOS scheduler
- isrAudio() – responds to McASP interrupt, read data from RCV XBUF – put in RCV buffer, acquire data from XMT buffer, write to XBUF. When buffer is full, copy RCV to XMT buffer. Repeat.
- FIR_process() – memcpy RCV to XMT buffer. Dummy “algo” for FIR later on...

Procedure
1. Import existing project (Lab11)
2. Create your own CUSTOM PLATFORM
3. Config Hwi to respond to McASP interrupt
4. Debug Interrupt Problems

Time = 45min
Lab 11 – Procedure

If you can’t remember how to perform some of these steps, please refer back to the previous labs for help. Or, if you really get stuck, ask your neighbor. If you AND your neighbor are stuck, then ask the instructor (who is probably doing absolutely NOTHING important) for help.

Import Existing Project

1. Close ALL open projects and files and then open CCS.
2. Import Lab11 project.
   - As before, import the archived starter project from:
     
     `C:\TI-RTOS\C6000\Labs\Lab_11\`

     ![LAB_11_C6000_STARTER_audio_Hwi.zip]

     This starter file contains all the starting source files for the audio project including the setup code for the A/D and D/A on the OMAP-L138 target board. It also has UIA activated.

3. Check the Properties to ensure you are using the latest XDC, BIOS and UIA.

   For every imported project in this workshop, ALWAYS check to make sure the latest tools (XDC, BIOS and UIA) are being used. The author created these projects at time “x” and you may have updated the tools on your student PC at “x+1” – some time later. The author used the tools available at time “x” to create the starter projects and solutions which may or may not match YOUR current set of tools.

   Therefore, you may be importing a project that is NOT using the latest versions of the tools (XDC, BIOS, UIA) or the compiler.

   - Check ALL settings for the Properties of the project (XDC, BIOS, UIA) and the compiler and update the imported project to the latest tools before moving on and save all settings.

Hack LogicPD’s BSL types.h

4. Edit Logic PD’s types.h file (already done for you…but take a look at what the author did).

   Logic PD’s type.h contains typedefs that conflict with BIOS. So, in order for them to play together nicely, users need to “hack” this file (like the author did for you already).

   - Open the following file via CCS or any editor:
     
     `C:\TI_RTOS\Labs\LogicPD_BSL\DSP BSL\inc\types.h`

   - At the top of the file, notice the following two lines of code:

     ```
     #include <stdint.h>
     #define TYPES_H //hack to not use this
     ```

   - Close types.h.

   Now that this file is hacked, you will be able to use Logic PD’s types.h for all future labs without a ton of warnings when you build.
Application (Audio Pass-Thru) Overview

5. Let’s review what this audio pass-thru code is doing.

As discussed in the lab description, this application performs an audio pass-thru. The best way to understand the process is via I-P-O:

- **Input (RCV)** – each analog audio sample from the audio INPUT port is converted by the A/D and sent to the McASP port on the C6748. For each sample, the McASP generates an interrupt to the CPU. In the ISR, the CPU reads this sample and puts it in a buffer (RCV ping or pong). Once the buffer fills up (BUFFSIZE), processing begins…

- **Process** – Our algorithm is very fancy – it is a COPY from the RCV buffer to the XMT buffer.

- **Output (XMT)** – When the McASP transmit buffer is empty, it interrupts the CPU and asks for another sample. In the ISR (same ISR for the RCV side), the CPU reads a sample from the XMT buffer and writes to the McASP transmit register. The McASP sends this sample to the D/A and is then transmitted to the audio OUTPUT port.

Several source files are needed to create this application. Let’s explore those briefly…

Source Code Overview

6. Inspect the source code.

Following is a brief description of the source code. Because this workshop can be targeted at many processors (MSP430, Stellaris-M3, C28x, C6000, ARM), some of the hardware details will be minimized and saved for the target-specific chapter.

- Feel free to open any of these files and inspect them as you read…

- **main.h** – same as before, but contains more function prototypes

- **aic3106_TTO.c** – initializes the analog interface chip (AIC) on the EVM – this is the A/D and D/A combo device.

- **fir.c** – this is a placeholder for the algorithm. Currently, it is simply a copy function – to copy RCV to XMT buffers.

- **isr.c** – This is the interrupt service routine (**isrAudio**). When the interrupt from the McASP fires (RCV or XMT), the BIOS HWI (soon to be set up) will call this routine to read/write audio samples.

- **main.c** – sets up the McASP and AIC and then calls BIOS_start().

- **mcasp_TTO.c** – init code for the McASP on the C6748 device.
More Detailed Code Analysis

7. Open main.c for editing.

Near the top of the file, you will see the buffer allocations:

```c
int16_t rcvPing[BUFFSIZE]; // ping/pong buffers
int16_t rcvPong[BUFFSIZE];
int16_t xmtPing[BUFFSIZE];
int16_t xmtPong[BUFFSIZE];
```

Notice that we have separate buffers for Ping and Pong for both RCV and XMT. Where is BUFFSIZE defined? Main.h. We'll see him in a minute.

As you go into main(), you'll see the zeroing of the buffers to provide initial conditions of ZERO. Think about this for a minute. Is that ok? Well, it depends on your system. If BUFFSIZE is 256, that means 256 ZEROs will be transmitted to the DAC during the first 256 interrupts. What will that sound like? Do we care? Some systems require solid initial conditions – so keep that in mind. We will just live with the zeros for now.

Then, you'll see the calls to the init routines for the McASP and AIC3106. Previously, with DSP/BIOS, this is where an explicit call to init interrupts was located. However, with SYS/BIOS, this is done via the GUI. Lastly, there is a call to McASP_Start(). This is where the McASP is taken out of reset and the clocks start operating and data starts being shifted in/out. Soon thereafter, we will get the first interrupt.

8. Open mcasp_TTO.c for editing.

This file is responsible for initializing and starting the McASP – hence, two functions (init and start). In particular, look at line numbers 83 and 84 (approximately). This is where the serializers are chosen. This specifies XBUF11 (XMT) and XBUF12 (RCV). Also, look at line numbers 111-114. This is where the McASP interrupts are enabled. So, if they are enabled correctly, we should get these interrupts to fire to the CPU.
9. Open `isr.c` for editing.

Well, this is where all the real work happens – inside the ISR. This code should look pretty familiar to you already. There are 3 key concepts to understand in this code:

- **Ping/Pong buffer management** – notice that two “local” pointers are used to point to the RCV/XMT buffers. This was done as a pre-cursor to future labs – but works just fine here too. Notice at the top of the function that the pointers are initialized only if `blkCnt` is zero (i.e it is time to switch from ping to pong buffers or vice versa) and we’re done with the previous block. `blkCnt` is used as an index into the buffers.

- **McASP reads/writes** – refer to the read/write code in the middle. When an interrupt occurs, we don’t know if it was the RRDY (RCV) or XRDY (XMT) bit that triggered the interrupt. We must first test those bits, then perform the proper read or write accordingly. On EVERY interrupt, we EITHER read one sample and write one sample. All McASP reads and writes are 32 bits. Period. Even if your word length is 16 bits (like ours is). Because we are “MSB first”, the 16-bits of interest land in the UPPER half of the 32-bits. We turned on ROR (rotate-right) of 16 bits on rcv/xmt to make our code look more readable (and save time vs. `>> 16` via the compiler).

- **At the end of the block** – what happens? Look at the bottom of the code. When `BUFFSIZE` is reached, `blkCnt` is zero’d and the pingPong Boolean switches. Then, a call to `FIR_process()` is made that simply copies RCV buffer to XMT buffer. Then, the process happens all over again for the “other” (PING or PONG) buffers.

10. Open `fir.c` for editing.

This is currently a placeholder for a future FIR algorithm to filter our audio. We are simply “pass through” the data from RCV to XMT. In future labs, a FIR filter written in C will magically appear and we’ll analyze its performance quite extensively.

11. Open `main.h` for editing.

`main.h` is actually a workhorse. It contains all of the `#includes` for BSL and other items, `#defines` for `BUFFSIZE` and `PING/PONG`, prototypes for all functions and externs for all variables that require them. Whenever you are asked to “change `BUFFSIZE`”, this is the file to change it in.
Creating A Custom Platform

12. Create a custom platform file.

In previous labs, we specified a platform file during creation of a new project. In this lab, we will create our own custom platform that we will use throughout the rest of the labs. Plus, this is a good skill to know how to do.

Whenever you create your own project, you should always IMPORT the seed platform file for the specific target board and then make changes. This is what we plan to do next…

► In Debug Perspective, select: Tools → RTSC Tools → Platform → New

When the following dialogue appears:

• ► Give your platform a name: evmc6748_student (the author used _TTO for his)
• ► Point the repository to the path shown (this is where the platform package is stored)
• ► Then select the Device Family/Name as shown
• ► Check the box “Add Repository to Project Package Path” (so we can find it later).

When you check this box, select your current project in the listing that pops up. This also adds this repository to the list of Repositories in the Properties → General → RTSC tab dialogue.

► Click Next.
When the new platform dialogue appears, ▶️ click the IMPORT button to copy the seed file we used before:

![Import button]

This will copy all of the initial default settings for the board and then we can modify them. A dialogue box should pop up and select the proper seed file as shown (▶️ select the _TTO version of the platform file that the author already created for you):

![Select Platform dialog]

▶️ Modify the memory settings to allocate all code, data and stacks into internal memory (IRAM) as shown. They may already be SET this way – just double check.

▶️ **BEFORE YOU SAVE – HAVE THE INSTRUCTOR CHECK THIS FILE.**

▶️ Then save the new platform. This will build a new platform package.

**13. Tell the tools to use this new custom platform in your project.**

We have created a new platform file, but we have not yet ATTACHED it to our project. When the project was created, we were asked to specify a platform file and we chose the default seed platform. How do we get back to the configuration screen?

▶️ Right-click on the project and select **Properties → General** and then select the **RTSC tab**.

▶️ Look near the bottom and you’ll see that the default seed platform is still specified. We need to change this.

▶️ Click on the down arrow next to the **Platform File**. The tools should access your new repository with your new custom platform file: `evmc6748_student`.

![Platform dialog]

▶️ Select **YOUR STUDENT PLATFORM FILE** and click **Ok**. Now, your project is using the new custom platform. Very nice…
Add Hwi to the Project

14. Use Hwi module and configure the hardware interrupt for the McASP.

Ok, FINALLY, we get to do some real work to get our code running. For most targets, an interrupt source (e.g. McASP) will have an interrupt EVENT ID (specified in the datasheet). This event id needs to be tied to a specific CPU interrupt. The details change based on the target device. For the C6748, the EVENT ID is #61 and the CPU interrupt we're using is INT5 (there are 16 interrupts on the C6748 – again, target specific).

So, we need to do two things: (1) tell the tools we want to USE the Hwi BIOS module; (2) configure a specific interrupt to point to our ISR routine (isrAudio).

During the 2-day TI-RTOS Kernel Workshop, you performed these actions – so this should be review – but that's ok. Review is good.

► First, make sure you are viewing the hwi.cfg file.

► In the list of Available Products, locate Hwi, right-click and select “Use Hwi”. It will now show up on the right-hand Outline View.

► Then, right click on Hwi in the Outline View and select “New Hwi”.

► When the dialogue appears, which is different than what you see below, click OK.

► Then click on the new Hwi (hwi0) (you'll see a new dialogue like below) and fill in the following:

![Required Settings](image)

Make sure “Enabled at startup” is NOT checked (this sets the IER bit on the C6748). This will provide us with something to debug later. Once again, you can click on the new HWI and see the corresponding Source script code.
Build, Load, Run.

15. Build, load and run the audio pass-thru application.

► Before you Run, make sure audio is playing into the board and your headphones are set up so you can hear the audio.

► Also, make sure that Windows Media Player is set to REPEAT forever. If the music stops (the input is air), and you click Run, you might think there is a problem with your code. Nope, there is no music playing. 😊

► Build and fix any errors. After a successful build, debug the application.

► Once the program is loaded, click Run.

Do you hear audio? If not, it’s debug time – it SHOULD NOT be working (by design). One quick tip for debug is to place a breakpoint in the `isrAudio()` routine and see if the program stops there. If not, no interrupt is being generated. Move on to the next steps to debug the problem…

Hint: The McASP on the C6748 cannot be restarted after a halt – i.e. you can’t just hit halt, then Run. Once you halt the code, you must click the restart button and then Play.

Debug Interrupt Problem

As we already know, we decided early on to NOT enable the IER bit in the static configuration of the Hwi. Ok. But debugging interrupt problems is a crucial skill. The next few steps walk you through HOW to do this. You may not know WHERE your interrupt problem occurred, so using these brief debug skills may help in the future.

16. Pause for a moment to reflect on the “dominos” in the interrupt game:

- An interrupt must occur (McASP init code should turn ON this source)
- The individual interrupt must be enabled (IER, BITx)
- Global Interrupts must be turned on (GIE = 1, handled by BIOS)
- HWI Dispatcher must be used to provide proper context save/restore
- Keep this all in mind as you do the following steps…

17. McASP interrupt firing – IFR bit set?

The McASP interrupt is set to fire properly, but is it setting the IFR bit? You configured `HWI_INT5`, so that would be a “1” in bit 5 of the IFR.

► Go there now (View → Registers → Core Registers). ► Look down the list to find the IFR and IER – the two of most interest at the moment. (author note: could it have been set, then auto-cleared already?). You can also DISABLE IERbit (as it is already in the CFG file), build/run, and THEN look at IFR (this is a nice trick).

Write your debug “checkmarks” here:

| IFR bit set? | □ Yes | □ No |
18. Is the **IER** bit set?

Interruptions must be individually enabled. When you look at **IER** bit 5, is it set to “1”? Probably NOT because we didn’t check that “Enable at Start” checkbox.

► Open up the config for HWI_INT5 and check the proper checkbox. Then, hit build and your code will build and load automatically regardless of which perspective you are in.

**IER** bit set? □ Yes □ No

Do you hear audio now? You probably should. But let’s check one more thing…

19. Is **GIE** set?

The Global Interrupt Enable (**GIE**) Bit is located in the CPU’s **CSR** register. SYS/BIOS turns this on automatically and then manages it as part of the O/S. So, no need to check on this.

**GIE** bit set? □ Yes □ No

**Hint:** If you create a project that does NOT use SYS/BIOS, it is the responsibility of the user to not only turn on **GIE**, but also **NMIE** in the **CSR** register. Otherwise, NO interrupts will be recognized. Ever. Did I say ever?

**Other Debug/Analysis Items**


Often times, users want to make a minor change in their code and rebuild and run quickly. After you launch a debug session and connect to the target (which takes time), there is NO NEED to terminate the session to make code changes. After pausing (halting) the code execution, make a change to code (using the Edit perspective or Debug perspective) and hit “Build”. CCS will build and load your new .out file WITHOUT taking the time to launch a new debug session or re-connecting to the target. This is very handy. **TRY THIS NOW.**

Because we are using the McASP, any underrun will cause the McASP to crash (no more audio to the speaker/headphone). So, how can you halt and then start again quickly?

► Halt your code and then select **Run → Restart** or click the **Restart** button (arrow with PLAY):

![Restart Button](image)

So, try this now.

► Run your code and halt (pause). Run again. Do you hear audio? Nope. Click the restart button and run again. Now it should work.

These will be handy tips for all lab steps now and in the future.
That’s It. You’re Done!!

21. **Note about benchmarks, UIA and Logs in this lab.**

   There is really no extra work we can do in terms of UIA and Logs. These services will be used in all future labs. If you have time and want to add a Log or benchmark using Timestamp to the code, go ahead.

   You spent the past two days in the Kernel workshop playing with these tools. The point of this lab was to get you up to speed on Platforms and focusing more on C6000 as the specific target. In the future labs, though, you’ll have more chances to use UIA and Logs to test the compiler and optimizer and cache settings.

22. **Close the project and delete it from the workspace.**

   Terminate the debug session and close CCS. Power cycle the board.

   RAISE YOUR HAND and get the instructor’s attention when you have completed PART A of this lab. If time permits, you can quickly do the next optional part...
PART B (Optional) – Using the Profiler Clock

23. Turn on the Profiler Clock and perform a benchmark.
   ▶ Set two breakpoints anywhere you like (double click in left pane of code) – one at the “start” point and another at the “end” point that you want to benchmark.
   
   Turn on the Profiler clock by selecting: Run → Clock → Enable

   In the bottom right-hand part of the screen, you should see a little CLK symbol that looks like this:

   ![CLK symbol]

   Run to the first breakpoint, then double-click on the clock symbol to zero it. Run again and the number of CPU cycles will display.
Additional Information

**Exception Handling**

An “exception” can be used to:

- Trap an illegal instruction (code/data corruption, resource conflicts or invalid use of hardware)
- Handle general errors for different peripherals

**Exception Types**

- External (serious/fatal hardware problem, NMI pin)
- Internal (generated by CPU or software-triggered via “SWE” instruction)
- All 3 types above use the NMI (Non-maskable Interrupt) vector

```c
void uh_oh (void)
{
    "Houston...we have a...",
}
```

**External/Internal Exception Causes**

- **External** – whatever is tied to the NMI pin (system dependent)
- **Internal (IERR register)** – includes the following:
  - Fetch error (branch to middle of 32-bit instruction or fetch packet header)
  - Illegal or reserved opcode
  - Simultaneous writes to the same register
  - Two branches taken in the same execute packet
  - SPLOOP buffer exception (e.g. unit conflict – attempt to use the same unit)
  - Software triggered (SWE instruction – uses NMI vector)

**IERR – Internal Exception Report Register**

- **MBX** – SPLOOP buffer
- **OPX** – Opcode
- **PRX** – Privilege
- **EPX** – Execute packet
- **RAX** – Resource access
- **FPX** – Fetch packet
- **RCX** – Resource conflict
- **IFX** – Instruction fetch

- To enable exceptions, user must enable GEE (Global Exception Enable)
- NMI ISR interrogates EFR (Exception Flag Register) to determine the type of exception
- If internal, the ISR can interrogate IERR to determine type of internal exception
- Return pointer placed in NRP (NMI Return Pointer). To return, you must execute “B NRP”.

*Note: Other context saved/restored (refer to SPRU32, Ch-2)*
Introduction

In this chapter, we will take a deeper look at the C64x+ architecture and assembly code. The point here is not to cover HOW to write assembly – it is just a convenient way to understand the architecture better.

Objectives

- Provide a detailed overview of the C64x+/C674x CPU architecture
- Describe the basic ASM language and h/w needed to solve a SOP
- Analyze how the hardware pipeline works
- Learn basics of software pipelining
Module Topics

C64x+/C674x+ CPU Architecture ................................................................. 9-1

Module Topics ............................................................................................... 9-2
What Does A DSP Do? .................................................................................... 9-3
CPU – From the Inside – Out ........................................................................... 9-4
Instruction Sets .............................................................................................. 9-10
“MAC” Instructions ....................................................................................... 9-12
C66x – “MAC” Instructions ........................................................................... 9-14
Hardware Pipeline ......................................................................................... 9-15
Software Pipelining ....................................................................................... 9-16
Chapter Quiz ................................................................................................. 9-19
Quiz - Answers ............................................................................................ 9-20
What Does A DSP Do?

What Problem Are We Trying To Solve?

**Digital sampling of an analog signal:**

Most DSP algorithms can be expressed with MAC:

\[
Y = \sum_{i=1}^{\text{count}} \text{coeff}_i \times x_i
\]

for (i = 0; i < count; i++){
    Y += coeff[i] * x[i]; }

How is the architecture designed to maximize computations like this?

'C6x CPU Architecture

- 'C6x Compiler excels at Natural C
- Multiplier (M) and ALU (L) provide up to 8 MACs/cycle (8x8 or 16x16)
- Specialized instructions accelerate intensive, non-MAC oriented calculations. Examples include: Video compression, Machine Vision, Reed Solomon, ...
- While MMACs speed math intensive algorithms, flexibility of 8 independent functional units allows the compiler to quickly perform other types of processing
- 'C6x CPU can dispatch up to eight parallel instructions each cycle
- All 'C6x instructions are conditional allowing efficient hardware pipelining

Note: More details later…
The Core of DSP: Sum of Products

\[ y = \sum_{n=1}^{40} c_n \times x_n \]

The ‘C6000

Designed to handle DSP’s math-intensive calculations

Note:
You don’t have to specify functional units (.M or .L)

Where are the variables stored?

Working Variables: The Register File

Register File A

\[ y = \sum_{n=1}^{40} c_n \times x_n \]

How can we loop our ‘MAC’?
Making Loops

1. **Program flow:** the branch instruction
   
   ![Branch Instruction Diagram]

2. **Initialization:** setting the loop count
   
   MVK 40, cnt

3. **Decrement:** subtract 1 from the loop counter
   
   SUB cnt, 1, cnt

---

".S" Unit: Branch and Shift Instructions

- Register File A
  - c
  - x
  - cnt
  - prod
  - y
  - ...
  - ...

- 32-bits
- 16 or 32 registers

- MVK .S 40, cnt
- MPY .M c, x, prod
- ADD .L y, prod, y
- SUB .L cnt, 1, cnt
- B .S loop

How is the loop terminated?
Conditional Instruction Execution

To minimize branching, all instructions are conditional

[code]
[condition]  B  loop
[/code]

Execution based on [zero/non-zero] value of specified variable

<table>
<thead>
<tr>
<th>Code Syntax</th>
<th>Execute if:</th>
</tr>
</thead>
<tbody>
<tr>
<td>[ cnt ]</td>
<td>cnt ≠ 0</td>
</tr>
<tr>
<td>[ !cnt ]</td>
<td>cnt = 0</td>
</tr>
</tbody>
</table>

Note: If condition is false, execution is essentially replaced with nop

Loop Control via Conditional Branch

Register File A

- **c**
- **x**
- **cnt**
- **prod**
- **y**
- ... (equivalent to 16 or 32 registers)
- 32-bits

Mathematical expression:

\[ y = \sum_{n=1}^{40} c_n \cdot x_n \]

Instructions:

- **MVK .S 40, cnt**
- **MPY .M c, x, prod**
- **ADD .L y, prod, y**
- **SUB .L cnt, 1, cnt**
- **[cnt] B .S loop**

How are the c and x array values brought in from memory?
Memory Access via “.D” Unit

$$ y = \sum_{n=1}^{40} c_n x_n $$

MVK .S 40, cnt
loop:
LDH .D *cp , c
LDH .D *xp , x
MPY .M c, x, prod
ADD .L y, prod, y
SUB .L cnt, 1, cnt
[cnt] B .S loop

Data Memory: x(40), a(40), y

What does the “H” in LDH signify?

<table>
<thead>
<tr>
<th>Instr.</th>
<th>Description</th>
<th>C Type</th>
<th>Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>LDB</td>
<td>load byte</td>
<td>char</td>
<td>8-bits</td>
</tr>
<tr>
<td>LDH</td>
<td>load half-word</td>
<td>short</td>
<td>16-bits</td>
</tr>
<tr>
<td>LDW</td>
<td>load word</td>
<td>int</td>
<td>32-bits</td>
</tr>
<tr>
<td>LDDW*</td>
<td>load double-word</td>
<td>double</td>
<td>64-bits</td>
</tr>
</tbody>
</table>

* Except C62x & C67x generations

How do we increment through the arrays?
Auto-Increment of Pointers

\[
y = \sum_{n=1}^{40} c_n \cdot x_n
\]

Register File A

- \( c \)
- \( x \)
- \( \text{cnt} \)
- \( \text{prod} \)
- \( y \)
- \( \star \text{cp} \)
- \( \star \text{xp} \)
- \( \star \text{yp} \)

Data Memory:
- \( x(40) \)
- \( a(40) \)
- \( y \)

- MVK \( .S \) 40, cnt
- loop:
  - LDH \( .D \) \( \star \text{cp}++ \), c
  - LDH \( .D \) \( \star \text{xp}++ \), x
  - MPY \( .M \) c, x, prod
  - ADD \( .L \) y, prod, y
  - SUB \( .L \) cnt, 1, cnt
- [cnt] B \( .S \) loop

How do we store results back to memory?

Storing Results Back to Memory

- STW \( .D \) y, \( \star \text{yp} \)

But wait - that's only half the story...
Dual Resources: Twice as Nice

Register File A

<table>
<thead>
<tr>
<th>A0</th>
<th>cn</th>
</tr>
</thead>
<tbody>
<tr>
<td>A1</td>
<td>xn</td>
</tr>
<tr>
<td>A2</td>
<td>cnt</td>
</tr>
<tr>
<td>A3</td>
<td>prd</td>
</tr>
<tr>
<td>A4</td>
<td>sum</td>
</tr>
<tr>
<td>A5</td>
<td>*c</td>
</tr>
<tr>
<td>A6</td>
<td>*x</td>
</tr>
<tr>
<td>A7</td>
<td>*y</td>
</tr>
<tr>
<td>A15</td>
<td></td>
</tr>
<tr>
<td>or</td>
<td></td>
</tr>
<tr>
<td>A31</td>
<td>32-bits</td>
</tr>
</tbody>
</table>

Register File B

<table>
<thead>
<tr>
<th>B0</th>
<th>.S1</th>
</tr>
</thead>
<tbody>
<tr>
<td>B1</td>
<td>.S2</td>
</tr>
<tr>
<td>B2</td>
<td>.M1</td>
</tr>
<tr>
<td>B3</td>
<td>.M2</td>
</tr>
<tr>
<td>B4</td>
<td>.L1</td>
</tr>
<tr>
<td>B5</td>
<td>.L2</td>
</tr>
<tr>
<td>B6</td>
<td>.D1</td>
</tr>
<tr>
<td>B7</td>
<td>.D2</td>
</tr>
<tr>
<td>B15</td>
<td></td>
</tr>
<tr>
<td>or</td>
<td></td>
</tr>
<tr>
<td>B31</td>
<td>32-bits</td>
</tr>
</tbody>
</table>

Our final view of the sum of products example...

\[ y = \sum_{n=1}^{40} c_n \cdot x_n \]

Optional - Resource Specific Coding

Register File A

<table>
<thead>
<tr>
<th>A0</th>
<th>cn</th>
</tr>
</thead>
<tbody>
<tr>
<td>A1</td>
<td>xn</td>
</tr>
<tr>
<td>A2</td>
<td>cnt</td>
</tr>
<tr>
<td>A3</td>
<td>prd</td>
</tr>
<tr>
<td>A4</td>
<td>sum</td>
</tr>
<tr>
<td>A5</td>
<td>*c</td>
</tr>
<tr>
<td>A6</td>
<td>*x</td>
</tr>
<tr>
<td>A7</td>
<td>*y</td>
</tr>
<tr>
<td>A15</td>
<td></td>
</tr>
<tr>
<td>or</td>
<td></td>
</tr>
<tr>
<td>A31</td>
<td>32-bits</td>
</tr>
</tbody>
</table>

It's easier to use symbols rather than register names, but you can use either method.
### Instruction Sets

#### ‘C62x RISC-like instruction set

<table>
<thead>
<tr>
<th>.S Unit</th>
<th>.L Unit</th>
</tr>
</thead>
<tbody>
<tr>
<td>ADD</td>
<td>ABS</td>
</tr>
<tr>
<td>ADDK</td>
<td>ADD</td>
</tr>
<tr>
<td>ADD2</td>
<td>AND</td>
</tr>
<tr>
<td>B</td>
<td>CMPEQ</td>
</tr>
<tr>
<td>CLR</td>
<td>CMPLT</td>
</tr>
<tr>
<td>EXT</td>
<td>LMBD</td>
</tr>
<tr>
<td>MV</td>
<td>MV</td>
</tr>
<tr>
<td>MVC</td>
<td>MVC</td>
</tr>
<tr>
<td>MVK</td>
<td>MVK</td>
</tr>
<tr>
<td>MVKH</td>
<td>MVK</td>
</tr>
<tr>
<td>.D Unit</td>
<td>.M Unit</td>
</tr>
<tr>
<td>ADD</td>
<td>ADD</td>
</tr>
<tr>
<td>ADDAB</td>
<td>ADD</td>
</tr>
<tr>
<td>B</td>
<td>ADD</td>
</tr>
<tr>
<td>CLR</td>
<td>ADD</td>
</tr>
<tr>
<td>EXT</td>
<td>ADD</td>
</tr>
<tr>
<td>MV</td>
<td>ADD</td>
</tr>
<tr>
<td>MVK</td>
<td>ADD</td>
</tr>
<tr>
<td>MVKH</td>
<td>ADD</td>
</tr>
</tbody>
</table>

#### ‘C67x: Superset of Fixed-Point

<table>
<thead>
<tr>
<th>.S Unit</th>
<th>.L Unit</th>
</tr>
</thead>
<tbody>
<tr>
<td>ADD</td>
<td>ABS</td>
</tr>
<tr>
<td>ADDK</td>
<td>ABSDP</td>
</tr>
<tr>
<td>ADD2</td>
<td>ABSDP</td>
</tr>
<tr>
<td>B</td>
<td>ADDDP</td>
</tr>
<tr>
<td>CLR</td>
<td>ADDDP</td>
</tr>
<tr>
<td>EXT</td>
<td>ADDDP</td>
</tr>
<tr>
<td>MV</td>
<td>ADDDP</td>
</tr>
<tr>
<td>MVC</td>
<td>ADDDP</td>
</tr>
<tr>
<td>MVK</td>
<td>ADDDP</td>
</tr>
<tr>
<td>MVKH</td>
<td>ADDDP</td>
</tr>
<tr>
<td>.D Unit</td>
<td>.M Unit</td>
</tr>
<tr>
<td>ADD</td>
<td>MPY</td>
</tr>
<tr>
<td>ADDAB</td>
<td>MPY</td>
</tr>
<tr>
<td>B</td>
<td>MPY</td>
</tr>
<tr>
<td>CLR</td>
<td>MPY</td>
</tr>
<tr>
<td>EXT</td>
<td>MPY</td>
</tr>
<tr>
<td>MV</td>
<td>MPY</td>
</tr>
<tr>
<td>MVK</td>
<td>MPY</td>
</tr>
<tr>
<td>MVKH</td>
<td>MPY</td>
</tr>
</tbody>
</table>

---

Note: The diagrams illustrate the instruction sets for the C62x and C67x processors, with the .S, .L, .D, and .M units. The tables list specific instructions for each unit, including operations like addition, subtraction, and logical operations. The .L and .M units also include specific fixed-point operations for the C67x processors.
### 'C64x: Superset of 'C62x Instruction Set

<table>
<thead>
<tr>
<th>Dual/Quad Arith</th>
<th>Data Pack/Unpack</th>
<th>Compares</th>
<th>Dual/Quad Arith</th>
<th>Data Pack/Unpack</th>
</tr>
</thead>
<tbody>
<tr>
<td>.S</td>
<td></td>
<td></td>
<td>.L</td>
<td></td>
</tr>
<tr>
<td>SADD2</td>
<td>PACK2</td>
<td>CMPEQ2</td>
<td>ABS2</td>
<td>PACK2</td>
</tr>
<tr>
<td>SADDUS2</td>
<td>PACKH2</td>
<td>CMPEQ4</td>
<td>ADD2</td>
<td>PACKH2</td>
</tr>
<tr>
<td>SADD4</td>
<td>PACKKLH2</td>
<td>CMPGT2</td>
<td>ADD4</td>
<td>PACKHL2</td>
</tr>
<tr>
<td></td>
<td>UNPKH4</td>
<td>CMPGT4</td>
<td>MAX</td>
<td>PACKHL2</td>
</tr>
<tr>
<td></td>
<td>SWAP2</td>
<td></td>
<td>MIN</td>
<td>PACKH4</td>
</tr>
<tr>
<td></td>
<td>SPACK2</td>
<td></td>
<td>SUB2</td>
<td>UNPKH4</td>
</tr>
<tr>
<td></td>
<td>SPACKU4</td>
<td></td>
<td>SUB4</td>
<td>UNPKLU4</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>SUBABS4</td>
<td>SWAP2/4</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Bitwise Logical</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>ANDN</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
### C64x+ Additions

<table>
<thead>
<tr>
<th>Dual/Quad Arith</th>
<th>Mem Access</th>
<th>Comparisons</th>
<th>Dual/Quad Arith</th>
<th>Mem Access</th>
</tr>
</thead>
<tbody>
<tr>
<td>.S</td>
<td></td>
<td></td>
<td>.L</td>
<td></td>
</tr>
<tr>
<td>CALLP</td>
<td></td>
<td></td>
<td>ADDSUB</td>
<td></td>
</tr>
<tr>
<td>DMV</td>
<td></td>
<td></td>
<td>ADDSUB2</td>
<td></td>
</tr>
<tr>
<td>RPACK2</td>
<td></td>
<td></td>
<td>DPACK2</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>DPACKX2</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>SADDSUB</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>SADDSUB2</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>SADDSUB2</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>SHFL3</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>SUB</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
### Dual Arithmetic

<table>
<thead>
<tr>
<th>And</th>
<th>Mem Access</th>
<th>Mem Access</th>
</tr>
</thead>
<tbody>
<tr>
<td>ADD2</td>
<td>LDDW</td>
<td></td>
</tr>
<tr>
<td>SUB2</td>
<td>LDNW</td>
<td></td>
</tr>
<tr>
<td>ANDN</td>
<td>LDNDW</td>
<td></td>
</tr>
<tr>
<td>OR</td>
<td>STDW</td>
<td></td>
</tr>
<tr>
<td>XOR</td>
<td>STNDW</td>
<td></td>
</tr>
<tr>
<td>Address Calc.</td>
<td>Load Constant</td>
<td>Load Constant</td>
</tr>
<tr>
<td>ADDAD</td>
<td>MVK (5-bit)</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Bitwise Logical</th>
<th></th>
<th>Bitwise Logical</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>ANDN</td>
<td></td>
<td>ANDN</td>
<td></td>
</tr>
<tr>
<td>OR</td>
<td></td>
<td>SHLMB</td>
<td></td>
</tr>
<tr>
<td>XOR</td>
<td></td>
<td>SHRMB</td>
<td></td>
</tr>
<tr>
<td>None</td>
<td></td>
<td>Move</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Load Constant</th>
<th></th>
<th>Load Constant</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>AVG2</td>
<td></td>
<td>AVG4</td>
<td></td>
</tr>
<tr>
<td>AVG4</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
### Compares

<table>
<thead>
<tr>
<th>Dual/Quad Arith</th>
<th>Mem Access</th>
<th>Compares</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Bit Operations

<table>
<thead>
<tr>
<th>Dual/Quad Arith</th>
<th>Mem Access</th>
<th>Bit Operations</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Shifts

<table>
<thead>
<tr>
<th>Dual/Quad Arith</th>
<th>Mem Access</th>
<th>Shifts</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Comparisons

<table>
<thead>
<tr>
<th>Dual/Quad Arith</th>
<th>Mem Access</th>
<th>Comparisons</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Multiples

<table>
<thead>
<tr>
<th>Dual/Quad Arith</th>
<th>Mem Access</th>
<th>Multiples</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>MPY2</td>
</tr>
<tr>
<td></td>
<td></td>
<td>MPY2R</td>
</tr>
<tr>
<td></td>
<td></td>
<td>MPY3</td>
</tr>
<tr>
<td></td>
<td></td>
<td>MPY32</td>
</tr>
<tr>
<td></td>
<td></td>
<td>MPY32U</td>
</tr>
<tr>
<td></td>
<td></td>
<td>MPY32US</td>
</tr>
<tr>
<td></td>
<td></td>
<td>MPY32US</td>
</tr>
<tr>
<td></td>
<td></td>
<td>SMPY32</td>
</tr>
<tr>
<td></td>
<td></td>
<td>XORMPY</td>
</tr>
</tbody>
</table>
**“MAC” Instructions**

**DOTP2 with LDDW**

\[
\begin{align*}
\text{B2} & = \text{A2} \\
\text{A3} & = \text{A1} + \text{A0} \\
\end{align*}
\]

\[
\begin{align*}
\text{B3} & = \text{B2} \\
\text{A4} & = \text{A3} + \text{A2} \\
\end{align*}
\]

**Block Real FIR Example (DDOTPL2)**

```c
for (i = 0; i < ndata; i++) {
    sum = 0;
    for (j = 0; j < ncoef; j++) {
        sum = sum + (d[i+j] * c[j]);
    }
    y[i] = sum;
}
```

**Key Points:**
- Four 16x16 multiplies
- In each .M unit every cycle
- Adds up to 8 MACs/cycle, or 8000 MMACS
- Bottom Line: Two loop iterations for the price of one
Complex Multiply (CMPY)

A0

r1 i1

\[
A0, A1, A3:A2 \quad r1*r2 - i1*i2 : i1*r2 + r1*i2
\]

32-bits 32-bits

A1

r2 i2

\[
\text{single .M unit}
\]

- Four 16x16 multiplies per .M unit
- Using two CMPYs, a total of eight 16x16 multiplies per cycle
- Floating-point version (CMPYSP) uses:
  - 64-bit inputs (register pair)
  - 128-bit packed products (register quad)
  - You then need to add/subtract the products to get the final result
C66x – “MAC” Instructions

C66x: QMPY32 (fixed), QMPYSP (float)

<table>
<thead>
<tr>
<th>A3:A2:A1:A0</th>
<th>c3 : c2 : c1 : c0</th>
</tr>
</thead>
<tbody>
<tr>
<td>A7:A6:A5:A4</td>
<td>x3 : x2 : x1 : x0</td>
</tr>
<tr>
<td>A11:A10:A9:A8</td>
<td>c3<em>x3 : c2</em>x3 : c1<em>x1 : c0</em>x0</td>
</tr>
</tbody>
</table>

32-bits 32-bits 32-bits 32-bits

QMPY32 or QMPYSP

single .M unit

- Four 32x32 multiplies per .M unit
- Total of eight 32x32 multiplies per cycle
- Fixed or floating-point versions
- Output is 128-bit packed result (register quad)

C66x: Complex Matrix Multiply (CMAXMULT)

\[
\begin{bmatrix}
M9 \\
M8
\end{bmatrix} = \begin{bmatrix}
M7 \\
M6
\end{bmatrix} \times \begin{bmatrix}
M3 & M2 \\
M1 & M0
\end{bmatrix}
\]

\[
M9 = M7*3 + M6*M1 \\
M8 = M7*M2 + M6*M0
\]

Where Mx represents a packed 16-bit complex number

- Single .M unit implements complex matrix multiply using 16 MACs (all in 1 cycle)
- Achieve 32 16x16 multiplies per cycle using both .M units
Hardware Pipeline

Program Fetch

Decode

Execute

E1

DC

E1

DP

DC

E1

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

PG

PS

PW

PR

PG

PS

PW

PR

PG

PS

PW

PR

Pipeline Full

Pipeline Phases

Program

Fetch

Decode

Execute

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

PG

PS

PW

PR

Pipeline Full

Pipeline Phases

Program

Fetch

Decode

Execute

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

PG

PS

PW

PR

Full Pipe

Pipeline Phases

Program

Fetch

Decode

Execute

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

DP

DC

E1

PG

PS

PW

PR

PG

PS

PW

PR

Texas INSTRUMENTS
Software Pipelining

**Instruction Delays**

All 'C64x instructions require only one cycle to execute, but some results are delayed...

<table>
<thead>
<tr>
<th>Description</th>
<th># Instr.</th>
<th>Delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single Cycle</td>
<td>All, instr’s except ...</td>
<td>0</td>
</tr>
<tr>
<td>Multiply</td>
<td>MPY, SMPY</td>
<td>1</td>
</tr>
<tr>
<td>Load</td>
<td>LDB, LDH, LDW</td>
<td>4</td>
</tr>
<tr>
<td>Branch</td>
<td>B</td>
<td>5</td>
</tr>
</tbody>
</table>

Would This Code Work As Is ??

\[
y = \sum_{n = 1}^{40} c_n \cdot x_n
\]

Register File A

- A0: \(c_n\)
- A1: \(x_n\)
- A2: cnt
- A3: prd
- A4: sum
- A5: *c
- A6: *x
- A7: *y
- A15 or A31: 32-bits

\[\text{loop: }\]
- MVK .S1 40, A2
- LDH .D1 *A5++, A0
- LDH .D1 *A6++, A1
- MPY .M1 A0, A1, A3
- ADD .L1 A4, A3, A4
- SUB .S1 A2, 1, A2
- [A2] B .S1 loop
- STW .D1 A4, *A7

- Need to add NOPs to get this code to work properly...
- NOP = “Not Optimized Properly”
- How many instructions can this CPU execute every cycle?
Software Pipelined Algorithm

<table>
<thead>
<tr>
<th></th>
<th>PROLOG</th>
<th>LOOP</th>
</tr>
</thead>
<tbody>
<tr>
<td>.L1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>.L2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>.S1</td>
<td>B</td>
<td>add</td>
</tr>
<tr>
<td>.S2</td>
<td>sub</td>
<td></td>
</tr>
<tr>
<td>.M1</td>
<td>mpy</td>
<td></td>
</tr>
<tr>
<td>.M2</td>
<td>mpyh</td>
<td></td>
</tr>
<tr>
<td>.D1</td>
<td>ldw1</td>
<td></td>
</tr>
<tr>
<td>.D2</td>
<td>ldw2</td>
<td></td>
</tr>
</tbody>
</table>

Software Pipelined ‘C6x Code

```
c0:       ldw .D1  *A4++,A5
         |     ldw .D2  *B4++,B5
| |      [B0] sub .S2  B0,1,B0
| |      c1:       ldw .D1  *A4++,A5
| |           |     ldw .D2  *B4++,B5
| |           |      [B0] sub .S2  B0,1,B0
| |           |      c2_3_4:  ldw .D1  *A4++,A5
| |           |           |     ldw .D2  *B4++,B5
| |           |           |      [B0] sub .S2  B0,1,B0
| |           |           |      [B0] B    .S1  loop
| |           |           | .
| |           |           | .
| |           |           | .
| c5_6:    ldw .D1  *A4++,A5
| |     ldw .D2  *B4++,B5
| |      [B0] sub .S2  B0,1,B0
| |      mpy .M1x A5,B5,A6
| |      mpyh .M2x A5,B5,B6
```
**this page contains no useful information**
Chapter Quiz

1. Name the four functional units and types of instructions they execute:

2. How many 16x16 MACs can a C674x CPU perform in 1 cycle? C66x?

3. Where are CPU operands stored and how do they get there?

4. What is the purpose of a hardware pipeline?

5. What is the purpose of s/w pipelining, which tool does this for you?
Quiz - Answers

Chapter Quiz

1. Name the four functional units and types of instructions they execute:
   • M unit – Multiplies (fixed, float)
   • L unit – ALU – arithmetic and logical operations
   • S unit – Branches and shifts
   • D unit – Data – loads and stores

2. How many 16x16 MACs can a C674x CPU perform in 1 cycle? C66x ?
   • C674x – 8 MACs/cycle, C66x – 32 MACs/cycle

3. Where are CPU operands stored and how do they get there?
   • Register Files (A and B), Load (LDx) data from memory

4. What is the purpose of a hardware pipeline?
   • To break up instruction execution enough to reach min cycle count thereby allowing single cycle execution when pipeline is FULL

5. What is the purpose of s/w pipelining, which tool does this for you?
   • Maximize performance – use as many functional units as possible in every cycle, the COMPILER/OPTIMIZER performs SW pipelining
C and System Optimizations

Introduction

In this chapter, we will cover the basics of optimizing C code and some useful tips on system optimization. Also included here are some other system-wide optimizations you can take advantage of in your own application – if they are necessary.

Outline

- Describe how to configure and use the various compiler/optimizer options
- Discuss the key techniques to increase performance or reduce code size
- Demonstrate how to use optimized libraries
- Overview key system optimizations
- Lab 13 – Use FIR algo on audio data, optimize using the compiler, benchmark
Module Topics

C and System Optimizations ................................................................. 13-1

Module Topics .................................................................................. 13-2

Introduction – “Optimal” and “Optimization” .................................. 13-3

C Compiler and Optimizer ................................................................. 13-5

“Debug” vs. “Optimized” ................................................................. 13-5

Levels of Optimization .................................................................. 13-6

Build Configurations ....................................................................... 13-7

Code Space Optimization (–ms) ....................................................... 13-7

File and Function Specific Options ................................................ 13-8

Coding Guidelines .......................................................................... 13-9

Data Types and Alignment ............................................................... 13-10

Data Types ..................................................................................... 13-10

Data Alignment ............................................................................. 13-11

Using DATA_ALIGN ..................................................................... 13-12

Upcoming Changes – ELF vs. COFF ............................................. 13-13

Restricting Memory Dependencies (Aliasing) ................................. 13-14

Access Hardware Features – Using Intrinsics ................................ 13-16

Give Compiler MORE Information ................................................ 13-17

Pragma – Unroll() ......................................................................... 13-17

Pragma – MUST_ITERATE() ......................................................... 13-18

Keyword - Volatile ....................................................................... 13-18

Setting MAX interrupt Latency (-mi option) ................................. 13-19

Compiler Directive - _nassert() ...................................................... 13-20

Using Optimized Libraries ............................................................. 13-21

Libraries – Download and Support ............................................... 13-23

System Optimizations .................................................................... 13-24

BIOS Libraries ............................................................................ 13-24

Custom Sections ......................................................................... 13-26

Use Cache ................................................................................... 13-27

Use EDMA .................................................................................... 13-28

System Architecture – SCR ......................................................... 13-29

Chapter Quiz ................................................................................ 13-31

Quiz - Answers ............................................................................ 13-32

Lab 13 – C Optimizations ............................................................... 13-33

Lab 13 – C Optimizations – Procedure ........................................... 13-34

PART A – Goals and Using Compiler Options ............................... 13-34

Determine Goals and CPU Mfr. .................................................... 13-34

Using Debug Configuration (–g, NO opt) ...................................... 13-35

Using Release Configuration (–o2, no –g) ...................................... 13-36

Using "Opt" Configuration ............................................................ 13-38

Part B – Code Tuning ................................................................. 13-40

Part C – Minimizing Code Size (–ms) ........................................... 13-43

Part D – Using DSPLib ............................................................... 13-44

Conclusion ................................................................................... 13-45

Additional Information ................................................................. 13-46

Notes ......................................................................................... 13-48
Introduction – “Optimal” and “Optimization”

What Does “Optimal” Mean?
- Every user will have a different definition of “optimal”:
  
  “When my processing keeps up with my I/O (real-time)…”

  “When my algo achieves theoretical minimum…”

  “When I’ve worked on it for 2 weeks straight, it is FAST ENOUGH…”

  “When my boss says GOOD ENOUGH…”

  “After I have applied all known (by me) optimization techniques, I guess this is as good as it gets…”

Know Your Goal and Your Limits…

\[ Y = \sum_{i=1}^{\text{count}} \text{coeff}_i \times x_i \]

for (i = 0; i < count; i++){
    Y += coeff[i] * x[i];
}

Goals:
- A typical goal of any system’s algo is to meet real-time
- You might also want to approach or achieve “CPU Min” in order to maximize #channels processed

CPU Min (the “limit”):
- The minimum # cycles the algo takes based on architectural limits (e.g. data size, #loads, math operations required)

Real-time vs. CPU Min
- Often, meeting real-time only requires setting a few compiler options (easy)
- However, achieving “CPU Min” often requires extensive knowledge of the architecture (harder, requires more time)
Optimization – Intro

- **Optimization is:**
  Continuous process of refinement in which code being optimized executes faster and takes fewer cycles, until a specific objective is achieved (real-time execution).

- **When is it “fast enough”?** Depends on user’s definition.

- **Compiler’s personality?** *Paranoid.* Will ALWAYS make decisions to give you the RIGHT answer vs. the best optimization (unless told otherwise)

- **Bottom Line:**
  - Learn as many optimization techniques as possible – try them all (if necessary)
  - This is the GOAL of this chapter...

- **Keep in mind:** mileage may vary (highly system/arch dependent)

So, let’s jump right in...
C Compiler and Optimizer

“Debug” vs. “Optimized”

“Debug” vs. “Optimized” – Benchmarks

FIR

```
for (j = 0; j < nr; j++) {
    sum = 0;
    for (i = 0; i < nh; i++)
        sum += x[i + j] * h[i];
    r[j] = sum >> 15;
}
```

Dot Product

```
for (i = 0; i < count; i++)
    Y += coeff[i] * x[i];
```

Benchmarks:

<table>
<thead>
<tr>
<th>Algo</th>
<th>FIR (256, 64)</th>
<th>DOTP (256-term)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Debug (no opt, –g)</td>
<td>817K</td>
<td>4109</td>
</tr>
<tr>
<td>“Opt” (-o3, no –g)</td>
<td>18K</td>
<td>42</td>
</tr>
<tr>
<td>Add’l pragmas (DSPLib)</td>
<td>7K</td>
<td>42</td>
</tr>
<tr>
<td>CPU Min</td>
<td>4096</td>
<td>42</td>
</tr>
</tbody>
</table>

- Debug – get your code LOGICALLY correct first (no optimization)
- “Opt” – increase performance using compiler options (easier)
- “CPU Min” – it depends. Could require extensive time...

“Debug” vs. “Optimized” – Environments

“Debug” (–g, NO opt): Get Code Logically Correct

- Provides the best “debug” environment with full symbolic support, no “code motion”, easy to single step
- Code is NOT optimized – i.e. very poor performance
- Create test vectors on FUNCTION boundaries (use same vectors as Opt Env)

“Opt” (–o3, ): Increase Performance

- Higher levels of “opt” results in code motion – functions become “black boxes” (hence the use of FXN vectors)
- Optimizer can find “errors” in your code (use volatile)
- Highly optimized code (can reach “CPU Min” w/some algos)
- Each level of optimization increases optimizer’s “scope”...
Levels of Optimization

FILE1.C
{
    {
    }
    
    {
        ...
    }
    
    {
    }
    
    {
        ...
    }

FILE2.C
{
    ...
}

- **o0, -o1, -o2, -o3, -pm, -o3**

**Levels of Optimization**

- **LOCAL** single block
- **FUNCTION** across blocks
- **FILE** across functions
- **PROGRAM** across files

Increasing levels of opt:
- ↑ scope, code motion
- ↑ build times
- ↓ visibility

**Program Level Optimization (-pm)**

Using -pm

Right-click on your Project and select:
Build Options...

Throttling -pm with -op_n

- -pm is **critical** in compiling for maximum performance (requires use of -o3)
- -pm creates a temp.c file which includes all C source files, thus giving the optimizer a program-level optimization context
- -op_n describes a program's external references (-op2 means NO ext'l refs) (-op is what "throttles" -pm …)
- Be careful with -op2 (no ext'l refs). BIOS scheduler calls are "external" to C
Build Configurations

Two Default Configurations

- For new projects, CCS always creates two default build configurations:

  ![Set Active](image1)

- "Debug" Options (OK for "Debug" Environment)

  ![Configuration](image2)

- "Release" Options (Ok for "first step" optimization)

  ![Configuration](image3)

Note: These are simply "sets" or "containers" for build options. If you set a path in one, it does NOT copy itself to the other (e.g. includes). Also, you can make your own!

Code Space Optimization (–ms)

Minimizing Space Option (–ms)

- The table shows the basic strategy employed by compiler and Asm-Opt when using the –ms options.
- % denotes how much you “care” about each:

<table>
<thead>
<tr>
<th>-ms level</th>
<th>Performance</th>
<th>Code Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>none</td>
<td>100%</td>
<td>0</td>
</tr>
<tr>
<td>-ms0</td>
<td>90</td>
<td>10</td>
</tr>
<tr>
<td>-ms1</td>
<td>60</td>
<td>40</td>
</tr>
<tr>
<td>-ms2</td>
<td>20</td>
<td>80</td>
</tr>
<tr>
<td>-ms3</td>
<td>0</td>
<td>100%</td>
</tr>
</tbody>
</table>

- Any –ms will invoke compressed opcodes (16 bit)
- User must use the optimizer (-o) with –ms for the greatest effect. Suggestion: use on “init” code.
Additional Code Space Options

- Use program level optimization (-pm)
- Try -mh to reduce prolog/epilog code
- Use -oi0 to disable auto-inlining

- Inlining inserts a copy of a function into a C file rather than calling (i.e. branching) to it
- Auto-inlining is a compiler feature whereas small functions are automatically inlined
- Auto-inlining is enabled for small functions by -o3
- The -o size sets the size of functions to be automatically inlined
  - size = function size * # of times inlined
  - Use -on1 or -on2 to report size
- Force function inlining with `inline` keyword
  - `inline` void func(void);

File and Function Specific Options

- Right-click on file and select “Build Options”
- Apply settings and click OK
- Little triangle ▲ on file denotes file-specific options applied

- Can also use FUNCTION-specific options via a pragma:
  ```c
  #pragma FUNCTION_OPTIONS();
  ```

Note: most used are -o, -ms
Coding Guidelines

Programming the ‘C6000

<table>
<thead>
<tr>
<th>Source</th>
<th>Efficiency*</th>
<th>Effort</th>
</tr>
</thead>
<tbody>
<tr>
<td>C</td>
<td>80 - 100%</td>
<td>Low</td>
</tr>
<tr>
<td>C ++</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Linear ASM</td>
<td>95 - 100%</td>
<td>Med</td>
</tr>
<tr>
<td>ASM</td>
<td>100%</td>
<td>High</td>
</tr>
</tbody>
</table>

Basic C Coding Guidelines

- In order for the compiler to create the most efficient code, it is best to follow these guidelines:

1. Use Minimum Complexity Code
   - If a human can’t understand and read it easily, neither can the compiler
   - Break up larger “logic” into smaller loops/pieces

2. No function calls in tight loops
   - The compiler cannot create a pipelined loop with fn calls present

3. Keep loops relatively small
   - Helps compiler generate tighter, more efficient pipelined loops

4. Create test vectors at FUNCTION boundaries
   - When optimization is turned on, it is nearly impossible to single-step inside fnxs

5. Look at the assembly file – SPLOOP ?
   - If curious, look at the disassembly. Was SPLOOP/LDDW used or not? Why?
   - Assembly optimizer generates comments as to what happened in the loop and why
   - Use –mw (verbose pipeline info), -os (interlist), -k (keep .asm file) to see all info
# Data Types and Alignment

## Data Types

### ‘C6000 C Data Types

<table>
<thead>
<tr>
<th>Type</th>
<th>Bits</th>
<th>Representation</th>
</tr>
</thead>
<tbody>
<tr>
<td>char</td>
<td>8</td>
<td>ASCII</td>
</tr>
<tr>
<td>short</td>
<td>16</td>
<td>Binary, 2's complement</td>
</tr>
<tr>
<td>int</td>
<td>32</td>
<td>Binary, 2's complement</td>
</tr>
<tr>
<td>long</td>
<td>40*</td>
<td>Binary, 2's complement</td>
</tr>
<tr>
<td>long long</td>
<td>64</td>
<td>Binary, 2's complement</td>
</tr>
<tr>
<td>float</td>
<td>32</td>
<td>IEEE 32-bit</td>
</tr>
<tr>
<td>double</td>
<td>64</td>
<td>IEEE 64-bit</td>
</tr>
<tr>
<td>long double</td>
<td>64</td>
<td>IEEE 64-bit</td>
</tr>
<tr>
<td>pointers</td>
<td>32</td>
<td>Binary</td>
</tr>
</tbody>
</table>

* long type is 32-bit for EABI (ELF)*

- Device ALWAYS accesses data on aligned boundaries
Data Alignment

Data Alignment in Memory

```c
DataType.C
char z = 1;
short x = 7;
int y;
double w;
void main (void)
{
    y = child(x, 5);
}
```

Byte (LDB) Boundaries

<p>| | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>z</td>
</tr>
<tr>
<td>1</td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
</tr>
<tr>
<td>4</td>
<td></td>
</tr>
<tr>
<td>5</td>
<td></td>
</tr>
<tr>
<td>6</td>
<td></td>
</tr>
<tr>
<td>7</td>
<td></td>
</tr>
<tr>
<td>8</td>
<td></td>
</tr>
<tr>
<td>9</td>
<td></td>
</tr>
</tbody>
</table>

Hint: all single data items are aligned on ‘type’ boundaries...

Alignment of Structures

- **Structures** are aligned to the largest type they contain
- For **data space efficiency**, start with larger types first to minimize holes
- **Arrays within structures** are only aligned to their typesize

Aligning arrays within structs...
Forcing Alignment within Structures

While arrays are aligned to 32 or 64-bit boundaries, arrays within structures are not, which might affect optimization.

Here are a couple ideas to force arrays to 8-byte alignment:

1. Use dummy variable to force alignment

```c
typedef struct ex1_t{
    short b;
    long long dummy1;
    short a[40];
} ex1;
```

2. Use unions

```c
typedef union align_t{
    short a2[80];
    long long a8[10];
};
```

```c
typedef struct ex2_t{
    short b;
    align_t a3;
} ex2;
```

How can we force alignments of scalars or structs?

Using DATA_ALIGN

```
#pragma DATA_ALIGN(x, 4)
short z;
short x;
```

Data Align pragma can align to any 2^n boundary

- They would have been placed here ...
- but pragma forces them to next 4 byte (int) boundary
## Upcoming Changes – ELF vs. COFF

### EABI : ELF ABI

- Starting with v7.2.0 the C6000 Code Gen Tools (CGT) will begin shipping **two versions of the Linker**:
  1. COFF: Binary file-format used by TI tools for over a decade
  2. ELF: New binary file-format which provides additional features like dynamic/relocatable linking

- You can **choose either format**
  - v7.3.x default may become ELF (prior to this, choose ELF for new features)
  - Continue using COFF for projects already in progress using `--abi=coffabi` compiler option (support will continue for a long time)

- **Formats are not compatible**
  - Your program's binary files (.obj, .lib) must all be built with the same format
  - If building libraries used for multiple projects, we recommend building two libraries – one with each format

- **Migration Issues**
  - EABI long's are 32 bits; new TI type (__int40_t) created to support 40 data
  - COFF adds a leading underscore to symbol names, but the EABI does not
Restricting Memory Dependencies (Aliasing)

What is Aliasing?

```c
int x;
int *p;

main()
{
    p = &x;

    x = 5;
    *p = 8;
}
```

One memory location, two ways to access it: `x` and `*p`

Note: This is a very simple alias example. The compiler doesn't have any problem disambiguating an alias condition like this.

Aliasing?

```c
void fcn(*in, *out)
{
    LDW *in++, A0
    ADD A0, 4, A1
    STW A1, *out++
}
```

- Intent: no aliasing (ASM code?)
- `*in` and `*out` point to different memory locations
- Reads are not the problem, WRITES are. `*out` COULD point anywhere
- Compiler is paranoid – it assumes aliasing unless told otherwise.
  ASM code is the key (pipelining)
- Use `restrict` keyword (more soon...)
Restricting Memory Dependencies (Aliasing)

Aliasing?

What happens if the function is called like this?

```
fcn(*myVector, *myVector+1)
```

```c
void fcn(*in, *out)
{
    LDW *in++, A0
    ADD A0, 4, A1
    STW A1, *out++
}
```

- Definitely Aliased pointers
- *in and *out could point to the same address
- But how does the compiler know?
- If you tell the compiler there is no aliasing, this code will break (LDs in software pipelined loop)
- One solution is to “restrict” the writes - *out (see next slide...)

Alias Solutions

1. **Compiler solves most aliasing on its own.**
   - If in doubt, the result will be correct even if the most optimal method won’t be used

2. **Program Level Optimization (-pm -o3)**
   - Provide compiler visibility to entire program

3. **No Bad Aliasing Option (-mt)**
   - Tell the compiler that no bad aliases exist in entire project
   - See Compiler User’s Guide for definition of “bad”
   - Previous weighted vector summation example performance was increased by 5x (by using -mt)

4. **“Restrict” Keyword (ANSI C)**
   - Similar to -mt, but on a array-level basis
   - ```void fcn(short * in, short * restrict out)```

Along with these suggestions, we highly recommend you check out:
- TMS320C6000 Programmer’s Guide
- TMS320C6000 Optimizing C Compiler User’s Guide
Access Hardware Features – Using Intrinsics

Comparing the Coding Methods

C Code
y = a * b;

C Code Using Intrinsics
y = _mpyh (a, b);

Intrinsics...
- Can use C variable names instead of register names
- Are compatible with the C environment
- Adhere to C's function call syntax
- Do NOT use in-line assembly!

Intrinsics - Examples

- Add2() _sadd()
- clr() _set()
- ext/u() _smpy()
- lmbd() _smpyh()
- mpy() _sshl()
- mpyh() _ssub()
- mpylh() _subc()
- mpyhl() _sub2()
- nassert() _sat()
- norm()

Refer to C Compiler User’s Guide for more information

- Think of intrinsic functions as a specialized function library written by TI
- #include <c6x.h> has prototypes for all the intrinsic functions
- Intrinsics are great for accessing the hardware functionality which is unsupported by the C language
- To run your C code on another compiler, download intrinsic C-source:
  spra616.zip
- int x, y, z;
  z = _lmbd(x, y);
Give Compiler MORE Information

Provide Compiler with More Insight

1. Program Level Optimization: -pm -op2 -o3
2. #pragma DATA_ALIGN (var, byte align)
3. #pragma UNROLL(# of times to unroll);
4. #pragma MUST_ITERATE(min, max, #factor);
5. Use volatile keyword
6. Set MAX interrupt threshold
7. Use _nassert() to tell optimizer about pointer alignment

- Like -pm, #pragmas are an easy way to pass more information to the compiler
- The compiler uses this information to create “better” code
- #pragmas are ignored by other C compilers if they are not supported

Pragma – Unroll()

3. UNROLL(# of times to unroll)

    #pragma UNROLL(2);
    for(i = 0; i < count ; i++) {
        sum += a[i] * x[i];
    }

- Tells the compiler to unroll the for() loop twice
- The compiler will generate extra code to handle the case that count is odd
- The #pragma must come right before the for() loop
- UNROLL(1) tells the compiler not to unroll a loop
Pragma – MUST_ITERATE()

4. MUST_ITERATE(min, max, %factor)

```c
#pragma UNROLL(2);
#pragma MUST_ITERATE(10, 100, 2);
for(i = 0; i < count ; i++) {
    sum += a[i] * x[i];
}
```

- Gives the compiler information about the trip (loop) count
  In the code above, we are promising that:
  ```c
count >= 10, count <= 100, and count % 2 == 0
```
- If you break your promise, you might break your code
- MIN helps with code size and software pipelining
- MULT allows for efficient loop unrolling (and “odd” cases)
- The #pragma must come right before the for() loop

Keyword - Volatile

5. Use Volatile Keyword

- If a variable changes OUTSIDE the optimizer’s scope, it will remove/delete the variable and any associated code.
- For example, let’s say `*ctrl` points to an EMIF address:

```c
int *ctrl;
while (*ctrl == 0);
```

- Use volatile keyword to tell compiler to “leave it alone”:

```c
volatile int *ctrl;
while (*ctrl == 0);
```
Setting MAX interrupt Latency (-mi option)

6. Set MAX Interrupt Threshold

- Loops using SPLOOP buffer are interruptible. However, loops that do not meet the criteria for SPLOOP are NOT generally interruptible
- Use the -mi option to set the MAX #cycles that interrupts are disabled (n = 1000 is a good starting number)
- This option does NOT comprehend slow memory cycles or stalls
- #pragma FUNC_INTERRUPT_THRESHOLD(func, threshold);

-mi Details

- -mi0
  - Compiler’s code is not interruptible
  - User must guarantee no interrupts will occur
- -mi1
  - Compiler uses single assignment and never produces a loop less than 6 cycles
- -mi1000 (or any number > 1)
  - Tells the compiler your system must be able to see interrupts every 1000 cycles
- When not using -mi (compiler’s default)
  - Compiler will software pipeline (when using -o2 or -o3)
  - Interrupts are disabled for s/w pipelined loops

Notes:
- Be aware that the compiler is unaware of issues such as memory wait-states, etc.
- Using -mi, the compiler only counts instruction cycles
MUST_ITERATE Example

```c
int dot_prod(short *a, short *b, int n)
{
    int i, sum = 0;
    #pragma MUST_ITERATE ( ,512)
    for (i = 0; i < n; i++)
        sum += a[i] * b[i];
    return sum;
}
```

- **Provided:**
  - If interrupt threshold was set at 1000 cycles (-mi 1000),
  - Assuming this can compile as a single-cycle loop,
  - And 512 = max# for Loop count (per MUST_ITERATE pragma).

- **Result:**
  - The compiler knows a 1-cycle kernel will execute no more than 512 times which is less than the 1000 cycle interrupt disable option (-mi1000)
  - Uninterruptible loop works fine

- **Verdict:**
  - 3072 cycle loop (512 x 6) can become a 512 cycle loop

Compiler Directive - _nassert()

7. _nassert()

```
_nassert((ptr & 0x7) == 0);
```

- Generates no code, evaluated at compile time
- Tells the optimizer that the expression declared with the ‘assert’ function is true
- Above example declares that `ptr` is aligned on an 8-byte boundary (i.e. the lowest 3-bits of the address in ptr are 000b)
- In the next lab, _nassert() is used to tell the compiler that “history” pointer is aligned on an 8-byte boundary
Using Optimized Libraries

**DSPLIB**
- Optimized **DSP Function Library** for C programmers using C62x/C67x and C64x devices
- These routines are typically used in computationally intensive real-time applications where optimal execution speed is critical.
- By using these routines, you can achieve execution speeds considerably faster than equivalent code written in standard ANSI C language. And these ready-to-use functions can significantly shorten your development time.
- The DSP library features:
  - C-callable
  - Hand-coded assembly-optimized
  - Tested against C model and existing run-time-support functions

**Adaptive filtering**
- DSP_firms2

**Correlation**
- DSP_autocor

**FFT**
- DSP_bitrev_cplx
- DSP_radix_2
- DSP_rfft
- DSP_fft
- DSP_fft16x16r
- DSP_fft16x16t
- DSP_fft16x32
- DSP_fft32x32
- DSP_fft32x32s
- DSP_fft16x32
- DSP_fft32x32

**Matrix**
- DSP_mat_mul
- DSP_mat_trans

**Miscellaneous**
- DSP_bexp
- DSP_blk_eswap16
- DSP_blk_eswap32
- DSP_blk_eswap64
- DSP_blk_move
- DSP_fltc15
- DSP_minerror
- DSP_q15tof

**IMGLIB**
- Optimized **Image Function Library** for C programmers using C62x/C67x and C64x devices
- The Image library features:
  - C-callable
  - C and linear assembly src code
  - Tested against C model

**Compression / Decompression**
- IMG_f dct_8 x8
- IMG_idct_8 x8
- IMG_idct_8 x8_12q4
- IMG mad_8 x8
- IMG mad_16 x16
- IMG mpeg2_vid_intra
- IMG mpeg2_vid_inter
- IMG quantize
- IMG y c demux_b 8
- IMG y c demux_b 16
- IMG y c br 222 g b 565
- IMG wave_horz
- IMG wave_vort

**Picture Filtering / Format Conversions**
- IMG_conv_3 x3
- IMG_corr_3 x3
- IMG_corr_gen
- IMG errdif bin
- IMG median_3 x3
- IMG pix expand
- IMG pix set
- IMG yc br 222 g b 565
- IMG yc demux b 16
- IMG y c demux b 8
- IMG y c br 222 g b 565
- IMG wave_horz
- IMG wave_vort
- IMG_boundary
- IMG dilation
- IMG erode bin
- IMG histogram
- IMG perimeter
- IMG sobel
- IMG thr gt2 max
- IMG thr gt2 hr
- IMG thr lt2 min
- IMG thr lt2 hr
FastRTS (C67x)

- Optimized floating-point math function library for C programmers using TMS320C67x devices
- Includes all floating-point math routines currently in existing C6000 runtime-support libraries
- The FastRTS library features:
  - C-callable
  - Hand-coded assembly-optimized
  - Tested against C model and existing run-time-support functions
- FastRTS must be installed per directions in its Users Guide (SPRU100a.PDF)

<table>
<thead>
<tr>
<th>Single Precision</th>
<th>Double Precision</th>
</tr>
</thead>
<tbody>
<tr>
<td>atanf</td>
<td>atan</td>
</tr>
<tr>
<td>atan2f</td>
<td>atan2</td>
</tr>
<tr>
<td>cosf</td>
<td>cos</td>
</tr>
<tr>
<td>expf</td>
<td>exp</td>
</tr>
<tr>
<td>exp2f</td>
<td>exp2</td>
</tr>
<tr>
<td>exp10f</td>
<td>exp10</td>
</tr>
<tr>
<td>logf</td>
<td>log</td>
</tr>
<tr>
<td>log2f</td>
<td>log2</td>
</tr>
<tr>
<td>log10f</td>
<td>log10</td>
</tr>
<tr>
<td>powf</td>
<td>pow</td>
</tr>
<tr>
<td>recipf</td>
<td>recip</td>
</tr>
<tr>
<td>rsqrf</td>
<td>rsqrt</td>
</tr>
<tr>
<td>sinf</td>
<td>sin</td>
</tr>
</tbody>
</table>

FastRTS (C62x/C64x)

- Optimized floating-point math function library for C programmers
  enhances floating-point performance on C62x and C64x fixed-point devices
- The FastRTS library features:
  - C-callable
  - Hand-coded assembly-optimized
  - Tested against C model and existing run-time-support functions
- FastRTS must be installed per directions in its Users Guide (SPRU653.PDF)

<table>
<thead>
<tr>
<th>Single Precision</th>
<th>Double Precision</th>
<th>Others</th>
</tr>
</thead>
<tbody>
<tr>
<td>_addf</td>
<td>_addd</td>
<td>_cvtdf</td>
</tr>
<tr>
<td>_divf</td>
<td>_divd</td>
<td>_cvtd</td>
</tr>
<tr>
<td>_fixf</td>
<td>_fixd</td>
<td></td>
</tr>
<tr>
<td>_fixlf</td>
<td>_fixld</td>
<td></td>
</tr>
<tr>
<td>_fixfu</td>
<td>_fixdu</td>
<td></td>
</tr>
<tr>
<td>_fixlfu</td>
<td>_fixdul</td>
<td></td>
</tr>
<tr>
<td>_fixlif</td>
<td>_fixldi</td>
<td></td>
</tr>
<tr>
<td>_fixlfu</td>
<td>_fixdul</td>
<td></td>
</tr>
<tr>
<td>_fixlif</td>
<td>_fixldi</td>
<td></td>
</tr>
<tr>
<td>_mulf</td>
<td>_muld</td>
<td></td>
</tr>
<tr>
<td>_mulf</td>
<td>_muld</td>
<td></td>
</tr>
<tr>
<td>_s turbulent</td>
<td>_s turbulent</td>
<td></td>
</tr>
<tr>
<td>_subf</td>
<td>_subd</td>
<td></td>
</tr>
</tbody>
</table>
 Libraries – Download and Support

- Download via TI Wiki
- Source code available
- Includes doc folders which contain useful API guides
- Other docs:
  - SPRU565 – DSP API User Guide
  - SPRU023 – Imaging API UG
  - SPRU100 – FastRTS Math API UG
  - SPRA885 – DSPLIB app note
  - SPRA886 – IMGLIB app note
System Optimizations

BIOS Libraries

BIOS Library Types

- **Instrumented** – all logs and assert checking enabled. Use: dev, debug, OK to deploy.
- **Non-Instrumented** – NO logs or assert checking. Use: if not meeting real-time with instrumented ver.
- **Custom** – uses the app’s .cfg to rebuild BIOS according to those settings.
- **Debug** – Use: stepping into and debugging BIOS itself – not generally useful for customers.

Using “Custom”

- Compiles BIOS C Source Code along with your project. Can customize compiler options and perform source-level debug of BIOS.
- Uncheck Enable boxes to remove Assert/Log.
- Build error generated if you try to use a feature not supported by non/instrumented options.
Build-profile Options

- For pure SYS/BIOS apps, release and debug profiles are the same – both provide same set of libraries for linking your app.
- If the user installs other packages (e.g., Codec Engine) and they contain debug and release versions, this selection can be used to choose between those libraries.
- "release" profile is the default – recommended for SYS/BIOS apps.

Reference: BIOS User’s Guide Appendix E (Minimizing the Application Footprint)

Build Configurations (C Compiler)

- "Debug" and "Release" profiles specified here have NO impact on the BIOS library selection or BIOS performance.
- These settings are ONLY used for the application .c files and libraries built with CCS.
Custom Sections

**Custom Placement of Data and Code**

- Problem #1: have three arrays, two have to be linked into L1D and one can be linked to DDR2. How do you “split” the .far section??

  ![Far Placement Diagram]

- Problem #2: have two fnxs, one has to be linked into L1P and the other can be linked to DDR2. How do you “split” the .text section??

  ![Text Placement Diagram]

**Making Custom Sections**

- Create custom **data section** using:

  ```c
  #pragma DATA_SECTION (rcvPing, ".far:rcvBuff");
  int rcvPing[32];
  #pragma DATA_SECTION (rcvPong, ".far:rcvBuff");
  int rcvPong[32];
  ```

  * rcvPing is the name of the buffer
  * ".far:rcvBuff" is the name of the custom section

- Create custom **code section** using:

  ```c
  #pragma CODE_SECTION(filter, ".text:_filter");
  void filter(*rcvPing, *coeffs, ...){
  ```

  How do we link these custom sections?
Linking Custom Sections

- Create your own linker.cmd file for custom sections
- CCS projects can have multiple linker CMD files
- Results of the linker are written to the .map file
- "far:" used in case linker.cmd forgets to link custom section
- -mo creates subsection for every fxn (great for libraries)
- -w warns if unexpected section encountered

Use Cache

- Cache hardware automatically transfers code/data to internal memory, as needed
- Addresses in the Memory Map are associated with locations in cache
- Cache locations do not have their own addresses

Note: we have an entire chapter dedicated to cache later on…
Use EDMA

Program the EDMA to automatically transfer data/code from one location to another.
Operation is performed WITHOUT CPU intervention
All details covered in a later chapter…

Multiple DMA’s: EDMA3 and QDMA

- EDMA3 (System DMA)
  - DMA (sync)
  - Enhanced DMA (version 3)
  - DMA to/from peripherals
  - Can be sync’d to peripheral events
  - Handles up to 64 events

- QDMA (async)
  - Quick DMA
  - DMA between memory
  - Async – must be started by CPU
  - 4-8 channels available

Both Share (number depends upon specific device)
- 128-256 Parameter RAM sets (PARAMs)
- 64 transfer complete flags
- 2-4 Pending transfer queues
System Architecture – SCR

- SCR – Switched Central Resource
- Masters initiate accesses to/from slaves via the SCR
- Most Masters (requestors) and Slaves (resources) have their own port to SCR
- Lower bandwidth masters (HPI, PCIe, etc) share a port
- There is a default priority (0 to 7) to SCR resources that can be modified:
  - SRIO, HOST (PCI/HPI), EMAC
  - TC0, TC1, TC2, TC3
  - CPU accesses (cache misses)
  - Priority Register: MSTPRI

Note: refer to your specific datasheet for register names…
*** this page is blank – so why are you staring at it? ***
Chapter Quiz

1. How do you turn ON the optimizer?

2. Why is there such a performance delta between “Debug” and “Opt”?

3. Name 4 compiler techniques to increase performance besides -o?

4. Why is data alignment important?

5. What is the purpose of the –mi option?

6. What is the BEST feedback mechanism to test compiler’s efficiency?
Quiz - Answers

Chapter Quiz

1. How do you turn ON the optimizer?
   • Project -> Properties, use -o2 or -o3 for best performance

2. Why is there such a performance delta between “Debug” and “Opt”?
   • Debug allows for single-step (NOPs), “Opt” fills delay slots optimally

3. Name 4 compiler techniques to increase performance besides -o?
   • Data alignment, MUST_ITERATE, restrict, –mi, intrinsics, _nassert()

4. Why is data alignment important?
   • Performance. The CPU can only perform 1 non-aligned LD per cycle

5. What is the purpose of the –mi option?
   • To specify the max # cycles a loop will go “dark” responding to INTs

6. What is the BEST feedback mechanism to test compiler’s efficiency?
   • Benchmarks, then LOOK AT THE ASSEMBLY FILE. Look for LDDW & SPLOOP
Lab 13 – C Optimizations

In the following lab, you will gain some experience benchmarking the use of optimizations using the C optimizer switches. While your own mileage may vary greatly, you will gain an understanding of how the optimizer works and where the switches are located and their possible affects on speed and size.

Lab 13 – FIR Algo & Buffer Management

- Lab 13 uses a double-buffered (PING/PONG) channel-sorted (L/R) buffering scheme.
- A FIR algorithm requires “history” to be preserved over calls to the algo.
- FIR_process() must first copy the history, then process the data

Lab 13 – FIR Audio – Optimizations Galore

- Part A – Determine goal/CPU Min
  Apply Compiler Options
- Part B – Code Tuning (pragmas)
- Part C – Optimize for Space
- Part D – Use DSPLib

Note: this lab uses NEW i2c code for LED.toggle() – 4 new files (i2c/led)
Lab 13 – C Optimizations – Procedure

PART A – Goals and Using Compiler Options

Determine Goals and CPU Min

1. Determine Real-Time Goal

Because we are running audio, our “real-time” goal is for the processing (using low-pass FIR filter) to keep up with the I/O which is sampling at 48KHz. So, if we were doing a “single sample” FIR, our processing time would have to be less than 1/48K = 20.8μS. However, we are using double buffers, so our time requirement is relaxed to 20.8μS * BUFFSIZE = 20.8 * 256 = 5.33ms. Alright, any DSP worth its salt should be able to do this work inside 5ms. Right? Hmmm…

► Real-time goal: music sounds fine.

2. Determine CPU Min.

What is the theoretical minimum based on the C674x architecture? This is based on several factors – data type (16-bit), #loads required and the type mathematical operations involved. What kind of algorithm are we using? FIR. So, let's figure this out:

- 256 data samples * 64 coeffs = 16384 cycles. This assumes 1 MAC/cycle
- Data type = 16-bit data
- # loads possible = 8 16-bit values (aligned). Two LDDW (load double words).
- Mathematical operation – DDOTP (cross multiply/accumulate) = 8 per cycle

So, the CPU Min = 16384/8 = ~2048 cycles + overhead.

If you look at the inner loop (which is a simple dot product, it will take 64/8 cycles = 8 cycles per inner loop. Add 8 cycles overhead for prologue and epilogue (pre-loop and post-loop code), so the inner loop is 16 cycles. Multiply that by the buffer size = 256, so the approximate CPU min = 16*256 = 4096.

CPU Min = 4096 cycles.

3. Import Lab 13 Project.

► Import Lab 13 Project from \Labs\Lab13 folder. Change the build properties to use YOUR student platform file and ensure the latest BIOS/XDC/UIA tools are selected.

4. Analyze new items – FIR_process and COEFFs

► Open fir.c. You will notice that this file is quite different. It has the same overall TSK structure (Semaphore_pend, if ping/pong, etc). Notice that after the if(pingPong), we process the data using a FIR filter.

► Scroll on down to cfir(). This is a simple nested for() loop. The outer loop runs once for every block size (in our case, this is DATA_SIZE). The inner loop runs the size of COEFFS[] times (in our case, 64).

► Open coeffs.c. Here you will see the coefficients for the symmetric FIR filter. There are 3 sets – low-pass, hi-pass and all-pass. We'll use the low-pass for now.
Using **Debug Configuration (−g, NO opt)**

5. **Using the Debug Configuration, build and play.**
   - Build your code and run it. The audio sounds terrible (if you can hear it at all). What is happening?

6. **Analyze poor audio.**
   - The first thing you might think is that the code is not meeting real-time. And, you’d be right. Let’s use some debugging techniques to find out what is going on.

7. **Check CPU load.**
   - Make sure you clicked **Restart**. Run again. What do the CPU loads and Log_info’s report?
   - Hmm. The CPU Load graph (for the author), showed NOTHING – no line at all.
   - Right now, the CPU is overloaded (> 100%). In that condition, results cannot be sent to the tools because the Idle thread is never run.
   - But, if you look at Raw Logs, you can see the CPU load reported as ZERO (which we know is not the case) and benchmark is:

   ![Raw Logs Table]

   About 913K cycles. Whoa. Maybe we need to OPTIMIZE this thing. 😊

   What were your results? Write the down below:

   ```plaintext
   Debug (-g, no opt) benchmark for cfir()? ______________________ cycles
   Did we meet our real-time goal (music sounding fine?): ____________
   ```

   Can anyone say “heck no”. The audio sounds terrible. We have failed to meet our only real-time goal.

   But hey, it’s using the Debug Configuration. And if we wanted to single step our code, we can. It is a very nice debug-friendly environment – although the performance is abysmal. This is to be expected.
8. Check Semaphore count of mcaspReadySem.

If the semaphore count for `mcaspReadySem` is anything other than ZERO after the `Semaphore_pend` in `FIR_process()`, we have troubles. This will indicate that we are NOT keeping up with real time. In other words, the Hwi is posting the semaphore but the processing algorithm is NOT keeping up with these posts. Therefore, if the count is higher than 0, then we are NOT meeting realtime.

► Use ROV and look at the Semaphore module. Your results may vary, but you’ll see the semaphore counts pretty high (darn, even `ledToggleSem` is out of control):

![Semaphore module in ROV](image)

My goodness – a number WELL greater than zero. We are definitely not meeting realtime.


► FYI – if you looked at the options for the Debug configuration, you’d see the following:

![Debug compiler options](image)

Full symbolic debug is turned on and NO optimizations. Ok, nice fluffy debug environment to make sure we’re getting the right answers, but not good enough to meet realtime. Let’s “kick it up a notch”…

**Using Release Configuration (–o2, no –g)**

10. Change the build configuration from Debug to Release.

Next, we’ll use the Release build configuration.

► In the project view, right-click on the project and choose “Build Configuration” and select Release:

![Build configurations](image)

► Check Properties → Include directory. Make sure the BSL `\inc` folder is specified.

**Also, double-check your PLATFORM file**. Make sure all code/data/stacks are in internal memory and that your project is USING the proper platform in this NEW build configuration. Once again, these configurations are containers of options. Even though `Debug` had the proper platform file specified, `Release` might NOT!!
   ► Build and Run. If you get errors, did you remember to set the INCLUDE path for the BSL library? Remember, the Debug configuration is a container of options – including your path statements and platform file. So, if you switch configs (Debug to Release), you must also add ALL path statements and other options you want. Don’t forget to modify the RTSC settings to point to your _student platform AGAIN!
   Once built and loaded, your audio should sound fine now – that is, if you like to hear music with no treble…

   ► Using the same method as before, observe the benchmark for cfir().

   ![Benchmark Table]

   Release (-o2, no -g) benchmark for cfir()? _________ cycles

   Meet real-time goal? Music sound better? _________

   Here’s our picture:

   ![Benchmark Graph]

   Ok, now we’re talkin’ – it went from 913K to 37K – just by switching to the release configuration. So, the bottom line is TURN ON THE OPTIMIZER!!

13. Study release configuration build properties.
   ► Here’s a picture of the build options for release:

   ![Build Options]

   The “biggie” is -o2 is selected.

   Can we improve on this benchmark a little? Maybe…
Using “Opt” Configuration

14. Create a NEW build configuration named “Opt”.

Really? Yep. And it’s easy to do. ► Using the Release configuration, right-click on the project and select properties (where you’ve been many times already).

► Click on Basic Options and notice they are currently set to –o2 –g. ► Look up a few inches and you’ll see the “Configuration:” drop-down dialogue. ► Click on the down arrow and you’ll see “Debug” and “Release”.

► Click on the “Manage” button:

![Manage Configurations dialog]

► Click New:

![Create New Configuration dialog]

(also note the Remove button – where you can delete build configurations).

► Give the new configuration a name: “Opt” and choose to copy the existing configuration from “Release”. Click Ok.

► Change the Active Configuration to “Opt”
15. Change the “Opt” build properties to use –o3 and NO –g (the “blank” choice).

- The only change that needs to be made is to turn UP the optimization level to –o3 vs. –o2 which was used in the Release Configuration. Also, make sure –g is turned OFF (which it should already be).

- Open the “Opt” Config Build Properties and verify it contains NO –g (blank) and optimization level of –o3. Rebuild your code and benchmark (FYI – LED may stop blinking...don’t worry).

- Follow the same procedure as before to benchmark cfir:

```
Opt (-o3, no -g) benchmark for cfir()? __________ cycles
```

The author’s number was about 18K cycles – another pretty significant performance increase over –o2, -g. We simply went to –o3 and killed –g and WHAM, we went from 37K to 18K. This is why the author has stated before that the Opt settings we used in this lab SHOULD be the RELEASE settings. But I am not king.

So, as you can see, we went from 913K to 18K in about 30 minutes. Wow. But what was the CPU Min? About 7K? Ok...we still have some room for improvement...

Just for kicks and grins, try single stepping your code and/or adding breakpoints in the middle of a function (like cfir). Is this more difficult with –g turned OFF and –o3 applied? Yep.

---

**Note:** With –g turned OFF, you still get symbol capability – i.e. you can enter symbol names into the watch and memory windows. However, it is nearly impossible to single step C code – hence the suggestion to create test vectors at function boundaries to check the LOGICAL part of your code when you build with the Debug Configuration. When you turn off –g, you need to look at the answers on function boundaries to make sure it is working properly.

16. Turn on verbose and interlist – and then see what the .asm file looks like for fir.asm.

As noted in the discussion material, to “see it all”, you need to turn on three switches. Turn them on now, then build, then peruse the fir.asm file. You will see some interesting information about software pipelining for the loops in fir.c.

- Turn on:
  RunTime Model Options → Verbose pipeline info (-mw)

```
[ ] Generate verbose software pipelining information (--debug_software_pipeline, -mw)
```

Optimizations → Interlist (-os)

```
[ ] Generate optimized source interlisted assembly (--optimizer_interlist, -os)
```

Assembler Options → Keep .ASM file (-k)

```
[ ] Keep the generated assembly language (.asm) file (--keep_asm, -k)
```
Part B – Code Tuning

17. Use #pragma MUST_ITERATE in cfir().
   ▶ Uncomment the #pragmas for MUST_ITERATE on the two for loops. This pragma gives
   the compiler some information about the loops – and how to unroll them efficiently. As
   always, the more info you can provide to the compiler, the better.
   ▶ Use the “Opt” build configuration. Rebuild (use the Build button – it is an incremental build
   and WAY faster when you’re making small code changes like this). Then Run.

   Opt + MUST_ITERATE (-o3, no –g) cfir()? _________ cycles

   The author’s results were close to the previous results – about 15K. Well, this code tuning
   didn’t help THIS algo much, but it might help yours. At least you know how to apply it now.

18. Use restrict keyword on the results array.
   You actually have a few options to tell the compiler there is NO ALIASING. The first method
   is to tell the compiler that your entire project contains no aliasing (using the –mt compiler
   option). However, it is best to narrow the scope and simply tell the compiler that the results
   array has no aliasing (because the WRITES are destructive, we RESTRICT the output array).
   ▶ So, in fir.c, add the following keyword (restrict) to the results (r) parameter of the fir
   algorithm as shown:

   ```c
   void cfir(int16_t * x, int16_t * h, int16_t * restrict r, int16_t nh, int16_t nr)
   ```

   ▶ Build, then run again. Now benchmark your code again. Did it improve?

   Opt + MUST_ITERATE + restrict (-o3, no –g) cfir()? _________ cycles

   Here is what the author got:

   Well, getting rid of ALIASING was a big help to our algo. We went from about 15K down to
   7K cycles. You could achieve the same result by using “-mt” compiler switch, but that tells the
   compiler that there is NO aliasing ANYWHERE – scope is huge. Restrict is more restricted.
19. Use _nassert() to tell optimizer about data alignment.

Because the receive buffers are set up using STRUCTURES, the compiler may or may not be able to determine the alignment of an ELEMENT (i.e. rcvPingL.hist) inside that structure – thus causing the optimizer to be conservative and use redundant loops. You may have seen the benchmarks have two results the same, and one larger. Or, you may not have. It usually happens on Thursdays….

It is possible that using _nassert() may help this situation. Again, this “fix” is only needed in this specific case where the memory buffers were allocated using structures (see main.h if you want a looksy).

► Uncomment the two _nassert() intrinsics in fir.c inside the cfir() function and rebuild/run and check the results.

Here is what the author got (same as before…but hey, worth a try):

![Benchmark results]

20. Turn on symbolic debug with FULL optimization.

This is an important little trick that you need to know. As we have stated before, it is impossible to single step your code when you have optimization turned on to level –o3. You are able to place breakpoints at function entry/exit points and check your answers, but that’s it. This is why FUNCTION LEVEL test vectors are important.

There are two ways to accomplish this. Some companies use script code to place breakpoints at specific disassembly symbols (function entry/exit) and run test vectors through automatically. Others simply want to manually set breakpoints in their source code and hit RUN and see the results.

► While still in the Debug perspective with your program loaded, select:

  Restart

The execution pointer is at main, but do you see your main() source file? Probably not. Ok, pop over to Edit perspective and open fir.c. Set a breakpoint at the beginning of the function. Hit RUN. Your program will stop at that breakpoint, but in the Debug perspective, do you see your source file associated with the disassembly? Again, probably not.

► Again, hit Restart to start your program at main() again.

How do you tell the compiler to add JUST ENOUGH debug info to allow your source files to SYNC with the disassembly but not affect optimization? There is a little known option that allows this…
► Make sure you have the Opt configuration selected, right click and choose Properties.

► Next, check the box below (at C6000 Compiler → Runtime Model Options) to turn on symbolic debug with FULL Optimization (-mn):

► TURN ON –g (symbolic debug). –mn only makes sense if –g is turned ON. Go back to the basic options and select Full Symbolic Debug.

► Rebuild and load your program. The execution pointer should now show up along with your main.c file.

► Hit Restart again.

► Set a breakpoint in the middle of FIR_process() function inside fir.c. You can’t do it. The breakpoint snaps to the beginning or end of the function, right?

► Make sure the breakpoint is at the beginning of FIR_process() and hit RUN. You can now see your source code synced with the disassembly. Very nice.

But did this affect your optimization and your benchmark? Go try it.

► Hit Restart again and remove all breakpoints.

► Then RUN. Halt your program and check your benchmark. Is it about the same? It should be…
Part C – Minimizing Code Size (–ms)

   ► Select the “Opt” configuration and also make sure MUST_ITERATE and restrict are used in your code (this is the same setting as the previous lab step).
   ► Rebuild and Run.
   ► Write down your fastest benchmark for cfir:

   \[
   \text{Opt (-o3, NO –g, NO –ms3) cfir, \hspace{1cm} \text{cycles}}
   \]
   \[
   .text (NO –ms) = \hspace{1cm} \text{h}
   \]

   ► Open the .map file generated by the linker. Hmmm. Where is it located?

   ► Try to find it yourself without asking anyone else. Hint: which build config did you use when you hit “build”?

22. Add –ms3 to Opt Config.
   ► Open the build properties and add –ms3 to the compiler options (under Basic Options). We will just put the “pedal to the metal” for code size optimizations and go all the way to –ms3 first. Note here that we also have –o3 set also (which is required for the –ms option).
   In this scenario, the compiler may choose to keep the “slow version” of the redundant loops (fast or slow) due to the presence of –ms.
   ► Rebuild and run.

   \[
   \text{Opt + -ms (-o3, NO –g, –ms3) cfir, \hspace{1cm} \text{cycles}}
   \]
   \[
   .text (-ms3) = \hspace{1cm} \text{h}
   \]

Did your benchmark get worse with –ms3? How much code size did you save? What conclusions would you draw from this?

____________________________________________________________________
____________________________________________________________________

Keep in mind that you can also apply –ms3 (or most of the basic options) to a specific function using #pragma FUNCTION_OPTIONS( ).

FYI – the author saved about 2.2K bytes total out of the .text section and the benchmark was about 33K. HOWEVER, most of the .text section is LIBRARY code which is not affected by –ms3. So, of the NON .lib code which IS affected by –ms3, using –ms3 saved 50% on code size (original byte count was 6881 bytes and was reduced to 3453 bytes). This is pretty significant. Yes, the benchmark ended up being 33K, but now you know the tradeoff.

Also remember that you can apply –ms3 on a FILE BY FILE basis. So, a smart way to apply this is to use it on init routines – and keep it far away from your algos that require the best performance.
Part D – Using DSPLib

23. Download and install the appropriate DSP Library.
   
   This, fortunately for you, has already been done for you. This directory is located at:
   
   `C:\SYSBIOSv4\Labs\dsplib64x+\lib`

24. Link the appropriate library to your project.
   
   ► Find the lib file in the above folder and link it to your project (non ELF version).
   
   ► Also, add the include path for this library to your build properties.

25. Add #include to the fir.c file.
   
   ► Add the proper #include for the header file for this library to fir.c.

26. Replace the calls to the fir function in fir.c.
   
   ► THIS MUST BE DONE 4 TIMES (Ping, Pong, L and R = 4). Should I say it again? There are FOUR calls to the fir routine that need to be replaced by something new. Ok, twice should be enough. ;-) 
   
   ► Replace:
   
   ```c
   cfir(rcvPongL.hist, COEFFS, xmt.PongL, ORDER, DATA_SIZE);
   ```
   
   with
   
   ```c
   DSP_fir_gen(rcvPongL.hist, COEFFS, xmt.PongL, ORDER, DATA_SIZE);
   ```

27. Build, load, verify and BENCHMARK the new FIR routine in DSPLib.

28. What are the best-case benchmarks?

   Yours (compiler/optimizer):___________    DSPLib: ___________

   Wow, for what we wanted in THIS system (a fast simple FIR routine), we would have been better off just using DSPLib. Yep. But, in the process, you’ve learned a great deal about optimization techniques across the board that may or may not help your specific system. Remember, your mileage may vary.
Conclusion

Hopefully this exercise gave you a feel for how to use some of the basic compiler/optimizer switches for your own application. Everyone’s mileage may vary and there just might be a magic switch that helps your code and doesn’t help someone else’s. That’s the beauty of trial and error.

Conclusion? TURN ON THE OPTIMIZER! Was that loud enough?

Here’s what the author came up with – how did your results compare?

<table>
<thead>
<tr>
<th>Optimizations</th>
<th>Benchmark</th>
</tr>
</thead>
<tbody>
<tr>
<td>Debug Bld Config – No opt</td>
<td>913K</td>
</tr>
<tr>
<td>Release (-o2, -g)</td>
<td>37K</td>
</tr>
<tr>
<td>Opt (-o3, no -g)</td>
<td>18K</td>
</tr>
<tr>
<td>Opt + MUST_ITERATE</td>
<td>15K</td>
</tr>
<tr>
<td>Opt + MUST_ITERATE + restrict</td>
<td>7K</td>
</tr>
<tr>
<td>DSPLib (FIR)</td>
<td>7K</td>
</tr>
</tbody>
</table>

Regarding –ms3, use it wisely. It is more useful to add this option to functions that are large but not time critical – like IDL functions, init code, maintenance type items. You can save some code space (important) and lose some performance (probably a don’t care). For your time-critical functions, do not use –ms ANYTHING. This is just a suggestion – again, your mileage may vary.

CPU Min was 4K cycles. We got close, but didn’t quite reach it. The authors believe that it is possible to get closer to the 4K benchmark by using intrinsics and the DDOTP instruction.

The biggest limiting factor in optimizing the cfr routine is the “sliding window”. The processor is only allowed ONE non-aligned load each cycle. This would happen 75% of the time. So, the compiler is already playing some games and optimizing extremely well given the circumstances. It would require “hand-tweaking” via intrinsics and intimate knowledge of the architecture to achieve much better.

29. Terminate the Debug session, close the project and close CCS. Power-cycle the board.

Throw something at the instructor to let him know that you’re done with the lab. Hard, sharp objects are most welcome...
13 - 46 C6000 Embedded Design Workshop - C and System Optimizations

Additional Information

IDMA0 – Programming Details

- IDMA0 operates on a block of 32 contiguous 32-bit registers (both src/dst blocks must be aligned on a 32-word boundary). Optionally generate CPU interrupt if needed.
- User provides: Src, Dst, Count and “mask” (Reference: SPRU871)

![Linear Assembly](image)

- Linear assembly abstracts the user from having to learn how to software pipeline C64x+ assembly code (NO NOPs, functional units, parallel bars, register specifications req’d)
- This linear assembly routine performs this function:
  ```
  int dotp ( short *a, short *x, int count )
  ```
- Can specify arguments (pm, pn, count), variables (m, n, prod, sum), return values (sum)
- .proc/.endproc are assembly directives that specify the start/end of the procedure
- Reference: SPRU187, Chapter 4

![Example Transfer using MASK](image)

- Example Transfer using MASK (not all regs typically need to be programmed):

User must write to IDMA0 registers in the following order (COUNT written – triggers transfer):

<table>
<thead>
<tr>
<th>Source address</th>
<th>Destination address</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 1 8 15 22 29</td>
<td>0 1 8 10 12 22 23</td>
</tr>
</tbody>
</table>

| Mask | 01010111101111111001010001100 |

- User must write to IDMA0 registers in the following order (COUNT written – triggers transfer):

| IDMA0_MASK = 0x573FEA8C; //set mask for 13 regs above |
| IDMA0_SOURCE = reg_ptr; //set src addr in L1D/L2 |
| IDMA0_DEST = MMR_ADDRESS; //set dst addr to config location |
| IDMA0_COUNT = 0; //set mask for 1 block of 32 registers |
IDMA1 – Programming Details

- IDMA1 is optimized for LINEAR burst transfers between L1P, L1D and L2

  ![Diagram showing IDMA1 source and destination addresses]

- Cannot access CFG port registers (only used for internal memory transfers)
- User provides: Src, Dst, Count (Reference: SPRU871)
- All src/dest addresses increment linearly throughout the transfer
- IDMA1_COUNT = #bytes to transfer
- Example:

  ```
  IDMA1_SOURCE = outBuffFast; // set src addr in L1D
  IDMA1_DEST = outBuff; // set dst addr to L2
  IDMA1_COUNT = 7 << IDMA_PRI_SHIFT | // PRI low vs. cache/EDMA
  1 << IDMA_INT_SHIFT | // interrupt CPU on completion
  buffsize; // set count to buffer size (bytes)
  ```

TXAS INSTRUMENTS
Cache & Internal Memory

Introduction

In this chapter the memory options of the C6000 will be considered. By far, the easiest – and highest performance – option is to place everything in on-chip memory. In systems where this is possible, it is the best choice. To place code and initialize data in internal RAM in a production system, refer to the chapters on booting and DMA usage.

Most systems will have more code and data than the internal memory can hold. As such, placing everything off-chip is another option, and can be implemented easily, but most users will find the performance degradation to be significant. As such, the ability to enable caching to accelerate the use of off-chip resources will be desirable.

For optimal performance, some systems may benefit from a mix of on-chip memory and cache. Fine tuning of code for use with the cache can also improve performance, and assure reliability in complex systems. Each of these constructs will be considered in this chapter.

Objectives

- Compare/contrast different uses of memory (internal, external, cache)
- Define cache terms and definitions
- Describe C6000 cache architecture
- Demonstrate how to configure and use cache optimally
- Lab 14 – modify an existing system to use cache – benchmark solutions
Why Cache?

Parking Dilemma

Sports Arena

- Close Parking
  - 0 minute walk
  - 10 spaces
  - $100/space

- Distant Parking-Ramp
  - 10 minute walk
  - 1000 spaces
  - $5/space

Parking Choices:
- 0 minute walk @ $100 for close-in parking
- 10 minute walk @ $5 for distant parking
or ...
- Valet parking: 0 minute walk @ only $6.00

How does this compare to cache memory?

Why Cache?

Sports Arena

- Cache Memory
  - Fast
  - Small
  - Works like Big, Fast Memory

- Bulk Memory
  - Slower
  - Larger
  - Cheaper

Memory Choices:
- Small, fast memory
- Large, slow memory
or ... Use Cache:
- Combines advantages of both
- Like valet, data movement is automatic
Cache Basics – Terminology

Using Cache Memory

- Cache hardware automatically transfers code/data to internal memory, as needed
- Addresses in the Memory Map are *associated* with locations in cache
- Cache locations do not have their own addresses

Cache: Block, Line, Index

- Conceptually, a cache divides the entire memory into *blocks* equal to its size
- A cache is divided into smaller storage locations called *lines*
- The term *index* or *Line-Number* is used to specify a specific cache line

How do we know which block is cached?
Cache Basics – Terminology

Cache Tags

- A **Tag** value keeps track of which block is associated with a cache block.
- **Each line has its own tag** – thus, the whole cache block won't be erased when lines from different memory blocks need to be cached simultaneously.

How do we know a cache line is valid (or not)?

Valid Bits

- A **Valid** bit keeps track of which lines contain “real” information.
- They are set by the cache hardware whenever new code or data is stored.

This type of cache is called ...
Direct-Mapped Cache

- **Direct-Mapped Cache** associates an address within each block with one cache line.
- Thus … there will be only one unique cache index for any address in the memory-map.
- Only one block can have information in a cache line at any given time.

Let's look at an example ...
Cache Example

Let's examine an arbitrary direct-mapped cache example:

- A 16-line, direct-mapped cache requires a 4-bit index
- If our example μP used 16-bit addresses, this leaves us with a 12-bit tag

Arbitrary Direct-Mapped Cache Example

- The following example uses:
  - 16-line cache
  - 16-bit addresses, and
  - Stores one 32-bit instruction per line
- C6000 cache’s have different cache and line sizes than this example
- It is only intended as a simple cache example to reinforce cache concepts
### Conceptual Example Code

<table>
<thead>
<tr>
<th>Address</th>
<th>Code</th>
</tr>
</thead>
<tbody>
<tr>
<td>0003h</td>
<td>L1 LDH</td>
</tr>
<tr>
<td>0004h</td>
<td>MPY</td>
</tr>
<tr>
<td>0005h</td>
<td>ADD</td>
</tr>
<tr>
<td>0006h</td>
<td>B L2</td>
</tr>
<tr>
<td>0026h</td>
<td>L2 ADD</td>
</tr>
<tr>
<td>0027h</td>
<td>SUB cnt</td>
</tr>
<tr>
<td>0028h</td>
<td>![cnt] B L1</td>
</tr>
</tbody>
</table>

### Direct-Mapped Cache Example

<table>
<thead>
<tr>
<th>Valid</th>
<th>Tag</th>
<th>Index</th>
<th>Cache</th>
</tr>
</thead>
<tbody>
<tr>
<td>✓</td>
<td>000</td>
<td>0</td>
<td>LDH</td>
</tr>
<tr>
<td>✓</td>
<td>000</td>
<td>1</td>
<td>MPY</td>
</tr>
<tr>
<td>✓</td>
<td>000</td>
<td>2</td>
<td>ADD</td>
</tr>
<tr>
<td>✓</td>
<td>000 002 000</td>
<td>3</td>
<td>ADD-B</td>
</tr>
<tr>
<td>✓</td>
<td>002</td>
<td>4</td>
<td>SUB</td>
</tr>
<tr>
<td>✓</td>
<td>002</td>
<td>5</td>
<td>B</td>
</tr>
<tr>
<td>✓</td>
<td>002</td>
<td>6</td>
<td></td>
</tr>
<tr>
<td>✓</td>
<td>002</td>
<td>7</td>
<td></td>
</tr>
<tr>
<td>✓</td>
<td>002</td>
<td>8</td>
<td></td>
</tr>
<tr>
<td>✓</td>
<td>002</td>
<td>9</td>
<td></td>
</tr>
</tbody>
</table>

### Address, Code Table

<table>
<thead>
<tr>
<th>Address</th>
<th>Code</th>
</tr>
</thead>
<tbody>
<tr>
<td>0003h</td>
<td>L1 LDH</td>
</tr>
<tr>
<td>0026h</td>
<td>L2 ADD</td>
</tr>
<tr>
<td>0027h</td>
<td>SUB cnt</td>
</tr>
<tr>
<td>0028h</td>
<td>![cnt] B L1</td>
</tr>
</tbody>
</table>
### Direct-Mapped Cache Example

<table>
<thead>
<tr>
<th>Valid</th>
<th>Tag</th>
<th>Index</th>
<th>Cache</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>0</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>1</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>2</td>
<td></td>
</tr>
<tr>
<td>✔</td>
<td>000</td>
<td>3</td>
<td>LDH</td>
</tr>
<tr>
<td></td>
<td></td>
<td>4</td>
<td>MPU</td>
</tr>
</tbody>
</table>

**Notes:**
- This example was contrived to show how cache lines can thrash
- Code thrashing is minimized on the C6000 due to relatively large cache sizes
- Keeping code in contiguous sections also helps to minimize thrashing
- Let’s review the two types of misses that we encountered

### Types of Misses

- **Compulsory**
  - Miss when first accessing an new address

- **Conflict**
  - Line is evicted upon access of an address whose index is already cached
  - Solutions:
    - Change memory layout
    - Allow more lines for each index

- **Capacity** (we didn’t see this in our example)
  - Line is evicted before it can be re-used because capacity of the cache is exhausted
  - Solution: Increase cache size
L1P – Program Cache

Internal Memory Hierarchy

- We often refer to a system’s memory in hierarchical levels
- Higher levels (L1) are closer to the CPU
- CPU always requests from highest level memory...
  ... If address isn’t present in L1, cache h/w gets it from lower level

L1P Cache

- Zero-waitstate Program Memory
- Direct-Mapped Cache
  - Works exceptionally well for DSP code (which tends to have many loops)
  - Can be placed to minimize thrashing

Looking more closely at L1P...
### L1P Cache Comparison

<table>
<thead>
<tr>
<th>Device</th>
<th>Scheme</th>
<th>Size</th>
<th>Linesize</th>
<th>New Features</th>
</tr>
</thead>
<tbody>
<tr>
<td>C62x/C67x</td>
<td>Direct Mapped</td>
<td>4K bytes</td>
<td>64 bytes (16 instr)</td>
<td>N/A</td>
</tr>
<tr>
<td>C64x</td>
<td>Direct Mapped</td>
<td>16K bytes</td>
<td>32 bytes (8 instr)</td>
<td>N/A</td>
</tr>
<tr>
<td>C64x+</td>
<td>Direct Mapped</td>
<td>32K bytes</td>
<td>32 bytes (8 instr)</td>
<td>Cache/RAM, Cache Freeze, Memory Protection</td>
</tr>
<tr>
<td>C66x</td>
<td>Direct Mapped</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

- All L1P memories provide zero waitstate access

Next two slides discuss Cache/RAM and Freeze features. Memory Protection is not discussed in this workshop.
C64x+ L1P – Cache vs. Addressable RAM

- RAM Memory
- Cache

- Can be configured as Cache or Addressable RAM
- Five cache sizes are available: 0K, 4K, 8K, 16K, 32K
- Allows critical loops to be put into L1P, while still affording room for cache memory

Cache Freeze (C64x+)

- Freezing cache prevents data that is currently cached from being evicted
- Cache Freeze
  - Responds to read and write hits normally
  - No updating of cache on miss
  - Freeze supported on C64x+ L2/L1P/L1D
- Commonly used with Interrupt Service Routines so that one-use code does not replace realtime algo code
- Other cache modes: Normal, Bypass
- Cache_xyz: BIOS Cache management module

Cache Mode Management

```c
typedef enum {
  CACHE_L1D,
  CACHE_L1P,
  CACHE_L2
} CACHE_Level;

typedef enum {
  CACHE_NORMAL,
  CACHE_FREEZE,
  CACHE_BYPASS
} CACHE_Mode;
```
L1D – Data Cache

Caching Data

- One instruction may access multiple data elements:
  ```
  for( i = 0; i < 4; i++ ) {
      sum += x[i] * y[i];
  }
  ```

- What would happen if x and y ended up at the following addresses?
  - x = 0x0000
  - y = 0x8000

  They would end up overwriting each other in the cache --- called thrashing

- Increasing the associativity of the cache will reduce this problem

Increased Associativity

- Split a Direct-Mapped Cache in half
  - Each half is called a cache way
  - Multiple ways make data caches more efficient

How do you increase associativity?

What is a set?
What is a Set?

- The lines from each way that map to the same index form a set.
- The number of lines per set defines the cache as an N-way set-associative cache.
- With 2 ways, there are now 2 unique cache locations for each memory address.
- How do you determine WHICH line gets replaced? (LRU algo)

L1D Summary

<table>
<thead>
<tr>
<th>Device</th>
<th>Scheme</th>
<th>Size</th>
<th>Linesize</th>
<th>New Features</th>
</tr>
</thead>
<tbody>
<tr>
<td>C62x/C67x</td>
<td>2-Way Set Assoc.</td>
<td>4K bytes</td>
<td>32 bytes</td>
<td>N/A</td>
</tr>
<tr>
<td>C64x</td>
<td>2-Way Set Assoc.</td>
<td>16K bytes</td>
<td>64 bytes</td>
<td>N/A</td>
</tr>
<tr>
<td>C64x+ C674x C68x</td>
<td>2-Way Set Assoc.</td>
<td>C6455: 32K DM64xx: 80K</td>
<td>64 bytes</td>
<td>Cache/RAM, Cache Freeze, Memory Protection</td>
</tr>
</tbody>
</table>

- All L1D memories provide zero waitstate access.
- Cache/RAM configuration and Cache Freeze work similar to L1P.
- L1 caches are ‘Read Allocate’, thus only updated on memory read misses.
L2 – RAM or Cache?

**Internal Memory (L2)**

<table>
<thead>
<tr>
<th>Device</th>
<th>Size</th>
<th>L2 Features</th>
</tr>
</thead>
<tbody>
<tr>
<td>C671x</td>
<td>64KB - 128K</td>
<td>• Unified (code or data)</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Config as Cache or RAM None, or 1 to 4 way cache</td>
</tr>
<tr>
<td>C64x</td>
<td>64KB - 1MB</td>
<td>• Unified (code or data)</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Config as Cache or RAM</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Cache is always 4-way</td>
</tr>
<tr>
<td>C64x+</td>
<td>64KB - 2MB</td>
<td>• Unified (code or data)</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Config as Cache or RAM</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Cache is always 4-way</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Cache Freeze</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Memory Protection</td>
</tr>
</tbody>
</table>

- L2 linesize for all devices is 128 bytes
- L2 caches are 'Read/Write Allocate' memories

L2 Cache Configuration...

**C64x+/C674x – L2 Memory Configuration**

- **Configuration**
  - 2MB on C6455
  - When enabled, it's always 4-Way (same as C64x)

- **Linesize**
  - Linesize = 128 bytes
  - Same linesize as C671x & C64x

- **Performance**
  - L2 → L1P
    - 1-8 Cycles
  - L2 → L1D
    - L2 SRAM hit: 12.5 cycles
    - L2 Cache hit: 14.5 cycles
    - Pipeline: 4 cycles
    - When required, minimize latency by using L1D RAM

Using the Config Tool...
L2 – RAM or Cache?

Configuring L1/L2 Cache with the Config Tool

- Use the Platform Package to specify the sizes of L1, L2 caches:

  ![Config Tool Screenshot]

- The default settings are:
  - L1D: 32K
  - L1P: 32K
  - L2: 0K

Cache Performance Summary

<table>
<thead>
<tr>
<th>Device</th>
<th>L1P</th>
<th>L1D</th>
<th>L2 Performance</th>
</tr>
</thead>
<tbody>
<tr>
<td>C62x/C67x</td>
<td>Zero Waitstate Cache</td>
<td>Zero Waitstate Cache</td>
<td>L2 ? L1P: 16 instr in 5 cycles</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>L2 ? L1D: 32 bytes in 4 cycles</td>
</tr>
<tr>
<td>C64x</td>
<td>Zero Waitstate Cache</td>
<td>Zero Waitstate Cache</td>
<td>L2 ? L1P: 8 instr in 1-8 cycles</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>L2 ? L1D: 64 bytes in:</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>L2 SRAM: 6 cycles</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>L2 Cache: 8 cycles</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Pipelined: 2 cycles</td>
</tr>
<tr>
<td>C64x+ C674x C66x</td>
<td>Zero Waitstate Cache/RAM</td>
<td>Zero Waitstate Cache/RAM</td>
<td>L2 ? L1P: 8 instr in 1-8 cycles</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>L2 ? L1D: 64 bytes in:</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>L2 SRAM: 12.5 cycles</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>L2 Cache: 14.5 cycles</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Pipelined: 4 cycles</td>
</tr>
</tbody>
</table>
Cache Coherency (or Incoherency?)

Coherency Example

- For this example, L2 is set up as cache
- Example’s Data Flow:
  - EDMA fills RcvBuf in DDR
  - CPU reads RcvBuf, processes data, and writes to XmtBuf
  - EDMA moves data from XmtBuf (e.g. to a serial port xmt)
Coherency – Reads & Writes

EDMA Writes Buffer - RCV

- Buffer (in external memory) written by the EDMA

CPU Reading Buffers - RCV

- CPU reads the buffer for processing
- This read causes a cache miss in L1D and L2
- The RcvBuf is added to both caches
  - Space is allocated in each cache
  - L2 is R/W allocate, L1 is read-allocate only

What happens on the WRITE?
Cache Coherency (or Incoherency?)

Where Does the CPU Write To?

- After processing, the CPU writes to XmtBuf.
- Write misses to L1D are written directly to the next level of memory (L2).
- Thus, the write does not go directly to external memory.
- Cache line Allocated:
  - L1D on Read only
  - L2 on Read or Write

Coherency Issue – Write

- EDMA is set up to transfer the buffer from ext. mem.
- The buffer resides in cache, not in ext. memory.
- So, the EDMA transfers whatever is in ext. memory, probably not what you wanted.

What is the solution?
When the CPU is finished with the data (and has written it to XmtBuf in L2), it can be sent to ext. memory with a cache writeback. A writeback is a copy operation from cache to memory, writing back the modified (i.e. dirty) memory locations – all writebacks operate on full cache lines.

Use BIOS Cache APIs to force a writeback:

\[
\text{BIOS: Cache_wb (XmtBuf, BUFFSIZE, CACHE_NOWAIT);}
\]

What happens with the "next" RCV buffer?

EDMA writes a new RcvBuf buffer to ext. memory. When the CPU reads RcvBuf a cache hit occurs since the buffer (with old “stale” data) is still valid in cache. Thus, the CPU reads the old data instead of the new.
Cache Coherency (or Incoherency?)

Coherency Solution – Read

- To get the new data, you must first invalidate the old data before trying to read the new data (clears cache line’s valid bits)
- Again, cache operations (writeback, invalidate) operate on cache lines
- BIOS provides an invalidate option:

  ```
  BIOS: Cache_inv (RcvBuf, BUFSIZE, CACHE_WAIT);
  ```

Cache Functions – Summary

**BIOS Cache Functions Summary**

<table>
<thead>
<tr>
<th>Function</th>
<th>Function Call</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cache Invalidate</td>
<td><code>Cache_inv(blockPtr, byteCnt, wait)</code></td>
</tr>
<tr>
<td></td>
<td><code>Cache_invL1pAll()</code></td>
</tr>
<tr>
<td>Cache Writeback</td>
<td><code>Cache_wb(blockPtr, byteCnt, wait)</code></td>
</tr>
<tr>
<td></td>
<td><code>Cache_wbAll()</code></td>
</tr>
<tr>
<td>Invalidate &amp; Writeback</td>
<td><code>Cache_wbInv(blockPtr, byteCnt, wait)</code></td>
</tr>
<tr>
<td></td>
<td><code>Cache_wbInvAll()</code></td>
</tr>
<tr>
<td>Sync waiting for Cache</td>
<td><code>Cache_wait()</code></td>
</tr>
</tbody>
</table>

- blockPtr: start address of range to be invalidated
- byteCnt: number of bytes to be invalidated
- Wait: f = wait until operation is completed

What if the EDMA is reading/writing INTERNAL memory (L2)?
Coherency – Use Internal RAM!

Another Solution: Place Buffers in L2

- Configure some of L2 as RAM
- Locate buffers in this RAM space
- Coherency issues do not exist between L1D and L2

Coherency – Summary

Coherence Summary

Internal (L1/L2) Cache Coherency is Maintained
- Coherence between L1D and L2 is maintained by cache controller
- No Cache_fxn operations needed for data stored in L1D or L2 RAM
- L2 coherence operations implicitly operate upon L1, as well

Simple Rules for Error Free Cache (for DDR, L3)
- TAKING OWNERSHIP – Before the DSP begins reading a shared external INPUT buffer, it should first BLOCK INVALIDATE the buffer
- GIVING OWNERSHIP – After the DSP finishes writing to a shared external OUTPUT buffer, it should initiate an L2 BLOCK WRITEBACK

DEBUG NOTE: An easy way identify cache coherency problems is to allocate your buffers in L2. Problem goes away? It’s probably a cache coherency issue.

What about "cache alignment"?
Cache Alignment

**Problem:** How can I invalidate (or writeback) just the buffer?

*In this case, you can’t*

**Definition:** False Addresses are ‘neighbor’ data in the cache line, but outside the buffer range

**Why Bad:** Writing data to buffer marks the line ‘dirty’, which will cause entire line to be written to external memory, thus

*External neighbor memory could be overwritten with old data*

Avoid “False Address” problems by aligning buffers to cache lines (and filling entire line)

- Align memory to 128 byte boundaries
- Allocate memory in multiples of 128 bytes

```c
#define BUF 128
#pragma DATA_ALIGN (in, BUF)
short in[256];
```
Turning OFF Cacheability (MAR)

"Turn Off" the DATA Cache (MAR)

- Memory Attribute Registers (MARs) enable/disable DATA caching memory ranges
- Don't use MAR to solve basic cache coherency – performance will be too slow
- Use MAR when you have to always read the latest value of a memory location, such as a status register in an FPGA, or switches on a board.
- MAR is like “volatile”. You must use both to always read a memory location: MAR for cache; volatile for the compiler

Looking more closely at the MAR registers ...

Memory Attribute Regs (MAR) – DATA

- Use MAR registers to enable/disable caching of external DATA ranges
- Useful when external data is modified outside the scope of the CPU
- You can specify MAR values in Config Tool

- C671x:
  - 16 MARs
  - 4 per CE space
  - Each handles 16MB
- C64x/C64x+/C674x:
  - Each handles 16MB
  - 256/224 MARs
  - 16 per CS space (on current C64x, some are rsvd)

Setting MARs in CFG files ...

Texas Instruments
Configure MAR via GCONF (C6748)

- First, add this line of script code to your .cfg file:
  ```
  Var Cache = xdo.useModule('ti.sysbios.family.c64p.Cache');
  ```

- Then, modify the MAR settings:

Example: C6748 EVM MAR 192-223 (DDR2) turned ‘on’ (starting at address 0xC000_0000)

Memory Attribute Registers : MARs

- 256 MAR bits define cache-ability of 4G of addresses as 16MB groups
- Many 16MB areas not used or present on given board
- Example: Usable 6748 EMIF addresses at right
- EVM6748 memory is:
  - 128MB of DDR2 starting at 0xC000_0000
  - FLASH, NAND Flash, or SRAM in CS2 space at 0x6000_0000
- Note: with the C64x+ program memory is always cached regardless of MAR settings

<table>
<thead>
<tr>
<th>Start Address</th>
<th>End Address</th>
<th>Size</th>
<th>Space</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x6000 0000</td>
<td>0x60FF FFFF</td>
<td>16MB</td>
<td>CS2_</td>
</tr>
<tr>
<td>0x6200 0000</td>
<td>0x62FF FFFF</td>
<td>16MB</td>
<td>CS3</td>
</tr>
<tr>
<td>0x6400 0000</td>
<td>0x64FF FFFF</td>
<td>16MB</td>
<td>CS4</td>
</tr>
<tr>
<td>0x6600 0000</td>
<td>0x66FF FFFF</td>
<td>16MB</td>
<td>CS5</td>
</tr>
<tr>
<td>0xC000 0000</td>
<td>0xDFFF FFFF</td>
<td>512MB</td>
<td>DDR2</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>MAR</th>
<th>MAR Address</th>
<th>EMIF Address Range</th>
</tr>
</thead>
<tbody>
<tr>
<td>192</td>
<td>0x0184 8200</td>
<td>C000 0000 - C0FF FFFF</td>
</tr>
<tr>
<td>193</td>
<td>0x0184 8204</td>
<td>C100 0000 - C1FF FFFF</td>
</tr>
<tr>
<td>194</td>
<td>0x0184 8208</td>
<td>C200 0000 - C2FF FFFF</td>
</tr>
<tr>
<td>195</td>
<td>0x0184 820C</td>
<td>C300 0000 - C3FF FFFF</td>
</tr>
<tr>
<td>196</td>
<td>0x0184 8210</td>
<td>C400 0000 - C4FF FFFF</td>
</tr>
<tr>
<td>197</td>
<td>0x0184 8214</td>
<td>C500 0000 - C5FF FFFF</td>
</tr>
<tr>
<td>...</td>
<td></td>
<td></td>
</tr>
<tr>
<td>223</td>
<td></td>
<td>D000 0000 - DFFF FFFF</td>
</tr>
</tbody>
</table>
Additional Topics

**L1D: DATA_MEM_BANK Example**
- Only one L1D access per bank per cycle
- Use DATA_MEM_BANK pragma to begin paired arrays in different banks
- Note: sequential data are not down a bank, instead they are along a horizontal line across banks, then onto the next horizontal line
- Only even banks (0, 2, 4, 6) can be specified

```c
#pragma DATA_MEM_BANK(a, 4);
short a[256];
#pragma DATA_MEM_BANK(x, 0);
short x[256];
for(i = 0; i < count; i++)
    sum += a[i] * x[i];
```

---

**Cache Optimization**

- Optimize for Level 1
- Multiple Ways and wider lines maximize efficiency –
  - *TI did this for you!*
- Main Goal - maximize line reuse before eviction
  - *Algorithms can be optimized for cache*
- “Touch Loops” can help with compulsory misses
  - *Run once thru loop in init code*
  - *Touch buffers to “pre-load” data cache*
- Up to 4 write misses can happen sequentially, but the next read or write will stall
  - *Bus has 4 deep buffer between CPU/L1 and beyond*
- Be smart about data output by one function then read by another (touch it first)
  - *When data is output by first function, where does it go?*
  - *If you touch output buffer first, then where will output data go?*
Updated Cache Documentation

- **Cache Reference**
  - More comprehensive description of C6000 cache
  - Revised terminology for cache coherence operations

- **Cache User’s Guide**
  - Cache Basics
  - Using C6000 Cache
  - Optimization for Cache Performance

---

Cache Aware Linking

**Goal**
Re-arrange functions to reduce L1P conflict misses

**How it works**
CGT v7.0 contains a new cache layout tool (clt5x). It takes dynamic profile info to create a preferred function ordering linker command file that guides the placement of function subsections

**More Info**
http://processors.wiki.ti.com/index.php/Program_Cache_Layer

**Procedure**
1. Profile code for L1P cache misses - don’t solve a problem that doesn’t exist
2. Instrument your app by building with compiler option --gen_profile_info
3. Run instrumented app to generate profile data (.ppd)
4. Decode profile data file (.prf)
5. Generate WCG data (.csv) for each source file
6. Generate linker command file (.cmd file)
7. Re-build the app with optimized function ordering
Cache – General Terminology

- **Associativity**: The # of places a piece of data can map to inside the cache.
- **Coherence**: assuring that the most recent data gets written back from a cache when there is different data in the levels of memory
- **“Dirty”**: When an allocated cache line gets changed/updated by the CPU (*file)

- **Read-allocate cache**: only allocates space in the cache during a read miss. C64x+ L1 cache is read-allocate only.
- **Write-allocate cache**: only allocates space in the cache during a write miss.
- **Read-write-allocate cache**: allocates space in the cache for a read miss or a write miss. C64x+ L2 cache is read-write allocate.

- **Write-through cache**: updates to cache lines will go to ALL levels of memory such that a line is never “dirty” (less efficient than WB cache – more DDR xfrs).
- **Write-back cache**: updates occur only in the cache. The line is marked as “dirty” and if it is evicted, updates are pushed out to lower levels of memory. All C64x+ cache is write-back'.
Chapter Quiz

1. How do you turn ON the cache?

2. Name the three types of caches & their associated memories:

3. All cache operations affect an aligned cache line. How big is a line?

4. Which bit(s) turn on/off “cacheability” and where do you set these?

5. How do you fix coherency when two bus masters access ext’l mem?

6. If a dirty (newly written) cache line needs to be evicted, how does that dirty line get written out to external memory?
Quiz – Answers

Chapter Quiz

1. How do you turn ON the cache?
   - *Set size > 0 in platform package* (or via `Cache_setSize()` during runtime)

2. Name the three types of caches & their associated memories:
   - *Direct Mapped (L1P), 2-way (L1D), 4-way (L2)*

3. All cache operations affect an aligned cache line. How big is a line?
   - *L1P – 32 bytes (256 bits), L1D – 64 bytes, L2 – 128 bytes*

4. Which bit(s) turn on/off “cacheability” and where do you set these?
   - *MAR (Mem Attribute Register), affects 16MB Ext’l data space, .cfg*

5. How do you fix coherency when two bus masters access ext’l mem?
   - *Invalidate before a read, writeback after a write (or use L2 mem)*

6. If a dirty (newly written) cache line needs to be evicted, how does that dirty line get written out to external memory?
   - *Cache controller takes care of this*
Lab 14 – Using Cache

In the following lab, you will gain some experience benchmarking the use of cache in the system. First, we’ll run the code with EVERYTHING (buffers, code, etc) off chip with NO cache. Then, we’ll turn on the cache and compare the results. Then, we’ll move everything ON chip and compare the cache results with using on-chip memory only.

This will provide a decent understanding of what you can expect when using cache in your own application.

Lab Overview:

There are two goals in this lab: (1) to learn how to turn on and off cache and the effects of each on the data buffers and program code; (2) to optimize a hi-pass FIR filter written in C. To gain this basic knowledge you will:

A. Learn to use the platform and CFG files to setup cache memory address range (MAR bits) and turn on L2 and L1 caches.
B. Benchmark the system performance with running code/data externally (DDR2) vs. with the cache on vs. internal (IRAM).
Lab 14 – Using Cache – Procedure

A. Run System From Internal RAM

1. Close all previous projects and import Lab14.
   
   This project is actually the solution for Lab 13 (OPT) – with all optimizations in place.
   ▶ Ensure the proper platform (student) and the latest XDC/BIOS/UIA versions are being used.

   **Note:** For all benchmarks throughout this lab, use the “Opt” build configuration when you build. Do NOT use the Debug or Release config.

2. Ensure BUFFSIZE is 256 in main.h.
   
   In order to compare our cache lab to the OPT lab, ▶ we need to make sure the buffer sizes are the same – which is 256.

3. Find out where code and data are mapped to in memory.
   
   ▶ First, check Build Properties for the Opt configuration. Make sure you are using YOUR student platform file in this configuration. ▶ Then, view the platform file and determine which memory segments (like IRAM) contain the following sections:

<table>
<thead>
<tr>
<th>Section</th>
<th>Memory Segment</th>
</tr>
</thead>
<tbody>
<tr>
<td>.text</td>
<td></td>
</tr>
<tr>
<td>.bss</td>
<td></td>
</tr>
<tr>
<td>.far</td>
<td></td>
</tr>
</tbody>
</table>

   It’s not so simple, is it? .bss and .far sections are “data” and .text is “code”. If you didn’t know that, you couldn’t answer the question. So, they are all allocated in IRAM – if not, please make sure they are before moving on.

4. Which cache areas are turned on/off (circle your answer)?
   
   L1P OFF/ON
   L1D OFF/ON
   L2 OFF/ON

   ▶ Leave the settings as is.

5. Build, load.
   
   **BEFORE YOU RUN,** ▶ open up the Raw Logs window.
   
   ▶ Click Run and write down below the benchmarks for cfir():

   **Data Internal (L1P/D cache ON): */cycles* cycles**

   The benchmark from the Log_info should be around 8K cycles. We’ll compare this “internal RAM” benchmark to “all external” and “all external with cache ON” numbers. You just might be surprised…
6. Place the buffers (data) in external DDR2 memory and turn OFF the cache.
   ► Edit your platform file and place the data external (DDR). Leave stacks and code in IRAM. Modify the L1P/D cache sizes to ZERO (0K).
   In this scenario, the audio data buffers are all external. Cache is not turned on. This is the worst case situation.
   Do you expect the audio to sound ok? ____________________
   ► Match the settings you see below (0K for all cache sizes, Data Memory in DDR):

   ![Memory Configuration Diagram]

   ► Select Project → Clean (this will ensure your platform file is correct). ► Then Build and load your code. ► Run your code. ► Listen to the audio – how does it sound? It’s DEAD – that’s how it sounds – just air – bad air – it is the absence of noise. Plus, we can’t see anything because the CPU is overloaded and therefore no RTA tools.
   Ah, but Log_info() just might save us again. ► Go look at the Raw Logs and see if the benchmark is getting reported.

   All Code/Data External: ___________ cycles

   Did you get a cycle count? The author experienced a total loss – absolute NOTHING. I think the system is so out of it, it crashes. In fact, CCS crashed a few times in this mode. Yikes. I vote for calling it “the national debt” #cycles – uh, what is it now – $15 Trillion? Ok, 15 trillion cycles... ;-(}
C. Run System From DDR2 (cache ON)

8. Turn on the cache (L1P/D, L2) in the platform file.
   ► Choose the following settings for the cache (L2=64K, L1P/D = 32K):

   ![Cache Settings]

   ▶ Set L1D/P to 32K and L2 to 64K — **IF YOU DON’T SET L2 CACHE ON, YOU WILL CACHE IN L1 ONLY.** Watch it, though, when you reconfigure cache sizes, it wipes your memory sections selections. Redo those properly after you set the cache sizes.

   These sizes are larger than we need, but it is good enough for now. Leave code/data in DDR and stacks in IRAM. ▶ Click Ok to rebuild the platform package.

   The system we now have is identical to one of the slides in the discussion material.

9. Wait – what about the MAR bits?

   In the discussion material, we talked about the MAR bits specifying which regions were cacheable and which were not. Don’t we have to set the MAR bits for the external region of DDR for them to get cached? Yep.

   In order to modify (or even SEE) the MAR bits OR use any BIOS Cache APIs (like invalidate or writeback), ▶ you need to add the C64p Cache Module to your .cfg file. Or, you can simply right-click (and Use) the Cache module listed under: Available Products → SYS/BIOS Target Specific Support → C674 →Cache (as shown in the discussion material).
► **Save the .cfg file.** This SHOULD add the module to your outline view. When it shows up in the outline view, click on it. Do you see the MAR bits?

The MAR region we are interested in, by the way, for DDR2 is MAR 192-223. As a courtesy to users, the platform file already turned on the proper MAR bits for us for the DDR2 region.

Check it out:

![MAR Region Table](image)

The good news is that we don’t need to worry about the MAR bits for now.

10. **Build, load, run –using the Opt (duh) Configuration.**

► Run the program. View the CPU load graph and benchmark stat and write them down below:

**All Code/Data External (cache “ON”): __________ cycles**

With code/data external AND the cache ON, the benchmark should be close to 8K cycles – the SAME as running from internal IRAM (L2). In fact, what you’re seeing is the L1D/P numbers. Why? Because L2 is cached in L1D/P – the closest memory to the CPU. This is what a cache does for you – especially with this architecture.

Here’s what the author got:
11. What about cache coherency?

So, how does the audio sound with the buffers in DDR2 and the cache on? Shouldn’t we be experiencing cache coherency problems with data in DDR2? Well, the audio sounds great, so why bother? Think about this for awhile. What is your explanation as to why there are NO cache coherency problems in this lab.

Answer: _______________________________________________________________

12. Conclusion and Summary – long read – but worth it…

It is amazing that you get the same benchmarks from all code/data in internal IRAM (L2) and L1 cache turned on as you do with code/data external and L2/L1 cache turned on. In fact, if you place the buffers DIRECTLY in L1D as SRAM, the benchmark is the same. How can this be? That’s an efficient cache, eh? Just let the cache do its thing. Place your buffers in DDR2, turn on the cache and move on to more important jobs.

Here’s another way to look at this. Cache is great for looping code (program, L1P) and sequentially accessed data (e.g. buffers). However, cache is not as effective at random access of variables. So, what would be a smart choice for part of L1D as SRAM? Coefficient tables, algorithm tables, globals and statics that are accessed frequently, but randomly (not sequential) and even frequently used ISRs (to avoid cache thrashing). The random data items would most likely fall into the .bss compiler section. Keep that in mind as you design your system.

Let’s look at the final results:

<table>
<thead>
<tr>
<th>System</th>
<th>benchmark</th>
</tr>
</thead>
<tbody>
<tr>
<td>Buffers in IRAM (internal)</td>
<td>8K cycles</td>
</tr>
<tr>
<td>All External (DDR2), cache OFF</td>
<td>~4M</td>
</tr>
<tr>
<td>All External (DDR2), cache ON</td>
<td>8K cycles</td>
</tr>
<tr>
<td>Buffers in L1D SRAM</td>
<td>7K cycles</td>
</tr>
</tbody>
</table>

So, will you experience the same results? 150x improvement with cache on and not much difference between internal memory only and external with cache on? Probably something similar. The point here is that turning the cache ON is a good idea. It works well – and there is little thinking that is required unless you have peripherals hooked to external memory (coherency). For what it is worth, you’ve seen the benefits in action and you know the issues and techniques that are involved. Mission accomplished.

RAISE YOUR HAND and get the instructor’s attention when you have completed PART A of this lab. If time permits, move on to the next OPTIONAL part…

STOP

You're finished with this lab. If time permits, you may move on to additional “optional” steps on the following pages if they exist.
Using EDMA3

Introduction

In this chapter, you will learn the basics of the EDMA3 peripheral. This transfer engine in the C64x+ architecture can perform a wide variety of tasks within your system from memory to memory transfers to event synchronization with a peripheral and auto sorting data into separate channels or buffers in memory. No programming is covered. For programming concepts, see ACPY3/DMAN3, LLD (Low Level Driver – covered in the Appendix) or CSL (Chip Support Library). Heck, you could even program it in assembly, but don’t call ME for help. 😊

Objectives

At the conclusion of this module, you should be able to:

• Understand the basic terminology related to EDMA3
• Be able to describe how a transfer starts, how it is configured and what happens after the transfer completes
• Understand how EDMA3 interrupts are generated
• Be able to easily read EDMA3 documentation and have a great context to work from to program the EDMA3 in your application
Module Topics

Using EDMA3 .............................................................................................................................. 15-1

Module Topics ............................................................................................................................. 15-2

Overview .................................................................................................................................... 15-3
  What is a “DMA” ? .................................................................................................................. 15-3
  Multiple “DMAs” .................................................................................................................. 15-4
  EDMA3 in C64x+ Device ....................................................................................................... 15-5

Terminology ............................................................................................................................... 15-6
  Overview ................................................................................................................................. 15-6
  Element, Frame, Block – ACNT, BCNT, CCNT ..................................................................... 15-7
  Simple Example ..................................................................................................................... 15-7
  Channels and PARAM Sets ................................................................................................. 15-8

Examples ..................................................................................................................................... 15-9

Synchronization ......................................................................................................................... 15-12

Indexing ...................................................................................................................................... 15-13

Events – Transfers – Actions ................................................................................................. 15-15
  Overview ............................................................................................................................... 15-15
  Triggers ................................................................................................................................. 15-16
  Actions – Transfer Complete Code .................................................................................. 15-16

EDMA Interrupt Generation ................................................................................................. 15-17

Linking ...................................................................................................................................... 15-18

Chaining ................................................................................................................................... 15-19

Channel Sorting ....................................................................................................................... 15-21

Architecture & Optimization ............................................................................................... 15-22

Programming EDMA3 – Using Low Level Driver (LLD) ....................................................... 15-23

Chapter Quiz ............................................................................................................................ 15-25
  Quiz – Answers ..................................................................................................................... 15-26

Additional Information ...................................................................................................... 15-27

Notes ......................................................................................................................................... 15-30
Overview

What is a “DMA”?

- **EDMA3** – “Enhanced” DMA handles 64 DMA CHs and 4 QDMA CHs
  - DMA – 64 channels that can be triggered manually or by events/chaining
  - QDMA – 8 channels of “Quick” DMA triggered by writing to a “trigger word”

- **IDMA** – 2 CHs of “Internal” DMA (Periph Cfg, Xfr L1 ? L2)

- **Peripheral “DMA”s** – Each master device hooked to the Switched Central Resource (SCR) has its own DMA (e.g. SRIO, EMAC, etc.)
Multiple “DMAs”

**Multiple DMA’s : EDMA3 and QDMA**

- **VPSS**
  - Master Periph
- **EDMA3** (System DMA)
  - DMA (sync)
  - QDMA (async)
- **C64x+ DSP**
  - L1P
  - L1D
  - L2

**DMA**
- Enhanced DMA (version 3)
- DMA to/from peripherals
- Can be sync’d to peripheral events
- Handles up to 64 events

**QDMA**
- Quick DMA
- DMA between memory
- Async – must be started by CPU
- 4-16 channels available

**Both Share** (number depends upon specific device)
- 128-256 Parameter RAM sets (PARAMs)
- 64 transfer complete flags
- 2-4 Pending transfer queues

**Multiple DMA’s : Master Periph & C64x+ IDMA**

- **VPSS**
  - Front End (capture)
  - Back End (display)
- **Master Periph’s**
  - USB
  - ATA
  - Ethernet
  - VLYNQ
  - Master Peripherals
    - VPSS (and other master periph’s) include their own DMA functionality
    - USB, ATA, Ethernet, VLYNQ share bus access to SCR
- **EDMA3** (System DMA)
  - DMA (sync)
  - QDMA (async)
- **C64x+ DSP**
  - L1P
  - L1D
  - L2

**IDMA**
- Built into all C64x+ DSPs
- Performs moves between internal memory blocks and/or config bus
- Don’t confuse with iDMA API

**Notes:**
- Both ARM and DSP can access the EDMA3
- Only DSP can access hardware IDMA
EDMA3 in C64x+ Device

- EDMA3 is a master on the DATA SCR – it can initiate data transfers
- EDMA3’s configuration registers are accessed via the CFG SCR (by the CPU)
- Each TC has its own connection (and priority) to the DATA SCR. Refer to the connection matrix to determine valid connections
Terminology

Overview

DMA: Direct Memory Access

Goal:
- Copy from memory to memory – HARDWARE memcpy(dst, src, len);
- Faster than CPU LD/ST. One INT per block vs. one INT per sample

Examples:
- Import raw data from off-chip to on-chip before processing
- Export results from on-chip to off-chip afterward

Controlled by:
- Transfer Configuration (i.e. Parameter Set - aka PaRAM or PSET)
- Transfer configuration primarily includes 8 control registers

Transfer Configuration

Source
Length
Destination
Element, Frame, Block – ACNT, BCNT, CCNT

How Much to Move?

Transfer Configuration

Options  Source
B  A
Destination
Index
Cnt Load  Link Addr
Index  Index
Rsvd  C

B Count  (# Elements)  A Count  (Element Size)

31  16  15  0

C Count  (# Frames)

31  16  15  0

Let's look at a simple example...

Simple Example

Example – How do you VIEW the transfer?

Let's start with a simple example – or is it simple?
We need to transfer 12 bytes from “here” to “there”.

Note: these are contiguous memory locations

What is ACNT, BCNT and CCNT? Hmmmm....
You can “view” the transfer several ways:

Which “view” is the best? Well, that depends on what your system needs and the type of sync and indexing (covered later...)
Channels and PARAM Sets

**C6748 – EDMA Channel/Parameter RAM Sets**

- EDMA3 has 128-256 Parameter RAM sets (PSETs) that contain configuration information about a transfer.
- 64 DMA CHs and 8 QDMA CHs can be mapped to any one of the 256 PSETs and then triggered to run (by various methods).

![Diagram showing EDMA Channel/Parameter RAM Sets]

- Each PSET contains 12 registers:
  - Options (interrupt, chaining, sync mode, etc)
  - 4 SRC/DST Indexes (bump addr after xfr)
  - SRC/DST addresses
  - BCNTRLD (BCNT reload for 3D xfrs)
  - ACNT/BCNT/CCNT (size of transfer)
  - LINK (pointer to another PSET)

*Note: PSETs are dedicated EDMA RAM (not part of IRAM)*
Examples

EDMA Example: Simple (Horizontal Line)

Goal:
Transfer 4 elements from loc_8 to myDest

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>11</td>
<td>12</td>
</tr>
<tr>
<td></td>
<td>13</td>
<td>14</td>
<td>15</td>
<td>16</td>
<td>17</td>
<td>18</td>
</tr>
<tr>
<td></td>
<td>19</td>
<td>20</td>
<td>21</td>
<td>22</td>
<td>23</td>
<td>24</td>
</tr>
<tr>
<td></td>
<td>25</td>
<td>26</td>
<td>27</td>
<td>28</td>
<td>29</td>
<td>30</td>
</tr>
</tbody>
</table>

Goal:
Transfer 4 elements from loc_8 to myDest

<table>
<thead>
<tr>
<th></th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>11</td>
<td>12</td>
</tr>
<tr>
<td></td>
<td>13</td>
<td>14</td>
<td>15</td>
<td>16</td>
<td>17</td>
<td>18</td>
</tr>
<tr>
<td></td>
<td>19</td>
<td>20</td>
<td>21</td>
<td>22</td>
<td>23</td>
<td>24</td>
</tr>
<tr>
<td></td>
<td>25</td>
<td>26</td>
<td>27</td>
<td>28</td>
<td>29</td>
<td>30</td>
</tr>
</tbody>
</table>

- DMA always increments across ACNT fields
- B and C counts must be 1 (or more) for any actions to occur
- Any indexing needed?

Source = &loc_8
BCNT = 4
ACNT = &myDest

Destination
CCNT = 1

Is there another way to set this up?

- Here, ACNT was defined as element size: 1 byte
- Therefore, BCNT will now be framesize: 4 bytes
- B indexing (after ACNT is transferred) must now be specified as well
- ‘BIDX often = ACNT for contiguous operations

Why is this a ‘less efficient’ version?
EDMA Example: Indexing (Vertical Line)

Goal:
Transfer 4 vertical elements from loc_8 to a port

- ACNT is again defined as element size: 1 byte
- Therefore, BCNT is still framesize: 4 bytes
- SRCBDIX now will be 6 – skipping to next column
- DSTBDIX now will be 2

<table>
<thead>
<tr>
<th>Source</th>
<th>Destination</th>
</tr>
</thead>
<tbody>
<tr>
<td>4 = BCNT</td>
<td>2 = DSTBDIX</td>
</tr>
<tr>
<td>2 = ACNT</td>
<td>2 = SRCBDIX</td>
</tr>
<tr>
<td>0 = CCNT</td>
<td>0 = CCNT</td>
</tr>
</tbody>
</table>

myDest:

8
14
20
26

- &loc_8
- &myDest

EDMA Example: Block Transfer (less efficient)

Goal:
Transfer a 5x4 subset from loc_8 to myDest

- ACNT is defined here as ‘short’ element size: 2 bytes
- BCNT is again framesize: 4 elements
- CCNT now will be 5 – as there are 5 frames
- SRCCIDX skips to the next frame

<table>
<thead>
<tr>
<th>Source</th>
<th>Destination</th>
</tr>
</thead>
<tbody>
<tr>
<td>4 = BCNT</td>
<td>2 = DSTCIDX</td>
</tr>
<tr>
<td>2 = ACNT</td>
<td>2 = SRCCIDX</td>
</tr>
</tbody>
</table>

Frame

myDest:

8
9
10
11
14
15

- &loc_8
- &myDest

(2 bytes going from block 8 to 9)

(3 elements from block 11 to 14)
### EDMA Example: Block Transfer (more efficient)

**Goal:**
Transfer a 5x4 subset from `loc_8` to `myDest`

<table>
<thead>
<tr>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
</tr>
</thead>
<tbody>
<tr>
<td>7</td>
<td>Elem 1</td>
<td>12</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>8</td>
<td>Elem 2</td>
<td>18</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>9</td>
<td>Elem 3</td>
<td>24</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>10</td>
<td>Elem 4</td>
<td>30</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>11</td>
<td>Elem 5</td>
<td>36</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

```
16-bit Pixels
```

```
myDest:
```
<table>
<thead>
<tr>
<th>8</th>
</tr>
</thead>
<tbody>
<tr>
<td>9</td>
</tr>
<tr>
<td>10</td>
</tr>
<tr>
<td>11</td>
</tr>
<tr>
<td>14</td>
</tr>
<tr>
<td>15</td>
</tr>
<tr>
<td>...</td>
</tr>
</tbody>
</table>

- ACNT is defined here as the entire frame: 4 * 2 bytes
- BCNT is the number of frames: 5
- CCNT now will be 1
- SRCBDX skips to the next frame

```
Source = &loc_8
BCNT = ACNT = 8

Destination = &myDest

(4*2) is 8 = DSTBDX, SRCBDX = 12 is (8*2) (from block 8 to 14)

0 = DSTCIDX, SRCCIDX = 0

CCNT = 1
```
Synchronization

“A” – Synchronization
- An event (like the McBSP receive register full), triggers the transfer of exactly 1 array of ACNT bytes (2 bytes)
- Example: McBSP tied to a codec (you want to sync each transfer of a 16-bit word to the receive buffer being full or the transmit buffer being empty).

“AB” – Synchronization
- An event triggers a two-dimensional transfer of BCNT arrays of ACNT bytes (A*B)
- Example: Line of video pixels (each line has BCNT pixels consisting of 3 bytes each – Y, Cb, Cr)
Indexing

Indexing – ‘BIDX, ‘CIDX

- EDMA3 has two types of indexing: ‘BIDX and ‘CIDX
- Each index can be set separately for SRC and DST (next slide...)
- BIDX = index in bytes between ACNT arrays (same for A-sync and AB-sync)
- CIDX = index in bytes between BCNT frames (different for A-sync vs. AB-sync)
- ‘BIDX’/CIDX: signed 16-bit, -32768 to +32767

- ‘CIDX distance is calculated from the starting address of the previously transferred block (array for A-sync, frame for AB-sync) to the next frame to be transferred.

Indexed Transfers

- EDMA3 has 4 indexes allowing higher flexibility for complex transfers:
  - SRCBIDX = # bytes between arrays (Ex: SRCBIDX = 2)
  - SRCCIDX = # bytes between frames (Ex: SRCCIDX_A = 2, SRCCIDX_AB = 4)
  - Note: ‘CIDX depends on the synchronization used – “A” or “AB”
  - DSTBIDX = # bytes between arrays (Ex: DSTBIDX = 3)
  - DSTCIDX = # bytes between frames (Ex: DSTCIDX_A = 5, DSTCIDX_AB = 8)

Note: ACNT = 1, BCNT = 2, CCNT = ____
Example – Using Indexing

- Remember this example? Ok, so for each “view”, fill in the proper SOURCE index values:

- Which “view” is the best? Well, that depends on what you are transferring from/to and which sync mode is used.
Events – Transfers – Actions

Overview

EDMA3 Basics Review

- **Count** – How many items to move
  - A, B, and C counts
- **Addresses** – the source & destination addresses
- **Index** – How far to increment the src/dst after each transfer

**T** (transfer config)

- **Event** – triggers the transfer to begin
- **Transfer** – the transfer config describes the transfers to be executed when triggered
- **Resulting Action** – what do you want to happen after the transfer is complete?

Let's look at triggers (events) and actions in more detail...
Triggers

How to TRIGGER a Transfer

- There are 3 ways to trigger an EDMA transfer:
  1. Event Sync from peripheral
  2. Manually Trigger the Channel to Run
  3. Chain Event from another channel (more details later...)

Actions – Transfer Complete Code

- TCC is generated when a transfer completes. This is referred to as the “Final TCC”.
- TCC can be used to trigger an EDMA interrupt and/or another transfer (chaining)
- Each TR below is a “transfer request” which can be either ACNT bytes (A-sync) or ACNT * BCNT bytes (AB-sync). Final TCC only occurs after the LAST TR.
EDMA Interrupt Generation

Generate EDMA Interrupt (Setting IER\textsubscript{bit})

<table>
<thead>
<tr>
<th>EDMA Channels</th>
<th>EDMA Interrupt Generation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Channel #</td>
<td>Options</td>
</tr>
<tr>
<td>0</td>
<td>TCINTEN=0</td>
</tr>
<tr>
<td>1</td>
<td>TCINTEN=0</td>
</tr>
<tr>
<td>...</td>
<td>TCINTEN=1</td>
</tr>
<tr>
<td>63</td>
<td>TCINTEN=0</td>
</tr>
</tbody>
</table>

- Use EDMA3 Low-Level Driver (LLD) to program EDMA’s IER bits
- 64 Channels and ONE interrupt? How do you determine WHICH channel completed?

EDMA Interrupt Dispatcher

Here’s the interrupt chain from beginning to end:

1. An interrupt occurs
2. Interrupt Selector
3. HWI_INT5 Properties

- HWI\_INT5 Properties
  - General
  - Dispatcher

- Interrupt selection number
  - Function: _edma_evt_dispatcher_

4. EDMA Interrupt Dispatcher
5. ISR (interrupt handler)

- Read IPR bits
- Determine which one is set
- Call corresponding handler (ISR) in Fxn Table

- void edma_rcv_ierv (void)
  
  SEM_post (&semaphore);

How does the ISR Fxn Table (in #4 above) get loaded with the proper handler Fxn names?

Use EDMA3 LLD to program the proper callback fxn for this HWI.
Linking

Linking – “Action” – Overview

- Need: auto-reload channel with new config
  - Ex1: do the same transfer again
  - Ex2: ping/pong system (covered later)
- Solution: use linking to reload Ch config
- Concept:
  - Linking two or more channels together allows the EDMA to auto-reload a new configuration when the current transfer is complete.
  - Linking still requires a “trigger” to start the transfer (manual, chain, event).
  - You can link as many PSETs as you like – it is only limited by the #PSETs on a device.

- How does linking work?
  - User must specify the LINK field in the config to link to another PSET.
  - When the current xfr (0) is complete, the EDMA auto reloads the new config (1) from the linked PSET.

Note: Does NOT start xfr!!
Chaining

Reminder – Triggering Transfers

- There are 3 ways to trigger an EDMA transfer:
  1. Event Sync from peripheral
     ![Diagram of Event Sync]
     - ER = Event Register (flag)
     - EER = Event Enable Register (user)
     - Start Ch Xfr
  2. Manually Trigger the Channel to Run
     ![Diagram of Manual Trigger]
     - ESR = Event Set Register (user)
     - Start Ch Xfr
  3. Chain Event from another channel (next example...)
     ![Diagram of Chain Event]
     - TCCHEN = TC Chain Enable (OPT)
     - Start Ch Xfr

Let’s do a simple example on chaining...

Chaining – “Action” & “Event” – Overview

- Need: When one transfer completes, trigger another transfer to run
  - Ex: ChX completes, kicks off ChY
- Solution: Use chaining to kick off next xfr
- Concept:
  - Chaining actually refers to both both an action and an event – the completed ‘action’ from the 1st channel is the ‘event’ for the next channel
  - You can chain as many Chans as you like – it is only limited by the #Ch’s on a device
  - Chaining does NOT reload current Chan config – that can only be accomplished by linking. It simply triggers another channel to run.
Example #1 – Simple Chaining

### Channel #5
- Triggered manually by ESR
- Chains to Ch #7 (Ch #5’s TCC = 7)

### Channel #7
- Triggered by chaining from Ch #5
- Interrupts the CPU when finished (sets TCC = 8)
- ISR checks IPR (TCC=4) to determine which channel generated the interrupt

**Notes:**
- Any Ch can chain to any other Ch by enabling OPT.TCCHEN and specifying the next TCC
- Any Ch can interrupt the CPU by enabling its OPT.TCINTEN option (and specifying the TCC)
- IPR bit set depends on previous Ch’s TCC setting
Channel Sorting

Channel Sort – “Transfer Config” – Overview

- **Need:** De-interleave (sort) two (or more) channels
  - Ex: stereo audio (LRLR) into L & R buffers
- **Solution:** Use DMA indexing to perform sorting automatically
- **Concept:**
  - In many applications, data comes from the peripheral as interleaved data (LRLR, etc.)
  - Most algo's that run on data require these channels to be de-interleaved
  - Indexing, built into the EDMA3, can auto-sort these channels with no time penalty

- **How does channel sorting work?**
  - User can specify the 'BIDX' and 'CIDX' values to accomplish auto sorting
EDMA consists of two parts: Channel Controller (CC) and Transfer Controller (TC).

An event (from periph-ER/EER, manual-ESR or via chaining-CER) sends the transfer to 1 of 4 queues (Q0 is mapped to TC0, Q1-TC1, etc. Note: McBSP can use TC1 only).

Xfr mapped to 1 of 256 PSETs and submitted to the TC (1 TR – transmit request – per ACNT bytes or “A*B” CNT bytes – based on sync). Note: Dst FIFO allows buffering of writes while more reads occur.

The TC performs the transfer (read/write) and then sends back a transfer completion code (TCC).

The EDMA can then interrupt the CPU and/or trigger another transfer (chaining – Chap 6)

EDMA Performance – Tips, References

Spread Out the Transfers Among all Q’s
- Don’t use the same Q for too many transfers (causes congestion)
- Break long non-realtime transfers into smaller xfrs using self-chaining

Manage Priorities
- Can adjust TC0-3 priority to the SCR (MSTPRI register)
- In general, place small transfers at higher priorities

Tune transfer size to FIFO length and bus width
- Place large transfers on TCs w/larger FIFOs (typically TC2/3)
- Place smaller, real-time transfers on TC0/1
- Match transfers sizes (A, A*B) to bus width (16 bytes)
- Align src/dst on 16-byte boundaries

References
- Programming EDMA3 using LLD (wiki) + examples (see next slide...)
- TC Optimization Rules (SPRUE23)
- EDMA3 User Guide (SPR965)
- EDMA3 Controller (SPRU234)
- EDMA3 Migration Guide (SPRA486)
- EDMA Performance (SPRAG98)
Programming EDMA3 – Using Low Level Driver (LLD)

EDMA3 LLD Wiki...

- Download the detailed app note...
- Use the examples to learn the APIs...

---

C6000 Embedded Design Workshop - Using EDMA3
*** this page used to have very valuable information on it ***
Chapter Quiz

1. Name the 4 ways to trigger a transfer?

2. Compare/contrast linking and chaining

3. Fill out the following values for this channel sorting example (5 min):
   - 16-bit stereo audio (interleaved)
   - Use EDMA to auto “channel sort” to memory
   
<table>
<thead>
<tr>
<th>PERIPH</th>
<th>MEM</th>
</tr>
</thead>
<tbody>
<tr>
<td>L0</td>
<td>L0</td>
</tr>
<tr>
<td>R0</td>
<td>L1</td>
</tr>
<tr>
<td>L1</td>
<td>L2</td>
</tr>
<tr>
<td>R1</td>
<td>R0</td>
</tr>
<tr>
<td>L2</td>
<td>R1</td>
</tr>
<tr>
<td>R2</td>
<td>R2</td>
</tr>
<tr>
<td>L3</td>
<td>R3</td>
</tr>
<tr>
<td>R3</td>
<td></td>
</tr>
</tbody>
</table>

   ACNT: _____
   BCNT: _____
   CCNT: _____
   ‘BIDX: _____
   ‘CIDX: _____

   Could you calculate these?
Quiz – Answers

1. Name the 4 ways to trigger a transfer?
   - Manual start, Event sync, chaining and (QDMA trigger word)

2. Compare/contrast linking and chaining
   - linking – copy new configuration from existing PARAM (link field)
   - chaining – completion of one channel triggers another (TCC) to start

3. Fill out the following values for this channel sorting example (5 min):

   - Manual start, Event sync, chaining and (QDMA trigger word)
   - linking – copy new configuration from existing PARAM (link field)
   - chaining – completion of one channel triggers another (TCC) to start

   - 16-bit stereo audio (interleaved)
   - Use EDMA to auto “channel sort” to memory

   - ACNT: 2
   - BCNT: 2
   - CCNT: 4
   - ‘BIDX: 8
   - ‘CIDX: -6

Could you calculate these?
### Options Register Details

<table>
<thead>
<tr>
<th>Bit</th>
<th>Offset</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>29</td>
<td>Reserved</td>
<td>Reserved</td>
</tr>
<tr>
<td>28</td>
<td>RX[15:0]</td>
<td>Transfer complete address 3</td>
</tr>
<tr>
<td>27</td>
<td>TXA[15:0]</td>
<td>Transfer complete address 1</td>
</tr>
<tr>
<td>26</td>
<td>TXB[15:0]</td>
<td>Transfer complete address 2</td>
</tr>
<tr>
<td>25</td>
<td>RX[15:0]</td>
<td>Transfer complete address 4</td>
</tr>
<tr>
<td>24</td>
<td>TXA[15:0]</td>
<td>Transfer complete address 5</td>
</tr>
<tr>
<td>23</td>
<td>TXB[15:0]</td>
<td>Transfer complete address 6</td>
</tr>
<tr>
<td>22</td>
<td>RX[15:0]</td>
<td>Transfer complete address 7</td>
</tr>
<tr>
<td>21</td>
<td>TXA[15:0]</td>
<td>Transfer complete address 8</td>
</tr>
<tr>
<td>20</td>
<td>TXB[15:0]</td>
<td>Transfer complete address 9</td>
</tr>
<tr>
<td>19</td>
<td>RX[15:0]</td>
<td>Transfer complete address 10</td>
</tr>
<tr>
<td>18</td>
<td>TXA[15:0]</td>
<td>Transfer complete address 11</td>
</tr>
<tr>
<td>17</td>
<td>RX[15:0]</td>
<td>Transfer complete address 12</td>
</tr>
<tr>
<td>16</td>
<td>TXA[15:0]</td>
<td>Transfer complete address 13</td>
</tr>
<tr>
<td>15</td>
<td>TXB[15:0]</td>
<td>Transfer complete address 14</td>
</tr>
<tr>
<td>14</td>
<td>RX[15:0]</td>
<td>Transfer complete address 15</td>
</tr>
<tr>
<td>13</td>
<td>TXA[15:0]</td>
<td>Transfer complete address 16</td>
</tr>
<tr>
<td>12</td>
<td>RX[15:0]</td>
<td>Transfer complete address 17</td>
</tr>
<tr>
<td>11</td>
<td>TXA[15:0]</td>
<td>Transfer complete address 18</td>
</tr>
<tr>
<td>10</td>
<td>RX[15:0]</td>
<td>Transfer complete address 19</td>
</tr>
<tr>
<td>9</td>
<td>RX[15:0]</td>
<td>Transfer complete address 20</td>
</tr>
<tr>
<td>8</td>
<td>RX[15:0]</td>
<td>Transfer complete address 21</td>
</tr>
<tr>
<td>7</td>
<td>RX[15:0]</td>
<td>Transfer complete address 22</td>
</tr>
<tr>
<td>6</td>
<td>RX[15:0]</td>
<td>Transfer complete address 23</td>
</tr>
<tr>
<td>5</td>
<td>RX[15:0]</td>
<td>Transfer complete address 24</td>
</tr>
<tr>
<td>4</td>
<td>RX[15:0]</td>
<td>Transfer complete address 25</td>
</tr>
<tr>
<td>3</td>
<td>RX[15:0]</td>
<td>Transfer complete address 26</td>
</tr>
<tr>
<td>2</td>
<td>RX[15:0]</td>
<td>Transfer complete address 27</td>
</tr>
<tr>
<td>1</td>
<td>RX[15:0]</td>
<td>Transfer complete address 28</td>
</tr>
<tr>
<td>0</td>
<td>RX[15:0]</td>
<td>Transfer complete address 29</td>
</tr>
</tbody>
</table>

### EDMA3 Terminology

- **3-dimensional transfer consisting of ACNT, BCNT and CCNT:**
  - ACNT = Array = # of contiguous ACNT bytes (16-bit unsigned, 0-65535)
  - BCNT = Frame = # of ACNT arrays (16-bit unsigned, 0-65535)
  - CCNT = Block = # of BCNT frames (16-bit unsigned, 0-65535)

- Minimum transfer is an array of ACNT bytes
- Total transfer count = ACNT * BCNT * CCNT

![Diagram of ACNT, BCNT, CCNT Arrays and Frames]

**ACNT Bytes**
- Frame 1: Array1, Array2, Array BCNT
- Frame 2: Array1, Array2, Array BCNT
- Frame CCNT: Array1, Array2, Array BCNT
**Triggering an EDMA Transfer to Start**

- Each of the 64 DMA channels can be triggered by any of the following:
  
  **Event Triggering (from a peripheral) – EER/ER** *(6455 values given: Check your datasheet)*

  - **Peripherals**
    - EES (E ED)
    - E Reg (ER)
    - E V Enable Reg (ER)
  
  - **Events**
    - MC/SP2 (MC/SP1 Transmit Event)
    - MC/SP0 (MC/SP0 Transmit Event)
    - XEVT0 (MC/SP0 Transmit Event)
    - XEVT1 (MC/SP1 Transmit Event)
  
  - **Other**
    - MC/SP0 (MC/SP0 Receive Event)
    - XEVT0 (MC/SP0 Receive Event)
    - XEVT1 (MC/SP1 Receive Event)
  
- **Each event is tied to a specific DMA channel (e.g. XEVT1).** (Ch 14) and can be enabled/disabled via EER register

**Manual Triggering - ESR**

- CPU writes a “1” to the corresponding bit of the Event Set Register (ESR)

**Chain Triggering - CER**

- Used to execute multiple TRs upon receipt of a single event
  - Ex: EVT1 triggers Ch0, Ch0 completes and triggers Ch1 (TCC=1)
  - Chained events are captured in the Chain Event Register (CER)

---

**Transfer Complete Code (TCC)**

**Options Reg**

- **TCC**
- **TCC MODE**

- **Ch 0-63**
  - NORMAL
  - EARLY

- TCC is generated when a transfer completes. This is referred to as the “Final TCC”.
- TCC can be used to trigger an EDMA interrupt and/or another transfer (chaining)
- Each TR below is a “transfer request” which can be either ACNT bytes (A-sync) or ACNT * BCNT bytes (AB-sync). Final TCC only occurs after the LAST TR.
- Final TCC can be generated at either of two different times:
  - **NORMAL mode (after peripheral acknowledgement)**
  - **EARLY mode (after submitting the last TR to TC)**
### Counter Reload

**Left:**

<p>| | | | | | | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
</tr>
</tbody>
</table>

**Right:**

|   |   |   |   |   |   |   |   |   |
|---|---|---|---|---|---|---|---|
| 1 | 2 |   |   |   |   |   |   |

---

What happens when BCNT goes to zero?

There's a register for this

<table>
<thead>
<tr>
<th>BCNT.ACNT</th>
<th>DST.BIDX,CIDX</th>
<th>BCNTRLD.LINK</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>20</td>
<td>2</td>
</tr>
<tr>
<td>-18</td>
<td>2</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>CCNT</th>
<th>Src Addr</th>
<th>Dst Addr</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>McBSP</td>
<td>Left</td>
</tr>
</tbody>
</table>

---

*Texas Instruments*
"Grab Bag" Topics

“Grab Bag” Explanation

Several other topics of interest remain. However, there is not enough time to cover them all. Most topics take over an hour to complete especially if the labs are done. Students can vote which ones they’d like to see first, second, third in the remaining time available.

Shown below is the current list of topics. Vote for your favorite two and the instructor will tally the results and make any final changes to the remaining agenda.

While all of these topics cannot be covered, the notes are in your student guide. So, at a minimum, you have some reference material on the topics not covered live to take home with you.

Topic Choices

<table>
<thead>
<tr>
<th>Vote</th>
<th>Chap #</th>
<th>Title</th>
<th>~Time (min)</th>
</tr>
</thead>
<tbody>
<tr>
<td>16a</td>
<td>Intro to DSP/BIOS</td>
<td>45 + 60 (lab)</td>
<td></td>
</tr>
<tr>
<td>16b</td>
<td>Booting from Flash</td>
<td>45 + 45 (lab)</td>
<td></td>
</tr>
<tr>
<td>16c</td>
<td>Drivers – SIO/PSP/IOM</td>
<td>60</td>
<td></td>
</tr>
<tr>
<td>16d</td>
<td>Introduction to C66x</td>
<td>75</td>
<td></td>
</tr>
</tbody>
</table>

All material is IN your student guide or \techdocs
*** insert blank page here ***
Introduction

In this chapter an introduction to the general nature of real-time systems and the DSP/BIOS operating system will be considered. Each of the concepts noted here will be studied in greater depth in succeeding chapters.

Objectives

- Grab bag chapter assumes students have already been through *Intro to SYS/BIOS*
- Describe how to create a new BIOS project
- Learn how to configure BIOS using TCF files
- *Lab 16a* – Create and debug a simple DSP/BIOS application
Module Topics

Intro to DSP/BIOS................................................................. 16-1
Module Topics................................................................. 16-2
DSP/BIOS Overview ...................................................... 16-3
Threads and Scheduling.................................................. 16-4
Real-Time Analysis Tools............................................... 16-6
DSP/BIOS Configuration – Using TCF Files...................... 16-7
Creating A DSP/BIOS Project........................................... 16-8
Memory Management – Using the TCF File....................... 16-10
Lab 16a: Intro to DSP/BIOS.............................................. 16-11
Lab 16a – Procedure....................................................... 16-12
  Create a New Project.................................................... 16-12
  Add a New TCF File and Modify the Settings.................. 16-14
  Build, Load, Play, Verify.............................................. 16-16
  Benchmark and Use Runtime Object Viewer (ROV)......... 16-19
Additional Information & Notes........................................ 16-22
Notes................................................................................. 16-24
DSP/BIOS Overview

Module names are DIFFERENT
Therefore, API names are DIFFERENT

DSP/BIOS is an interface and library, therefore its API is modular.

Application Program Interfaces (API) define the interactions (methods) with a module and data structures (objects).

Objects - are structures that define the state of a component
- Pointers to objects are called handles
- Object based programming offers:
  - Better encapsulation and abstraction
  - Multiple instance ability
Threads and Scheduling

**DSP/BIOS Thread Types**

- **HWI**
  - Hardware Interrupts
  - Module names are DIFFERENT
  - Same implicit "group" priorities
  - Fewer SWI/TSK priority levels
  - Performs HWI *follow-up* activity
  - *posted* by software
  - PDUs (periodic functions) are prioritized as SWIs
  - 14 priority levels

- **SWI**
  - Software Interrupts
  - Runs programs concurrently under separate contexts
  - Usually enabled to run by posting a *semaphore*
    (a task signaling mechanism)
  - 15 priority levels
  - Multiple IDL functions
  - Runs as an infinite loop (like traditional *while* loop)
  - Single priority level

- **TSK**
  - Tasks

- **IDL**
  - Background

**SWIs and TSKs**

- **SWI_post(&swiObj);**
  - API Names are different
  - Same behavior of threads
  - "run to completion"

- **SEM_post(&semObj);**
  - Pause
  - (blocked state)

- **SEM_pend**
  - Unblocking triggers execution
  - Each TSK has its own stack, which allows them to pause
    (i.e. block)
  - Topology: prologue, loop, epilogue...
DSP/BIOS: Priority-Based Scheduling

SWI_post(&swi_name);

Must "return" from main() to start DSP/BIOS Scheduler
Real-Time Analysis Tools

### Built-in Real-Time Analysis Tools

- Gather data on target (3-10 CPU cycles)
- Send data during run (30-40 CPU cycles)
- Format data on PC
- Data gathering done during run

#### RunTime Obj View (ROV)

- Halt to see results
- Displays stats about all threads in system

#### CPU Load Graph

- Analyze time NOT spent in IDL

### Printf LOGs

- Send Dbg Msgs to PC
- Data displayed during runtime
- Deterministic, low DSP cycle count
- WAY more efficient than traditional printf()

#### Statistics (STS)

- Gather benchmarks during runtime
- Set "start/end" points in code (more later...)

---

Printf LOGs work well
STS is GREAT – not available in SYS/BIOS.
DSP/BIOS Configuration – Using TCF Files

Textual Config File (TCF) Contents

System Config
- Clock & Cache
  - BIOS Clk freq, cache settings
- MEM
  - Memory Areas (origin, length, ...)
  - Stack/heap sizes

BIOS Config
- Instrumentation
  - LOG and Statistics (STS) Objects
- Scheduling
  - CLK objects (tick rate)
- PRD, HWI, SWI, TSK, IDL fxns
- Synchronization
  - Semaphores (SEM)

The GUI creates a TCF script...

GUI Creates TCF Script...

To create a TCF file, we first need a new BIOS project...
Creating A DSP/BIOS Project

Creating a New BIOS Project (1)

You have two options:

Start with a standard EVM6748 BIOS Example

Done For You...
- CGT/BIOS include paths added
- TCF file with proper memory map added

Modifications...
- Delete unused source files
- [Optional] Rename TCF file to match project name (explorer)

Creating a New BIOS Project (2)

You have two options:

Start with standard EVM6748 BIOS Example

Use an EMPTY example and add a TCF file to it

Done For You...
- CGT/BIOS include paths added
- TCF file NOT ADDED

Modifications...
- ADD TCF file to your project:
  - File → New...(next...)
  - BIOS Examples
  - Elsewhere...
Adding a New TCF File to Your Project

You have several options – however the easiest way is simply to:

1. Select: File → New → DSP/BIOS v5.x Config File

2. Give the new file a name:

3. Pick the proper platform (e.g. evm6748)

Platform file sets up...
- Clock settings
- Memory Map & Cache settings

The TCF file does some work for us...

TCF Generates Key Files...

- **file.tcf** file generates (when saved) two **very important** files:
  - **filecfg.h**: header file for all BIOS libraries (must #include in project)
  - **filecfg.cmd**: linker.cmd file for your project (add to project)

Other files... Covered later
Memory Management – Using the TCF File

Remember?

- How do you define the memory segments (e.g. IRAM, FLASH, DDR2)?
- How do you place the sections into these memory segments?

How do we accomplish this with a .tcf file?

MEM – Memory Section Manager

- Similar to a linker.cmd file, the .tcf defines two pieces:
  - Memory Segments: name, base, len
  - Sections: name, which segment to link to
  - Note: seed file has default mem settings

Memory Segments
- Right-click on name, select Properties

Sections
- Right-click on MEM and select Properties

MEM Mgmt – WAY easier using TCF

SYS/BIOS has some “catching up” to do...
Lab 16a: Intro to DSP/BIOS

Now that you’ve been through creating projects, building and running code, we now turn the page to learn about how DSP/BIOS-based projects work. This lab, while quite simple in nature, will help guide you through the steps of creating (possibly) your first BIOS project in CCSv4.

This lab will be used as a "seed" for future labs.

**Application:** blink USER LED_1 on the EVM every second

**Key Ideas:** main() returns to BIOS scheduler, IDL fxn runs to blink LED

**What will you learn?** .tcf file mgmt, IDL fxn creation/use, creation of BIOS project, benchmarking code, ROV

**Pseudo Code:**

- **main()** – init BSL, init LED, return to BIOS scheduler
- **ledToggle()** – IDL fxn that toggles LED_1 on EVM

---

**Lab 16a – Intro to DSP/BIOS**

**Procedure**

- Create a new **BIOS project** (empty)
- Add files (main.c, led.c)
- Link BSL library
- Add a new **BIOS TCF file** (mem map, scheduler, cmd)
- **Create IDL object**
- Build, “Play”, Debug

**Scheduler**

- **HWI**
- **SWI**
- **IDL**

**Time:** 45min

**led.c**

```c
ledToggle() {
    toggle(LED_1);
    delay(500ms);
}
```
Lab 16a – Procedure

If you can’t remember how to perform some of these steps, please refer back to the previous labs for help. Or, if you really get stuck, ask your neighbor. If you AND your neighbor are stuck, then ask the instructor (who is probably doing absolutely NOTHING important) for help. 😊

Create a New Project

1. Create a new project named “bios_led”.

   Create your new project in the following directory:

   C:\TI-RTOS\C6000\Labs\Lab16a\Project

When the following screen appears, make sure you click **Next** instead of Finish:
2. **Choose a Project template.**

   This screen was brand new in CCSv4.2.2. And it is not intuitive to the casual observer that the Next button above even exists – you see Finish, you click it. Ah, but the hidden secret is the Next button. The CCS developers are actually trying to do us a favor IF you understand what a BIOS template is.

   As you can see, there are many choices. Empty Projects are just that – empty – just a path to the include files for the selected processor. Go ahead and click on “Basic Examples” to see what’s inside. Click on all the other + signs to see what they contain. Ok, enough playing around. We are using BIOS 5.41.xx.xx in this workshop. So, the correct + sign to choose in the end is the one that is highlighted above.

3. **Choose the specific BIOS template for this workshop.**

   Next, you’ll see the following screen:

   Select “Empty Example”. This will give us the paths to the BIOS include directories. The other examples contain example code and .tcf files. NOW you can click Finish.
4. Add files to your project.
   From the lab’s Files directory, ADD the following files:
   - led.c, main.c, main.h

   Open each and inspect them. They should be pretty self explanatory.

5. Link the LogicPD BSL library to your project as before.

6. Add an include path for the BSL library \inc directory.
   Right-click on the project and select “Build Properties”. Select C6000 Compiler, then Include Options (you’ve done this before). Add the proper path for the BSL include dir (else you will get errors when you build).

At this point in time, what files are we missing?  There are 3 of them. Can you name them?
________________________  __________________________  __________________________

Add a New TCF File and Modify the Settings

   As discussed earlier, you have several options available to you regarding the TCF file. In this lab, we chose to use an EMPTY BIOS example from the project templates. Therefore, no TCF file exists.

   Referring back to the material in this chapter, create a NEW TCF file (File → New → DSP/BIOS v5.x Config File). Name it: bios_led.tcf. When prompted to pick a platform seed tcf file, type “evm6748” into the filter filter and choose the tcf that pops up.

   CCS should have placed your new TCF file in the project directory AND added it to your project. Check to make sure both of these statements are true.

   If the new TCF file did not open automatically when created, double-click on the new TCF file (bios_led.tcf) to open it.

8. Create a HEAP in memory.
   All BIOS projects need a heap. Why this doesn’t get created for you in the “seed” tcf file is a good question. The fact that it doesn’t causes a heap full of troubles. If you ever get any strange unexplainable errors when you build BIOS projects, check THIS first.

   Open the TCF file (if it’s not already) and click on System. Right-click on MEM and select Properties. The checkbox for “No Dynamic Heaps” is most likely not checked (because we used an existing TCF file that had this selection as default).

   UNCHECK this box (if not already done) to specify that you want a heap created. A warning will bark at you that you haven’t defined a memory segment yet – no kidding. Just ignore the warning and click OK. (Note: this warning probably won’t occur because we used an existing TCF file).

   Click the + next to MEM. This will display the “seed” TCF memory areas already defined. Thank you.
Right-click IRAM and select properties.

Check the box that says “create a heap in this memory” (if not already checked) and change the heap size to 4000h.

Click Ok.

Now that we HAVE a heap in IRAM (that’s another name for L2 by the way), we need to tell the mother ship (MEM) where our heap is.

Right-click on MEM and select Properties. Click on both down arrows and select IRAM for both (again, this is probably already done for you). Click OK. Now she’s happy...

Save the TCF file.

**Note:** FYI – throughout the labs, we will throw in the “top 10 or 20” tips that cause Debug nightmares during development. Here’s your first one...

**Hint:** TIP #1 – Always create a HEAP when working with BIOS projects.
Build, Load, Play, Verify…

9. Ensure you have the proper target config file selected as Default.

10. Build your project.

   Fix any errors that occur (and there will be some, just keep reading…). You didn’t make errors, did you? Of course you did. Remember when we said that ANY BIOS project needs the cfg.h file included in one of the source files? Yep. And it was skipped on purpose to drive the point home.

   Open main.h for editing and add the following line as the FIRST include in main.h:
   ```
   #include "bios_ledcfg.h"
   ```

   Rebuild and see if the errors go away. They should. If you have more, than you really DO need to debug something. If not, move on…

---

**Hint:** TIP #2 – Always #include the cfg.h file in your application code when using BIOS as the FIRST included header file.

---

11. Inspect the “generated” files resulting from our new TCF file.

   In the project view, locate the following files and inspect them (actually, you’ll need to BUILD the project before these show up):

   - bios_ledcfg.h
   - bios_ledcfg.cmd

   There are other files that get generated by the existence of .tcf which we will cover in later labs. The .cmd file is automatically added to your project as a source file. However, your code must #include the cfg.h file or the compiler will think all the BIOS stuff is “declared implicitly”.
12. Debug and “Play” your code.

Click the Debug “Bug” – this is equivalent to “Debug Active Project”. Remember, this code blinks LED_1 near the bottom of the board. When you Play your code and the LED blinks, you’re done.

When the execution arrow reaches `main()`, hit “Play”. Does the LED blink?

No? What is going on?

Think back to the scheduling diagram and our discussions. To turn BIOS ON, what is the most important requirement? `main()` must RETURN or fall out via a brace `}`. Check `main.c` and see if this is true. Many users still have `while()` loops in their code and wonder why BIOS isn’t working. If you never return from `main()`, BIOS will never run.

**Hint:** TIP #3 – BIOS will NOT run if you don’t exit `main()`. Ok, so no funny tricks there - that checks out.

Next question: how is the function `ledToggle()` getting called? Was it called in `main()`? Hmm. Who is supposed to call `ledToggle()`?

When your code returns from `main()`, where does it go? The BIOS scheduler. And, according to our scheduling diagram and the threads we have in the system, which THREAD will the scheduler run when it returns from `main()`?

Can you explain what needs to be done? ____________________________________________
13. Add IDL object to your TCF.

The answer is: the scheduler will run the IDL thread when nothing else exists. All other thread types are higher priority. So, how do you make the IDL thread call ledToggle()?

Simple. Add an IDL object and point it to our function.

Open the TCF file and click on Scheduling. Right-click on IDL and select “Insert IDL”. Name the IDL Object “IDL_ledToggle”.

Now that we have the object, we need to tell the object what to do – which fxn to run. Right-click on IDL_ledToggle and select Properties. You’ll notice a spot to type in the function name.

Ok, make room for another important tip. BIOS is written in ASSEMBLY. The ledToggle() function is written in C. How does the compiler distinguish between an assembly label or symbol and a C label? The magic underscore “_”. All C symbols and labels (from an assembly point of view) are preceded with an underscore.

Hint: TIP #4 – When entering a fxn name into BIOS objects, precede the name with an underscore – “_”. Otherwise you will get a symbol referencing error which is difficult to locate.

SO, the fxn name you type in here must be preceded by an underscore:

You have now created an IDL object that is associated with a fxn. By the way, when you create HWI, SWI and TSK objects later on, guess what? It is the SAME procedure. You’ll get sick of this by the end of the week – right-click, insert, rename, right-click and select Properties, type some stuff. There – that is DSP/BIOS in a nutshell.

14. Build and Debug AGAIN.

When the execution arrow hits main(), click “Play”. You should now see the LED blinking. If you ever HALT/PAUSE, it will probably pause inside a library fxn that has no source associated with it. Just X that thing.

At this point, your first BIOS project is working. Do NOT “terminate all” yet. Simply click on the C/C++ perspective and move on to a few more goodies…
Benchmark and Use Runtime Object Viewer (ROV)

15. Benchmark LED BSL call.

So, how long does it take to toggle an LED? 10, 20, 50 instruction cycles? Well, you would be off by several orders of magnitude. So, let’s use the CLK module in BIOS to determine how long the LED_toggle() BSL call takes.

This same procedure can be used quickly and effectively to benchmark any area in code and then display the results either via a local variable (our first try) or via another BIOS module called LOG (our 2nd try).

BIOS uses a hardware timer for all sorts of things which we will investigate in different labs. The high-resolution time count can be accessed through a call to CLK_gethtime() API. Let’s use it...

Open led.c for editing.

Allocate three new variables: start, finish and time. First, we’ll get the CLK value just before the BSL call and then again just after. Subtract the two numbers and you have a benchmark – called time. This will show up as a local variable when we use a breakpoint to pause execution.

Your new code in led.c should look something like this:

```c
void ledToggle(void) //called by IDL thread or PRD
{
    uint32_t start, finish, time;
    start = CLK_gethtime();
    LED_toggle(LED_1); //toggle LED_1 on C6748 EVM
    finish = CLK_gethtime();
    time = finish - start;
    LOG_printf(&trace, "Toggle time = %d\n", time);
    USTIMER_delay(DELAY_HALF_SEC); //wait half-second
}
```

Don’t type in the call to LOG_printf() just yet. We’ll do that in a few moments...
16. **Build, Debug, Play.**

When finished, build your project – it should auto-download to the EVM. Switch to the Debug perspective and set a breakpoint as shown in the previous diagram. Click “Play”.

When the code stops at your breakpoint, select View → Local. Here’s the picture of what that will look like:

![Local View Example]

Are you serious? 1.57M CPU cycles. Of course. This mostly has to do with going through I2C and a PLD and waiting forever for acknowledge signals (can anyone say “BUS HOLD”?). Also, don’t forget we’re using the “Debug” build configuration with no optimization. More on that later. Nonetheless, we have our benchmark.

17. **Open up TWO .tcf files – is this a problem?**

The author has found a major “uh oh” that you need to be aware of. Open your .tcf file and keep it open. Double-click on the project’s TCF file AGAIN. Another “instance” of this window opens. Nuts. If you change one and save the other, what happens? Oops. So, we recommend you NOT minimize TCF windows and then forget you already have one open and open another. Just BEWARE...

18. **Add LOG Object and LOG_printf() API to display benchmark.**

Open `led.c` for editing and add the `LOG_printf()` statement as shown in a previous diagram.

Open the TCF for editing. Under Instrumentation, add a new LOG object named “trace”. Remember? Right-click on LOG, insert log, rename to trace, click OK.

![Instrumentation Diagram]

Save the TCF.
19. Pop over to Windows Explorer and analyse the `Project` folder.

Remember when we said that another folder would be created if you were using BIOS? It was called `.gconf`. This is the GRAPHICAL config tool in action that is fed by the .cdb file. When you add a .tcf file, the graphical and textual tools must both exist and follow each other. Go check it out. Is it there? Ok…back to the action…

20. Build, Debug, Play – use ROV.

When the code loads, remove the breakpoint in led.c. Then, click Play. PAUSE the execution after about 5 seconds. Open the ROV tool via `Tools → ROV`. When ROV opens, select LOG and one of the sequence numbers – like 2 or 3:

Notice the result of the LOG_printf() under “message”. You can choose other sequence numbers and see what their times were.

You can also choose to see the LOG messages via `Tools RTA Printf Logs`. Try that now and see what you get. If you’d like to change the behaviour of the LOGging, go back to the LOG object and try a bigger buffer, circular (last N samples) or fixed (first N samples). Experiment away…

When we move on to a TSK-based system, the ROV will come in very handy. This tool actually replaced the older KOV (kernel object viewer) in the previous CCS. Also, in future labs, we’ll use the RTA (Real-time Analysis) tools to view Printf logs directly. By then, you’ll know two different ways to access debug info.

**Note:** Explain this to me – so, the tool is called ROV which stands for RUNTIME Object Viewer. But the only way to VIEW the OBJECT is in STOP time. Hmmm. Marketing? Illegal drug use? Ok, so it "collects" the data during runtime…but still…to the author, this is a stretch and confuses new users. Ah, but now you know the “rest of the story”…

Terminate the Debug Session and close the project.

You’re finished with this lab. Please raise your hand and let the instructor know you are finished with this lab (maybe throw something heavy at them to get their attention or say “CCS crashed – AGAIN!” – that will get them running…)
Additional Information & Notes

**DSP/BIOS – TI’s Real-Time O/S**

**DSP/BIOS is:**
- Library of essential application services
- Pre-emptive scheduler to manage threads, memory, I/O, timers
- Hardware (peripheral) abstraction

**Benefits:**
- Provides an interactive config tool
- Consumes minimal MIPS, memory
- Supports specialized DSP drivers
- Integrates real-time analysis tools

**Files Generated by the Config Tool**

- **myWork.tcf**  Textual configuration script file
- **myWorkcfg.cmd**  Linker command file
- **myWorkcfg_c.c**  C file to s/u BIOS obj’s, etc
- **myWorkcfg.s##**  ASM init file for series ## DSP
- **myWorkcfg.h##**  Header file for above
- **myWork.cdb**  I/F to GCONF display
- **myWorkcfg.h**  header file for config inclusions
Polling vs Interrupt (Event) Driven

Polling:
- Overhead of repeated checking
- Wastes MIPS, Watts
- Doesn’t allow other threads to run in the mean time

Interrupts:
+ No checking – launch on event
  + no wasted time or power
+ Allows other threads to run independently
+ Represent response to priority events
- Small number of interrupt sources to post ISRs

“Software” Interrupts and Semaphore posting
+ Allows interrupt/event launch of threads beyond ISRs
+ BIOS HWI & SWI are both posted to run, like an ISR
+ BIOS tasks (TSK) can be synchronized via SEMaphores
+ Improved Modularity

System Design Options and Tradeoffs

- Dynamic vs Static
  - Static systems – are smaller and faster code solutions, simpler to create and manage
  - Dynamic systems – allow blocks of RAM to be ‘borrowed’ from heap when needed, and returned afterward for reuse by subsequent requestors; add the create & delete phases

- MIPS vs Mbytes – system designer can often trade one for the other to optimize performance and cost

- Number of Buffers: Latency (input to output time) vs flexibility (improved ability to tolerate preemption)

- What is speed? MIPS vs TTM (Time To Market)
  - Faster DSP processing rates offer performance that exceeds minimum requirements of many systems
  - More sophisticated features can be employed to simplify coding effort, improve speed of coding and time to market

- Cost: Device vs TTM
  - Price of DSP HW and development should be weighed against the value of time to market
Booting From Flash

Introduction

In this chapter the steps required to migrate code from being loaded and run via CCS to running autonomously in flash will be considered. Given the AISgen and SPIWriter tools, this is a simple process that is desired toward the end of the design cycle.

Objectives

- Compare/contrast the startup events of CCS (GEL) vs. booting from flash
- Describe how to use AISgen and SPI Flash Writer utilities to create and burn a flash boot image
- Lab 16b – Convert the keystone lab to a bootable flash image, POR, run
Module Topics

Booting From Flash ............................................................................................................. 16-1

Lab 16b: Booting From Flash ........................................................................................... 16-11

Notes ............................................................................................................................... 16-34
Booting From Flash

Boot Modes – Overview

'C6748 Boot Modes - Overview

On RESET:
- BOOT[x] pins are sampled
- Corresponding boot routine is executed

Boot Loader (ARM or DSP):
- Runs out of L2 ROM
- Copies FLASH ? RAM
- Execution begins at specified “entry point” (reset vector)

ROM Code
- 0x11700000

BOOT Modes
- NAND
- NOR
- HPI
- I2C
- SPI
- UART

Questions
- What else does the user need to configure? (GEL vs. Boot)
- How is the “flash image” created? (AIS)
- How is the EVM6748 Flash programmed? (SPIWriter)
System Startup

System Startup – CCS vs. Boot

<table>
<thead>
<tr>
<th>Required Task</th>
<th>CCS</th>
<th>Boot</th>
</tr>
</thead>
<tbody>
<tr>
<td>PLL Init</td>
<td>GEL file</td>
<td>AISgen.cfg</td>
</tr>
<tr>
<td>DDR Config</td>
<td>GEL file</td>
<td>AISgen.cfg</td>
</tr>
<tr>
<td>PINMUX</td>
<td>GEL file</td>
<td>AISgen.cfg</td>
</tr>
<tr>
<td>PSC</td>
<td>GEL file</td>
<td>AISgen.cfg</td>
</tr>
<tr>
<td>Load Program</td>
<td>CCS loader</td>
<td>ROM code</td>
</tr>
</tbody>
</table>

/top As - Application Image Script

- When using CCS, the GEL file takes care of important setup FOR YOU
- When using a boot loader, the user is responsible for writing code to accomplish the same tasks (e.g., AIS...)

Init Files

CCS GEL File

- The GEL script runs every time you connect to your target (C6748.gel).
- This script sets up the target environment:

<table>
<thead>
<tr>
<th>Mem Map</th>
<th>Core Freq</th>
<th>EMIF</th>
<th>PLL0</th>
<th>PSC</th>
<th>DDR</th>
<th>PINMUX</th>
<th>PLL1</th>
</tr>
</thead>
</table>

Runs at “Connect To Target” .GEL Snippets

```c
OnTargetConnect() {
    Clear_Memory_Map();
    Setup_Memory_Map();
    PSC_All_On_Expimentrer();
    Core_300MHz_mDDR_132MHz();
}
```

```c
Setup_Memory_Map() {
    /* DSP */
    GEL_MapAddStr(...); //DSP L2 ROM
    GEL_MapAddStr(...); //DSP L2 RAM
    GEL_MapAddStr(...); //DSP L1P RAM
}
```

```c
Set_Core_300MHz();
Set_mDDR_132MHz();
```
AlSgen Conversion

AlSgen Conversion (.OUT ? .BIN)
- AlSgen converts your .OUT file to a “flash”-able boot image (.bin)
- Contains all of your app’s code/data sections
- Can include user-defined code to set up environment:

Build Process

Build Steps: CCS/Debug vs FLASH

CCS
- C6748.gel
- CCS: File / Ld Pgm
- CCS: Project / Build
- file.out
- CCS forces entry pt to _c_int00
- Flash
- DDR2
- L1, L2, L3
- 6748 EVM

FLASH
- AlSgen.cfg
- AlSgen
- file.bin
- SPIWriter.cmd
- SPIFlash
- DDR2
- L1, L2, L3
- 6748 EVM
- User specifies Entry point
- User specifies Entry point
**SPIWriter Utility (Flash Programmer)**

**SPIWriter Flash Utility - Procedure**
- SPIWriter is the “flash programming utility” that runs on the target and programs the flash with your .bin file
- SIMPLE procedure:
  1. Create your app.OUT file
  2. Use AISgen to convert .OUT ? .BIN using proper settings
  3. Load/run SPIWriter_OMAP-L138.OUT file in CCS
  4. Respond “no” to “UBL boot?”
  5. Provide path to .BIN file (then flash erase/program occurs)
  6. Terminate debug session, power-cycle, DONE.

**Using SPIWriter**
- SPIWriter is available for download at:

- Part of a larger package of utils that includes writers for NAND, NOR, UBL_ARM, UBL_DSP

![SPlWri.png](attachment:SPlWri.png)
ARM + DSP Boot

ARM + DSP Boot (OMAP-L138)

- Unlock DSP
- Set DSP reset vec
- Wake DSP
- DSP PC = reset vec
- while(1)

DSP.out

- Audio Application
- Link reset vector to specific addr (add reset vector to CFG)

AISgen

- ARM code programs DSP’s entry point to L2 addr
- SYS/BIOS CFG file specifies exact entry point for boot
- AISgen combines .out files into one bootable image

flash.bin

ARM
sect_1
sect_2
...
DSP
sect_1
sect_2
...

ARM + DSP Boot (OMAP-L138)

- ARM Bootloader runs at reset, copies ARM/DSP sections to RAM
- ARM App runs, wakes DSP, sets DSP PC = entry point, DSP runs
- Both ARM and DSP programs are running simultaneously
Booting From Flash

Additonal Info…

For Add’l Info…(Wiki & App Notes)

OMAP-L1x Debug GEL Files

Debug THIS !
- ROM ID
- Si Revision
- Boot Mode
- ROM Status Code
- Boot ROM Errors
- Current PC
- Device Info
- Clock Info
- PSC States

Outputs results to the Console Window
C6748 Boot Modes (S7, DIP_x)

Table 2.10 – S7 DIP Switch Functions

<table>
<thead>
<tr>
<th>Switch</th>
<th>OFF Position</th>
<th>ON Position</th>
</tr>
</thead>
<tbody>
<tr>
<td>S7:1*</td>
<td>Baseboard LCD drive enabled</td>
<td>Baseboard LCD drive disabled</td>
</tr>
<tr>
<td>S7:2</td>
<td>Baseboard audio enabled. Associated McASP lines connect to baseboard audio only.</td>
<td>Baseboard audio disabled. Associated McASP lines are available on audio expansion connector.</td>
</tr>
<tr>
<td>S7:3</td>
<td>OMAP-L138 I/O runs at 3.3V</td>
<td>OMAP-L138 I/O runs at 1.8V</td>
</tr>
<tr>
<td>S7:4</td>
<td>Normal connection</td>
<td>Normal connection</td>
</tr>
<tr>
<td>S7:5</td>
<td>BOOT[1]</td>
<td>BOOT[1]</td>
</tr>
</tbody>
</table>

Table 2.11 – S7 DIP Switch Boot Modes

<table>
<thead>
<tr>
<th>Boot Mode</th>
<th>DIP Switch Setting – S7[5:8]</th>
</tr>
</thead>
<tbody>
<tr>
<td>NOR EMI FA</td>
<td>OFF</td>
</tr>
<tr>
<td>NAND-6 EMI FA</td>
<td>OFF</td>
</tr>
<tr>
<td>Default: SPI Flash</td>
<td>OFF</td>
</tr>
<tr>
<td>SPI Debug</td>
<td>ON</td>
</tr>
</tbody>
</table>

Flash Pin Settings – C6748 EVM

EMU MODE

<table>
<thead>
<tr>
<th>SW7</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
<td>ON</td>
</tr>
<tr>
<td>7</td>
<td></td>
</tr>
<tr>
<td>6</td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>ON</td>
</tr>
<tr>
<td>4</td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td></td>
</tr>
</tbody>
</table>

Default = SPI BOOT

SPI BOOT

<table>
<thead>
<tr>
<th>SW7</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
<td>OFF</td>
</tr>
<tr>
<td>7</td>
<td></td>
</tr>
<tr>
<td>6</td>
<td></td>
</tr>
<tr>
<td>5</td>
<td></td>
</tr>
<tr>
<td>4</td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
</tr>
<tr>
<td>2</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td></td>
</tr>
</tbody>
</table>

Default = SPI BOOT
Booting From Flash

*** this page was accidentally created by a virus – please ignore ***
Lab 16b: Booting From Flash

In this lab, a .out file will be loaded to the on-board flash memory so that the program may be run when the board is powered up, with no connection to CCS.

Any lab solution would work for this lab, but again we'll standardize on the “keystone” lab so that we ensure a known quantity.

Lab 16b – ARM+DSP SPI FLASH Boot

- Using AISgen & SPIWriter
  - Select “Keystone” Solution
  - Build “Release” config (.out)
  - Convert ARM and DSP .out files to .bin using AISgen
  - Run SPIWriter.OUT (burn flash)
  - Provide path to .bin
  - Success?
  - Disconnect CCS
  - Power off/on – code runs

- Time: 45 min

- Workshop Students: Skip Lab Steps 1-6 (lab setup only)
Lab16b – Booting From Flash - Procedure

Hint: This lab procedure will work with either the C6748 SOM or OMAP-L138 SOM. The basic procedure is the same but a few steps are VERY different. These will be noted clearly in this document. So, please pay attention to the HINTS and grey boxes like this one along the way.

Tools Download and Setup (Students: SKIP STEPS 1-6 !!)

The following steps in THIS SECTION ONLY have already been performed. So, workshop attendees can skip to the next section. These steps are provided in order to show exactly where and how the flash/boot environment was set up (for future reference).

1. Download AISgen utility – SPRAB41c.

   Download the pdf file from here:

   [http://focus.ti.com/dsp/docs/litabsmultiplefilelist.tsp?docCategoryId=1&familyId=1621&literatureNumber=sprab41c&sectionId=3&tabId=409](http://focus.ti.com/dsp/docs/litabsmultiplefilelist.tsp?docCategoryId=1&familyId=1621&literatureNumber=sprab41c&sectionId=3&tabId=409)

   A screen cap of the pdf file is here:

   ![Using the OMAP-L1x8 Bootloader](http://focus.ti.com/dsp/docs/litabsmultiplefilelist.tsp?docCategoryId=1&familyId=1621&literatureNumber=sprab41c&sectionId=3&tabId=409)

   The contents of this zip are shown here:
2. Create directories to hold tools and projects.

Three directories need to be created:

- C:\TI-RTOS\C6000\Labs\Lab16b_keystone – will contain the audio project (keystone) to build into a .OUT file.
- C:\TI-RTOS\C6000\Labs\Lab13b_ARM_Boot – will contain the ARM boot code required to start up the DSP after booting.
- C:\TI-RTOS\C6000\Labs\Lab13b_SPIWriter – will contain the SPIWriter.out file used to program the flash on the EVM.
- C:\TI-RTOS\C6000\Labs\Lab13b_AIS – contains the AISgen.exe file (shown above) and is where the resulting AIS script (bin) will be located after running the utility (.OUT → .BIN)

Place the “keystone” files into the Lab16b_keystone\Files directory. Users will build a new project to get their .OUT file.

Place the recently downloaded AISgen.exe file into Lab16b_AIS directory.
3. **Download SPI Flash Utilities.**

You can find the SPI Flash Utility here:


This is actually a TI wiki page:


From here, locate the following and click "here" to go to the download page:

![Download Page](http://processors.wiki.ti.com/index.php/Serial_Boot_and_Flash_Loading_Utility_for_OMAP-L138)

This will take you to a SourceForge site that will contain the tools you need to download.

![SourceForge Site](http://processors.wiki.ti.com/index.php/Serial_Boot_and_Flash_Loading_Utility_for_OMAP-L138)

Click on the latest version under OMAP-L138 and download the tar.gz file. UnTAR the contents and you'll see this:

![UnTAR Contents](http://processors.wiki.ti.com/index.php/Serial_Boot_and_Flash_Loading_Utility_for_OMAP-L138)

The path we need is `\OMPAL138`. If we dive down a bit, we will find the `SPIWriter.out` file that is used to program the flash with our boot image `.bin`. 
4. **Copy the SPIWriter.out file to \Lab13b_SPIWriter\ directory.**
   
   Shown below is the initial contents of the Flash Utility download:

   ![Flash Utility directory structure]

   Copy the following file to the \Lab13b_SPIWriter\ directory:

   SPIWriter_OMAP-L138.out

5. **Install AISgen.**

   Find the download of the AISgen.exe file and double-click it to install. After installation, copy a shortcut to the desktop for this program:

   ![AISgen shortcut]

6. **Create the keystone project.**

   Create a new CCSv5 SYS/BIOS project with the source files listed in C:\SYSBIOSv4\Lab13b_keystone\Files. Create this project in the neighboring \Project folder. Also, don’t forget to add the BSL library and BSL includes (as normal) Make sure you use the RELEASE configuration only.
Hint: [workshop students: START HERE]

Build Keystone Project: [Src → .OUT File]

7. Import keystone audio project and make a few changes.
   Import “keystone_flash” project from the following directory:
   
   C:\TI-RTOS\C6000\Labs\Lab16b_keystone\Project

   This project was built for emulation with CCSv5 – i.e there is a GEL file that sets up our PLL, DDR2, etc. This is actually the SOLUTION to the clk_rta_audio lab (with the platform file set to all data/code INTERNAL). In creating a boot image, as discussed in the chapter, we have to perform these actions in code vs. the GEL creating this nice environment for us.

   So, we have a choice here – write code that runs in main to set up PLL0, PLL1, DDR, etc. OR have the bootloader do it FOR US. Having the bootloader perform these actions offers several advantages – fewer mistakes by human programmers AND, these settings are done at bootload time vs waiting all the way until main() for the settings to take effect.

Hint: The following step is for OMAP-L138 SOM Users ONLY !!

8. Set address of reset vector for DSP

   Here is one of the “tricks” that must be employed when using both the ARM and DSP. The ARM code has to know the entry point (reset vector, c_int00) of the DSP. Well, if you just compile and link, it could go anywhere in L2. If your class is based on SYS/BIOS, please follow those instructions. If you’re interested in how this is done with DSP/BIOS, that solution is also provided for your reference.

SYS/BIOS Users – must add two lines of script code to the CFG file as shown. This script forces the reset vector address for the DSP to 0x11830000. Locate this in the given .cfg file and UNCOMMENT these two lines of code.

```c
21var Hwi = xdc.useModule('ti.sysbios.family.c64p.Hwi');
22Hwi.resetVectorAddress = 0x11830000;
```

DSP/BIOS Users – must create a linker.cmd file as shown below to force the address of the reset vector. This little command file specifies EXACTLY where the .boot section should go for a BIOS project (this is not necessary for a non-BIOS program).

```cmd
SECTIONS
{
    .boot > 0x11830000
    -l bios.c674<boot.c674>[:.sysinit]
}
```
9. **Examine the platform file.**

In the previous step, we told the tools to place the DSP reset vector specifically at address 0x11830000. This is the upper 64K of the 256K block of L2 RAM. One of our labs in the workshop specified L2 cache as 64K. Guess what? If that setting is still true, L2 cache effective starts at the same address – which means that this address is NOT available for the reset vector. WHOOPS.

Select Build Options and determine WHICH platform file is associated with this project. Once you have determined which platform it is, open it and examine it. Make sure L2 cache is turned off – or ZERO – and that all code/data/stack segments are allocated in iRAM. If this is not true, then “make it so”.

10. **Build the keystone project.**

Update all tools for XDC, BIOS, UIA. Kill Agent. Update Compiler – basically update everything to your latest tool set to get rid of errors and warnings.

Using the DEBUG build configuration, build the project. This should create the .OUT file. Go check the Debug directory and locate the .OUT file:

```
keystone_flash.out
```

Load the .OUT file and make sure it executes properly. We don’t want to flash something that isn’t working. 😊

Do not close the Debug session yet.
11. Determine silicon rev of the device you are currently using.

AISgen will want to know which silicon rev you are using. Well, you can either attempt to read it off the device itself (which is nearly impossible) or you can visit a convenient place in memory to see it.

Now that you have the Debug perspective open, this should be relatively straightforward. Open a memory view window and type in the following address:

0x11700000

Can you see it? No? Shame on you. Ok. Try changing the style view to "Character" instead. See something different?

Like this?

![Memory view screenshot]

That says “d800k002” which means rev2 of the silicon. That’s an older rev…but whatever yours is…write it down below:

Silicon REV: ____________________

FYI – for OMAP-L138 (and C6748), note the following:

- d800k002 = Rev 1.0 silicon (common, but old)
- d800k004 = Rev 1.1 silicon (fairly common)
- d800k006 = Rev 2.0 silicon (if you have a newer board, this is the latest)

There ARE some differences between Rev1 and Rev2 silicon that we’ll mention later in this lab – very important in terms of how the ARM code is written.

You will probably NEVER need to change the memory view to “Character” ever again – so enjoy the moment. 😊

Next, we need to convert this .out file and combine it with the ARM .out file and create a single flash image for both using the AIS script via AISgen…
12. Use the Debug GEL script to locate the Silicon Rev.

This script can be run at any time to debug the state of your silicon and all of the important registers and frequencies your device is running at. This file works for both OMAP-L137/8 and C6747/8 devices. It is a great script to provide feedback for your hardware engineer.

It goes kind of like this: we want a certain frequency for PLL1. We read the documentation and determine that these registers need to be programmed to a, b and c. You write the code, program them and then build/run. Well, is PLL1 set to the frequency you thought it should be? Run the debug script and find out what the processor is “reporting” the setting is. Nice.

This script outputs its results to the Console window.

Let’s use the debug script to determine the silicon rev as in the previous step.

First, we need to LOAD the gel file. This file can be downloaded from the wiki shown in the chapter. We have already done that for you and placed that GEL file in the \gel directory next to the GEL file you’ve been using for CCS.

Select Tools → GEL Files.

Right-click in the empty area under the currently loaded GEL file and select: Load Gel.

The \gel directory should show up and the file OMAPL1x_debug.gel should be listed. If not, browse to C:\SYSBIOSv4\Labs\DSP_BSL\gel.

Click Open.
This will load the new GEL file and place the scripts under the “Scripts” menu.
Select “Scripts” → Diagnostics → Run All:

![Screenshot of Diagnostics menu]

You can choose to run only a specific script or “All” of them. Notice the output in the Console window. Scroll up and find the silicon revision. Also make note of all of the registers and settings this GEL file reports. Quite extensive.

![Image of BootROM Info]

Does your report show the same rev as you found in the previous step? Let’s hope so…

Write down the Si Rev again here:

Silicon Rev (again): ____________________
Use AISgen To Convert [.OUT → .BIN]

AISgen (Application Image Script Generator) is a free downloadable tool from TI – check out the beginning of this lab for the links to get this tool.

13. Locate AISgen.exe (only if requiring installation…if not, see next step).

   The installation file has already been downloaded for you and is sitting in the following directory:

   C:\SYSBIOSv4\Labs\Lab13b_AIS

   Here, you will find the following install file:

   ![Install File](image)

   This is the INSTALL file (fyi). You don’t need to use this if the tool is already installed on your computer…


   There should be an icon on your desktop that looks like this:

   ![Desktop Icon](image)

   If not, you will need to install the tool by double-clicking on the install file, installing it and then creating a shortcut to it on the desktop (you’ll find it in Programs → Texas Instruments → AISgen).

   Double-click on the icon to launch AISgen and fill out the dialogue box as shown on the next page…there are several settings you need…so be careful and go SLOWLY here…

   It is usually BEST to place all of your PLL and DDR settings in the flash image and have the bootloader set these up vs. running code on the DSP to do it. Why? Because the DSP then comes out of reset READY to go at the top speeds vs. running “slow” until your code in main() is run. So, that’s what we plan to do…. 

---

**Note:** Each dialogue has its own section below. It is quite a bit of setup…but hey, you are enabling the bootloader to set up your entire system. This is good stuff…but it takes some work…

---

**Hint:** When you actually use the DSP to burn the flash in a later step, the location you store your .bin file too (name of the .bin file AND the directory path you place the .bin file in) CANNOT have ANY SPACES IN THE PATH OR FILENAME.
Main dialogue – basic settings.

Fill out the following on this page:

- Device Type (match it up with what you determined before)
  - For OMAP-L138 SOM (ARM + DSP), choose “ARM”. If you’re using the 6748 SOM, choose “DSP”.
  - Boot Mode: SPI1 Flash. On the OMAP-L138, the SPI1 port and UART2 ports are connected to the flash.
  - For now, wait on filling in the Application and Output files.

**Hint:** For C6748 SOM, choose “DSP” as the Device type

**Hint:** For OMAP-L138 SOM, choose “ARM” as the Device type

Note: you will type in these paths in a future step – do NOT do it now...
Configure PLL0, PLL0 Tab

On the "General" tab, check the box for "Configure PLL0" as shown:

![Configure PLL0](image)

Then click on the PLL0 tab and view these settings. You will see the defaults show up. Make the following modifications as shown below.

Change the multiplier value from 20 to 25 and notice the values in the bottom RH corner change.

Peripheral Tab

Next, click on the Peripheral tab. This is where you will set the SPI Clock. It is a function (divide down) from the CPU clock. If you leave it at 1MHz, well, it will work, but the bootload will take WAY longer. So, this is a "speed up" enhancement.

Type "20" into the SPI Clock field as shown:

![Peripheral Tab](image)

Also check the "Enable Sequential Read" checkbox. Why is this important? Speed of the boot load. If this box is unchecked, the ROM code will send out a read command (0x03) plus a 24-bit address before every single BYTE. That is a TON of read commands.

However, if we CHECK this box, the ROM code will send out a single 24-bit address (0x0000000) and then proceed to read out the ENTIRE boot image. WAY WAY faster.
Configure PLL1

Just in case you EVER want to put code or data into the DDR, PLL1 needs to be set in the flash image and therefore configured by the bootloader.

So, click the checkbox next to “Configure PLL1”, click on that tab, and use the following settings:

![Configure PLL1 Settings](image)

This will clock the DDR at 300MHz. This is equivalent to what our GEL file sets the DDR frequency to. We don’t have any code in DDR at the moment – but now we have it setup just in case we ever do later on. Now, we need to write values to the DDR config registers…

Configure DDR

You know the drill. Click the proper checkbox on the main dialogue page and click on the DDR tab. Fill in the following values as shown. If you want to know what each of the values are on the right, look it up in the datasheet. 😊
Configure PSC0, PSC0 Tab

Next, we need to configure the Low Power Sleep Controller (LPSC) to allow the ARM to write to the DSP’s L2 memory. If both the ARM and DSP code resided in L3, well, the ARM bootloader could then easily write to L3. But, with a BIOS program, BIOS wants to live in L2 DSP memory (around 0x11800000). In order for the ARM bootloader code to write to this address, we need to have the DSP clocks powered up. Enabling PSC0 does this for us.

On the main page, “check” the box next to “Configure PSC” and go to the PSC tab.

In the GEL file we’ve been using in the workshop, a function named PSC_All_On_Full_EVM() runs to set all the PSC values. We could cheat and just type in “15” as shown below:

Minimum Setting (don’t use this for the lab):

<table>
<thead>
<tr>
<th>PSC0</th>
<th>PSC1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Enable LPSC: 15</td>
<td>Disable LPSC:</td>
</tr>
<tr>
<td>Sync Rst LPSC:</td>
<td></td>
</tr>
</tbody>
</table>

This would Enable module 15 of the PSC which says “de-assert the reset on the DSP megamodule” and enable the clocks so that the ARM can write to the DSP memory located in L2. However, this setting does NOT match what the GEL file did for us. So, we need to enable MORE of the PSC modules so that we match the GEL file.

Note: When doing this for your own system, you’ll need to pick and choose the PSC modules that are important to your specific system.

Better Setting (USE THIS ONE for the lab – or as a starting point for your own system)

<table>
<thead>
<tr>
<th>PSC0</th>
<th>PSC1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Enable LPSC: 0;1;2;3;4;5;9;10;11;12;13;15</td>
<td>Disable LPSC:</td>
</tr>
<tr>
<td>Sync Rst LPSC:</td>
<td></td>
</tr>
</tbody>
</table>

The numbers scroll out of sight, so here are the values:

PSC0: 0;1;2;3;4;5;9;10;11;12;13;15
PSC1: 0;1;2;3;4;5;6;7;9;10;11;12;13;14;15;16;17;18;19;20;21;24;25;26;27;28;29;30;31

Note: PSC1 is MISSING modules 8, 22-23 (see datasheet for more details on these).
Notice for SATA users:

PSC1 Module 8 (SATA) is specifically NOT being enabled. There is a note in the System Reference Guide saying that you need to set the FORCE bit in MDCTL when enabling SATA. That's not an option in the GUI/bootROM so we simply cannot enable it. If you ignore the author’s advice and enable module 8 in PSC1, you’ll find the boot ROM gets stuck in a spin loop waiting for SATA to transition and so ultimately your boot fails as a result.

So, there are really two pieces to this puzzle if using SATA:

A. Make sure you do NOT try to enable PSC1 Module 8 through AISgen

B. If you need SATA, make sure you enable this through your application code and be sure to set the FORCE bit in MDCTL when doing so.

FINAL CHECK - SUMMARY

So, your final main dialogue should look like this with all of these tabs showing. Please double-check you didn’t forget something:

Save your .cfg file in the Lab13b_AIS folder for potential use later on – you don’t want to have to re-create all of these steps again if you can avoid it. If you look in that folder, it already contains this .cfg file done for you. Ok, so we could have told you that earlier, but then the learning would have been crippled.

The author named the solution’s config file:

OMAP-L138-ARM-DSP-LAB13B_TTO.cfg

Hint: C6748 Users: You will only specify ONE output file (DSP.out)

Hint: OMAP-L138 Users: You will specify TWO files (an ARM.out and a DSP.out).
ARM/DSP Application & Output Files

Ok, we’re almost done with the AISgen settings.

**Hint:** 6748 SOM Users – follow THESE directions (OMAP Users can skip this part)

For the “DSP Application File”, browse to the .OUT file that was created when you built your keystone project: `keystone_flash.out`

**Hint:** OMAP-L138 SOM Users – follow THESE directions:

For OMAP-L138 users: you will enter the paths to both files and AISgen will combine them into ONE image (.bin) to burn into the flash. You must **FIRST specify the ARM.out file** followed by the DSP.out file – this order MATTERS.

Follow these steps in order carefully.

Click the “…” button shown above next to “ARM Application File” to browse to (use \Lab13b instead):

Click Open.

Your screen should now look like this (except for using \Lab13b…):

This ARM code is for rev1 silicon. It should also work on Rev2 silicon – but not tested.
Next, click on the “+” sign (yours will say \Lab13b):

```
| ARM Application File: | C:\SYSBIOSv4\Labs\Lab12b_ARM_Boot\OMAPL138-AR |
| AIS Output File:      |                                                 |
```

and browse to your `keystone_flash.out` file you built earlier. You should now have two .out files listed under “ARM Application File” – first the ARM.out, then the DSP.out files separated by a semicolon. Double-check this is the case.

The AISgen software won’t allow you to see both paths at once in that tiny box, but here is a picture of the “middle” of the path showing the “semicolon” in the middle of the two .out files – again, the ARM.out file needs to be first followed by the DSP.out file (use \Lab13b instead):

```
| ARM Application File: | \Lab12b_Keystone\Project \Lab12b_ARM_Boot\OMAPL138-AR |
| AIS Output File:      |                                                 |
```

**Hint:** ALL SOM Users – Follow THIS STEP…

For the Output file, name it “flash.bin” and use the following path:

```
C:\SYSBIOSv4\Labs\Lab12b_AIS\flash.bin
```

**Hint:** Again, the path and filename CANNOT contain any spaces. When you run the flash writer later on, that program will barf on the file if there are any spaces in the path or filename.

Before you click the “Generate AIS” button, notice the other configuration options you have here. If you wanted AIS to write the code to configure any of these options, simply check them and fill out the info on the proper tab. This is a WAY cool interface. And, the bootloader does “system” setup for you instead of writing code to do it – and making mistakes and debugging those mistakes…and getting frustrated…like getting tired of reading this rambling text from the author….
15. Generate AIS script (flash.bin).

Click the “Generate AIS” button. When complete, it will provide a little feedback as to how many bytes were written. Like this:

```
Wrote 63388 bytes to file C:\BIOSv4\Sols\Lab14a_keysto... Generate AIS
```

So, what did you just do?

For OMAP-L138 (ARM+DSP) users, you just combined the ARM.out and DSP.out files into one flash image – flash.bin. For C6748 Users, you simply converted your .out file to a flash image.

The next step is to burn the flash with this image and then let the bootloader do its thing...

**Program the Flash: [.BIN → SPI1 Flash]**

16. Check target config and pin settings.

Use the standard XDS510 Target Config file that uses one GEL file (like all the other labs in this workshop). Make sure it is the default.

Also, make sure pins 5 and 8 on the EVM (S7 – switch 7) are ON/UP – so that we are in EMU mode – NOT flash boot mode.

17. Load SPIWriter.out into CCS.

The SPIWriter.out file should already be copied into a convenient place:

```
C:\SYSBIOSv4\Labs\Lab13b_SPIWriter
```

In CCS,

- Launch a debug session (right-click on the target config file and click “launch”)
- Connect to target
- Select “Load program” and browse to this location:

```
C:\SYSBIOSv4\Labs\Lab13b_SPIWriter\SPIWriter_OMAP-L138.out
```
18. PLAY!

Click Play. The console window will pop up and ask you a question about whether this is a UBL image. The answer is NO. Only if you were using a TI UBL which would then boot Uboot, the answer is no. This assumes that Linux is running. Our ARM code has no O/S.

Type a smallcase "n" and hit [ENTER]. To respond to the next question, provide the path name for your .BIN file (flash.bin) created in a previous step, i.e.:

```
C:\SYSBIOSv4\Labs\Lab13b_AIS\flash.bin
```

**Hint:** Do NOT have any spaces in this path name for SPIWriter – it NO WORK that way.

Here’s a screen capture from the author (although, you are using the \Lab13b_ais dir, not \Lab12b):

![Screen capture showing SPI boot preparation was successful](image)

Let it run – shouldn’t take too long. 15-20 seconds (with an XDS510 emulator). You will see some progress msgs and then see “success” – like this:

```
SPI boot preparation was successful!
```

19. Terminate the Debug session, close CCS.
20. Ensure DIP switches are set correctly and get music playing, then power-cycle!

Make sure ALL DIP switches on S7 are DOWN [OFF]. This will place the EVM into the SPI-1 boot mode. Get some music playing. Power cycle the board and THERE IT GOES…

No need to re-flash anything like a POST – just leave your neat little program in there for some unsuspecting person to stumble on one day when they forget to set the DIP switches back to EMU mode and they automagically hear audio coming out of the speakers when the turn on the power. Freaky. You should see the LED blinking as well…great work !!

**Hint:**  DO NOT SKIP THE FOLLOWING STEP.

21. Change the boot mode pins on the EVM back to their original state.

Please ensure DIP_5 and DIP_8 of S7 (the one on the right) are UP [ON].

RAISE YOUR HAND and get the instructor’s attention when you have completed this lab. If time permits, move on to the next OPTIONAL part…
Optional – DDR Usage

Go back to your keystone project and link the *data buffers* into DDR memory (just like we did in the cache lab) via the platform file. Re-compile and generate a new .out file. Then, use AISgen to create a new flash.bin file and flash it with SPIWriter. Then reset the board and see if it worked. Did it?

FYI – to make things go quicker, we have a .cfg file pre-loaded for AISgen. It is located at (use `\Lab13b_AIS`): 

When running AISgen, you can simply load this config file and it contains ALL of the settings from this lab. Edit, recompile, load this cfg, generate .bin, burn, reset. Quick.

Or, you can simply use the .cfg file you saved earlier in this lab...
Additional Information

AIS – Boot Script

Application Image Script (AIS) Boot

AIS is a format of storing the boot image. Apart from the HPI and two NOR-boot modes described above, all boot modes supported by the OMAP-L1x8 bootloader use AIS for boot purposes. AIS is a binary language, accessed in terms of 32-bit (4-byte) words in little endian format. AIS starts with a magic word (0x1F54954) and contains a series of AIS commands, which are executed by the bootloader in sequential manner. The Jump & Close (J&C) command marks the end of AIS.

Figure 4. Structure of AIS

Each AIS command consists of an opcode, optionally followed by one or more arguments, followed by optional data.

Figure 5. Structure of an AIS Command
Stream I/O and Drivers (PSP/IOM)

Introduction

In this chapter a technique to exchange buffers of data between input/output devices and processing threads will be considered. The BIOS 'stream' interface will be seen to provide a universal interface between I/O and processing threads, making coding easier and more easily reused.

Objectives

- Analyze **BIOS streams** – SIO – and the key APIs used
- **Adapt a TSK** to use SIO (Stream I/O)
- Describe the **benefits** of multi-buffer streams
- Learn the basics of PSP drivers
Module Topics

Stream I/O and Drivers (PSP/IOM) ................................................................. 16-1

Module Topics .......................................................................................... 16-2
Driver I/O - Intro ..................................................................................... 16-3
Using Double Buffers .............................................................................. 16-5
PSP/IOM Drivers ...................................................................................... 16-7
Additional Information ........................................................................... 16-10
Notes ........................................................................................................ 16-12
Driver I/O - Intro

Basic Real-time System Design (IPO)

Input
- Input (Driver)
- Input source data

Process
- User Program
- “master thread”
- Process (algorithm)
- Convert input data to desired results

Output
- Output (Driver)
- Export results

SYS/BIOS System

User Program

// "Master Thread"
// Create Phase
Initialize Drivers
Create Algo Instance
// Execute Phase
while (run)
- Input (exch bufs)
- Process
- Output (exch bufs)
// Delete Phase
Delete Algo Instance

GUI
- Sort
- Volume
- Bass
- Treble

Process
- (xDM algo)

BIOS
- Driver API
- Input Driver
- Output Driver

VISA = Algorithm Interface
- Video
- Imaging
- Speech
- Audio
- Etc. (same for every OS)

Driver = Varies from OS-to-OS

VISA API
- create
- process
- control
- delete

BIOS Driver API
- Stream_Create
- Stream_Issue
- Stream_Reclaim
- Stream_Control
- Stream_Delete
**Basic Driver API (SYS/BIOS Stream I/O)**

- **Stream I/O**: interface between TSKs and Devices
  - Universal interface to I/O devices
  - # of buffers and buffer size are user selectable
- **Unidirectional**: streams are input or output - not both
- **Efficiency**: uses pointer exchange instead of buffer copy

**APIs:**
- `Stream_issue()` – passes buffer to the stream
- `Stream_reclaim()` – requests buffer from stream, blocks until available

---

**Master Thread – Accessing I/O (BIOS)**

```c
// Create Phase (single buffer)
// issue EMPTY buffer to inStream
// issue EMPTY buffer to OutStream

while( doRecordVideo == 1 ) {
    inSize = Stream_reclaim(inStream, pBufIn, size);
    outSize = Stream_reclaim(outStream, pBufOut);
    ...
    ... DO DSP ...
    ...
    status = Stream_issue(inStream, pBufIn, size);
    status = Stream_issue(outStream, pBufOut, size);
}

// Execute phase
// get FULL input buffer
// get EMPTY output buffer

// AlgO goes here
// issue EMPTY buffer to inStream
// issue FULL buffer to OutStream

Stream_reclaim(inStream, pBufIn);
Stream_reclaim(outStream, pBufOut);

// Delete Phase
// retrieve buffers back from stream
```

---

Texas Instruments
Using Double Buffers

---

**Double Buffer Example – Input Only**

```c
// prolog – prime the process...
status = SIO_issue(&sioIn, pIn1, SIZE, NULL);
status = SIO_issue(&sioIn, pIn2, SIZE, NULL);

// while loop – iterate the process...
while (condition == TRUE)
{
    size = SIO_reclaim(&sioIn, (Ptr *)&pInX, NULL);
    size = SIO_reclaim(&sioOut, (Ptr *)&pOutX, NULL);
    // DSP... to pOut
    status = SIO_issue(&sioIn, pInX, SIZE, NULL);
    status = SIO_issue(&sioOut, pOutX, SIZE, NULL);
}

// epilog – wind down the process...
status = SIO_flush(&sioIn); // stop input
status = SIO_idle(&sioOut); // idle output, then stop
size = SIO_reclaim(&sioIn, (Ptr *)&pIn1, NULL);
size = SIO_reclaim(&sioIn, (Ptr *)&pIn2, NULL);
size = SIO_reclaim(&sioOut, (Ptr *)&pOut1, NULL);
size = SIO_reclaim(&sioOut, (Ptr *)&pOut2, NULL);
```

---

**Double Buffer Stream TSK Coding Example**

```c
// prolog – prime the process...
status = SIO_issue(&sioIn, pIn1, SIZE, NULL);
status = SIO_issue(&sioIn, pIn2, SIZE, NULL);

// while loop – iterate the process...
while (condition == TRUE)
{
    size = SIO_reclaim(&sioIn, (Ptr *)&pInX, NULL);
    size = SIO_reclaim(&sioOut, (Ptr *)&pOutX, NULL);
    // DSP... to pOut
    status = SIO_issue(&sioIn, pInX, SIZE, NULL);
    status = SIO_issue(&sioOut, pOutX, SIZE, NULL);
}

// epilog – wind down the process...
status = SIO_flush(&sioIn); // stop input
status = SIO_idle(&sioOut); // idle output, then stop
size = SIO_reclaim(&sioIn, (Ptr *)&pIn1, NULL);
size = SIO_reclaim(&sioIn, (Ptr *)&pIn2, NULL);
size = SIO_reclaim(&sioOut, (Ptr *)&pOut1, NULL);
size = SIO_reclaim(&sioOut, (Ptr *)&pOut2, NULL);
```
Double Buffer Stream TSK Coding Example

//prolog – prime the process...
status = SIO_issue(&sioIn, pIn1, SIZE, NULL);
status = SIO_issue(&sioIn, pIn2, SIZE, NULL);
size = SIO_reclaim(&sioIn, (Ptr *)&pInX, NULL);
// DSP... To pOut1
status = SIO_issue(&sioIn, pInX, SIZE, NULL);
size = SIO_reclaim(&sioIn, (Ptr *)&pInX, NULL);
// DSP... To pOut2
status = SIO_issue(&sioOut, pOut1, SIZE, NULL);
status = SIO_issue(&sioOut, pOut2, SIZE, NULL);

//while loop – iterate the process...
while (condition == TRUE){
size = SIO_reclaim(&sioIn, (Ptr *)&pInX, NULL);
// DSP... To pOut
status = SIO_issue(&sioIn, pInX, SIZE, NULL);
status = SIO_issue(&sioOut, pOutX, SIZE, NULL);
}

//epilog – wind down the process...
status = SIO_flush(&sioIn); //stop input
status = SIO_idle(&sioOut); //idle output, then stop
size = SIO_reclaim(&sioIn, (Ptr *)&pIn1, NULL);
size = SIO_reclaim(&sioIn, (Ptr *)&pIn2, NULL);
size = SIO_reclaim(&sioOut, (Ptr *)&pOut1, NULL);
size = SIO_reclaim(&sioOut, (Ptr *)&pOut2, NULL);
PSP/IOM Drivers

Using a PSP/IOM Driver – Procedure (1)

♦ Procedure: Using an IOM/PSP driver

1. Register the IOM-compliant PSP Driver
   • The heart of each PSP driver is an IOM-compliant mini-driver.
   • Register via GUI or .tcf. Refer to driver’s sample app and U/G for parameters.

2. Define DIO Adapter (BIOS Class Driver)
   • Doorway between PSP/IOM and SIO/GIO stream
   • Define via GUI, tie to specific “device” – i.e. what was created in Step 1.

3. Define Stream (SIO)
   • Create the stream statically (GUI) or dynamically
   • Tie the stream to a DIO Adapter

4. Add PSP/IOM Driver Library to Your Project
Using a PSP/IOM Driver – Procedure (2)

1. Register IOM/PSP Driver
   - Register via GUI or .tcf. Refer to driver’s sample app and User Guide for parameters.

2. Define DIO Adapter
   - Define via GUI and tie to specific “device”

Using a PSP/IOM Driver – Procedure (3)

3. Define Stream (SIO)
   - Create the stream statically (GUI) or dynamically
   - Tie the stream to a DIO Adapter

4. Add Library to Project
   - Either add the library directly to your project or via Build Options

Dynamic Stream Config

```c
sioIn = SIO_create("sioIn", SIO_INPUT, 4*BUF, &attrs);
sioOut = SIO_create("sioOut", SIO_OUTPUT, 4*BUF, &attrs);
```
**PSP – For More Information (TI Wiki)**

- Check out the PSP Tutorial on the TI Wiki...

**PSP Drivers – Where Are They?**

- PSP/IOM drivers are part of the SDK download of your specific device
- Questions galore:
  - **Where are they?**
  - **Any examples?**
  - **Are they documented?**

**Driver Docs (e.g.)**

- BIOSPSP_MKASP_Driver_Design.pdf

**BIOS PSP Docs**

- C6748_BIOSPSP_Datasheet.pdf
- C6748_BIOSPSP_ReleaseNotes.pdf
- C6748_BIOSPSP_Userguide.pdf
Additional Information

**SIO API Summary**

**Buffer Passing**
- `SIO_issue` Send a buffer to a stream
- `SIO_reclaim` Request a buffer back from a stream
- `SIO_ready` Test to see if stream has buffer available for reclaim
- `SIO_select` Wait for any of a specified group of streams to be ready

**Stream Management**
- `SIO_staticbuf` Obtain pointer to statically created buffer
- `SIO_flush` Idle a stream by flushing buffers
- `SIO_idle` Idle a stream
- `SIO_ctrl` Perform a device-dependent control operation

**Stream Properties Interrogation**
- `SIO bufsize` Returns size of the buffers specified in stream object
- `SIO_nbufs` Returns number of buffers specified in stream object
- `SIO_segid` Memory segment used by a stream as per stream object

**Dynamic Stream Management (mod.11)**
- `SIO_create` Dynamically create a stream (malloc fxn)
- `SIO_delete` Delete a dynamically created stream (free fxn)

**Archaic Stream API**
- `SIO_get` Get buffer from stream
- `SIO_put` Put buffer to a stream

---

**Triple Buffer Stream Coding Example**

```c
//prolog - prime the process.
status = SIO_issue(&sioIn, pIn1, SIZE, NULL);
status = SIO_issue(&sioIn, pIn2, SIZE, NULL);
size = SIO_issue(&sioIn, (Ptr *)pInX, NULL);
// DSP ... to pOut1
status = SIO_issue(&sioIn, pInX, SIZE, NULL);
size = SIO_reclaim(&sioIn, (Ptr *)pInX, NULL);
// DSP ... to pOut2
status = SIO_issue(&sioIn, pInX, SIZE, NULL);
size = SIO_reclaim(&sioIn, (Ptr *)pInX, NULL);
// DSP ... to pOut3
status = SIO_issue(&sioIn, pInX, SIZE, NULL);
status = SIO_issue(&sioOut, pOut3, SIZE, NULL);
status = SIO_issue(&sioOut, pOut2, SIZE, NULL);
status = SIO_issue(&sioOut, pOut1, SIZE, NULL);

//while loop - iterate the process... No change here!
while (condition == TRUE){
  size = SIO_reclaim(&sioIn, (Ptr *)pInX, NULL);
  size = SIO_reclaim(&sioOut, (Ptr *)pOutX, NULL);
  // DSP ... to pOut
  status = SIO_issue(&sioIn, pInX, SIZE, NULL);
  status = SIO_issue(&sioOut, pOutX, SIZE, NULL);
}

//epilog - wind down...
status = SIO_flush(&sioIn);
status = SIO_idle(&sioOut);
size = SIO_reclaim(&sioIn, (Ptr *)pIn1, NULL);
size = SIO_reclaim(&sioIn, (Ptr *)pIn2, NULL);
size = SIO_reclaim(&sioIn, (Ptr *)pIn3, NULL);
size = SIO_reclaim(&sioOut, (Ptr *)pOut1, NULL);
size = SIO_reclaim(&sioOut, (Ptr *)pOut3, NULL);
size = SIO_reclaim(&sioOut, (Ptr *)pOut2, NULL);
```

---
"N" Buffer Stream Coding Example

//prolog – prime the process...
for (n=0;n<SIO_nbufS(&sioIn);n++)
    status = SIO_issue(&sioIn, pIn[n], SIZE, NULL);

for (n=0;n<SIO_nbufS(&sioOut);n++)
    size = SIO_reclaim(&sioIn, (Ptr *)&pInX, NULL);
    // DSP... To pOut[n]
    status = SIO_issue(&sioIn, pInX, SIZE, NULL );
}

for (n=0;n<SIO_nbufS(&sioOut);n++)
    status = SIO_issue(&sioOut, pOut[n], SIZE, NULL);

//while loop – iterate the process... NO CHANGE HERE !!!!!
while (condition == TRUE) {
    size = SIO_reclaim(&sioIn, (Ptr *)&pInX, NULL);
    size = SIO_reclaim(&sioOut, (Ptr *)&pOutX, NULL);
    // DSP... To pOut
    status = SIO_issue(&sioIn, pInX, SIZE, NULL );
    status = SIO_issue(&sioOut, pOutX, SIZE, NULL);
}

//epilog – wind down...
status = SIO_flush(&sioIn);
status = SIO_idle(&sioOut);
for (n=0;n<SIO_nbufS(&sioIn);n++)
    size = SIO_reclaim(&sioIn, (Ptr *)&pIn[n], NULL);
for (n=0;n<SIO_nbufS(&sioOut);n++)
    size = SIO_reclaim(&sioOut, (Ptr *)&pOut[n], NULL);
Introduction

This chapter provides a high-level overview of the architecture of the C66x devices along with a brief overview of the MCSDK (Multicore Software Development Kit).

Objectives

- Describe the basic architecture of the C66x family of devices
- Provide an overview of each device subsystem
- Describe the basic features of the Multicore Software Development Kit (MCSDK)
## Module Topics

<table>
<thead>
<tr>
<th>Topic</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>C66x Introduction</td>
<td>16-1</td>
</tr>
<tr>
<td>Module Topics</td>
<td>16-2</td>
</tr>
<tr>
<td>C66x Family Overview</td>
<td>16-3</td>
</tr>
<tr>
<td>C6000 Roadmap</td>
<td>16-3</td>
</tr>
<tr>
<td>C667x Architecture Overview</td>
<td>16-4</td>
</tr>
<tr>
<td>C665x Low-Power Devices</td>
<td>16-11</td>
</tr>
<tr>
<td>MCSDK Overview</td>
<td>16-13</td>
</tr>
<tr>
<td>What is the MCSDK ?</td>
<td>16-13</td>
</tr>
<tr>
<td>Software Architecture</td>
<td>16-14</td>
</tr>
<tr>
<td>For More Info...</td>
<td>16-16</td>
</tr>
<tr>
<td>Notes</td>
<td>16-17</td>
</tr>
<tr>
<td>More Notes...</td>
<td>16-18</td>
</tr>
</tbody>
</table>
C66x Family Overview

C6000 Roadmap

Enhanced DSP core

- C66x ISA
  - 100% upward object code compatible
  - 4x performance improvement for multiply operation
  - 32 18-bit MACs
  - Improved support for complex arithmetic and matrix computation

- C67x
  - IEEE 754 Native Instructions for SP & DP
  - Advanced VLIW architecture
  - 2x registers
  - Enhanced floating-point add capabilities

- C67x+
  - 100% upward object code compatible with C66x, C64x, C67x, and C67x+
  - Best of fixed-point and floating-point architecture for better system performance and faster time-to-market.

- C64x+
  - SPLOOP and 16-bit instructions for smaller code size
  - Four 16-bit or eight 8-bit MACs
  - Flexible level one memory architecture
  - iDMA for rapid data transfers between local memories

- C64x
  - Advanced fixed-point instructions
  - Four 16-bit or eight 8-bit MACs
  - Two-level cache
C667x Architecture Overview

CorePac

- 1 to 8 C66x CorePac DSP Cores operating at up to 1.25 GHz
  - Fixed/Floating-pt operations
  - Code compatible with other C64x+ and C67x+ devices
- L1 Memory
  - Partition as Cache or RAM
  - 32KB L1P/D per core
- Dedicated L2 Memory
  - Partition as Cache or RAM
  - 512 KB to 1 MB per core
- Direct connection to memory subsystem

Memory Subsystem

- Multicore Shared Memory (MSM SRAM)
  - 2 to 4MB (Program or Data)
  - Available to all cores
- Multicore Shared Mem (MSMC)
  - Arbitrates access to shared memory and DDR3 EMIF
  - Provides CorePac access to coprocessors and I/O
  - Provides address extension to 64G (36 bits)
- DDR3 External Memory Interface (EMIF) – 8GB
  - Support for 16/32/64-bit modes
  - Specified at up to 1600 MT/s
Multicore Navigator

- Provides seamless inter-core communications (msgs and data) between cores, IP, and peripherals. “Fire and forget”
- Low-overhead processing and routing of packet traffic to/from cores and I/O
- Supports dynamic load optimization
- Consists of a Queue Manager Subsystem (QMSS) and multiple, dedicated Packet DMA engines

Multicore Navigator Architecture

[Diagram showing the architecture of Multicore Navigator]
Network Coprocessor

- Provides H/W accelerators to perform L2, L3, L4 processing and encryption (often done in S/W)
- Packet Accelerator (PA)
  - 8K multi-in/out HW queues
  - Single IP address option
  - UDP/TCP checksum and CRCs
  - Quality of Service (QoS) support
  - Multi-cast to multiple queues
- Security Accelerator (SA)
  - HW encryption, decryption
  - Supports protocols: IPsec ESP, IPsec AH, SRTP, 3GPP

External Interfaces

- 2x SGMII ports – support 10/100/1000 Ethernet
- 4x SRIO lanes for inter-DSP xfrs
- SPI for boot operations
- UART for development/test
- 2x PCIe at 5Gbps
- I2C for EPROM at 400 Kbps
- GPIO
- App-specific interfaces
TeraNet Switch Fabric

- Non-blocking switch fabric that enables fast and contention-free data movement
- Can configure/manage traffic queues and priorities of xfers while minimizing core involvement
- High-bandwidth transfers between cores, subsystems, peripherals and memory

TeraNet Data Connections

- Facilitates high-bandwidth communication links between DSP cores, subsystems, peripherals, and memories.
- Supports parallel orthogonal communication links
**Diagnostic Enhancements**

- Embedded Trace Buffers (ETB) enhance CorePac’s diagnostic capabilities
- CP Monitor provides diagnostics on TeraNet data traffic
- Automatic statistics collection and exporting (non-intrusive)
- Can monitor individual events
- Monitor all memory transactions
- Configure triggers to determine when data is collected

**HyperLink Bus**

- Expands the TeraNet Bus to external devices
- Supports 4 lanes with up to 12.5Gbaud per lane
C66x Family Overview

**Miscellaneous Elements**

- 1 to 8 Cores @ up to 1.25 GHz
- MSMC
- MSM
- SR AM
- Application-Specific Coprocessors
- Memory Subsystem
- Multicore Navigator
- Network Coprocessor
- External Interfaces
- TeraNet Switch Fabric
- Diagnostic Enhancements
- HyperLink Bus

- **Boot ROM**
- HW Semaphore provides atomic access to shared resources
- Power Management
- PLL1 (Corepacs), PLL2 (DDR3), PLL3 (Packet Acceleration)
- Three EDMA Controllers
- 16 64-bit Timers
- Inter-Processor Communication (IPC) Registers

**App-Specific: Wireless Applications**

- Wireless-specific Coprocessors
  - 2x FFT Coprocessor (FFTC)
  - Turbo Dec/Enc (TCP3D/3E)
  - 4x Viterbi Coprocessor (VCP2)
  - Bit-rate Coprocessor (BCP)
  - 2x Rake Search Accel (RSA)

- Wireless-specific Interfaces
  - 6x Antenna Interface (AIF2)
App-Specific: General Purpose

- 2x Telecom Serial Port (TSIP)
- EMIF 16 (EMIF-A):
  - Connects memory up to 256MB
  - Three modes:
    - Synchronized SRAM
    - NAND Flash
    - NOR Flash

General Purpose Applications

- CorePac
- Memory Subsystem
- Multicore Navigator
- Network Coprocessor
- External Interfaces
- TeraNet Switch Fabric
- Diagnostic Enhancements
- HyperLink Bus
- Miscellaneous
- Wireless Applications

Common and App-specific I/O

External Interfaces

HyperLink

TeraNet

Network Coprocessor

CorePac

Memory Subsystem

Multicore Navigator

Network Coprocessor

External Interfaces

TeraNet Switch Fabric

Diagnostic Enhancements

HyperLink Bus

Miscellaneous

Wireless Applications
C665x Low-Power Devices

Keystone C6655/57 – Device Features

- **C66x CorePac**
  - C6655 (1 core) @ 1/1.25 GHz
  - C6657 (2 cores) @ 0.85, 1.0 or 1.25 GHz
- **Memory Subsystem**
  - 1MB Local L2 per core
  - MSMC, 32-bit DDR3 I/F
- **Hardware Coprocessors**
  - TCP3d, VCP2
- **Multicore Navigator**
- **Interfaces**
  - 2x McBSP, SPI, I2C, UPP, UART
  - 1x 10/100/1000 SGMII port
  - Hyperlink, 4x SRIO, 2x PCIe
  - EMIF 16, GPIO
- **Debug and Trace (ETB/STB)**

Keystone C6654 – Power Optimized

- **C66x CorePac**
  - C6654 (1 core) @ 850 MHz
- **Memory Subsystem**
  - 1MB Local L2
- **Multicore Navigator**
- **Interfaces**
  - 2x McBSP, SPI, I2C, UPP, UART
  - 1x 10/100/1000 SGMII port
  - EMIF 16, GPIO
- **Debug and Trace (ETB/STB)**
# Keystone C665x – Comparisons

<table>
<thead>
<tr>
<th>HW Feature</th>
<th>C6654</th>
<th>C6655</th>
<th>C6657</th>
</tr>
</thead>
<tbody>
<tr>
<td>CorePac Frequency (GHz)</td>
<td>0.85</td>
<td>1 @ 1.0, 1.25</td>
<td>2 @ 0.85, 1.0, 1.25</td>
</tr>
<tr>
<td>Multicore Shared Mem (MSM)</td>
<td>No</td>
<td></td>
<td>1MB SRAM</td>
</tr>
<tr>
<td>DDR3 Maximum Data Rate</td>
<td>1066</td>
<td>1333</td>
<td></td>
</tr>
<tr>
<td>Serial Rapid I/O (SRIO) Lanes</td>
<td>No</td>
<td></td>
<td>4x</td>
</tr>
<tr>
<td>HyperLink</td>
<td>No</td>
<td></td>
<td>Yes</td>
</tr>
<tr>
<td>Viterbi CoProcessor (VCP)</td>
<td>No</td>
<td></td>
<td>2x</td>
</tr>
<tr>
<td>Turbo Decoder (TCP3d)</td>
<td>No</td>
<td></td>
<td>Yes</td>
</tr>
</tbody>
</table>
What is MCSDK?

- The Multicore Software Development Kit (MCSDK) provides the core foundational building blocks for customers to quickly start developing embedded applications on TI high performance multicore DSPs.
- Uses the SYS/BIOS or Linux real-time operating system
- Accelerates customer time to market by focusing on ease of use and performance
- Provides multicore programming methodologies
- Available for free on the TI website bundled in one installer, all the software in the MCSDK is in source form along with pre-built libraries
Software Architecture

Migrating Development Platform

TI Demo Application on TI Evaluation Platform

TI Demo Application on Customer Platform

Customer Application on Customer Platform

Customer App on Next Generation TI SOC Platform

No modifications required
May be used "as is" or customer can implement value-add modifications
Needs to be modified or replaced with customer version

Software may be different, but API remains the same (CSL, LLD, etc.)

BIOS-MCSDK Software

Demonstration Applications

Software Framework Components

Interprocessor Communication

Instrumentation (MCSA)

Communication Protocols

TCP/IP Networking (NDK)

Algorithm Libraries

DSPLIB

IMGLIB

MATHLIB

Low-Level Drivers (LLDs)

EDMA3

PA

SRIO

FFTC

TSIP

PCIe

QMISS

CPI

HyperLink

Platform/EVM Software

Platform Library

Transports - IPC - NDK

Resource Manager

OSAL

Bootloader

Chip Support Library (CSL)

Hardware
Interprocessor Communication (IPC)

<table>
<thead>
<tr>
<th>Device 1</th>
<th>Device 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>Core 1</td>
<td>Core 2</td>
</tr>
<tr>
<td>BIOS</td>
<td>BIOS</td>
</tr>
<tr>
<td>Process 1</td>
<td>Process 1</td>
</tr>
<tr>
<td>IPC</td>
<td>IPC</td>
</tr>
<tr>
<td>SoC Hardware and Peripherals</td>
<td>SoC Hardware and Peripherals</td>
</tr>
</tbody>
</table>

IPC Transports
- Task to Task
- Core to Core
- Device to Device

<table>
<thead>
<tr>
<th>Transport</th>
<th>Task to Task</th>
<th>Core to Core</th>
<th>Device to Device</th>
</tr>
</thead>
<tbody>
<tr>
<td>Shared Memory</td>
<td>x</td>
<td>x</td>
<td></td>
</tr>
<tr>
<td>Navigator/QMSS</td>
<td>x</td>
<td>x</td>
<td></td>
</tr>
<tr>
<td>SRIO</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>PCie</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>HyperLink</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
</tbody>
</table>

Core 1
- Linux
- SysLink

Core N
- BIOS
- Process 1
- Process 2
Linux/BIOS MCSDK C66x Lite EVM Details

**DVD Contents**
- Factory default recovery
  - EEPROM: POST, IBL
  - NOR: BIOS MCSDK Demo
  - NAND: Linux MCSDK Demo
  - EEPROM/Flash writers
- CCS 5.0
  - IDE
  - C66x EVM GEL/XML files
  - BIOS MCSDK 2.0
  - Source/binary packages
- Linux MCSDK 2.0
  - Source/binary packages

**EVM Flash Contents**
- EEPROM 128 KB
- POST
- IBL
- BIOS MCSDK "Out of Box" Demo
- Linux MCSDK Demo

**Online Collateral**
- TMS320C66x processor website:
  - http://focus.ti.com/docs/prod/folders/print/tms320c6670.html
  - http://focus.ti.com/docs/prod/folders/print/tms320c6678.html
- MCSDK website for updates:
  - http://focus.ti.com/docs/toolsw/folders/print/bioslinuxmcsdk.html
- CCS v5:
- Developer’s website:
  - Linux: http://linux-c6x.org/

**For More Information**

Download MCSDK software:
http://focus.ti.com/docs/toolsw/folders/print/bioslinuxmcsdk.html

Refer to the MCSDK User’s Guide:

For questions regarding topics covered in this training, visit the following e2e support forums:
http://e2e.ti.com/support/CHIP/6000_multi-core_DSPs/f/635.aspx
Notes
More Notes…

*** the very very end ***