TMS320C62x/C67x Programmer's Guide: Literature Number: SPRU198B February 1998
TMS320C62x/C67x Programmer's Guide: Literature Number: SPRU198B February 1998
IMPORTANT NOTICE Texas Instruments (TI) reserves the right to make changes to its products or to discontinue any semiconductor product or service without notice, and advises its customers to obtain the latest version of relevant information to verify, before placing orders, that the information being relied on is current. TI warrants performance of its semiconductor products and related software to the specifications applicable at the time of sale in accordance with TIs standard warranty. Testing and other quality control techniques are utilized to the extent TI deems necessary to support this warranty. Specific testing of all parameters of each device is not necessarily performed, except those mandated by government requirements. Certain applications using semiconductor products may involve potential risks of death, personal injury, or severe property or environmental damage (Critical Applications). TI SEMICONDUCTOR PRODUCTS ARE NOT DESIGNED, INTENDED, AUTHORIZED, OR WARRANTED TO BE SUITABLE FOR USE IN LIFE-SUPPORT APPLICATIONS, DEVICES OR SYSTEMS OR OTHER CRITICAL APPLICATIONS. Inclusion of TI products in such applications is understood to be fully at the risk of the customer. Use of TI products in such applications requires the written approval of an appropriate TI officer. Questions concerning potential risk applications should be directed to TI through a local SC sales office. In order to minimize risks associated with the customers applications, adequate design and operating safeguards should be provided by the customer to minimize inherent or procedural hazards. TI assumes no liability for applications assistance, customer product design, software performance, or infringement of patents or services described herein. Nor does TI warrant or represent that any license, either express or implied, is granted under any patent right, copyright, mask work right, or other intellectual property right of TI covering or relating to any combination, machine, or process in which such semiconductor products or services might be or are used.
Preface
Part I: Introduction includes a brief description of the C6x architecture and code development flow. It also includes a tutorial that introduces you to the tools you will use in each phase of development and an optimization checklist to help you achieve optimal performance from your code. Part II: C Code includes C code examples and discusses optimization methods for the code. This information can help you choose the most appropriate optimization techniques for your code. Part III: Assembly Code describes the structure of assembly code. It provides examples and discusses optimizations for assembly code. It also includes a chapter on interrupt subroutines. Part IV: Appendix provides extensive code examples from the GSM EFR vocoder.
iii
TMS320C6x Assembly Language Tools Users Guide (literature number SPRU186) describes the assembly language tools (assembler, linker, and other tools used to develop assembly language code), assembler directives, macros, common object file format, and symbolic debugging directives for the C6x generation of devices. TMS320C6x Optimizing C Compiler Users Guide (literature number SPRU187) describes the C6x C compiler and the assembly optimizer. This C compiler accepts ANSI standard C source code and produces assembly language source code for the C6x generation of devices. The assembly optimizer helps you optimize your assembly code. TMS320C6x C Source Debugger Users Guide (literature number SPRU188) tells you how to invoke the C6x simulator and emulator versions of the C source debugger interface. This book discusses various aspects of the debugger, including command entry, code execution, data management, breakpoints, profiling, and analysis. TMS320C62x/C67x CPU and Instruction Set Reference Guide (literature number SPRU189) describes the C62x/C67x CPU architecture, instruction set, pipeline, and interrupts for these digital signal processors. TMS320 DSP Designers Notebook: Volume 1 (literature number SPRT125) presents solutions to common design problems using C2x, C3x, C4x, C5x, and other TI DSPs. TMS320C6201/C6701 Peripherals Reference Guide (literature number SPRU190) describes common peripherals available on the TMS320C6201/C6701 digital signal processors. This book includes information on the internal data and program memories, the external memory interface (EMIF), the host port, serial ports, direct memory access (DMA), clocking and phase-locked loop (PLL), and the powerdown modes. TMS320C6201 Digital Signal Processor Data Sheet (literature number SPRS051) describes the features of the TMS320C6201 and provides pinouts, electrical specifications, and timings for the device.
iv
Trademarks
Trademarks
Solaris and SunOS are trademarks of Sun Microsystems, Inc. VelociTI is a trademark of Texas Instruments Incorporated. Windows and Windows NT are registered trademarks of Microsoft Corporation.
Email: dsph@ti.com
Fax: +49 81 61 80 40 10
Asia-Pacific
Literature Response Center +852 2 956 7288 Fax: +852 2 956 2200 Hong Kong DSP Hotline +852 2 956 7268 Fax: +852 2 956 1002 Korea DSP Hotline +82 2 551 2804 Fax: +82 2 551 2828 Korea DSP Modem BBS +82 2 551 2914 Singapore DSP Hotline Fax: +65 390 7179 Taiwan DSP Hotline +886 2 377 1450 Fax: +886 2 377 2718 Taiwan DSP Modem BBS +886 2 376 2592 Taiwan DSP Internet BBS via anonymous ftp to ftp://dsp.ee.tit.edu.tw/pub/TI/
Japan
Product Information Center +0120-81-0026 (in Japan) +03-3457-0972 or (INTL) 813-3457-0972 DSP Hotline +03-3769-8735 or (INTL) 813-3769-8735 DSP BBS via Nifty-Serve Type Go TIASP Fax: +0120-81-0036 (in Japan) Fax: +03-3457-1259 or (INTL) 813-3457-1259 Fax: +03-3457-7071 or (INTL) 813-3457-7071
Documentation
When making suggestions or reporting errors in documentation, please include the following information that is on the title page: the full title of the book, the publication date, and the literature number. Mail: Texas Instruments Incorporated Email: dsph@ti.com Technical Documentation Services, MS 702 P.O. Box 1443 Houston, Texas 77251-1443
Note:
When calling a Literature Response Center to order documentation, please specify the literature number of the book.
vi
Contents
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 Introduces some features of the C6x microprocessor and discusses the basic process for creating code. 1.1 1.2 1.3 2 TMS320C6x Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 TMS320C6x Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Code Development Flow to Increase Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3
Code Development Flow Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1 Uses example code to walk you through the code development flow for the TMS320C6x. 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Before You Begin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Introduction to the Example Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 Lesson 1: Compiling, Assembling, and Linking the Example Code . . . . . . . . . . . . . . . . 2 5 Lesson 2: Profiling the Example Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 8 Lesson 3: Phase 1 of the Code Development Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 14 Lesson 4: Phase 2 of the Code Development Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 17 Lesson 5: Phase 3 of the Code Development Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 25 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 31
TMS320C6x Optimization Checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1 Provides a code development flow and checklist for optimizing loops. Optimizing C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1 Explains how to maximize C performance by using compiler options, intrinsics, and code transformations. 4.1 Writing C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 4.1.1 Tips on Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 4.1.2 Analyzing C Code Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Compiling C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4.2.1 Compiler Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4.2.2 Memory Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 Refining C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 9 4.3.1 Using Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 9 4.3.2 Using Word Access for Short Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 14 4.3.3 Software Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 20
vii
4.2
4.3
Contents
Structure of Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1 Describes the structure of the assembly code, including labels, conditions, instructions, functional units, operands, and comments. 5.1 5.2 5.3 5.4 5.5 5.6 5.7 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel Bars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Functional Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 5 5 5 5 5 2 2 3 4 6 8 9
Optimizing Assembly Code via Linear Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1 Describes methods that help you develop more efficient assembly language programs. 6.1 6.2 Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Writing Parallel Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 6.2.1 Dot Product C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 6.2.2 Translating C Code to Linear Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5 6.2.3 Linear Assembly Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 6.2.4 Drawing a Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 6.2.5 Nonparallel Versus Parallel Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 10 6.2.6 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 14 Using Word Access for Short Data and Doubleword Access for Floating-Point Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 15 6.3.1 Unrolled Dot Product C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 15 6.3.2 Translating C Code to Linear Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 16 6.3.3 Drawing a Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 18 6.3.4 Linear Assembly Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 19 6.3.5 Final Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 22 6.3.6 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 24 Software Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 25 6.4.1 Modulo Iteration Interval Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 28 6.4.2 Using the Assembly Optimizer to Create Optimized Loops . . . . . . . . . . . . . . . 6 35 6.4.3 Final Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 36 6.4.4 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 53 Modulo Scheduling of Multicycle Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 54 6.5.1 Weighted Vector Sum C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 54 6.5.2 Translating C Code to Linear Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 54 6.5.3 Determining the Minimum Iteration Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 55 6.5.4 Drawing a Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 57 6.5.5 Linear Assembly Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 58 6.5.6 Modulo Iteration Interval Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 58 6.5.7 Using the Assembly Optimizer for the Weighted Vector Sum . . . . . . . . . . . . . 6 69 6.5.8 Final Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 70
6.3
6.4
6.5
viii
Contents
6.6
6.7
6.8
6.9
6.10
6.11
Loop Carry Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 73 6.6.1 IIR Filter C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 73 6.6.2 Translating C Code to Linear Assembly (Inner Loop) . . . . . . . . . . . . . . . . . . . . 6 74 6.6.3 Drawing a Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 75 6.6.4 Determining the Minimum Iteration Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 76 6.6.5 Linear Assembly Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 78 6.6.6 Modulo Iteration Interval Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 79 6.6.7 Using the Assembly Optimizer for the IIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . 6 80 6.6.8 Final Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 81 If-Then-Else Statements in a Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 82 6.7.1 If-Then-Else C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 82 6.7.2 Translating C Code to Linear Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 83 6.7.3 Drawing a Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 84 6.7.4 Determining the Minimum Iteration Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 85 6.7.5 Linear Assembly Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 86 6.7.6 Final Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 87 6.7.7 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 88 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 90 6.8.1 Unrolled If-Then-Else C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 90 6.8.2 Translating C Code to Linear Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 91 6.8.3 Drawing a Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 92 6.8.4 Determining the Minimum Iteration Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 93 6.8.5 Linear Assembly Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 93 6.8.6 Final Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 95 6.8.7 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 96 Live-Too-Long Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 97 6.9.1 C Code With Live-Too-Long Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 97 6.9.2 Translating C Code to Linear Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 98 6.9.3 Drawing a Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 98 6.9.4 Determining the Minimum Iteration Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 100 6.9.5 Linear Assembly Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 102 6.9.6 Final Assembly With Move Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 104 Redundant Load Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 106 6.10.1 FIR Filter C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 106 6.10.2 Translating C Code to Linear Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 108 6.10.3 Drawing a Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 109 6.10.4 Determining the Minimum Iteration Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 110 6.10.5 Linear Assembly Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 110 6.10.6 Final Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 111 Memory Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 114 6.11.1 FIR Filter Inner Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 116 6.11.2 Unrolled FIR Filter C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 118 6.11.3 Translating C Code to Linear Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 119 6.11.4 Drawing a Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 120
Contents
ix
Contents
6.12
6.13
6.11.5 Linear Assembly for Unrolled FIR Inner Loop With .mptr Directive . . . . . . . . 6.11.6 Linear Assembly Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11.7 Determining the Minimum Iteration Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11.8 Final Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11.9 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software Pipelining the Outer Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.12.1 Unrolled FIR Filter C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.12.2 Making the Outer Loop Parallel With the Inner Loop Epilog and Prolog . . . 6.12.3 Final Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.12.4 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outer Loop Conditionally Executed With Inner Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.13.1 Unrolled FIR Filter C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.13.2 Translating C Code to Linear Assembly (Inner Loop) . . . . . . . . . . . . . . . . . . . 6.13.3 Translating C Code to Linear Assembly (Outer Loop) . . . . . . . . . . . . . . . . . . . 6.13.4 Unrolled FIR Filter C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.13.5 Translating C Code to Linear Assembly (Inner Loop) . . . . . . . . . . . . . . . . . . . 6.13.6 Translating C Code to Linear Assembly (Inner Loop and Outer Loop) . . . . 6.13.7 Determining the Minimum Iteration Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.13.8 Final Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.13.9 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
121 123 124 124 124 127 127 128 128 131 132 132 133 134 134 136 138 142 142 145
Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1 Describes interrupts from a software programming point of view. 7.1 7.2 7.3 7.4 Overview of Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Single Assignment vs. Multiple Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interruptible Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interruptible Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Level 0 Specified Code is Guaranteed to Not Be Interrupted . . . . . . . . . . . . . 7.4.2 Level 1 Specified Code Interruptible at All Times . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Level 2 Specified Code Interruptible Within Threshold Cycles . . . . . . . . . . . Interrupt Subroutines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 ISR with the C Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 ISR with Hand-Coded Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Nested Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 7 7 7 7 7 7 7 7 7 2 3 5 6 6 7 7 8 8 9 9
7.5
Applications Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 1 Provides extensive code examples from the GSM EFR vocoder. A.1 A.2 Summary of Major Programming Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 2 Implementation of the GSM EFR Vocoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 3 A.2.1 Implementation of the Multiply-Accumulate Loop . . . . . . . . . . . . . . . . . . . . . . . . A 4 A.2.2 Implementation of the Windowing and Scaling Part of autocorr.c . . . . . . . . . . . A 7 A.2.3 Implementation of cor_h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 20 A.2.4 Implementation of the rrv Computation in search_10i40 . . . . . . . . . . . . . . . . . A 27 A.2.5 Implementation of the Index Search in search_10i40 . . . . . . . . . . . . . . . . . . . . A 38
Contents
A.2.6 A.2.7
Implementation of the FIR Filter, residu.c, in GSM EFR Vocoder . . . . . . . . . . A 51 Implementation of the Lag Search in the lag_max ( ) Routine . . . . . . . . . . . . A 56
Contents
xi
Figures
Figures
41 42 43 51 52 53 54 55 56 57 58 59 61 62 63 64 65 66 67 68 69 610 611 612 613 614 615 616 617 618 619 620
xii
Dependency Graph for Vector Sum #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 6 Dependency Graph for Vector Sum #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 7 Software-Pipelined Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 20 Labels in Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Parallel Bars in Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Conditions in Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Instructions in Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4 TMS320C6x Functional Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6 Units in the Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 7 Operands in the Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 8 Operands in Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 8 Comments in Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 9 Dependency Graph of Fixed-Point Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 7 Dependency Graph of Floating-Point Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 8 Dependency Graph of Fixed-Point Dot Product with Parallel Assembly . . . . . . . . . . . . . . 6 11 Dependency Graph of Floating-Point Dot Product with Parallel Assembly . . . . . . . . . . . . 6 13 Dependency Graph of Fixed-Point Dot Product With LDW . . . . . . . . . . . . . . . . . . . . . . . . . 6 18 Dependency Graph of Floating-Point Dot Product With LDDW . . . . . . . . . . . . . . . . . . . . . 6 19 Dependency Graph of Fixed-Point Dot Product With LDW (Showing Functional Units) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 20 Dependency Graph of Floating-Point Dot Product With LDDW (Showing Functional Units) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 21 Dependency Graph of Fixed-Point Dot Product With LDW (Showing Functional Units) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 26 Dependency Graph of Floating-Point Dot Product With LDDW (Showing Functional Units) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 27 Dependency Graph of Weighted Vector Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 57 Dependency Graph of Weighted Vector Sum (Showing Resource Conflict) . . . . . . . . . . 6 61 Dependency Graph of Weighted Vector Sum (With Resource Conflict Resolved) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 64 Dependency Graph of Weighted Vector Sum (Scheduling ci +1) . . . . . . . . . . . . . . . . . . . . 6 66 Dependency Graph of IIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 75 Dependency Graph of IIR Filter (With Smaller Loop Carry) . . . . . . . . . . . . . . . . . . . . . . . . 6 77 Dependency Graph of If-Then-Else Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 84 Dependency Graph of If-Then-Else Code (Unrolled) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 92 Dependency Graph of Live-Too-Long Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 99 Dependency Graph of Live-Too-Long Code (Split-Join Path Resolved) . . . . . . . . . . . . . 6 102
Figures
Dependency Graph of FIR Filter (With Redundant Load Elimination) . . . . . . . . . . . . . . . 6 109 4-Bank Interleaved Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 114 4-Bank Interleaved Memory With Two Memory Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 115 Dependency Graph of FIR Filter (With Even and Odd Elements of Each Array on Same Loop Cycle) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 117 Dependency Graph of FIR Filter (With No Memory Hits) . . . . . . . . . . . . . . . . . . . . . . . . . . 6 120 Flow Diagram for the Windowing and Scaling Part of autocorr.c . . . . . . . . . . . . . . . . . . . . . A 9 Flow Diagram for autocorr.c With Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 12 Flow Diagram for autocorr.c With Rearranged C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 13
Contents
xiii
Tables
Tables
21 22 23 24 25 26 27 31 32 41 42 51 52 53 61 62 63 64 65 66 67 68 69 610 611 612 613 614 615 616 617 618
xiv
Using the C_OPTIONS Environment Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 7 Cycle Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 11 Revised Cycle Counts for vec_mpy( ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 22 Revised Cycle Counts for iir( ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 23 Revised Cycle Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 24 Revised Cycle Counts for iir( ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 29 Revised Cycle Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 30 Code Development Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 TMS320C6x Optimization Checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 Subset of Compiler Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 TMS320C6x C Compiler Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 10 Selected TMS320C6x Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4 Selected TMS320C6x Instruction Mnemonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 Functional Units and Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 7 Comparison of Nonparallel and Parallel Assembly Code for Fixed-Point Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 14 Comparison of Nonparallel and Parallel Assembly Code for Floating-Point Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 14 Comparison of Fixed-Point Dot Product Code With Use of LDW . . . . . . . . . . . . . . . . . . . . 6 24 Comparison of Floating-Point Dot Product Code With Use of LDDW . . . . . . . . . . . . . . . . 6 24 Modulo Iteration Interval Scheduling Table for Fixed-Point Dot Product (Before Software Pipelining) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 28 Modulo Iteration Interval Scheduling Table for Floating-Point Dot Product (Before Software Pipelining) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 29 Modulo Iteration Interval Table for Fixed-Point Dot Product (After Software Pipelining) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 31 Modulo Iteration Interval Table for Floating-Point Dot Product (After Software Pipelining) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 32 Software Pipeline Accumulation Staggered Results Due to Three-Cycle Delay . . . . . . 6 34 Comparison of Fixed-Point Dot Product Code Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 6 53 Comparison of Floating-Point Dot Product Code Examples . . . . . . . . . . . . . . . . . . . . . . . . 6 53 Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) . . . . . . . . . . . . 6 60 Modulo Iteration Interval Table for Weighted Vector Sum With SHR Instructions . . . . . . 6 62 Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) . . . . . . . . . . . . 6 65 Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) . . . . . . . . . . . . 6 68 Resource Table for IIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 76 Modulo Iteration Interval Table for IIR (4-Cycle Loop) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 79 Resource Table for If-Then-Else Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 85
Tables
619 620 621 622 623 624 625 626 627 628
Comparison of If-Then-Else Code Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 89 Resource Table for Unrolled If-Then-Else Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 93 Comparison of If-Then-Else Code Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 96 Resource Table for Live-Too-Long Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 100 Resource Table for FIR Filter Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 110 Resource Table for FIR Filter Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 124 Comparison of FIR Filter Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 124 Comparison of FIR Filter Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 131 Resource Table for FIR Filter Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 142 Comparison of FIR Filter Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 145
Contents
xv
Examples
Examples
21 22 23 24 25 26 27 28 29 210 211 212 213 214 215 216 217 218 219 220 31 41 42 43 44 45 46 47 48 49 410 411 412 413 414 415
xvi
The Code Example demo1.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 The Multiply Accumulate Functionmac1.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 The Vector Multiply Functionvec_mpy1.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4 The Biquad Filteriir1.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4 Including the clock( ) Function in demo1.c (count.c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 12 Inner Loop Kernel of mac1.asm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 14 Inner Loop Kernel of vec_mpy1.asm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 15 Inner Loop Kernel of iir1.asm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 16 The Vector Multiply Functionvec_mpy1.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 17 Inner Loop Kernel of vec_mpy1.asm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 17 The Revised Vector Multiply Functionvec_mpy2.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 18 The Biquad Filteriir1.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 19 The Revised Biquad Filteriir2.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 20 The Revised Exampledemo2.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 21 Inner Loop Kernel of vec_mpy2.asm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 22 Inner Loop Kernel of iir2.asm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 23 The Revised Biquad Filteriir2.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 26 The Biquad Filter, Revised and Assembly-Optimizediir3.sa . . . . . . . . . . . . . . . . . . . . . . 2 27 The Revised Example demo3.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 28 Inner Loop Kernel of iir3.asm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 29 Compiler and/or Assembly Optimizer Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Basic Vector Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 Vector Sum With const Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 7 Compiler Output for Vector Sum Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 8 Saturated Add Without Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 9 Saturated Add With Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 10 Vector Sum With const Keywords, _nassert, Word Reads . . . . . . . . . . . . . . . . . . . . . . . . . 4 14 Vector Sum With const Keywords, _nassert, Word Reads (Generic Version) . . . . . . . . . 4 15 Dot Product Using Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 16 FIR Filter Original Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 16 FIR Filter Optimized Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 17 Basic Float Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 18 Float Dot Product Using Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 18 Float Dot Product With Peak Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 19 Trip Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 21 Vector Sum With const Keywords and _nassert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 22
Examples
416 417 418 419 420 61 62 63 64 65 66 67 68 69 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629
Vector Sum With Three Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 23 Word-Aligned Vector Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 23 Vector Sum Using const Keywords, _nassert, Word Reads, and Loop Unrolling . . . . . . 4 24 FIR_Type2Original Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 25 FIR_Type2Inner Loop Completely Unrolled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 26 Fixed-Point Dot Product C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 Floating-Point Dot Product C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 List of Assembly Instructions for Fixed-Point Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5 List of Assembly Instructions for Floating-Point Dot Product . . . . . . . . . . . . . . . . . . . . . . . . 6 5 Nonparallel Assembly Code for Fixed-Point Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . 6 10 Parallel Assembly Code for Fixed-Point Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 11 Nonparallel Assembly Code for Floating-Point Dot Product . . . . . . . . . . . . . . . . . . . . . . . . 6 12 Parallel Assembly Code for Floating-Point Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 13 Fixed-Point Dot Product C Code (Unrolled) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 15 Floating-Point Dot Product C Code (Unrolled) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 16 Linear Assembly for Fixed-Point Dot Product Inner Loop with LDW . . . . . . . . . . . . . . . . . 6 16 Linear Assembly for Floating-Point Dot Product Inner Loop with LDDW . . . . . . . . . . . . . 6 17 Linear Assembly for Fixed-Point Dot Product Inner Loop With LDW (With Allocated Resources) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 20 Linear Assembly for Floating-Point Dot Product Inner Loop With LDDW (With Allocated Resources) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 21 Assembly Code for Fixed-Point Dot Product With LDW (Before Software Pipelining) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 22 Assembly Code for Floating-Point Dot Product With LDDW (Before Software Pipelining) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 23 Linear Assembly for Fixed-Point Dot Product Inner Loop (With Conditional SUB Instruction) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 26 Linear Assembly for Floating-Point Dot Product Inner Loop (With Conditional SUB Instruction) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 27 Pseudo-Code for Single-Cycle Accumulator With ADDSP . . . . . . . . . . . . . . . . . . . . . . . . . 6 33 Linear Assembly for Full Fixed-Point Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 35 Linear Assembly for Full Floating-Point Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 36 Assembly Code for Fixed-Point Dot Product (Software Pipelined) . . . . . . . . . . . . . . . . . . 6 38 Assembly Code for Floating-Point Dot Product (Software Pipelined) . . . . . . . . . . . . . . . . 6 39 Assembly Code for Fixed-Point Dot Product (Software Pipelined With No Extraneous Loads) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 42 Assembly Code for Floating-Point Dot Product (Software Pipelined With No Extraneous Loads) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 44 Assembly Code for Fixed-Point Dot Product (Software Pipelined With Removal of Prolog and Epilog) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 48 Assembly Code for Floating-Point Dot Product (Software Pipelined With Removal of Prolog and Epilog) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 49 Assembly Code for Fixed-Point Dot Product (Software Pipelined With Smallest Code Size) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 51 Assembly Code for Floating-Point Dot Product (Software Pipelined With Smallest Code Size) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 52
Contents
xvii
Examples
630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671
xviii
Weighted Vector Sum C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 54 Linear Assembly for Weighted Vector Sum Inner Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 54 Weighted Vector Sum C Code (Unrolled) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 55 Linear Assembly for Weighted Vector Sum Using LDW . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 56 Linear Assembly for Weighted Vector Sum With Resources Allocated . . . . . . . . . . . . . . . 6 58 Linear Assembly for Weighted Vector Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 69 Assembly Code for Weighted Vector Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 71 IIR Filter C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 73 Linear Assembly for IIR Inner Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 74 Linear Assembly for IIR Inner Loop With Reduced Loop Carry Path . . . . . . . . . . . . . . . . . 6 78 Linear Assembly for IIR Inner Loop (With Allocated Resources) . . . . . . . . . . . . . . . . . . . . 6 78 Linear Assembly for IIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 80 Assembly Code for IIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 81 If-Then-Else C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 82 Linear Assembly for If-Then-Else Inner Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 83 Linear Assembly for Full If-Then-Else Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 86 Assembly Code for If-Then-Else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 87 Assembly Code for If-Then-Else With Loop Count Greater Than 3 . . . . . . . . . . . . . . . . . . 6 88 If-Then-Else C Code (Unrolled) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 90 Linear Assembly for Unrolled If-Then-Else Inner Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 91 Linear Assembly for Full Unrolled If-Then-Else Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 94 Assembly Code for Unrolled If-Then-Else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 95 Live-Too-Long C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 97 Linear Assembly for Live-Too-Long Inner Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 98 Linear Assembly for Full Live-Too-Long Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 103 Assembly Code for Live-Too-Long With Move Instructions . . . . . . . . . . . . . . . . . . . . . . . 6 104 FIR Filter C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 106 FIR Filter C Code With Redundant Load Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 107 Linear Assembly for FIR Inner Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 108 Linear Assembly for Full FIR Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 110 Final Assembly Code for FIR Filter With Redundant Load Elimination . . . . . . . . . . . . . . 6 112 Final Assembly Code for Inner Loop of FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 116 FIR Filter C Code (Unrolled) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 118 Linear Assembly for Unrolled FIR Inner Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 119 Linear Assembly for Full Unrolled FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 121 Final Assembly Code for FIR Filter With Redundant Load Elimination and No Memory Hits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 125 Unrolled FIR Filter C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 127 Final Assembly Code for FIR Filter With Redundant Load Elimination and No Memory Hits With Outer Loop Software-Pipelined . . . . . . . . . . . . . . . . . . . . . . . . 6 129 Unrolled FIR Filter C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 132 Linear Assembly for Unrolled FIR Inner Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 133 Linear Assembly for FIR Outer Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 134 Unrolled FIR Filter C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 135
Examples
672 673 674 71 72 73 74 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17 A18 A19 A20 A21 A22 A23 A24 A25 A26 A27 A28 A29 A30 A31 A32 A33 A34 A35
Linear Assembly for FIR With Outer Loop Conditionally Executed With Inner Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 137 Linear Assembly for FIR With Outer Loop Conditionally Executed With Inner Loop (With Functional Units) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 139 Final Assembly Code for FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 143 Code With Multiple Assignment of A1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Code Using Single Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4 Hand-Coded Assembly ISR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 9 Hand-Coded Assembly ISR Allowing Nesting of Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . 7 10 C Code for the Typical MAC Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 4 Linear Assembly for the MAC Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 4 C Code for MAC Loop With Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 5 C Code for Energy Computation MAC Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 5 Linear Assembly for Energy Computation MAC Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 5 Assembly Code for the Energy Computation MAC Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 6 C Code for the Windowing and Scaling Part of autocorr.c . . . . . . . . . . . . . . . . . . . . . . . . . . . A 8 Linear Assembly for One Iteration of autocorr.c (Loop 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . A 9 Linear Assembly for Loop 1 of autocorr.c (Using LDW) . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 10 Linear Assembly for Loop 2 of autocorr.c (No Loop Unrolling) . . . . . . . . . . . . . . . . . . . . . . A 10 Linear Assembly for Loop 2 of autocorr.c (With Loop Unrolling) . . . . . . . . . . . . . . . . . . . . A 11 Linear Assembly for Loop 3 of autocorr.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 11 Linear Assembly for Loop I of autocorr.c (Modified) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 14 Linear Assembly for Loop II of autocorr.c (Modified) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 15 Implemented C Code for autocorr.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 16 Assembly Code for Windowing and Scaling Part of autocorr.c . . . . . . . . . . . . . . . . . . . . . . A 17 C Code for cor_h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 20 Linear Assembly for cor_h (One Inner Loop Iteration) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 21 C Code for cor_h (With Inner Loop Unrolling) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 22 Linear Assembly for cor_h (With Inner Loop Unrolling) . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 23 Assembly Code for cor_h With Reduced Code Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 24 C Code for the rrv Computation in search_10i40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 27 Linear Assembly for the rrv Computation in Search_10i40 (One Loop Iteration) . . . . . . A 28 C Code for the rrv Computation in search_10i40 (Unrolled Loop) . . . . . . . . . . . . . . . . . . . A 30 Linear Assembly for rrv Computation in search_10i40 (One Loop Iteration) . . . . . . . . . . A 31 Assembly Code for the rrv Computation in search_10i40 . . . . . . . . . . . . . . . . . . . . . . . . . . A 33 C Code for the Index Search for search_10i40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 38 Linear Assembly for the Index Search for search_10i40 (Inner Loop) . . . . . . . . . . . . . . . A 41 Modified C Code for the Index Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 42 Assembly Code for the search_10i40 Index Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 44 C Code for residu.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 51 C Code for residu.c After Rearrangement Using Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . A 52 Implemented C Code for residu.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 53 Assembly Code for residu.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 54 C Code for the Lag Search in lag_max( ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 57
Contents
xix
Examples
C Code for the Lag Search in lag_max ( ) (Comparison Order Changed) . . . . . . . . . . . . C Code for the Lag Search in lag_max( ) With Outer Loop Unrolling . . . . . . . . . . . . . . . . . Linear Assembly for the Lag Search in lag_max( ) Inner Loop . . . . . . . . . . . . . . . . . . . . . . C Code for the Lag Search in lag_max( ) With Inner and Outer Loops Unrolled . . . . . . . Linear Assembly for the Lag Search in lag_max( ) Inner Loop . . . . . . . . . . . . . . . . . . . . . . Assembly Code for the Lag Search in lag_max( ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A 58 A 59 A 59 A 60 A 61 A 62
xx
Part I
Introduction
Part II
C Code
Part III
Assembly Code
Part IV
Appendix
Chapter 1
Introduction
This chapter introduces some features of the C6x microprocessor and discusses the basic process for creating code. Any reference to C6x pertains to both the C62x (fixed-point) and the C67x (floating-point) devices. All techniques are applicable to both devices, even though most of the examples shown are fixed-point specific.
Topic
1.1 1.2 1.3
Page
TMS320C6x Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2 TMS320C6x Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2 Code Development Flow to Increase Performance . . . . . . . . . . . . . . . 1-3
1-1
The C67x is a floating-point DSP with the same features. It is the second DSP to use the VelociTI architecture.
The C6x DSPs are based on the C6x CPU, which consists of:
Program fetch unit Instruction dispatch unit Instruction decode unit Two data paths, each with four functional units Thirty-two 32-bit registers Control registers Control logic Test, emulation, and interrupt logic
Increased pipelining eliminates traditional architectural bottlenecks in program fetch, data access, and multiply operations. Pipeline control is simplified by eliminating pipeline locks. The pipeline can dispatch eight parallel instructions every cycle. Parallel instructions proceed simultaneously through the same pipeline phases.
1-2
Yes
Complete
Efficient?
Yes
Complete
No Yes More C optimization? No Write linear assembly Phase 3: Write Linear Assembly Assembly optimize Profile No Efficient? Yes Complete
Introduction
1-3
The following lists the phases in the 3-step software development flow shown on page 1-3, and the goal for each phase:
Phase Goal
You can develop your C code for phase 1 without any knowledge of the C6x. Use the C6x profiling tools that are described in the TMS320C6x C Source Debugger Users Guide to identify any inefficient areas that you might have in your C code. To improve the performance of your code, proceed to phase 2. Use the intrinsics, shell options, and techniques that are described in Chapter 4 of this book to improve your C code. Use the C6x profiling tools to check its performance. If your code is still not as efficient as you would like it to be, proceed to phase 3. Extract the time-critical areas from your C code and rewrite the code in linear assembly. You can use the assembly optimizer to optimize this code.
1-4
Chapter 2
This chapter walks you through the code development flow that was introduced in Chapter 1. It uses step-by-step instructions and code examples to show you how to use the software development tools in each phase of development. Before you start this tutorial, you should install the code generation tools and the C source debugger. If you do not have a Texas Instruments C source debugger, use your own debugger to check your results. The sample code that is used in this tutorial is included on the code generation tools CD-ROM. When you install your code generation tools, the example code is installed in the c6xtools directory. Use the code in that directory to go through the examples in this chapter. The examples in this chapter were run on the most recent version of the software development tools that were available as of the publication of this book. Because the tools are being continuously improved, you may get different results if you are using a more recent version of the tools.
Topic
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8
Page
Before You Begin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2 Introduction to the Example Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3 Lesson 1: Compiling, Assembling, and Linking the Example Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5 Lesson 2: Profiling the Example Code . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8 Lesson 3: Phase 1 of the Code Development Flow . . . . . . . . . . . . . . 2-14 Lesson 4: Phase 2 of the Code Development Flow . . . . . . . . . . . . . . 2-17 Lesson 5: Phase 3 of the Code Development Flow . . . . . . . . . . . . . . 2-25 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-31
2-1
Important information
In addition to primary actions, important information ensures that the tutorial works correctly. Important information is marked like this:
Important! If you are using SunOS, be sure
you reinitialize your shell before continuing with this tutorial. Optional tasks Optional tasks allow you to learn more about the C6x tools; however, you do not need to perform the optional tasks to complete the tutorial successfully. Optional tasks are marked like this:
Try This: The stand-alone simulator (load6x)
is another tool that you can use to find out what the cycle count for each function is. This tutorial is divided into lessons. Each lesson builds on the previous lesson. To get the most benefit from the tutorial, you should start at the beginning and work your way through each lesson in order to the end.
2-2
The mac1( ) function, a multiply accumulate and squaring accumulate example, is shown in Example 22. It is performing a dot product of vector a with vector b and is also squaring and summing vector b.
2-3
The vec_mpy( ) function shown in Example 23 is a vector multiply, which is a scalar multiply followed by a right shift. The result is stored to a second vector.
The third function, iir1( ), is a typical infinite impulse response (IIR) biquad filter. The code for this function is shown in Example 24.
2-4
Compiling for the C67x: On a command line, enter the following on a single line:
cl6x g o k mg mv6700 demo1.c mac1.c vec_mpy1.c iir1.c z lnk.cmd l rts6701.lib o demo1.out
You should not receive any errors, and the file, demo1.out, should be created. If you receive an error message, look up that error message in the appropriate users guide. Here is a description of what you told the shell program (cl6x) to do: cl6x g o Run the compiler and the assembler. Generate symbolic debugging directives that are used by the debugger. Invoke the optimizer at the default level (o is the same as o2). Not all optimizations work well with debugging because the optimizers rearrangement of code can make it difficult for you to correlate source code with object code. Using the g option with the o option allows for the maximum amount of optimization that is compatible with debugging. k Keep the assembly output files. Notice that you now have the following .asm files in your current directory: demo1.asm, mac1.asm, vec_mpy1.asm, and iir1.asm. When the k option is not used, the shell program deletes the assembly output files after assembly is complete. mg Turn on the maximum amount of optimization that is compatible with profiling. The mg option allows you to profile optimized code. Compiler is invoked to target C67x devices. If this switch is not used, the compiler defaults to the C62x device. This code will run on a C67x device, but it will run slower if using floating-point instructions since the code will have been compiled for the C62x device.
mv6700
2-5
Invoke the linker. The addition of this option to the cl6x command line means that the code is compiled, assembled, and linked in one step. Use lnk.cmd as the linker command file. Linker command files allow you to put linking information into a file, which is useful when you invoke the linker often with the same information. Linker command files are also useful because they allow you to use the MEMORY directive, which defines the target memory configuration, and the SECTIONS directive, which controls how sections are built and allocated.
lnk.cmd
l rts6201.lib
Include the runtime-support library for the C62x device, rts6201.lib, which is included on your CD-ROM. The runtime-support functions in rts6201.lib were compiled for little-endian mode. For big-endian mode, use the runtime support functions in rts6201e.lib.
l rts6701.lib
Include the runtime-support library for the C67x device, rts6701.lib, which is included on your CD-ROM. The runtime-support functions in rts6701.lib were compiled for little-endian mode. For big-endian mode, use the runtime support functions in rts6701e.lib.
o demo1.out
Name the output file demo1.out. (The default is a.out.) Because this option comes after the z option, it is considered a linker option and is interpreted differently than the o option that you entered before z.
Try This: The options above are used throughout the rest of this tutorial.
They are fairly common and might be ones that you want to use repeatedly. To avoid having to retype them each time you run the code development tools, you can use the C_OPTIONS environment variable. The shell program uses the default options and/or input filenames that you name with the C_OPTIONS environment variable every time you run the shell. Use the commands in Table 21 to set up the C_OPTIONS environment variable with the options used on page 2-5.
2-6
Notice that the o demo1.out linker option was not included. If it were included, running the second tutorial example, demo2.c, would result in an output file named demo1.out instead of a more logical name such as demo2.out. Files must be explicitly called on command and not as an environment variable. To compile all of the C files in a directory, use the cl6x command with the appropriate options and use *.c where the files are normally indicated. For example:
cl6x g mg *.c z lnk.cmd l rts6201.lib o demo1.out
Important! If you are using SunOS, be sure you reinitialize your shell before
2-7
2-8
To select the areas of demo1 that you want profiled, follow these steps: 1) From the Profile menu, select Select Areas. This displays the Profile Marking dialog box. 2) In the Level box, select C. 3) In the Area box, select Functions. This indicates that the C functions in demo1.out will be your profile areas. 4) Click Mark.
5) Click Close. The Profile window is updated to include a line for each C function in demo1.
2-9
To start the profiling session, follow these steps: 1) Click the run icon on the toolbar:
This displays the Profile Run dialog box. 2) In the Run Method box, select Quick, no exclusive fields. This will show you the total execution time (cycle count) of a profile area, including the execution time of any subroutines called within the functions. 3) If main( ) is not already selected as your starting point, choose it from the list of starting points.
4) Click OK. The Run Method dialog box closes and the status bar reads Target: Profiling to indicate that the profiling session has started.
2-10
The program restarts and runs to main( ) without profiling. Profiling begins when main( ) is reached and continues until the exit point of main( ) is reached. When profiling is complete, the status bar reads Target: Halted and your Profile window looks like this:
The Inclusive column indicates the cycle counts for each function, including any function that it calls. Because these functions do not call any other functions, the inclusive cycle counts are the same as the exclusive cycle counts. Notice that the cycle count for the mac1( ) function is 167, and that the cycle counts for the vec_mpy1( ) and iir1( ) functions are much higher316 and 270, respectively. To interpret the cycle counts in the Profile window, you need to understand how they are calculated. Here is the formula for calculating cycle counts: Execute packets
An execute packet is a group of parallel instructions. You can have up to eight instructions executing in parallel; therefore, each execute packet can contain up to eight instructions. An example of execute packets is shown in Example 27 on page 2-15. Table 22 shows how the cycle counts were calculated for each function.
2-11
Try This: The stand-alone simulator (load6x) is another tool that you can use
to find out what the cycle count for each function is. To get cycle count information for each function with the stand-alone simulator, embed the clock( ) function in your C code. Example 25 shows how to rewrite demo1.c to include the clock( ) function.
Note: When using this method, remember to calculate the overhead and include the appropriate header files.
2-12
Now, compile, assemble, and link count.c. If you did not set up your C_OPTIONS environment variable as described on page 2-6, enter the following on a command line:
cl6x g o k mg count.c mac1.c vec_mpy1.c iir1.c z lnk.cmd l rts6201.lib o count.out
OR If you set up your C_OPTIONS environment variable as described on page 2-6, enter the following on a command line:
cl6x z o count.out
Although the z option is already specified in the C_OPTIONS environment variable, you need to specify it on the command line to indicate that this occurrence of o is a linker option. Use load6x to see the output of the printf statements that were embedded in the C code. On a command line, enter:
load6x count.out
Notice that these cycle counts are higher than the cycle counts that you saw with the profiler. For example, mac1 is listed here as having 175 cycles; however, it was listed in the Profiler window as having 167 cycles. You will see some extra cycles when you use load6x because you still have overhead for each function call. When you use the profiler, the cycles needed for calling the functions are not included in the profile display. The Using the StandAlone Simulator chapter in the TMS320C6x Optimizing C Compiler Users Guide discusses load6x in more detail.
2-13
|| || || || [ B0] || [ B0] || ||
The @ characters specify the iteration of the loop that an instruction is on in the software pipeline; these symbols are automatically created by the code generation tools. The first iteration does not have an @ character; one @ character represents the second iteration; two @ characters represents the third iteration, and so on. Because the mac1( ) function does not need to be improved, it does not need to go beyond phase 1 of the code development flow.
2-14
Look at Example 27, which shows the assembly output of the innermost loop for the vec_mpy1( ) function. Recall from page 2-11 that the vec_mpy1( ) function took 316 cycles to execute. This code is not as parallel as the mac1( ) function. The assembly output for the vec_mpy1( ) function shows two execute packets. Each execute packet has four parallel instructions. This loop can be improved.
|| [ A1] || ||
Execute packets
|| || || [ A1]
2-15
Example 28 shows the assembly output of the innermost loop for the iir( ) function. Recall from page 2-11 that the iir1( ) function took 270 cycles to execute. As you can see, some execute packets have five parallel instructions, while others have as few as four parallel instructions, which indicates that the code can probably be improved.
|| || || ||
|| || || ||
|| || ||
|| || || [ B0] ||
|| [ B0] || ||
To improve the vec_mpy( ) and iir( ) functions, start by seeing how you can refine and improve your C code. This is what is referred to as phase 2 of the code development flow, and this is what the next lesson is about.
2-16
Example 29 uses short data types. Because short data types are 16 bits, they translate into halfword instructions, such as LDH and STH (see Example 210). The loop in Example 210 uses two LDH instructions and an STH instruction to load x[i] and y[i] and store back to y[i]. Because only two memory operations can occur per cycle, the fastest that this loop can execute is one y[i] result every two cycles. The performance of this loop is limited by the number of D units.
|| [ A1] || ||
|| || || [ A1]
Because x is an array, x[i] and x[i + 1] are next to each other in memory. This means that instead of using halfword instructions (LDH and STH) to load and store each element in the array, you can use word instructions (LDW and STW) to load and store two elements at a time, as long as the data is aligned on a word boundary. In other words, all word accesses should have the 2 LSBs of the address set to 0. Two elements at a time, x[i] and x[i + 1], fit into one 32-bit register.
Code Development Flow Tutorial
2-17
To achieve this in C, declare x[ ] as an integer instead of as a short data type. Also, you need to use some intrinsics. Now that you have determined that you can load x[i] and x[i + 1] into the same register, you need to figure out how to do it. You can do this by using the _mpy and _mpylh intrinsics. Intrinsics are like built-in C functions that correspond to C6x assembly language instructions. The _mpy intrinsic multiplies the 16 LSBs of one operand by the 16 LSBs of another and returns the result. The _mpylh intrinsic multiplies the 16 LSBs of the first operand by the 16 MSBs of the second and returns the result. You can then use the _add2 intrinsic to add the 16 MSBs of the first operand to the 16 MSBs of the second operand. At the same time, the _add2 intrinsic also adds the 16 LSBs of the first operand to the 16 LSBs of the second operand. The result of both additions is stored in a 32-bit operand.
MSBs + MSBs = MSBs LSBs + LSBs = LSBs
Example 211 shows how to rewrite the vec_mpy( ) function to include the _mpy and _mpylh intrinsics:
2-18
Now, look at the iir1( ) function. Example 212 shows the same code that you saw in Example 24.
2-19
You can improve the iir( ) function by using the same methods that you used to improve the vec_mpy( ) function. Example 213 shows how to rewrite the iir( ) function:
2-20
Using demo2.c, shown in Example 214, run the revised functions through the compiler, assembler, and linker.
If you did not set up your C_OPTIONS environment variable as described on page 2-6, enter the following on a command line:
cl6x g o k mg demo2.c mac1.c vec_mpy2.c iir2.c z lnk.cmd l rts6201.lib o demo2.out
OR If you set up your C_OPTIONS environment variable as described on page 2-6, enter the following on a command line:
cl6x z o demo2.out
Although the z option is already specified in the C_OPTIONS environment variable, you need to specify it on the command line to indicate that this occurrence of o is a linker option.
2-21
The inner loop of the vec_mpy2( ) function translates into the assembly output shown in Example 215.
|| || [ A1] || || || ||
|| || || || [ A1] || ||
As you can see, the code for the vec_mpy2( ) function is improved over the original vec_mpy( ) code. Two LDW instructions are loading four elements (x[i], x[i+1], y[i], and y[i+1]), and one STW instruction is storing two elements: x[i] and y[i+1]. With the revised code, two y[i] results are stored every two cycles. Recall that only one y[i] result was stored every two cycles in Example 210. Table 23 shows how the vec_mpy( ) function has improved as it moves from phase 1 to phase 2.
2-22
Now, look at the inner loop of the third function, iir( ). Example 216 shows the assembly output of the innermost loop for the revised iir( ) function:
|| || || ||
|| || [ B0] || || ||
|| || || [ B0] || ||
|| || || ||
Table 24 shows how the iir( ) function has improved. Now, the code has only four execute packets; however, each packet has only five or six parallel instructions, which could be probably improved.
50 + 20 = 270 4 50 + 20 = 220
2-23
Use the profiler to view the cycle counts of the revised functions. Your profile window should look like this:
Notice that the cycle count for the second function, the vector multiply, is down from 316 to 172. The IIR filter has improved also: it is down from 270 to 220. However, the cycle count for the IIR filter is still too high. Naturally, the cycle count for main( ) has decreased also. It is down from 831 to 637.
The cycle count for the mac1( ) function has not changed.
You have done everything you can to refine the C code in the iir( ) function. To improve your code at this point, you need to use the assembly optimizer. This leads you to phase 3 of the code development flow.
2-24
Parallel instructions Pipeline latency Register usage Which functional unit is being used
If you choose not to specify these things, the assembly optimizer determines the information that you do not include, based on the information that it has about your code. As with other code generation tools, you might need to modify your linear assembly code until you are satisfied with its performance. When you do this, you will probably want to add more detail to your linear assembly. For example, you might want to specify which functional unit should be used. Before you use the assembly optimizer, you need to know the following things about how it works: A linear assembly file must be specified with a .sa extension. Linear assembly code should include the .cproc and .endproc directives. The .cproc and .endproc directives delimit a section of your code that you want the assembly optimizer to optimize. Use .cproc at the beginning of the section and .endproc at the end of the section. In this way, you can set off sections of your assembly code that you want to be optimized, like procedures or functions. Linear assembly code may include a .reg directive. The .reg directive allows you to use descriptive names for values that will be stored in registers. When you use .reg, the assembly optimizer chooses a register whose use agrees with the functional units chosen for the instructions that operate on the value. Linear assembly code may include a .trip directive. The .trip directive specifies the value of the trip count. The trip count indicates how many times a loop will iterate.
Now that you have some information about the fundamentals of linear assembly code, look at the revised C code for the biquad filter again. Example 217 shows the same code that you saw in Example 213 on page 2-20.
Code Development Flow Tutorial
2-25
Example 218 shows how to rewrite the iir( ) function in linear assembly.
2-26
.reg cptr1, s01, s10, s23, c10, c32, s10_s, s10_t .reg p0, p1, p2, p3, s23_s, s1, t, x, mask, sptr1, s10p, ctr MV MV MVK LOOP: .trip 50 LDW LDW LDW MV MPY MPYH ADD SHR ADD AND MPY MPYH ADD SHR ADD SHL OR STW [ctr] ADD [ctr] B .D1T1 .D2T2 .D1T2 .2 .M1 .M1 .1 .1 .2 .2 .M2 .M2 .2 .2 .2 .2 .2 .D1 .S1 .S1 *cptr0++[2],c32 *cptr1++[2],c10 *sptr0,s10 s10,s10p c32,s10,p2 c32,s10,p3 p2,p3,s23 s23,15,s23_s s23_s,x,t t,mask,t c10,s10,p0 c10,s10,p1 p0,p1,s10_t s10_t,15,s10_s s10_s,t,x s10p,16,s1 t,s1,s01 s01,*sptr1++ 1,ctr,ctr LOOP ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; coefAddr[3] & CoefAddr[2] CoefAddr[1] & CoefAddr[0] StateAddr[1] & StateAddr[0] save StateAddr[1] & StateAddr[0] CoefAddr[2] * StateAddr[0] CoefAddr[3] * StateAddr[1] CA[2] * SA[0] + CA[3] * SA[1] (CA[2] * SA[0] + CA[3] * SA[1]) >> 15 t = x+((CA[2]*SA[0]+CA[3]*SA[1])>>15) clear upper 16 bits CoefAddr[0] * StateAddr[0] CoefAddr[1] * StateAddr[1] CA[0] * SA[0] + CA[1] * SA[1] (CA[0] * SA[0] + CA[1] * SA[1]) >> 15 x = t+((CA[0]*SA[0]+CA[1]*SA[1])>>15) .2 .1 cptr0,cptr1 sptr0,sptr1 ; setup loop counter
50,ctr
; StateAddr[1] = StateAddr[0] ; StateAddr[0] = t ; store StateAddr[1] & StateAddr[0] ; dec outer lp cntr ; Branch outer loop
.endproc
2-27
Using demo3.c, shown in Example 219, run the revised functions through the code generation tools.
Use the shell program (cl6x) to compile, assemble, and link. Be sure you use the mg option. The mg option ensures that the optimizations that are used are compatible with profiling. On a command line, enter:
cl6x g o k mg demo3.c mac1.c vec_mpy2.c iir3.sa z lnk.cmd l rts6201.lib o demo3.out
Notice that you used the shell program to compile a linear assembly file and a C file at the same time. Also notice that (except for the mg option) you used the same options that you used in the first part of this tutorial. The assembly optimizer has a small set of some unique options, but many of the options that you will use are shell options that apply to either linear assembly files or C files.
2-28
AND || ADD || [ A1] B || ADD || MPYH || MPY || LDW || LDW ADD OR SHR SHR MPY MPYH LDW
|| || || || || ||
B0,*A7++ ; store StateAddr[1] & StateAddr[0] B5,0x10,B9 ;@ StateAddr[1] = StateAddr[0] B9,A5,B3 ;@ t = x+((CA[2]*SA[0]+CA[3]*SA[1])>>15) 0xffffffff,A1,A1 ;@@ dec outer lp cntr B1,B5 ;@@ save StateAddr[1] & StateAddr[0]
Table 26 shows how the iir( ) function has improved as it has moved through the three phases of code development.
Cycle Count
2-29
Use the profiler to view the cycle counts of the revised functions. Your profile window should look like this:
Notice that the cycle count for the IIR filter has improved: it is down from 220 to 177. Naturally, the cycle count for main( ) has decreased also. It is down from 637 to 594.
The cycle count for the mac1( ) function and the vec_mpy( ) function have not changed.
The Using the Assembly Optimizer chapter in the TMS320C6x Optimizing C Compiler Users Guide discusses the assembly optimizer in more detail.
2-30
Summary
2.8 Summary
Congratulations! In this tutorial, you learned the following things:
What the three phases of code development are, how to determine which phases are appropriate for improving different parts of your code, and how to write your code for each phase. What a linear assembly file is and some fundamental information on how to write one. How to use the code generation tools to compile, assemble, and link your C and linear assembly files. How to use the profiler to analyze your results and determine whether or not you need to continue refining your code.
2-31
Chapter 3
3-1
Table 31 describes the steps recommended for developing code to achieve the highest performance on loops.
Validates original C code Determines which loops are most important in terms of MIPS requirements
Add const declarations and loop count information Reduces potential pointer aliasing problems Allows loops with indeterminate iteration counts to execute epilogs
Optimize C code using intrinsics and other methods Facilitates use of certain C6x instructions not easily represented in C Optimizes data flow bandwidth
4a
Write linear assembly Allows control in determining exact C6x instructions to be used Provides flexibility of hand-coded assembly without worry of pipelining, parallelism, or register allocation Can pass memory bank information to the tools
4b
Add partitioning information to the linear assembly Can improve partitioning of loops when necessary Can avoid bottlenecks of certain hardware resources
When you achieve the desired performance in your code, there is no need to move to the next step. Each of the steps in the development involve passing more information to the C6x tools. Even at the final step, development time is greatly reduced from that of hand-coding, and the performance approaches the best that can be achieved by hand. Internal benchmarking efforts at Texas Instruments have shown that most loops achieve maximal throughput after steps 1 and 2. For loops that do not, the C compiler offers a rich set of optimizations that can fine tune all from the high level C language. For the few loops that need even further optimizations, the assembly optimizer gives the programmer more flexibility than C can offer, works within the framework of C, and is much like programming in higher level C. For more information on the assembly optimizer, see the TMS320C6x Optimizing C Compiler Users Guide and Chapter 6, Optimizing Assembly Code
3-2
via Linear Assembly, in this book. For example, linear assembly files point to the demo directory included with the C6x tools.
In order to aid the development process, a feedback option (mw) is included in the code generation tools. Example 31 shows output from the compiler and/or assembly optimizer of a particular loop. See the TMS320C6x Optimizing C Compiler Users Guide for more information about the mw option.
This feedback is important in determining which optimizations might be useful for further improved performance. The following checklist is provided as a quick reference to techniques that can be used to optimize loops and refers to specific sections within this book for more detail.
3-3
n n n n n n n n n
Add const declarations to all pointers passed to a function that are read only. Use mt option to assume no memory pointer aliasing.
4-6
Memory Dependencies
4-5
Linear assembly Make sure instructions accessing Loop Carry Paths memory at the beginning of the loop do not use the same pointer variables as instructions accessing memory at the end of the loop. Write code in linear assembly with partitioning/functional unit information. Use intrinsics in C code to select more efficient C6x instructions. Write code in linear assembly to pick exact C6x instruction to be executed. Write linear assembly and insert MV instructions to split register lifetimes that are live-too-long. Try splitting the loop into two separate loops. If multiple conditionals are used in the loop, allocation of the condition registers could be the reason for the failure. Try writing linear assembly and partition all instructions, writing to condition registers evenly between the A and B sides of the machine. If there are an uneven number, put more on the B side, since there are 3 condition registers on the B side and only 2 on the A side. 6-73
Partitioned resource bound is higher than unpartitioned resource bound Too many instructions, or inefficient instructions were generated by the compiler Failed to software pipeline due to register live-toolong Failed to software pipeline due to register allocation
6-19
4-9
3-4
n n n n n n n n n n n
Use word access for short arrays; declare int* and use mpy intrinsics to multiply upper and lower halves of registers. Try to employ redundant load elimination technique if possible.
4-14
Redundant Load Elimination Using Word Access for Short Data in Part III Loop Unrolling
6-106
Linear assembly Use LDW/STW instructions for accesses to memory. There are memory bank conflicts (specified in the memory analysis window of simulator) Larger outer loop overhead in nested loop Write linear assembly and use the .mptr directive.
6-15 6-90
Unroll the inner loop. Make one loop with the outer loop instructions conditional on an inner loop counter.
4-23, 6-90
Uneven resources (for example, 3 multiplies per loop iteration) Two loops are generated, one not software pipelined Two loops are generated, one not software pipelined Loop will not software pipeline for other reasons
Unroll the loop to make an even num- Loop Unrolling in Part III ber of resources. Use the _nassert statement to specify Communicating Triploop count information. Count Information to the Compiler Use the .trip directive to specify loop count information.
4-22
Make sure there are no function calls, What Disqualifies a Loop 4-26 branches to other code, or conditional from Being Software-Pibreak statements in the loop. pelined 4-2, Try making the loop counter downTips on Data Types and 4-21 counting and declare it an int in C. Trip Count Issues Remove any modifications to the loop What Disqualifies a Loop counter inside the loop. from Being Software-Pi- 4-26 pelined
3-5
Part I
Introduction
Part II
C Code
Part III
Assembly Code
Part IV
Appendix
Chapter 4
Optimizing C Code
You can maximize C performance by using compiler options, intrinsics, and code transformations. This chapter discusses the following topics:
The compiler and its options Intrinsics Software pipelining Loop unrolling
Topic
4.1 4.2 4.3
Page
Writing C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2 Compiling C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4 Refining C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9
4-1
Writing C Code
4.1.1
Based on the size of each data type, follow these guidelines when writing C code: Avoid code that assumes that int and long types are the same size, because the C6x compiler uses long values for 40-bit operations. Use the short data type for fixed-point multiplication inputs whenever possible because this data type provides the most efficient use of the 16-bit multiplier in the C6x. Use int or unsigned int types for loop counters, rather than short or unsigned short data types, to avoid unnecessary sign-extension instructions. When using floating-point instructions on a floating-point device such as the C67x, use the mv6700 compiler switch so the code generated will use the devices floating-point hardware instead of performing the task with fixed point hardware.
4.1.2
4-2
One of the preliminary measures of code is the time it takes the code to run. Use the clock( ) and printf( ) functions in C to time and display the performance of specific code regions. You can use the stand-alone simulator (load6x) to run the code for this purpose. Use the profile mode in the debugger, as explained in the TMS320C6x C Source Debugger Users Guide, to collect execution statistics about specific areas in your code.
Writing C Code
Use breakpoints, the clk register, and the RUNB command in the debugger, as described in the TMS320C6x C Source Debugger Users Guide, to track the number of CPU clock cycles consumed by a particular section of code. The critical performance areas in your code are most often loops. The easiest way to optimize a loop is by extracting it into a separate file that can be rewritten, recompiled, and run stand-alone.
As you use the techniques described in this chapter to optimize your C code, you can then evaluate the performance results by running the code and looking at the instructions generated by the compiler.
Optimizing C Code
4-3
Compiling C Code
4.2.1
Compiler Options
Options control the operation of the compiler. Table 41 defines the options discussed in this chapter.
Although o3 is preferable, at a minimum use the o option. Use the pm option for as much of your program as possible.
4-4
Compiling C Code
4.2.2
Memory Dependencies
To maximize the efficiency of your code, the C6x compiler schedules as many instructions as possible in parallel. To schedule instructions in parallel, the compiler must determine the relationships, or dependencies, between instructions. Dependency means that one instruction must occur before another. Because only independent instructions can execute in parallel, dependencies inhibit parallelism.
If the compiler cannot determine that two instructions are independent (for example, b does not depend on a), it assumes a dependency and schedules the two instructions sequentially. If the compiler can determine that two instructions are independent of one another, it can schedule them in parallel.
Often it is difficult for the compiler to determine if instructions that access memory are independent. The following techniques help the compiler determine which instructions are independent:
Use the const keyword to indicate which objects are not changed by a function. Use the pm (program-level optimization) option, which gives the compiler global access to the whole program or module and allows it to be more aggressive in ruling out dependencies. Use the mt option, which allows the compiler to use assumptions that allow it to eliminate dependencies.
To illustrate the concept of memory dependencies, it is helpful to look at the algorithm code in a dependency graph. Example 41 shows the C code for a basic vector sum. Figure 41 shows the dependency graph for this basic vector sum. (For more information, see section 6.2.4, Drawing a Dependency Graph, on page 6-6.)
Optimizing C Code
4-5
Compiling C Code
sum[i]
The paths from sum[i] back to in1[i] and in2[i] indicate that writing to sum may have an effect on the memory pointed to by either in1 or in2. A read from in1 or in2 cannot begin until the write to sum finishes, which creates an aliasing problem. Aliasing occurs when two pointers can point to the same memory location. For example, if vecsum( ) is called in a program with the following statements, in1 and sum alias each other because they both point to the same memory location:
short a[10], b[10]; vecsum(a, a, b, 10);
4.2.2.1
4-6
Compiling C Code
Example 42 shows the vecsum( ) example rewritten with the const keyword to indicate that the write to sum never changes the memory referenced by in1 and in2. Figure 42 shows the revised dependency graph for the code in the inner loop.
sum[i]
Example 43 shows the output of the compiler for the vector sum in Example 42. The compiler finds better schedules when dependency paths are eliminated between instructions. For this loop, the compiler found a software pipeline with a 2-cycle kernel, compared with seven cycles for the previous loop. (The kernel is the body of a pipelined loop where all instructions execute in parallel.)
Optimizing C Code
4-7
Compiling C Code
For basic information on assembly code, see Chapter 4, Structure of Assembly Code.
4.2.2.2
If a particular argument in a function always has the same value, the compiler replaces the argument with the value and passes the value instead of the argument. If a return value of a function is never used, the compiler deletes the return code in the function. If a function is not called, directly or indirectly, the compiler removes the function.
Another way to eliminate memory dependencies is to use the mt option, which allows the compiler to use assumptions that can eliminate memory dependency paths. For example, if you use the mt option when compiling the code in Example 41, the compiler uses the assumption that that in1 and in2 do not alias memory pointed to by sum and, therefore, eliminates memory dependencies among the instructions that access those variables.
4-8
Refining C Code
Using intrinsics to replace complicated C code Using word access to operate on 16-bit data stored in the high and low parts of a 32-bit register Software pipelining the instructions manually Using double access to operate on 32-bit data stored in a 64-bit register pair (C67x only)
The C6x compiler provides intrinsics, special functions that map directly to inlined C62x/C67x instructions, to optimize your C code quickly. All instructions that are not easily expressed in C code are supported as intrinsics. Intrinsics are specified with a leading underscore ( _ ) and are accessed by calling them as you call a function. For example, saturated addition can be expressed in C code only by writing a multicycle function, such as the one in Example 44.
This complicated code can be replaced by the _sadd( ) intrinsic, which results in a single C6x instruction (see Example 45).
Optimizing C Code
4-9
Refining C Code
Table 42 lists the C6x intrinsics. For more information on using intrinsics, see the TMS320C6x Optimizing C Compiler Users Guide.
CLR
CLR
int_dpint(double);
DPINT
EXT
EXT
Note:
4-10
Refining C Code
EXTU
uint _ftoi(float);
uint _hi(double);
float _itof(uint);
uint _lo(double);
int _mpy(int src1, int src2); int _mpyus(uint src1, int src2); int _mpysu(int src1, uint src2); uint _mpyu(uint src1, uint src2); int _mpyh(int src1, int src2); int _mpyhus(uint src1, int src2); int _mpyhsu(int src1, uint src2); uint _mpyhu(uint src1, uint src2);
Note:
Multiplies the 16 MSBs of src1 by the 16 MSBs of src2 and returns the result. Values can be signed or unsigned.
Optimizing C Code
4-11
Refining C Code
uint _norm(int src2); uint _lnorm(long src2); lnorm(long double _rcpdp(double); float _rcpsp(float); double _rsqrdp(double src); float _rsqrsp(float src); int _sadd(int src1, int src2); long _lsadd(int src1 long src2): lsadd(int src1, int _sat(long src2); uint _set(uint src2, uint csta, uint cstb);
SAT SET
Note:
4-12
Refining C Code
int _smpy(int src1, int sr2); int _smpyh(int src1, int sr2); int _smpyhl(int src1, int sr2); int _smpylh(int src1, int sr2); int _spint(float);
SSHL
int _ssub(int src1, int src2); long _lssub(int src1 long src2): lss b(int src1, uint _subc(uint src1, uint src2); int _sub2(int src1, int src2);
SSUB
SUBC SUB2
Note:
Optimizing C Code
4-13
Refining C Code
4.3.2
Example 46. Vector Sum With const Keywords, _nassert, Word Reads
void vecsum4(short *sum, const short *in1, const short *in2, unsigned int N) { int i; const int *i_in1 = (const int *)in1; const int *i_in2 = (const int *)in2; int *i_sum = (int *)sum; _nassert(N >= 20); for (i = 0; i < (N/2); i++) i_sum[i] = _add2(i_in1[i], i_in2[i]); }
Note: The _nassert intrinsic tells the optimizer that the code that follows meets the condition specified. This transformation assumes that the pointers sum, in1, and in2 can be cast to int *, which means that they must point to word-aligned data. By default, the compiler aligns all short arrays on word boundaries; however, a call like the following creates an illegal memory access:
short a[51], b[50], c[50]; vecsum4(&a[1], b, c, 50);
Another consideration is that the loop must now run for an even number of iterations. You can ensure that this happens by padding the short arrays so that the loop always operates on an even number of elements.
4-14
Refining C Code
If a vecsum( ) function is needed to handle short-aligned data and odd-numbered loop counters, then you must add code within the function to check for these cases. Knowing what type of data is passed to a function can improve performance considerably. It may be useful to write different functions that can handle different types of data. If your short-data operations always operate on even-numbered word-aligned arrays, then the performance of your application can be improved. However, Example 47 provides a generic vecsum( ) function that handles all types of data.
Example 47. Vector Sum With const Keywords, _nassert, Word Reads (Generic Version)
void vecsum5(short *sum, const short *in1, const short *in2, unsigned int N) { int i; _nassert(N >= 20); if (((int)sum | (int)in2 | (int)in1) & 0x2) { for (i = 0; i < N; i++) sum[i] = in1[i] + in2[i]; } else { const int *i_in1 = (const int *)in1; const int *i_in2 = (const int *)in2; int *i_sum = (int *)sum; for (i = 0; i < (N/2); i++) i_sum[i] = _add2(i_in1[i], i_in2[i]); if (N & 0x1) sum[i] = in1[i] + in2[i]; } }
4.3.2.1
Refining C Code
4.3.2.2
Example 410 shows an optimized version of Example 49. The optimized version passes an int array instead of casting the short arrays to int arrays and, therefore, helps ensure that data passed to the function is word-aligned. Assuming that a prototype is used, each invocation of the function ensures that the input arrays are word-aligned by forcing you to insert a cast or by using int arrays that contain short data.
4-16
Refining C Code
4.3.2.3
Optimizing C Code
4-17
Refining C Code
In Example 412, the dot product example is rewritten to use double word loads and instrincs are used to extract the high and low 32-bit values contained in the 64-bit double. The _hi() and _lo() instrinsics return integer values, the _itof() intrinsic subverts the C typing system by interpreting an integer value as a float value. In this version of the loop, 2 float results are computed every 4 cycles.
In Example 413, the dot product example is unrolled to maximize performance. The preprocessor is used to define convenient macros FHI() and FLO() for accessing the high and low 32-bit values in a double word. In this version of the loop, 8 float values are computed every 4 cycles.
4-18
Refining C Code
* * * * * * * *
Optimizing C Code
4-19
Refining C Code
4.3.3
Software Pipelining
Software pipelining is a technique used to schedule instructions from a loop so that multiple iterations of the loop execute in parallel. When you use the o2 and o3 compiler options, the compiler attempts to software pipeline your code with information that it gathers from your program. Figure 43 illustrates a software-pipelined loop. The stages of the loop are represented by A, B, C, D, and E. In this figure, a maximum of five iterations of the loop can execute at one time. The shaded area represents the loop kernel. In the loop kernel, all five stages execute in parallel. The area immediately before the kernel is known as the pipelined-loop prolog, and the area immediately following the kernel is known as the pipelined-loop epilog.
Because loops present critical performance areas in your code, consider the following areas to improve the performance of your C code:
4-20
Refining C Code
4.3.3.1
The minimum trip count for a software pipelined loop is determined by the minimum number of times the loop will execute. If the compiler knows the trip count, it can generate faster and more compact code. If the compiler cannot determine that a loop always executes for the minimum trip count, it generates a redundant unpipelined loop. The redundant unpipelined loop is executed only when the runtime trip count is less than the minimum trip count; otherwise, the software-pipelined version of the loop is executed.
Optimizing C Code
4-21
Refining C Code
4.3.3.2
An unpipelined version that executes if N is less than the minimum trip count A software-pipelined version that executes if N is equal to or greater than the minimum trip count
To indicate to the compiler that you do not want two versions of the loop, you can use the ms option so that the compiler generates only the software-pipelined code and never generates a redundant loop; however, loops with an unknown trip count are not software pipelined.
4.3.3.3
Use the o3 and pm compiler options to allow the optimizer to access the whole program or large parts of it and to characterize the behavior of loop trip counts. Use the _nassert intrinsic to help reduce code size by preventing the generation of a redundant loop or by allowing the compiler (with or without the ms option) to software pipeline innermost loops. Example 415 shows the vector sum code with an _nassert intrinsic that asserts that N is always at least 10.
See the TMS320C6x Optimizing C Compiler Users Guide for a complete discussion of the ms, o3, and pm options and the _nassert intrinsic.
4-22
Refining C Code
4.3.3.4
Loop Unrolling
Another technique that improves performance is unrolling the loop; that is, expanding small loops so that each iteration of the loop appears in your code. This optimization increases the number of instructions available to execute in parallel. You can use loop unrolling when the operations in a single iteration do not use all of the resources of the C6x architecture. In Example 416, the loop produces a new sum[i] every two cycles. Three memory operations are performed: a load for both in1[i] and in2[i] and a store for sum[i]. Because only two memory operations can execute per cycle, two cycles are necessary to perform three memory operations.
The performance of a software pipeline is limited by the number of resources that can execute in parallel. In its word-aligned form (Example 417), the vector sum loop delivers two results every two cycles because the two loads and the store are all operating on two 16-bit values at a time.
Optimizing C Code
4-23
Refining C Code
If you unroll the loop once, the loop then performs six memory operations per iteration, which means the unrolled vector sum loop can deliver four results every three cycles (that is, 1.33 results per cycle). Example 418 shows four results for each iteration of the loop: sum[i] and sum[ i + sz] each store an int value that represents two 16-bit values. Example 418 is not simple loop unrolling where the loop body is simply replicated. The additional instructions use memory pointers that are offset to point midway into the input arrays and the assumptions that the additional arrays are a multiple of four shorts in size.
Example 418. Vector Sum Using const Keywords, _nassert, Word Reads, and Loop Unrolling
void vecsum6(int *sum, const int *in1, const int *in2, unsigned int N) { int i; int sz = N >> 2; _nassert(N >= 20); for (i = 0; i < sz; i++) { sum[i] = _add2(in1[i], in2[i]); sum[i+sz] = _add2(in1[i+sz], in2[i+sz]); } }
Software pipelining is performed by the compiler only on inner loops; therefore, you can increase performance by creating larger inner loops. One method for creating large inner loops is to completely unroll inner loops that execute for a small number of cycles. In Example 419, the compiler pipelines the inner loop with a kernel size of one cycle; therefore, the inner loop completes a result every cycle. However, the overhead of filling and draining the software pipeline can be significant, and other outer-loop code is not software pipelined.
4-24
Refining C Code
For loops with a simple loop structure, the compiler uses a heuristic to determine if it should unroll the loop. Because unrolling can increase code size, in some cases the compiler does not unroll the loop. If you have identified this loop as being critical to your application, then unroll the inner loop in C code, as in Example 420.
Optimizing C Code
4-25
Refining C Code
Now the outer loop is software-pipelined, and the overhead of draining and filling the software pipeline occurs only once per invocation of the function rather than for each iteration of the outer loop.
4.3.3.5
4.3.3.6
4-26
Refining C Code
Although a software-pipelined loop can contain intrinsics, it cannot contain function calls. You must not have a conditional break (early exit) in the loop. The loop cannot have an incrementing loop counter. One reason that you run the optimizer with the o2 or o3 option is to convert as many loops as possible into downcounting loops. If the trip counter is modified within the body of the loop, it typically cannot be converted into a downcounting loop. For example, the following code is not software-pipelined:
for (i = 0; i < n; i++) { ... i += x; }
A conditionally incremented loop control variable is not software-pipelined. For example, the following code is not software-pipelined:
for (i = 0; i < x; i++) { ... if (b > a) i += 2 }
If the code size is too large and requires more than the 32 registers in the C6x, it is not software-pipelined. If a register value is live too long, the code is not software-pipelined. See section 6.5.6.2, Live Too Long, on page 6-63 and section 6.9, Live-TooLong Issues, on page 6-97 for examples of code that is live too long. If the loop has complex condition code within the body that requires more than the five C6x condition registers, the loop is not software pipelined.
Optimizing C Code
4-27
Part I
Introduction
Part II
C Code
Part III
Assembly Code
Part IV
Appendix
Chapter 5
Topic
5.1 5.2 5.3 5.4 5.5 5.6 5.7
Page
Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2 Parallel Bars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4 Functional Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6 Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9
5-1
5.1 Labels
A label identifies a line of code or a variable and represents a memory address that contains either an instruction or data. Figure 51 shows the position of the label in a line of assembly code. The colon following the label is optional.
The first character of a label must be a letter or an underscore ( _ ) followed by a letter. The first character of the label must be in the first column of the text file. Labels can include up to 32 alphanumeric characters.
An instruction that executes in parallel with the previous instruction signifies this with parallel bars (||). This field is left blank for an instruction that does not execute in parallel with the previous instruction.
5-2
Conditions
5.3 Conditions
Five registers in the C6x are available for conditions: A1, A2, B0, B1, and B2. Figure 53 shows the position of a condition in a line of assembly code.
If no condition is specified, the instruction is always performed. If a condition is specified and that condition is true, the instruction executes. For example:
With this condition ... [A1] [!A1] The instruction executes if ... A1 ! = 0 A1 = 0
If a condition is specified and that condition is false, the instruction does not execute.
With this condition ... [A1] [!A1] The instruction does not execute if ... A1 = 0 A1! = 0
5-3
Instructions
5.4 Instructions
Assembly code instructions are either directives or mnemonics:
Assembler directives are commands for the assembler (asm6x) that control the assembly process or define the data structures (constants and variables) in the assembly language program. All assembler directives begin with a period, as shown in the partial list in Table 51. Processor mnemonics are the actual microprocessor instructions that execute at runtime and perform the operations in the program. Table 52 summarizes the C6x mnemonics. Processor mnemonics must begin in column 2 or greater.
.double value
.float
value
.int value .long value .word value .short value .half value .byte
Reserve 16 bits in memory and fill with specified value Reserve 8 bits in memory and fill with specified value
value
See the TMS320C6x Assembly Language Tools Users Guide for a complete list of directives.
5-4
Instructions
See the TMS320C62x/C67x CPU and Instruction Set Reference Guide for a complete list of instructions.
5-5
Functional Units
.D1
.D2
Memory
5-6
Functional Units
Single-precision (32-bit) floating-point IEEE multiplies Double-precision (64-bit) floating-point IEEE multiplies .D unit (.D1, .D2) 32-bit add, subtract, linear and circular address calculation
Specifying the functional unit in the assembly code is optional. The functional unit can be used to document which resource(s) each instruction uses.
5-7
Operands
5.6 Operands
The C6x architecture requires that memory reads and writes move data between memory and a register. Figure 57 shows the position of the operands in a line of assembly code.
Instructions have the following requirements for operands in the assembly code:
All instructions require a destination operand. Most instructions require one or two source operands. The destination operand must be in the same register file as one source operand. One source operand from each register file per execute packet can come from the register file opposite that of the other source operand. When an operand comes from the other register file, the unit includes an X, as shown in Figure 58, indicating that the instruction is using one of the cross paths. (See the TMS320C6x CPU and Instruction Set Reference Guide for more information on cross paths.)
Register operands indicate a register that contains the data. Constant operands specify the data within the assembly code. Pointer operands contain addresses of data values. Only the load and store instructions require and use pointer operands to move data values between memory and a register.
5-8
Comments
5.7 Comments
As with all programming languages, comments provide code documentation. Figure 59 shows the position of the comment in a line of assembly code.
A comment may begin in any column when preceded by a semicolon (;). A comment must begin in first column when preceded by an asterisk ( *). Comments are not required but are recommended.
5-9
Chapter 6
Topic
6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9
Page
Assembly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2 Writing Parallel Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4 Using Word Access for Short Data and Doubleword Access for Floating-Point Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15 Software Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-25 Modulo Scheduling of Multicycle Loops . . . . . . . . . . . . . . . . . . . . . . . 6-54 Loop Carry Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-74 If-Then-Else Statements in a Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-83 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-91 Live-Too-Long Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-98
6.10 Redundant Load Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-107 6.11 Memory Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-115 6.12 Software Pipelining the Outer Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-128 6.13 Outer Loop Conditionally Executed With Inner Loop . . . . . . . . . . . 6-133
6-1
Assembly Code
Finding instructions that can be executed in parallel Handling pipeline latencies during software pipelining Assigning register usage Defining which unit to use
Although you have the option with the C6x to specify the functional unit or register used, this may restrict the compilers ability to fully optimize your code. See the TMS320C6x Optimizing C Compiler Users Guide for more information. This chapter takes you through the optimization process manually to show you how the assembly optimizer works and to help you understand when you might want to perform some of the optimizations manually. Each section introduces optimization techniques in increasing complexity:
6-2
Section 6.2 and section 6.3 begin with a dot product algorithm to show you how to translate the C code to assembly code and then how to optimize the linear assembly code with several simple techniques. Section 6.4 and section 6.5 introduce techniques for the more complex algorithms associated with software pipelining, such as modulo iteration interval scheduling for both single-cycle loops and multicycle loops. Section 6.6 uses an IIR filter algorithm to discuss the problems with loop carry paths. Section 6.7 and section 6.8 discuss the problems encountered with ifthen-else statements in a loop and how loop unrolling can be used to resolve them. Section 6.9 introduces live-too-long issues in your code. Section 6.10 uses a simple FIR filter algorithm to discuss redundant load elimination. Section 6.11 discusses the same FIR filter in terms of the interleaved memory bank scheme used by C6x devices. Section 6.12 and section 6.13 show you how to execute the outer loop of the FIR filter conditionally and in parallel with the inner loop.
Assembly Code
Algorithm in C code Translation of the C code to linear assembly Dependency graph to describe the flow of data in the algorithm Allocation of resources (functional units, registers, and cross paths) in linear assembly
Note: There are three types of code for the C6x: C code (which is input for the C compiler), linear assembly code (which is input for the assembly optimizer), and assembly code (which is input for the assembler). In the next three sections, we use the dot product to demonstrate how to use various programming techniques to optimize both performance and code size. Most of the examples provided in this book use fixed-point arithmetic; however, the next three sections give both fixed-point and floating-point examples of the dot product to show that the same optimization techniques apply to both fixed- and floating-point programs.
6-3
6.2.1
6-4
6.2.2
6.2.2.1
[A1]
The load halfword (LDH) instructions increment through the a and b arrays. Each LDH does a postincrement on the pointer. Each iteration of these instructions sets the pointer to the next halfword (16 bits) in the array. The ADD instruction accumulates the total of the results from the multiply (MPY) instruction. The subtract (SUB) instruction decrements the loop counter. An additional instruction is included to execute the branch back to the top of the loop. The branch (B) instruction is conditional on the loop counter, A1, and executes only until A1 is 0.
6.2.2.2
[A1]
The load word (LDW) instructions increment through the a and b arrays. Each LDW does a postincrement on the pointer. Each iteration of these instructions sets the pointer to the next word (32 bits) in the array. The ADDSP instruction
Optimizing Assembly Code via Linear Assembly
6-5
accumulates the total of the results from the multiply (MPYSP) instruction. The subtract (SUB) instruction decrements the loop counter. An additional instruction is included to execute the branch back to the top of the loop. The branch (B) instruction is conditional on the loop counter, A1, and executes only until A1 is 0.
6.2.3
Load (LDH and LDW) instructions must use a .D unit. Multiply (MPY and MPYSP) instructions must use a .M unit. Add (ADD and ADDSP) instructions use a .L unit. Subtract (SUB) instructions use a .S unit. Branch (B) instructions must use a .S unit.
Note: The ADD and SUB can be on the .S, .L, or .D units; however, for Example 63 and Example 64, they are assigned as listed above. The ADDSP instruction in Example 64 must use a .L unit.
6.2.4
6-6
A node is a point on a dependency graph with one or more data paths flowing in and/or out. The path shows the flow of data between nodes. The numbers beside each path represent the number of cycles required to complete the instruction. An instruction that writes to a variable is referred to as a parent instruction and defines a parent node. An instruction that reads a variable written by a parent instruction is referred to as its child and defines a child node.
Use the following steps to draw a dependency graph: 1) 2) 3) 4) Define the nodes based on the variables accessed by the instructions. Define the data paths that show the flow of data between nodes. Add the instructions and the latencies. Add the functional units.
6.2.4.1
Functional unit
MPY
Register allocation
.M1
pi (A6) 2 ADD
1 B LOOP .S1
sum (A7)
.L1
The two LDH instructions, which write the values of ai and bi, are parents of the MPY instruction. It takes five cycles for the parent (LDH) instruction to complete. Therefore, if LDH is scheduled on cycle i, then its child (MPY) cannot be scheduled until cycle i + 5. The MPY instruction, which writes the product pi, is the parent of the ADD instruction. The MPY instruction takes two cycles to complete. The ADD instruction adds pi (the result of the MPY) to sum. The output of the ADD instruction feeds back to become an input on the next iteration and, thus, creates a loop carry path. (See section 6.6 on page 6-74 for more information on loop carry paths.)
Optimizing Assembly Code via Linear Assembly
6-7
The dependency graph for this dot product algorithm has two separate parts because the decrement of the loop counter and the branch do not read or write any variables from the other part.
6.2.4.2
The SUB instruction writes to the loop counter, cntr. The output of the SUB instruction feeds back and creates a loop carry path. The branch (B) instruction is a child of the loop counter.
Functional unit
MPYSP
Register allocation
.M1
pi (A6) 4 ADDSP
1 B LOOP .S1
sum (A7)
.L1
The two LDW instructions, which write the values of ai and bi, are parents of the MPYSP instruction. It takes five cycles for the parent (LDW) instruction to complete. Therefore, if LDW is scheduled on cycle i, then its child (MPYSP) cannot be scheduled until cycle i + 5. The MPYSP instruction, which writes the product pi, is the parent of the ADDSP instruction. The MPYSP instruction takes four cycles to complete. The ADDSP instruction adds pi (the result of the MPYSP) to sum. The output of the ADDSP instruction feeds back to become an input on the next iteration and, thus, creates a loop carry path. (See section 6.6 on page 6-74 for more information on loop carry paths.)
6-8
The dependency graph for this dot product algorithm has two separate parts because the decrement of the loop counter and the branch do not read or write any variables from the other part.
The SUB instruction writes to the loop counter, cntr. The output of the SUB instruction feeds back and creates a loop carry path. The branch (B) instruction is a child of the loop counter.
6-9
6.2.5
6.2.5.1
Assigning the same functional unit to both LDH instructions slows performance of this loop. Therefore, reassign the functional units to execute the code in parallel, as shown in the dependency graph in Figure 63. The parallel assembly code is shown in Example 66.
6-10
Figure 63. Dependency Graph of Fixed-Point Dot Product with Parallel Assembly
LDH ai .D1 5 MPY .M1X pi 2 ADD 1 sum .L1 LOOP 1 5 SUB i 1 B .S1 .S1 LDH bi .D2
Because the loads of ai and bi do not depend on one another, both LDH instructions can execute in parallel as long as they do not share the same resources. To schedule the load instructions in parallel, allocate the functional units as follows:
ai and the pointer to ai to a functional unit on the A side, .D1 bi and the pointer to bi to a functional unit on the B side, .D2
Because the MPY instruction now has one source operand from A and one from B, MPY uses the 1X cross path.
6-11
Rearranging the order of the instructions also improves the performance of the code. The SUB instruction can take the place of one of the NOP delay slots for the LDH instructions. Moving the B instruction after the SUB removes the need for the NOP 5 used at the end of the code in Example 65. The branch now occurs immediately after the ADD instruction so that the MPY and ADD execute in parallel with the five delay slots required by the branch instruction.
6.2.5.2
Assigning the same functional unit to both LDW instructions slows performance of this loop. Therefore, reassign the functional units to execute the code in parallel, as shown in the dependency graph in Figure 64. The parallel assembly code is shown in Example 68.
6-12
Figure 64. Dependency Graph of Floating-Point Dot Product with Parallel Assembly
LDW ai .D1 5 MPYSP .M1X pi 4 ADDSP 4 sum .L1 LOOP 1 5 SUB i 1 B .S1 .S1 LDW bi .D2
Because the loads of ai and bi do not depend on one another, both LDW instructions can execute in parallel as long as they do not share the same resources. To schedule the load instructions in parallel, allocate the functional units as follows:
ai and the pointer to ai to a functional unit on the A side, .D1 bi and the pointer to bi to a functional unit on the B side, .D2
Because the MPYSP instruction now has one source operand from A and one from B, MPYSP uses the 1X cross path.
6-13
Rearranging the order of the instructions also improves the performance of the code. The SUB instruction replaces one of the NOP delay slots for the LDW instructions. Moving the B instruction after the SUB removes the need for the NOP 5 used at the end of the code in Example 67 on page 6-12. The branch now occurs immediately after the ADDSP instruction so that the MPYSP and ADDSP execute in parallel with the five delay slots required by the branch instruction. Since the ADDSP finishes execution before the result is needed, the NOP 3 for delay slots is removed, further reducing cycle count.
6.2.6
Comparing Performance
Executing the fixed-point dot product code in Example 66 requires eight cycles for each iteration plus one cycle to set up the loop counter and initialize the accumulator; 100 iterations require 801 cycles. Table 61 compares the performance of the nonparallel code with the parallel code for the fixed-point example.
Table 61. Comparison of Nonparallel and Parallel Assembly Code for Fixed-Point Dot Product
Code Example Example 65 Fixed-point dot product nonparallel assembly Example 66 Fixed-point dot product parallel assembly 100 Iterations 2 + 100
16 1 + 100 8
Executing the floating-point dot product code in Example 68 requires ten cycles for each iteration plus one cycle to set up the loop counter and initialize the accumulator; 100 iterations require 1001 cycles. Table 62 compares the performance of the nonparallel code with the parallel code for the floating-point example.
Table 62. Comparison of Nonparallel and Parallel Assembly Code for Floating-Point Dot Product
Code Example Example 67 Floating-point dot product nonparallel assembly Example 68 Floating-point dot product parallel assembly 100 Iterations
21 1 + 100 10
2 + 100
6-14
Using Word Access for Short Data and Doubleword Access for Floating-Point Data
6.3 Using Word Access for Short Data and Doubleword Access for Floating-Point Data
The parallel code for the fixed-point example in section 6.2 uses an LDH instruction to read a[i]. Because a[i] and a[i + 1] are next to each other in memory, you can optimize the code further by using the load word (LDW) instruction to read a[i] and a[i + 1] at the same time and load both into a single 32-bit register. (The data must be word-aligned in memory.) In the floating-point example, the parallel code uses an LDW instruction to read a[i]. Because a[i] and a[i + 1] are next to each other in memory, you can optimize the code further by using the load doubleword (LDDW) instruction to read a[i] and a[i + 1] at the same time and load both into a register pair. (The data must be doubleword-aligned in memory.) See the TMS320C62x/C67x CPU and Instruction Set Users Guide for more specific information on the LDDW instruction. Note: The load doubleword (LDDW) instruction is only available on the C67x (floating-point) device.
6.3.1
6-15
Using Word Access for Short Data and Doubleword Access for Floating-Point Data
6.3.2
6.3.2.1
Example 611. Linear Assembly for Fixed-Point Dot Product Inner Loop with LDW
LDW LDW MPY MPYH ADD ADD [cntr] SUB [cntr] B *a++,ai_i1 *b++,bi_i1 ai_i1,bi_i1,pi ai_i1,bi_i1,pi1 pi,sum0,sum0 pi1,sum1,sum1 cntr,1,cntr LOOP ; ; ; ; ; ; ; ; load ai & a1 from memory load bi & b1 from memory ai * bi ai+1 * bi+1 sum0 += (ai * bi) sum1 += (ai+1 * bi+1) decrement loop counter branch to loop
The two load word (LDW) instructions load a[i], a[i+1], b[i], and b[i+1] on each iteration.
6-16
Using Word Access for Short Data and Doubleword Access for Floating-Point Data
Two MPY instructions are now necessary to multiply the second set of array elements:
The first MPY instruction multiplies the 16 least significant bits (LSBs) in each source register: a[i] b[i]. The MPYH instruction multiplies the 16 most significant bits (MSBs) of each source register: a[i+1] b [i+1].
The two ADD instructions accumulate the sums of the even and odd elements: sum0 and sum1. Note: This is true only when the C6x is in little-endian mode. In big-endian mode, MPY operates on a[i+1] and b[i+1] and MPYH operates on a[i] and b[i]. See the TMS320C62x/C67x Peripherals Reference Guide for more information.
6.3.2.2
Example 612. Linear Assembly for Floating-Point Dot Product Inner Loop with LDDW
LDDW LDDW MPYSP MPYSP ADDSP ADDSP [cntr] SUB [cntr] B *a++,ai1:ai0 *b++,bi1:bi0 ai0,bi0,pi0 ai1,bi1,pi1 pi0,sum0,sum0 pi1,sum1,sum1 cntr,1,cntr LOOP ; ; ; ; ; ; ; ; load a[i+0] & a[i+1] from memory load b[i+0] & b[i+1] from memory a[i+0] * b[i+0] a[i+1] * b[i+1] sum0 += (a[i+0] * b[i+0]) sum1 += (a[i+1] * b[i+1]) decrement loop counter branch to loop
The two load doubleword (LDDW) instructions load a[i], a[i+1], b[i], and b[i+1] on each iteration. Two MPYSP instructions are now necessary to multiply the second set of array elements. The two ADDSP instructions accumulate the sums of the even and odd elements: sum0 and sum1.
Optimizing Assembly Code via Linear Assembly
6-17
Using Word Access for Short Data and Doubleword Access for Floating-Point Data
6.3.3
5 MPY pi
5 MPYH pi+1
2 ADD 1 sum0 1
2 ADD sum1
Similarly, the dependency graph in Figure 66 for the floating-point dot product shows that the LDDW instructions are parents of the MPYSP instructions and the MPYSP instructions are parents of the ADDSP instructions. To split the graph between the A and B register files, place an equal number of
6-18
Using Word Access for Short Data and Doubleword Access for Floating-Point Data
LDDWs, MPYSPs, and ADDSPs on each side. To keep both sides even, place the remaining two instructions, B and SUB, on opposite sides.
5 MPYSP pi
MPYSP pi+1
4 ADDSP 4 4
4 ADDSP sum1
sum0
6.3.4
Using Word Access for Short Data and Doubleword Access for Floating-Point Data
Figure 67. Dependency Graph of Fixed-Point Dot Product With LDW (Showing Functional Units)
A side LDW .D1 ai & ai+1 5 MPY pi B side LDW bi & bi+1 5 MPYH pi+1 .M2X .D2
.M1X
B LOOP .S2
Example 613. Linear Assembly for Fixed-Point Dot Product Inner Loop With LDW (With Allocated Resources)
LDW LDW MPY MPYH ADD ADD SUB [A1] B .D1 .D2 .M1X .M2X .L1 .L2 .S1 .S2 *A4++,A2 *B4++,B2 A2,B2,A6 A2,B2,B6 A6,A7,A7 B6,B7,B7 A1,1,A1 LOOP ; ; ; ; ; ; ; ; load ai and ai+1 from memory load bi and bi+1 from memory ai * bi ai+1 * bi+1 sum0 += (ai * bi) sum1 += (ai+1 * bi+1) decrement loop counter branch to loop
6-20
Using Word Access for Short Data and Doubleword Access for Floating-Point Data
Figure 68. Dependency Graph of Floating-Point Dot Product With LDDW (Showing Functional Units)
A side LDDW .D1 ai & ai+1 5 MPYSP pi B side LDDW bi & bi+1 5 .D2
5 MPYSP pi+1
.M1X
.M2X
ADDSP 4 .L1
4 ADDSP 4 sum1
sum0
.L2
B LOOP .S2
Example 614. Linear Assembly for Floating-Point Dot Product Inner Loop With LDDW (With Allocated Resources)
LDDW LDDW MPYSP MPYSP ADDSP ADDSP SUB [A1] B .D1 .D2 .M1X .M2X .L1 .L2 .S1 .S2 *A4++,A3:A2 *B4++,B3:B2 A2,B2,A6 A3,B3,B6 A6,A7,A7 B6,B7,B7 A1,1,A1 LOOP ; ; ; ; ; ; ; ; load ai and ai+1 from memory load bi and bi+1 from memory ai * bi ai+1 * bi+1 sum0 += (ai * bi) sum1 += (ai+1 * bi+1) decrement loop counter branch to loop
6-21
Using Word Access for Short Data and Doubleword Access for Floating-Point Data
6.3.5
Final Assembly
Example 615 shows the final assembly code for the unrolled loop of the fixedpoint dot product and Example 616 shows the final assembly code for the unrolled loop of the floating-point dot product.
6.3.5.1
Example 615. Assembly Code for Fixed-Point Dot Product With LDW (Before Software Pipelining)
|| || LOOP: || LDW LDW SUB [A1] B NOP MPY MPYH NOP ADD .L1 A6,A7,A7 ADD .L2 B6,B7,B7 ; Branch occurs here ADD .L1X A7,B7,A4 ; sum0+= (ai * bi) ; sum1+= (ai+1 * bi+1) .D1 .D2 .S1 .S1 2 .M1X .M2X A2,B2,A6 A2,B2,B6 ; ai * bi ; ai+1 * bi+1 *A4++,A2 *B4++,B2 A1,1,A1 LOOP ; load ai & ai+1 from memory ; load bi & bi+1 from memory ; decrement loop counter ; branch to loop MVK ZERO ZERO .S1 .L1 .L2 50,A1 A7 B7 ; set up loop counter ; zero out sum0 accumulator ; zero out sum1 accumulator
||
||
6-22
The setup code for the loop is included to initialize the array pointers and the loop counter and to clear the accumulators. The setup code assumes that A4 and B4 have been initialized to point to arrays a and b, respectively. The MVK instruction initializes the loop counter. The two ZERO instructions, which execute in parallel, initialize the even and odd accumulators (sum0 and sum1) to 0. The third ADD instruction adds the even and odd accumulators.
Using Word Access for Short Data and Doubleword Access for Floating-Point Data
6.3.5.2
Example 616. Assembly Code for Floating-Point Dot Product With LDDW (Before Software Pipelining)
MVK ZERO ZERO .S1 .L1 .L2 50,A1 A7 B7 ; set up loop counter ; zero out sum0 accumulator ; zero out sum1 accumulator
|| || LOOP: ||
; load ai & ai+1 from memory ; load bi & bi+1 from memory ; decrement loop counter
[A1]
||
||
ADDSP .L1 A6,A7,A7 ADDSP .L2 B6,B7,B7 ; Branch occurs here NOP ADDSP NOP 3 .L1X 3 A7,B7,A4
The setup code for the loop is included to initialize the array pointers and the loop counter and to clear the accumulators. The setup code assumes that A4 and B4 have been initialized to point to arrays a and b, respectively. The MVK instruction initializes the loop counter. The two ZERO instructions, which execute in parallel, initialize the even and odd accumulators (sum0 and sum1) to 0. The third ADDSP instruction adds the even and odd accumulators.
Optimizing Assembly Code via Linear Assembly
6-23
Using Word Access for Short Data and Doubleword Access for Floating-Point Data
6.3.6
Comparing Performance
Executing the fixed-point dot product with the optimizations in Example 615 requires only 50 iterations, because you operate in parallel on both the even and odd array elements. With the setup code and the final ADD instruction, 100 iterations of this loop require a total of 402 cycles (1 + 8 50 + 1).
Table 63 compares the performance of the different versions of the fixedpoint dot product code discussed so far.
Table 63. Comparison of Fixed-Point Dot Product Code With Use of LDW
Code Example Example 65 Example 66 Fixed-point dot product nonparallel assembly Fixed-point dot product parallel assembly
100 Iterations
Executing the floating-point dot product with the optimizations in Example 616 requires only 50 iterations, because you operate in parallel on both the even and odd array elements. With the setup code and the final ADDSP instruction, 100 iterations of this loop require a total of 508 cycles (1 + 10 50 + 7).
Table 64 compares the performance of the different versions of the floatingpoint dot product code discussed so far.
Table 64. Comparison of Floating-Point Dot Product Code With Use of LDDW
Code Example Example 67 Example 68 Floating-point dot product nonparallel assembly Floating-point dot product parallel assembly
100 Iterations
6-24
Software Pipelining
6-25
Software Pipelining
Figure 69. Dependency Graph of Fixed-Point Dot Product With LDW (Showing Functional Units)
A side LDW .D1 ai & ai+1 B side LDW bi & bi+1 .D2
5 MPY pi
5 MPYH pi+1
.M1X
.M2X
B LOOP .S2
Example 617. Linear Assembly for Fixed-Point Dot Product Inner Loop (With Conditional SUB Instruction)
LDW LDW MPY MPYH ADD ADD [A1] SUB [A1] B .D1 .D2 .M1X .M2X .L1 .L2 .S1 .S2 *A4++,A2 *B4++,B2 A2,B2,A6 A2,B2,B6 A6,A7,A7 B6,B7,B7 A1,1,A1 LOOP ; ; ; ; ; ; ; ; load ai and ai+1 from memory load bi and bi+1 from memory ai * bi ai+1 * bi+1 sum0 += (ai * bi) sum1 += (ai+1 * bi+1) decrement loop counter branch to top of loop
6-26
Software Pipelining
Figure 610. Dependency Graph of Floating-Point Dot Product With LDDW (Showing Functional Units)
A side LDDW .D1 ai & ai+1 B side LDDW bi & bi+1 .D2
5 MPYSP pi
5 MPYSP pi+1
.M1X
.M2X
4 ADDSP 4 .L1 4
4 ADDSP sum1
sum0
.L2
B LOOP .S2
Example 618. Linear Assembly for Floating-Point Dot Product Inner Loop (With Conditional SUB Instruction)
LDDW LDDW MPYSP MPYSP ADDSP ADDSP [A1] SUB [A1] B .D1 .D2 .M1X .M2X .L1 .L2 .S1 .S2 *A4++,A2 *B4++,B2 A2,B2,A6 A2,B2,B6 A6,A7,A7 B6,B7,B7 A1,1,A1 LOOP ; ; ; ; ; ; ; ; load ai and ai+1 from memory load bi and bi+1 from memory ai * bi ai+1 * bi+1 sum0 += (ai * bi) sum1 += (ai+1 * bi+1) decrement loop counter branch to top of loop
6-27
Software Pipelining
6.4.1
6.4.1.1
Fixed-Point Example
The fixed-point code in Example 615 needs eight cycles for each iteration of the loop, so the iteration interval is eight. Table 65 shows a modulo iteration interval scheduling table for the fixed-point dot product loop before software pipelining (Example 615). Each row represents a functional unit. There is a column for each cycle in the loop showing the instruction that is executing on a particular cycle:
LDWs on the .D units are issued on cycles 0, 8, 16, 24, etc. MPY and MPYH on the .M units are issued on cycles 5, 13, 21, 29, etc. ADDs on the .L units are issued on cycles 7, 15, 23, 31, etc. SUB on the .S1 unit is issued on cycles 1, 9, 17, 25, etc. B on the .S2 unit is issued on cycles 2, 10, 18, 24, etc.
Table 65. Modulo Iteration Interval Scheduling Table for Fixed-Point Dot Product (Before Software Pipelining)
Unit / Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2 SUB B 0, 8, ... LDW LDW MPY MPYH ADD ADD 1, 9, ... 2, 10, ... 3, 11, ... 4, 12, ... 5, 13, ... 6, 14, ... 7, 15, ...
In this example, each unit is used only once every eight cycles.
6-28
Software Pipelining
6.4.1.2
Floating-Point Example
The floating-point code in Example 616 needs ten cycles for each iteration of the loop, so the iteration interval is ten. Table 66 shows a modulo iteration interval scheduling table for the floatingpoint dot product loop before software pipelining (Example 616). Each row represents a functional unit. There is a column for each cycle in the loop showing the instruction that is executing on a particular cycle:
LDDWs on the .D units are issued on cycles 0, 10, 20, 30, etc. MPYSPs and on the .M units are issued on cycles 5, 15, 25, 35, etc. ADDSPs on the .L units are issued on cycles 9, 19, 29, 39, etc. SUB on the .S1 unit is issued on cycles 3, 13, 23, 33, etc. B on the .S2 unit is issued on cycles 4, 14, 24, 34, etc.
Table 66. Modulo Iteration Interval Scheduling Table for Floating-Point Dot Product (Before Software Pipelining)
Unit / Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2 SUB B 0, 10, ... 1, 11, ... 2, 12, ... 3, 13, ... 4, 14, ... 5, 15, ... 6, 16, ... 7, 17, ... 8, 18, ... 9, 19, ... LDDW LDDW MPYSP MPYSP ADDSP ADDSP
In this example, each unit is used only once every ten cycles.
6-29
Software Pipelining
6.4.1.3
6.4.1.4
Fixed-Point Example
Table 67 shows a fully pipelined schedule for the fixed-point dot product example.
6-30
Software Pipelining
Table 67. Modulo Iteration Interval Table for Fixed-Point Dot Product (After Software Pipelining)
Loop Prolog Unit / Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2
Note:
0 LDW LDW
1 * LDW * LDW
2 ** LDW ** LDW
SUB
* SUB B
** SUB * B
*** SUB ** B
The asterisks indicate the iteration of the loop; shading indicates the single-cycle loop.
The rightmost column in Table 67 is a single-cycle loop that contains the entire loop. Cycles 06 are loop setup code, or loop prolog.
Asterisks define which iteration of the loop the instruction is executing each cycle. For example, the rightmost column shows that on any given cycle inside the loop:
The ADD instructions are adding data for iteration n. The MPY instructions are multiplying data for iteration n + 2 (**). The LDW instructions are loading data for iteration n + 7 (*******). The SUB instruction is executing for iteration n + 6 (******). The B instruction is executing for iteration n + 5 (*****).
In this case, multiple iterations of the loop execute in parallel in a software pipeline that is eight iterations deep, with iterations n through n + 7 executing in parallel. Fixed-point software pipelines are rarely deeper than the one created by this single-cycle loop. As loop sizes grow, the number of iterations that can execute in parallel tends to become fewer.
Optimizing Assembly Code via Linear Assembly
6-31
Software Pipelining
Floating-Point Example
Table 68 shows a fully pipelined schedule for the floating-point dot product example.
Table 68. Modulo Iteration Interval Table for Floating-Point Dot Product (After Software Pipelining)
Loop Prolog Unit / Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2
Note:
0 LDDW LDDW
1 * LDDW * LDDW
2 ** LDDW ** LDDW
9, 10, 11... ********* LDDW ********* LDDW **** MPYSP **** MPYSP ADDSP ADDSP
* ** *** MPYSP MPYSP MPYSP MPYSP * ** *** MPYSP MPYSP MPYSP MPYSP
SUB
* SUB B
** SUB * B
*** SUB ** B
The asterisks indicate the iteration of the loop; shading indicates the single-cycle loop.
The rightmost column in Table 68 is a single-cycle loop that contains the entire loop. Cycles 08 are loop setup code, or loop prolog.
Asterisks define which iteration of the loop the instruction is executing each cycle. For example, the rightmost column shows that on any given cycle inside the loop:
6-32
The ADDSP instructions are adding data for iteration n. The MPYSP instructions are multiplying data for iteration n + 4 (****). The LDDW instructions are loading data for iteration n + 9 (*********). The SUB instruction is executing for iteration n + 6 (******). The B instruction is executing for iteration n + 5 (*****).
Software Pipelining
Note: Since the ADDSP instruction has three delay slots associated with it, the results of adding are staggered by four. That is, the first result from the ADDSP is added to the fifth result, which is then added to the ninth, and so on. The second result is added to the sixth, which is then added to the 10th. This is shown in Table 69. In this case, multiple iterations of the loop execute in parallel in a software pipeline that is ten iterations deep, with iterations n through n + 9 executing in parallel. Floating-point software pipelines are rarely deeper than the one created by this single-cycle loop. As loop sizes grow, the number of iterations that can execute in parallel tends to become fewer.
6.4.1.5
Table 69 shows the results of the loop kernel for a single-cycle accumulator using a multicycle add instruction; in this case, the ADDSP, which has three delay slots (a 4-cycle instruction).
6-33
Software Pipelining
Table 69. Software Pipeline Accumulation Staggered Results Due to Three-Cycle Delay
Cycle # 0 1 2 3 4 5 6 7 8 Pseudoinstruction ADDSP x(0), sum, sum ADDSP x(1), sum, sum ADDSP x(2), sum, sum ADDSP x(3), sum, sum ADDSP x(4), sum, sum ADDSP x(5), sum, sum ADDSP x(6), sum, sum ADDSP x(7), sum, sum ADDSP x(8), sum, sum
S S S
Current value of pseudoregister sum 0 0 0 0 x(0) x(1) x(6) x(7) x(0) + x(4)
Written expected result ; cycle 4 sum = x(0) ; cycle 5 sum = x(1) ; cycle 6 sum = x(2) ; cycle 7 sum = x(3) ; cycle 8 sum = x(0) + x(4) ; cycle 9 sum = x(1) + x(5) ; cycle 10 sum = x(2) + x(6) ; cycle 11 sum = x(3) + x(7) ; cycle 12 sum = x(0) + x(8)
i + j
x(j) + x(j+4) + x(j+8) x(i4+j) ; cycle i + j + 4 sum = x(j) + x(j+4) + x(j+8) x(i4+j) + x(i+j)
where i is a multiple of 4
The first value of the array x, x(0) is added to the accumulator (sum) on cycle 0, but the result is not ready until cycle 4. This means that on cycle 1 when x(1) is added to the accumulator (sum), sum has no value in it from x(0). Thus, when this result is ready on cycle 5, sum will have the value x(1) in it, instead of the value x(0) + x(1). When you reach cycle 4, sum will have the value x(0) in it and the value x(4) will be added to that, causing sum = x(0) + x(4) on cycle 8. This is continuously repeated, resulting in four separate accumulations (using the register sum).
The current value in the accumulator sum depends on which iteration is being done. After the completion of the loop, the last four sums should be written into separate registers and then added together to give the final result. This is shown in Example 623 on page 6-39.
6-34
Software Pipelining
6.4.2
[cntr] [cntr]
.trip 50 LDW *a++,ai_i1 LDW *b++,bi_i1 MPY ai_i1,bi_i1,pi MPYH ai_i1,bi_i1,pi1 ADD pi,sum0,sum0 ADD pi1,sum1,sum1 SUB cntr,1,cntr B LOOP ADD sum0,sum1,sum
; ; ; ; ; ; ; ;
load ai & ai+1 from memory load bi & bi+1 from memory ai * bi ai+1 * bi+1 sum0 += (ai * bi) sum1 += (ai+1 * bi+1) decrement loop counter branch to loop
Resources such as functional units and 1X and 2X cross paths do not have to be specified because these can be allocated automatically by the assembly optimizer.
6-35
Software Pipelining
[cntr] [cntr]
.trip 50 LDDW *a++,ai:ai1 LDDW *b++,bi:bi1 MPYSP a0,b0,pi MPYSP a1,b1,pi1 ADDSP pi,sum0,sum0 ADDSP pi1,sum1,sum1 SUB cntr,1,cntr B LOOP ADDSP sum,sum1,sum0
; ; ; ; ; ; ; ;
load ai & ai+1 from memory load bi & bi+1 from memory ai * bi ai+1 * bi+1 sum0 += (ai * bi) sum1 += (ai+1 * bi+1) decrement loop counter branch to loop
6.4.3
Final Assembly
Example 622 shows the assembly code for the fixed-point software-pipelined dot product in Table 67 on page 6-31. Example 623 shows the assembly code for the floating-point software-pipelined dot product in Table 68 on page 6-32. The accumulators are initialized to 0 and the loop counter is set up in the first execute packet in parallel with the first load instructions. The asterisks in the comments correspond with those in Table 67 and Table 68, respectively. Note: All instructions executing in parallel constitute an execute packet. An execute packet can contain up to eight instructions. See the TMS320C62x/C67x CPU and Instruction Set Reference Guide for more information about pipeline operation.
6-36
Software Pipelining
6.4.3.1
Fixed-Point Example
Multiple branch instructions are in the pipe. The first branch in the fixed-point dot product is issued on cycle 2 but does not actually branch until the end of cycle 7 (after five delay slots). The branch target is the execute packet defined by the label LOOP. On cycle 7, the first branch returns to the same execute packet, resulting in a single-cycle loop. On every cycle after cycle 7, a branch executes back to LOOP until the loop counter finally decrements to 0. Once the loop counter is 0, five more branches execute because they are already in the pipe. Executing the dot product code with the software pipelining as shown in Example 622 requires a total of 58 cycles (7 + 50 + 1), which is a significant improvement over the 402 cycles required by the code in Example 615. Note: The code created by the assembly optimizer will not completely match the final assembly code shown in this and future sections because different versions of the tool will produce slightly different code. However, the inner loop performance (number of cycles per iteration) should be similar.
6-37
Software Pipelining
Example 622. Assembly Code for Fixed-Point Dot Product (Software Pipelined)
|| || || || LDW LDW MVK ZERO ZERO [A1] SUB LDW LDW .D1 .D2 .S1 .L1 .L2 .S1 .D1 .D2 .S1 .S2 .D1 .D2 .S1 .S2 .D1 .D2 .S1 .S2 .D1 .D2 .M1X .M2X .S1 .S2 .D1 .D2 .M1X .M2X .S1 .S2 .D1 .D2 *A4++,A2 *B4++,B2 50,A1 A7 B7 A1,1,A1 *A4++,A2 *B4++,B2 A1,1,A1 LOOP *A4++,A2 *B4++,B2 A1,1,A1 LOOP *A4++,A2 *B4++,B2 A1,1,A1 LOOP *A4++,A2 *B4++,B2 A2,B2,A6 A2,B2,B6 A1,1,A1 LOOP *A4++,A2 *B4++,B2 A2,B2,A6 A2,B2,B6 A1,1,A1 LOOP *A4++,A2 *B4++,B2 ; ; ; ; ; load ai & ai+1 from memory load bi & bi+1 from memory set up loop counter zero out sum0 accumulator zero out sum1 accumulator
|| ||
; decrement loop counter ;* load ai & ai+1 from memory ;* load bi & bi+1 from memory ;* decrement loop counter ; branch to loop ;** load ai & ai+1 from memory ;** load bi & bi+1 from memory ;** decrement loop counter ;* branch to loop ;*** load ai & ai+1 from memory ;*** load bi & bi+1 from memory ;*** decrement loop counter ;** branch to loop ;**** load ai & ai+1 from memory ;**** load bi & bi+1 from memory ; ai * bi ; ai+1 * bi+1 ;**** decrement loop counter ;*** branch to loop ;***** ld ai & ai+1 from memory ;***** ld bi & bi+1 from memory ;* ai * bi ;* ai+1 * bi+1 ;***** decrement loop counter ;**** branch to loop ;****** ld ai & ai+1 from memory ;****** ld bi & bi+1 from memory
[A1] SUB || [A1] B || LDW || LDW [A1] SUB || [A1] B || LDW || LDW [A1] SUB || [A1] B || LDW || LDW MPY MPYH SUB B LDW LDW MPY MPYH SUB B LDW LDW
|| ||[A1] ||[A1] || ||
ADD .L1 A6,A7,A7 ADD .L2 B6,B7,B7 MPY .M1X A2,B2,A6 MPYH .M2X A2,B2,B6 SUB .S1 A1,1,A1 B .S2 LOOP LDW .D1 *A4++,A2 LDW .D2 *B4++,B2 ; Branch occurs here ADD .L1X A7,B7,A4
; sum0 += (ai * bi) ; sum1 += (ai+1 * bi+1) ;** ai * bi ;** ai+1 * bi+1 ;****** decrement loop counter ;***** branch to loop ;******* ld ai & ai+1 fm memory ;******* ld bi & bi+1 fm memory
6-38
Software Pipelining
6.4.3.2
Floating-Point Example
The first branch in the floating-point dot product is issued on cycle 4 but does not actually branch until the end of cycle 9 (after five delay slots). The branch target is the execute packet defined by the label LOOP. On cycle 9, the first branch returns to the same execute packet, resulting in a single-cycle loop. On every cycle after cycle 9, a branch executes back to LOOP until the loop counter finally decrements to 0. Once the loop counter is 0, five more branches execute because they are already in the pipe. Executing the floating-point dot product code with the software pipelining as shown in Example 623 requires a total of 74 cycles (9 + 50 + 15), which is a significant improvement over the 508 cycles required by the code in Example 616.
Example 623. Assembly Code for Floating-Point Dot Product (Software Pipelined)
|| || || || MVK ZERO ZERO LDDW LDDW LDDW LDDW LDDW LDDW .S1 .L1 .L2 .D1 .D2 .D1 .D2 .D1 .D2 .D1 .D2 .S1 .D1 .D2 .S2 .S1 .D1 .D2 .M1X .M2X .S2 .S1 .D1 .D2 .M1X .M2X .S2 .S1 50,A1 A8 B8 A4++,A7:A6 B4++,B7:B6 A4++,A7:A6 B4++,B7:B6 A4++,A7:A6 B4++,B7:B6 A4++,A7:A6 B4++,B7:B6 A1,1,A1 A4++,A7:A6 B4++,B7:B6 LOOP A1,1,A1 A4++,A7:A6 B4++,B7:B6 A6,B6,A5 A7,B7,B5 LOOP A1,1,A1 A4++,A7:A6 B4++,B7:B6 A6,B6,A5 A7,B7,B5 LOOP A1,1,A1 ; ; ; ; ; set up loop counter sum0 = 0 sum1 = 0 load ai & ai + 1 from memory load bi & bi + 1 from memory
||
;* load ai & ai + 1 from memory ;* load bi & bi + 1 from memory ;** load ai & ai + 1 from memory ;** load bi & bi + 1 from memory ;*** load ai & ai + 1 from memory ;*** load bi & bi + 1 from memory ; decrement loop counter ;**** load ai & ai + 1 from memory ;**** load bi & bi + 1 from memory ; branch to loop ;* decrement loop counter ;***** load ai & ai + 1 from memory ;***** load bi & bi + 1 from memory ; pi = a0 b0 ; pi1 = a1 b1 ;* branch to loop ;** decrement loop counter ;****** load ai & ai + 1 from memory ;****** load bi & bi + 1 from memory ;* pi = a0 b0 ;* pi1 = a1 b1 ;** branch to loop ;*** decrement loop counter
||
LDDW || LDDW ||[A1] SUB LDDW || LDDW ||[A1] B ||[A1] SUB LDDW LDDW MPYSP MPYSP B SUB LDDW LDDW MPYSP MPYSP B SUB
|| || || ||[A1] ||[A1]
|| || || ||[A1] ||[A1]
6-39
Software Pipelining
Example 623. Assembly Code for Floating-Point Dot Product (Software Pipelined) (Continued)
|| || || ||[A1] ||[A1] || || || ||[A1] ||[A1] LOOP: LDDW || LDDW || MPYSP || MPYSP || ADDSP || ADDSP ||[A1] B ||[A1] SUB ; Branch occurs ADDSP ADDSP ADDSP ADDSP NOP ADDSP NOP ADDSP NOP ADDSP NOP .L1X .L2X A0,B0,B5 3 A5,B5,A4 3 ; sum = sum(01) + sum(23) ; .L1X A0,B0,A5 .D1 .D2 .M1X .M2X .L1 .L2 .S2 .S1 here .L1X .L2X .L1X .L2X A4++,A7:A6 B4++,B7:B6 A6,B6,A5 A7,B7,B5 A5,A8,A8 B5,B8,B8 LOOP A1,1,A1 ;********* load ai & ai + 1 from memory ;********* load bi & bi + 1 from memory ;**** pi = a0 b0 ;**** pi1 = a1 b1 ; sum0 += (ai bi) sum1 += (ai+1 bi+1) ;***** branch to loop ;****** decrement loop counter LDDW LDDW MPYSP MPYSP B SUB LDDW LDDW MPYSP MPYSP B SUB .D1 .D2 .M1X .M2X .S2 .S1 .D1 .D2 .M1X .M2X .S2 .S1 A4++,A7:A6 B4++,B7:B6 A6,B6,A5 A7,B7,B5 LOOP A1,1,A1 A4++,A7:A6 B4++,B7:B6 A6,B6,A5 A7,B7,B5 LOOP A1,1,A1 ;******* load ai & ai + 1 from memory ;******* load bi & bi + 1 from memory ;** pi = a0 b0 ;** pi1 = a1 b1 ;*** branch to loop ;**** decrement loop counter ;******** load ai & ai + 1 from memory ;******** load bi & bi + 1 from memory ;*** pi = a0 b0 ;*** pi1 = a1 b1 ;**** branch to loop ;***** decrement loop counter
; sum(0) = sum0(0) + sum1(0) ; sum(1) = sum0(1) + sum1(1) ; sum(2) = sum0(2) + sum1(2) ; sum(3) = sum0(3) + sum1(3) ; wait for B0 ; sum(01) = sum(0) + sum(1) ; wait for next B0 ; sum(23) = sum(2) + sum(3)
6-40
Software Pipelining
6.4.3.3
Iteration 50 of the ADD instructions Iteration 52 of the MPY and MPYH instructions Iteration 57 of the LDW instructions
The following operations occur in parallel on the last cycle of the loop in Example 623: Iteration 50 of the ADDSP instructions Iteration 54 of the MPYSP instructions Iteration 59 of the LDDW instructions
In most cases, extra iterations are not a problem; however, when extraneous LDWs and LDDWs access unmapped memory, you can get unpredictable results. If the extraneous instructions present a potential problem, remove the extraneous load and multiply instructions by adding an epilog like that included in the second part of Example 624 on page 6-43 and Example 625 on page 6-44.
Fixed-Point Example
To eliminate LDWs in the fixed-point dot product from iterations 51 through 57, run the loop seven fewer times. This brings the loop counter to 43 (50 7), which means you still must execute seven more cycles of ADD instructions and five more cycles of MPY instructions. Five pairs of MPYs and seven pairs of ADDs are now outside the loop. The LDWs, MPYs, and ADDs all execute exactly 50 times. (The shaded areas of Example 624 indicate the changes in this code.) Executing the dot product code in Example 624 with no extraneous LDWs still requires a total of 58 cycles (7 + 43 + 7 + 1), but the code size is now larger.
Floating-Point Example
To eliminate LDDWs in the floating-point dot product from iterations 51 through 59, run the loop nine fewer times. This brings the loop counter to 41 (50 9), which means you still must execute nine more cycles of ADDSP instructions and five more cycles of MPYSP instructions. Five pairs of MPYSPs and nine pairs of ADDSPs are now outside the loop. The LDDWs, MPYSPs, and
Optimizing Assembly Code via Linear Assembly
6-41
Software Pipelining
ADDSPs all execute exactly 50 times. (The shaded areas of Example 625 indicate the changes in this code.) Executing the dot product code in Example 625 with no extraneous LDDWs still requires a total of 74 cycles (9 + 41 + 9 + 15), but the code size is now larger.
Example 624. Assembly Code for Fixed-Point Dot Product (Software Pipelined With No Extraneous Loads)
LDW LDW MVK ZERO ZERO .D1 .D2 .S1 .L1 .L2 .S1 .D1 .D2 .S1 .S2 .D1 .D2 .S1 .S2 .D1 .D2 .S1 .S2 .D1 .D2 .M1X .M2X .S1 .S2 .D1 .D2 .M1X .M2X .S1 .S2 .D1 .D2 *A4++,A2 *B4++,B2 43,A1 A7 B7 A1,1,A1 *A4++,A2 *B4++,B2 A1,1,A1 LOOP *A4++,A2 *B4++,B2 A1,1,A1 LOOP *A4++,A2 *B4++,B2 A1,1,A1 LOOP *A4++,A2 *B4++,B2 A2,B2,A6 A2,B2,B6 A1,1,A1 LOOP *A4++,A2 *B4++,B2 A2,B2,A6 A2,B2,B6 A1,1,A1 LOOP *A4++,A2 *B4++,B2 ; ; ; ; ; load ai & ai+1 from memory load bi & bi+1 from memory set up loop counter zero out sum0 accumulator zero out sum1 accumulator
|| || || ||
[A1] SUB || LDW || LDW [A1] ||[A1] || || [A1] ||[A1] || || [A1] ||[A1] || || SUB B LDW LDW SUB B LDW LDW SUB B LDW LDW MPY MPYH SUB B LDW LDW MPY MPYH SUB B LDW LDW
; decrement loop counter ;* load ai & ai+1 from memory ;* load bi & bi+1 from memory ;* decrement loop counter ; branch to loop ;** load ai & ai+1 from memory ;** load bi & bi+1 from memory ;** decrement loop counter ;* branch to loop ;*** load ai & ai+1 from memory ;*** load bi & bi+1 from memory ;*** decrement loop counter ;** branch to loop ;**** load ai & ai+1 from memory ;**** load bi & bi+1 from memory ; ai * bi ; ai+1 * bi+1 ;**** decrement loop counter ;*** branch to loop ;***** ld ai & ai+1 from memory ;***** ld bi & bi+1 from memory ;* ai * bi ;* ai+1 * bi+1 ;***** decrement loop counter ;**** branch to loop ;****** ld ai & ai+1 from memory ;****** ld bi & bi+1 from memory
|| ||[A1] ||[A1] || ||
|| ||[A1] ||[A1] || ||
6-42
Software Pipelining
Example 624. Assembly Code for Fixed-Point Dot Product (Software Pipelined With No Extraneous Loads) (Continued)
LOOP: || || || ||[A1] ||[A1] || || ADD .L1 A6,A7,A7 ADD .L2 B6,B7,B7 MPY .M1X A2,B2,A6 MPYH .M2X A2,B2,B6 SUB .S1 A1,1,A1 B .S2 LOOP LDW .D1 *A4++,A2 LDW .D2 *B4++,B2 ; Branch occurs here ; sum0 += (ai * bi) ; sum1 += (ai+1 * bi+1) ;** ai * bi ;** ai+1 * bi+1 ;****** decrement loop counter ;***** branch to loop ;******* ld ai & ai+1 fm memory ;******* ld bi & bi+1 fm memory
ADDs MPYs
|| || ||
; sum0 += (ai * bi) ; sum1 += (ai+1 * bi+1) ;** ai * bi ;** ai+1 * bi+1 ; sum0 += (ai * bi) ; sum1 += (ai+1 * bi+1) ;** ai * bi ;** ai+1 * bi+1
1 1
2 2
|| || ||
|| || ||
ADD ADD MPY MPYH ADD ADD MPY MPYH ADD ADD MPY MPYH ADD ADD ADD ADD ADD
.L1 .L2 .M1X .M2X .L1 .L2 .M1X .M2X .L1 .L2 .M1X .M2X .L1 .L2 .L1 .L2 .L1X
A6,A7,A7 B6,B7,B7 A2,B2,A6 A2,B2,B6 A6,A7,A7 B6,B7,B7 A2,B2,A6 A2,B2,B6 A6,A7,A7 B6,B7,B7 A2,B2,A6 A2,B2,B6 A6,A7,A7 B6,B7,B7 A6,A7,A7 B6,B7,B7 A7,B7,A4
; sum0 += (ai * bi) ; sum1 += (ai+1 * bi+1) ;** ai * bi ;** ai+1 * bi+1 ; sum0 += (ai * bi) ; sum1 += (ai+1 * bi+1) ;** ai * bi ;** ai+1 * bi+1 ; sum0 += (ai * bi) ; sum1 += (ai+1 * bi+1) ;** ai * bi ;** ai+1 * bi+1 ; sum0 += (ai * bi) ; sum1 += (ai+1 * bi+1) ; sum0 += (ai * bi) ; sum1 += (ai+1 * bi+1) ; sum = sum0 + sum1
3 3
4 4
|| || ||
5 5
|| || ||
||
||
6-43
Software Pipelining
Example 625. Assembly Code for Floating-Point Dot Product (Software Pipelined With No Extraneous Loads)
|| || || ||
.S1 .L1 .L2 .D1 .D2 .D1 .D2 .D1 .D2 .D1 .D2 .S1 .D1 .D2 .S2 .S1 .D1 .D2 .M1X .M2X .S2 .S1 .D1 .D2 .M1X .M2X .S2 .S1 .D1 .D2 .M1X .M2X .S2 .S1 .D1 .D2 .M1X .M2X .S2 .S1
41,A1 A8 B8 A4++,A7:A6 B4++,B7:B6 A4++,A7:A6 B4++,B7:B6 A4++,A7:A6 B4++,B7:B6 A4++,A7:A6 B4++,B7:B6 A1,1,A1 A4++,A7:A6 B4++,B7:B6 LOOP A1,1,A1 A4++,A7:A6 B4++,B7:B6 A6,B6,A5 A7,B7,B5 LOOP A1,1,A1 A4++,A7:A6 B4++,B7:B6 A6,B6,A5 A7,B7,B5 LOOP A1,1,A1 A4++,A7:A6 B4++,B7:B6 A6,B6,A5 A7,B7,B5 LOOP A1,1,A1 A4++,A7:A6 B4++,B7:B6 A6,B6,A5 A7,B7,B5 LOOP A1,1,A1
; ; ; ; ;
set up loop counter sum0 = 0 sum1 = 0 load ai & ai + 1 from memory load bi & bi + 1 from memory
||
;* load ai & ai + 1 from memory ;* load bi & bi + 1 from memory ;** load ai & ai + 1 from memory ;** load bi & bi + 1 from memory ;*** load ai & ai + 1 from memory ;*** load bi & bi + 1 from memory ; decrement loop counter ;**** load ai & ai + 1 from memory ;**** load bi & bi + 1 from memory ; branch to loop ;* decrement loop counter ;***** load ai & ai + 1 from memory ;***** load bi & bi + 1 from memory ; pi = a0 b0 ; pi1 = a1 b1 ;* branch to loop ;** decrement loop counter ;****** load ai & ai + 1 from memory ;****** load bi & bi + 1 from memory ;* pi = a0 b0 ;* pi1 = a1 b1 ;** branch to loop ;*** decrement loop counter ;******* load ai & ai + 1 from memory ;******* load bi & bi + 1 from memory ;** pi = a0 b0 ;** pi1 = a1 b1 ;*** branch to loop ;**** decrement loop counter ;******** load ai & ai + 1 from memory ;******** load bi & bi + 1 from memory ;*** pi = a0 b0 ;*** pi1 = a1 b1 ;**** branch to loop ;***** decrement loop counter
||
LDDW || LDDW ||[A1] SUB LDDW || LDDW ||[A1] B ||[A1] SUB LDDW LDDW MPYSP MPYSP B SUB LDDW LDDW MPYSP MPYSP B SUB LDDW LDDW MPYSP MPYSP B SUB LDDW LDDW MPYSP MPYSP B SUB
|| || || ||[A1] ||[A1]
|| || || ||[A1] ||[A1]
6-44
Software Pipelining
Example 625. Assembly Code for Floating-Point Dot Product (Software Pipelined With No Extraneous Loads) (Continued
LOOP: LDDW || LDDW || MPYSP || MPYSP || ADDSP || ADDSP ||[A1] B ||[A1] SUB ; Branch occurs MPYSP MPYSP ADDSP ADDSP MPYSP MPYSP ADDSP ADDSP MPYSP MPYSP ADDSP ADDSP MPYSP MPYSP ADDSP ADDSP MPYSP MPYSP ADDSP ADDSP ADDSP ADDSP ADDSP ADDSP ADDSP ADDSP ADDSP ADDSP .D1 .D2 .M1X .M2X .L1 .L2 .S2 .S1 here .M1X .M2X .L1 .L2 .M1X .M2X .L1 .L2 .M1X .M2X .L1 .L2 .M1X .M2X .L1 .L2 .M1X .M2X .L1 .L2 .L1 .L2 .L1 .L2 .L1 .L2 .L1 .L2 A4++,A7:A6 B4++,B7:B6 A6,B6,A5 A7,B7,B5 A5,A8,A8 B5,B8,B8 LOOP A1,1,A1 ;********* load ai & ai + 1 from memory ;********* load bi & bi + 1 from memory ;**** pi = a0 b0 ;**** pi1 = a1 b1 ; sum0 += (ai bi) ; sum1 += (ai+1 bi+1) ;***** branch to loop ;****** decrement loop counter
ADDSPs MPYSPs 1 1
|| || ||
A6,B6,A5 A7,B7,B5 A5,A8,A8 B5,B8,B8 A6,B6,A5 A7,B7,B5 A5,A8,A8 B5,B8,B8 A6,B6,A5 A7,B7,B5 A5,A8,A8 B5,B8,B8 A6,B6,A5 A7,B7,B5 A5,A8,A8 B5,B8,B8 A6,B6,A5 A7,B7,B5 A5,A8,A8 B5,B8,B8 A5,A8,A8 B5,B8,B8 A5,A8,A8 B5,B8,B8 A5,A8,A8 B5,B8,B8 A5,A8,A8 B5,B8,B8
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
pi = a0 b0 pi1 = a1 b1 sum0 += (ai bi) sum1 += (ai+1 bi+1) pi = a0 b0 pi1 = a1 b1 sum0 += (ai bi) sum1 += (ai+1 bi+1) pi = a0 b0 pi1 = a1 b1 sum0 += (ai bi) sum1 += (ai+1 bi+1) pi = a0 b0 pi1 = a1 b1 sum0 += (ai bi) sum1 += (ai+1 bi+1) pi = a0 b0 pi1 = a1 b1 sum0 += (ai bi) sum1 += (ai+1 bi+1)
|| || ||
2 2
|| || ||
3 3
|| || || || || ||
4 4 5 5
||
; sum0 += (ai bi) ; sum1 += (ai+1 bi+1) ; sum0 += (ai bi) ; sum1 += (ai+1 bi+1) ; sum0 += (ai bi) ; sum1 += (ai+1 bi+1) ; sum0 += (ai bi) ; sum1 += (ai+1 bi+1)
6 7
||
||
8 9
||
6-45
Software Pipelining
Example 625. Assembly Code for Floating-Point Dot Product (Software Pipelined With No Extraneous Loads) (Continued)
ADDSP ADDSP ADDSP ADDSP NOP ADDSP NOP ADDSP NOP ADDSP NOP .L1X .L2X A0,B0,B5 3 A5,B5,A4 3 ; sum = sum(01) + sum(23) ; .L1X A0,B0,A5 .L1X .L2X .L1X .L2X A8,B8,A0 A8,B8,B0 A8,B8,A0 A8,B8,B0 ; sum(0) = sum0(0) + sum1(0) ; sum(1) = sum0(1) + sum1(1) ; sum(2) = sum0(2) + sum1(2) ; sum(3) = sum0(3) + sum1(3) ; wait for B0 ; sum(01) = sum(0) + sum(1) ; wait for next B0 ; sum(23) = sum(2) + sum(3)
6-46
Software Pipelining
6.4.3.4
Fixed-Point Example
To eliminate the prolog of the fixed-point dot product and, therefore, the extra LDW and MPY instructions, begin execution at the loop body (at the LOOP label). Eliminating the prolog means that:
Two LDWs, two MPYs, and two ADDs occur in the first execution cycle of the loop. Because the first LDWs require five cycles to write results into a register, the MPYs do not multiply valid data until after the loop executes five times. The ADDs have no valid data until after seven cycles (five cycles for the first LDWs and two more cycles for the first valid MPYs).
Example 626 shows the loop without the prolog but with four new instructions that zero the inputs to the MPY and ADD instructions. Making the MPYs and ADDs use 0s before valid data is available ensures that the final accumulator values are unaffected. (The loop counter is initialized to 57 to accommodate the seven extra cycles needed to prime the loop.) Because the first LDWs are not issued until after seven cycles, the code in Example 626 requires a total of 65 cycles (7 + 57+ 1). Therefore, you are reducing the code size with a slight loss in performance.
6-47
Software Pipelining
Example 626. Assembly Code for Fixed-Point Dot Product (Software Pipelined With Removal of Prolog and Epilog)
MVK [A1] SUB || ZERO || ZERO [A1] ||[A1] || || [A1] ||[A1] || || SUB B ZERO ZERO SUB B ZERO ZERO .S1 .S1 .L1 .L2 .S1 .S2 .L1 .L2 .S1 .S2 .L1 .L2 .S1 .S2 .S1 .S2 .S1 .S2 57,A1 A1,1,A1 A7 B7 A1,1,A1 LOOP A6 B6 A1,1,A1 LOOP A2 B2 A1,1,A1 LOOP A1,1,A1 LOOP A1,1,A1 LOOP ; set up loop counter ; decrement loop counter ; zero out sum0 accumulator ; zero out sum1 accumulator ;* decrement loop counter ; branch to loop ; zero out add input ; zero out add input ;** decrement loop counter ;* branch to loop ; zero out mpy input ; zero out mpy input ;*** decrement loop counter ;** branch to loop ;**** decrement loop counter ;*** branch to loop ;***** decrement loop counter ;**** branch to loop
[A1] SUB ||[A1] B [A1] SUB ||[A1] B [A1] SUB ||[A1] B LOOP: || || || ||[A1] ||[A1] || ||
ADD .L1 A6,A7,A7 ADD .L2 B6,B7,B7 MPY .M1X A2,B2,A6 MPYH .M2X A2,B2,B6 SUB .S1 A1,1,A1 B .S2 LOOP LDW .D1 *A4++,A2 LDW .D2 *B4++,B2 ; Branch occurs here ADD .L1X
; sum0 += (ai * bi) ; sum1 += (ai+1 * bi+1) ;** ai * bi ;** ai+1 * bi+1 ;****** decrement loop counter ;***** branch to loop ;******* ld ai & ai+1 fm memory ;******* ld bi & bi+1 fm memory
A7,B7,A4
6-48
Software Pipelining
Floating-Point Example
To eliminate the prolog of the floating-point dot product and, therefore, the extra LDDW and MPYSP instructions, begin execution at the loop body (at the LOOP label). Eliminating the prolog means that:
Two LDDWs, two MPYSPs, and two ADDSPs occur in the first execution cycle of the loop. Because the first LDDWs require five cycles to write results into a register, the MPYSPs do not multiply valid data until after the loop executes five times. The ADDSPs have no valid data until after nine cycles (five cycles for the first LDDWs and four more cycles for the first valid MPYSPs).
Example 627 shows the loop without the prolog but with four new instructions that zero the inputs to the MPYSP and ADDSP instructions. Making the MPYSPs and ADDSPs use 0s before valid data is available ensures that the final accumulator values are unaffected. (The loop counter is initialized to 59 to accommodate the nine extra cycles needed to prime the loop.) Because the first LDDWs are not issued until after nine cycles, the code in Example 627 requires a total of 81 cycles (7 + 59+ 15). Therefore, you are reducing the code size with a slight loss in performance.
Example 627. Assembly Code for Floating-Point Dot Product (Software Pipelined With Removal of Prolog and Epilog)
MVK ZERO ZERO SUB B SUB ZERO ZERO B SUB ZERO ZERO B SUB ZERO ZERO .S1 .L1 .L2 .S1 .S2 .S1 .L1 .L2 .S2 .S1 .L1 .L2 .S2 .S1 .L1 .L2 59,A1 A7 B7 A1,1,A1 LOOP A1,1,A1 A8 B8 LOOP A1,1,A1 A5 B5 LOOP A1,1,A1 A6 B6 ; set up loop counter ; zero out mpysp input ; zero out mpysp input ; decrement loop counter ; branch to loop ;* decrement loop counter ; zero out sum0 accumulator ; zero out sum0 accumulator ;* branch to loop ;** decrement loop counter ; zero out addsp input ; zero out addsp input ;** branch to loop ;*** decrement loop counter ; zero out mpysp input ; zero out mpysp input
6-49
Software Pipelining
Example 627. Assembly Code for Floating-Point Dot Product (Software Pipelined With Removal of Prolog and Epilog) (Continued)
[A1] ||[A1] [A1] ||[A1] LOOP: || || || || || ||[A1] ||[A1] ; Branch LDDW .D1 LDDW .D2 MPYSP .M1X MPYSP .M2X ADDSP .L1 ADDSP .L2 B .S2 SUB .S1 occurs here ADDSP ADDSP ADDSP ADDSP NOP ADDSP NOP ADDSP NOP ADDSP NOP .L1X .L2X A0,B0,B5 3 A5,B5,A4 3 ; sum = sum(01) + sum(23) ; .L1X A0,B0,A5 .L1X .L2X .L1X .L2X A4++,A7:A6 B4++,B7:B6 A6,B6,A5 A7,B7,B5 A5,A8,A8 B5,B8,B8 LOOP A1,1,A1 ;********* load ai & ai + 1 from memory ;********* load bi & bi + 1 from memory ;**** pi = a0 b0 ;**** pi1 = a1 b1 ; sum0 += (ai bi) ; sum1 += (ai+1 bi+1) ;***** branch to loop ;****** decrement loop counter B SUB B SUB .S2 .S1 .S2 .S1 LOOP A1,1,A1 LOOP A1,1,A1 ;*** branch to loop ;**** decrement loop counter ;**** branch to loop ;***** decrement loop counter
; sum(0) = sum0(0) + sum1(0) ; sum(1) = sum0(1) + sum1(1) ; sum(2) = sum0(2) + sum1(2) ; sum(3) = sum0(3) + sum1(3) ; wait for B0 ; sum(01) = sum(0) + sum(1) ; wait for next B0 ; sum(23) = sum(2) + sum(3)
6-50
Software Pipelining
6.4.3.5
Example 628. Assembly Code for Fixed-Point Dot Product (Software Pipelined With Smallest Code Size)
B MVK B B ZERO ZERO B ZERO ZERO B ZERO ZERO .S2 .S1 .S2 .S2 .L1 .L2 .S2 .L1 .L2 .S2 .L1 .L2 LOOP 51,A1 LOOP LOOP A7 B7 LOOP A6 B6 LOOP A2 B2 ; branch to loop ; set up loop counter ;* branch to loop ;** branch to loop ; zero out sum0 accumulator ; zero out sum1 accumulator ;*** branch to loop ; zero out add input ; zero out add input ;**** branch to loop ; zero out mpy input ; zero out mpy input
||
|| ||
|| ||
ADD .L1 A6,A7,A7 ADD .L2 B6,B7,B7 MPY .M1X A2,B2,A6 MPYH .M2X A2,B2,B6 SUB .S1 A1,1,A1 B .S2 LOOP LDW .D1 *A4++,A2 LDW .D2 *B4++,B2 ; Branch occurs here ADD .L1X A7,B7,A4
; sum0 += (ai * bi) ; sum1 += (ai+1 * bi+1) ;** ai * bi ;** ai+1 * bi+1 ;****** decrement loop counter ;***** branch to loop ;******* ld ai & ai+1 fm memory ;******* ld bi & bi+1 fm memory
6-51
Software Pipelining
Example 629. Assembly Code for Floating-Point Dot Product (Software Pipelined With Smallest Code Size)
|| B MVK B ZERO ZERO B ZERO ZERO B ZERO ZERO B ZERO ZERO .S2 .S1 .S2 .L1 .L2 .S2 .L1 .L2 .S2 .L1 .L2 .S2 .L1 .L2 LOOP 53,A1 LOOP A7 B7 LOOP A8 B8 LOOP A5 B5 LOOP A6 B6 ; branch to loop ; set up loop counter ;* branch to loop ; zero out mpysp input ; zero out mpysp input ;** branch to loop ; zero out sum0 accumulator ; zero out sum0 accumulator ;*** branch to loop ; zero out addsp input ; zero out addsp input ;**** branch to loop ; zero out mpysp input ; zero out mpysp input
|| ||
|| ||
|| ||
|| || LOOP:
LDDW .D1 A4++,A7:A6 || LDDW .D2 B4++,B7:B6 || MPYSP .M1X A6,B6,A5 || MPYSP .M2X A7,B7,B5 || ADDSP .L1 A5,A8,A8 || ADDSP .L2 B5,B8,B8 ||[A1] B .S2 LOOP ||[A1] SUB .S1 A1,1,A1 ; Branch occurs here ADDSP .L1X ADDSP .L2X ADDSP .L1X ADDSP .L2X NOP ADDSP .L1X NOP ADDSP .L2X NOP ADDSP .L1X NOP A0,B0,B5 3 A5,B5,A4 3 A0,B0,A5 A8,B8,A0 A8,B8,B0 A8,B8,A0 A8,B8,B0
;********* load ai & ai + 1 from memory ;********* load bi & bi + 1 from memory ;**** pi = a0 b0 ;**** pi1 = a1 b1 ; sum0 += (ai bi) ; sum1 += (ai+1 bi+1) ;***** branch to loop ;****** decrement loop counter
; sum(0) = sum0(0) + sum1(0) ; sum(1) = sum0(1) + sum1(1) ; sum(2) = sum0(2) + sum1(2) ; sum(3) = sum0(3) + sum1(3) ; wait for B0 ; sum(01) = sum(0) + sum(1) ; wait for next B0 ; sum(23) = sum(2) + sum(3)
6-52
Software Pipelining
6.4.4
Comparing Performance
Table 32 compares the performance of all versions of the fixed-point dot product code. Table 611 compares the performance of all versions of the floating-point dot product code.
Example 615 Fixed-point dot product parallel assembly with LDW Example 622 Fixed-point software-pipelined dot product Example 624 Fixed-point software-pipelined dot product with no extraneous loads Example 626 Fixed-point software-pipelined dot product with no prolog or epilog Example 628 Fixed-point software-pipelined dot product with smallest code size
16 1 + 100 8 1 + (50 8) + 1
2 + 100 7 + 50 + 1 7 + 43 + 7 + 1 7 + 57 + 1 5 + 57 + 1
100 Iterations
Example 616 Floating-point dot product parallel assembly with LDDW Example 623 Floating-point software-pipelined dot product Example 625 Floating-point software-pipelined dot product with no extraneous loads Example 627 Floating-point software-pipelined dot product with no prolog or epilog Example 629 Floating-point software-pipelined dot product with smallest code size
100 Iterations
6-53
6.5.1
6.5.2
Example 631. Linear Assembly for Weighted Vector Sum Inner Loop
LDH LDH MPY SHR ADD STH [cntr]SUB [cntr]B *aptr++,ai *bptr++,bi m,ai,pi pi,15,pi_scaled pi_scaled,bi,ci ci,*cptr++ cntr,1,cntr LOOP ; ; ; ; ; ; ; ; ai bi m * ai (m * ai) >> 15 ci = (m * ai) >> 15 + bi store ci decrement loop counter branch to loop
6-54
6.5.3
6.5.3.1
6-55
6.5.3.2
LDW LDW MPY MPYHL SHR SHR AND SHR ADD ADD STH STH [cntr]SUB [cntr]B
The two store pointers (*ciptr and *ci+1ptr) are separated so that one (*ciptr) increments by 2 through the odd elements of the array and the other (*ci+1ptr) increments through the even elements. AND and SHR separate bi and bi+1 into two separate registers. This code assumes that mask is preloaded with 0x0000FFFF to clear the upper 16 bits. The shift right of 16 places bi+1 into the 16 LSBs.
Example 633. Linear Assembly for Weighted Vector Sum Using LDW
*aptr++,ai_i+1 *bptr++,bi_i+1 m,ai_i+1,pi m,ai_i+1,pi+1 pi,15,pi_scaled pi+1,15,pi+1_scaled bi_i+1,mask,bi bi_i+1,16,bi+1 pi_scaled,bi,ci pi+1_scaled,bi+1,ci+1 ci,*ciptr++[2] ci+1,*ci+1ptr++[2] cntr,1,cntr LOOP ; ; ; ; ; ; ; ; ; ; ; ; ; ; ai & ai+1 bi & bi+1 m * ai m * ai+1 (m * ai) >> 15 (m * ai+1) >> 15 bi bi+1 ci = (m * ai) >> 15 + bi ci+1 = (m * ai+1) >> 15 + bi+1 store ci store ci+1 decrement loop counter branch to loop
6.5.3.3
Four memory operations (two LDWs and two STHs) must each use a .D unit. With two .D units available, this loop still requires only two cycles. Four instructions must use the .S units (three SHRs and one branch). With two .S units available, the minimum iteration interval is still 2. The two MPYs do not increase the minimum iteration interval. Because the remaining four instructions (two ADDs, AND, and SUB) can all use a .L unit, the minimum iteration interval for this loop is the same as in Example 631.
By using LDWs instead of LDHs, the program can do twice as much work in the same number of cycles.
6-56
6.5.4
1 ADD .L1X ci
1 .L1
SUB cntr 1 B
.S1
LOOP
6-57
6.5.5
LDW LDW MPY MPYHL SHR SHR AND SHR ADD ADD STH STH [A1] SUB [A1] B .D1 .D2 .M1X .M2X .S1 .S2 .L2 .S2 .L1X .L2 .D1 .D2 .L1 .S1
The pointers are initialized outside the loop. m resides in B6, which causes both .M units to use a cross path. The mask in the AND instruction resides in B10.
Example 634. Linear Assembly for Weighted Vector Sum With Resources Allocated
*A4++,A2 *B4++,B2 A2,B6,A5 A2,B6,B5 A5,15,A7 B5,15,B7 B2,B10,B8 B2,16,B1 A7,B8,A9 B7,B1,B9 A9,*A6++[2] B9,*B0++[2] A1,1,A1 LOOP ; ; ; ; ; ; ; ; ; ; ; ; ; ; ai & ai+1 bi & bi+1 pi = m * ai pi+1 = m * ai+1 pi_scaled = (m * ai) >> 15 pi+1_scaled = (m * ai+1) >> 15 bi bi+1 ci = (m * ai) >> 15 + bi ci+1 = (m * ai+1) >> 15 + bi+1 store ci store ci+1 decrement loop counter branch to loop
6.5.6
6-58
Instructions that execute on cycle k also execute on cycle k + 2, k + 4, etc. Instructions scheduled on these even cycles cannot use the same resources. Instructions that execute on cycle k + 1 also execute on cycle k + 3, k + 5, etc. Instructions scheduled on these odd cycles cannot use the same resources. Because two instructions (MPY and ADD) use the 1X path but do not use the same functional unit, Table 612 includes two rows (1X and 2X) that help you keep track of the cross path resources.
The two LDWs use the .D units on the even cycles. The MPY and MPYH are scheduled on cycle 5 because the LDW has four delay slots. The MPY instructions appear in two rows because they use the .M and cross path resources on cycles 5, 7, 9, etc. The two SHR instructions are scheduled two cycles after the MPY to allow for the MPYs single delay slot. The AND is scheduled on cycle 5, four delay slots after the LDW.
6-59
Table 612. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop)
Unit/Cycle .D1 0 2 * LDW ai_i+1 * LDW bi_i+1 4 ** LDW ai_i+1 ** LDW bi_i+1 6 *** LDW ai_i+1 *** LDW bi_i+1 8 **** LDW ai_i+1 **** LDW bi_i+1 10 ***** LDW ai_i+1 ***** LDW bi_i+1
LDW ai_i+1
.D2 .M1 .M2 .L1 .L2 .S1 .S2 1X 2X Unit/Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2 1X 2X
Note:
LDW bi_i+1
11
SHR pi_s SHR pi+1_s MPY pi MPYHL pi+1 * MPY pi * MPYHL pi+1
The asterisks indicate the iteration of the loop; shaded cells indicate cycle 0.
6-60
6.5.6.1
Resource Conflicts
Resources from one instruction cannot conflict with resources from any other instruction scheduled modulo iteration intervals away. In other words, for a 2-cycle loop, instructions scheduled on cycle n cannot use the same resources as instructions scheduled on cycles n + 2, n + 4, n + 6, etc. Table 613 shows the addition of the SHR bi+1 instruction. This must avoid a conflict of resources in cycles 5 and 7, which are one iteration interval away from each other. Even though LDW bi_i+1 (.D2, cycle 0) finishes on cycle 5, its child, SHR bi+1, cannot be scheduled on .S2 until cycle 6 because of a resource conflict with SHR pi+1_scaled, which is on .S2 in cycle 7.
Figure 612. Dependency Graph of Weighted Vector Sum (Showing Resource Conflict)
A side LDW B side
Scheduled on cycle 5
AND
Scheduled on cycle 7
pi_scaled
pi+1_scaled
bi
6-61
Table 613. Modulo Iteration Interval Table for Weighted Vector Sum With SHR Instructions
Unit / Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2 1X 2X Unit / Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2 1X
Note:
10, 12, 14, ... ***** LDW ai_i+1 ***** LDW bi_i+1
SHR bi+1
* SHR bi+1
** SHR bi+1
The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 612.
6-62
Unit / Cycle 2X
Note:
5 MPYHL pi+1
7 * MPYHL pi+1
9 ** MPYHL pi+1
The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 612.
6-63
6.5.6.2
6.5.6.3
6-64
Figure 613. Dependency Graph of Weighted Vector Sum (With Resource Conflict Resolved)
A side LDW ai_i+1 MPY pi 2 SHR 7 pi_scaled pi+1_scaled ADD 1 ci+1 1 1 STH mem STH mem 5 5 5 MPYHL pi+1 2 5 LDW bi_i+1 SHR AND 5 7 bi 6 bi+1 5 SHR 6 0 0 B side
1 ADD ci 8
Note:
Shaded numbers indicate the cycle in which the instruction is first scheduled.
6-65
Table 614. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop)
Unit/Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2 1X 2X Unit/Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2 1X 2X
Note:
* ADD ci ** AND bi
SHR bi+1
* SHR bi+1
** SHR bi+1
11
SHR pi_s SHR pi+1_s MPY pi MPYHL pi+1 * MPY pi * MPYHL pi+1
The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 613.
6-66
6.5.6.4
MPYHL pi+1 2 SHR .D2 AND 5 .L2 bi 7 .S2 bi_i+1 5 SHR bi+1 8 6 LDW 2
pi_scaled
pi+1_scaled
1 ADD .L1X ci 8
1 1 .L2
ADD ci+1 1 10
SUB 5
Note:
Shaded numbers indicate the cycle in which the instruction is first scheduled.
6-67
B LOOP (.S1, cycle 6) SUB cntr (.L1, cycle 5) ADD ci+1 (.L2, cycle 10) STH ci (cycle 9) STH ci+1 (cycle 11)
To avoid resource conflicts and live-too-long problems, Table 615 also includes the following additional changes: LDW bi_i+1 (.D2) moved from cycle 0 to cycle 2. AND bi (.L2) moved from cycle 6 to cycle 7. SHR pi+1_scaled (.S2) moved from cycle 7 to cycle 9. MPYHL pi+1 moved from cycle 5 to cycle 6. SHR bi+1 moved from cycle 6 to 8.
From the table, you can see that this loop is pipelined six iterations deep, because iterations n and n + 5 execute in parallel.
6-68
Table 615. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop)
Unit/Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2 1X 2X Unit/Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2 1X 2X
Note: The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 614.
0 LDW ai_i+1
10, 12, 14, ... ***** LDW ai_i+1 **** LDW bi_i+1 ** MPYHL pi+1 * ADD ci ADD ci+1
MPYHL pi+1
B LOOP
** B LOOP * SHR bi+1 * ADD ci ** MPYHL pi+1 11, 13, 15, ... * STH ci STH ci+1
MPYHL pi+1 1 3 5 7
MPY pi
*** MPY pi *** SUB cntr ** AND bi ** SHR pi_s * SHR pi+1_s *** MPY pi
SUB cntr
MPY pi
* MPY pi
** MPY pi
6-69
6.5.7
[cntr] [cntr]
.trip 50 LDW .D1 LDW .D2 MPY .M1X MPYHL .M2X SHR .S1 SHR .S2 AND .L2 SHR .S2 ADD .L1X ADD .L2 STH .D1 STH .D2 SUB .L1 B .S1 .endproc
*a++,ai_i1 ; *b++,bi_i1 ; ai_i1,m,pi ; ai_i1,m,pi1 ; pi,15,pi_s ; pi1,15,pi1_s ; bi_i1,mask,bi ; bi_i1,16,bi1 ; pi_s,bi,ci ; pi1_s,bi1,ci1 ; ci,*c++[2] ; ci1,*c1++[2] ; cntr,1,cntr ; LOOP ;
ai & ai+1 bi & bi+1 m * ai m * ai+1 (m * ai) >> 15 (m * ai+1) >> 15 bi bi+1 ci = (m * ai) >> 15 + bi ci+1 = (m * ai+1) >> 15 + bi+1 store ci store ci+1 decrement loop counter branch to loop
6-70
6.5.8
Final Assembly
Example 636 shows the final assembly code for the weighted vector sum. The following optimizations are included:
While iteration n of instruction STH ci+1 is executing, iteration n + 1 of STH ci is executing. To prevent the STH ci instruction from executing iteration 51 while STH ci + 1 executes iteration 50, execute the loop only 49 times and schedule the final executions of ADD ci+1 and STH ci+1 after exiting the loop. The mask for the AND instruction is created with MVK and MVKH in parallel with the loop prolog. The pointer to the odd elements in array c is also set up in parallel with the loop prolog.
6-71
||
|| || ||
MPY ||[A1] SUB MPYHL ||[A1] B || LDW || LDW SHR || AND || MPY ||[A1] SUB SHR ADD MPYHL B LDW LDW SHR STH SHR AND SUB MPY
|| || ||[A1] || ||
; ci+1 = (m * ai+1) >> 15 + bi+1 ;* bi+1 ;* ci = (m * ai) >> 15 + bi ;** m * ai+1 ;** branch to loop ;**** bi & bi+1 ;***** ai & ai+1
6-72
6-73
6.6.1
6-74
6.6.2
xptr is not postincremented after loading xi+1, because xi of the next iteration is actually xi+1 of the current iteration. Thus, the pointer points to the same address when loading both xi+1 for one iteration and xi for the next iteration. yptr is also not postincremented after storing yi+1, because yi of the next iteration is yi+1 for the current iteration.
6-75
6.6.3
MPY
MPY
MPY
ADD s0
2 2 ADD 1 s1 1 1
SUB 1 1 cntr
SHR
yi+1 1
STH
LOOP
mem
Note:
6-76
6.6.4
(b) B side
Unit(s) .M2 .S2 .D2 .L2 or .S2, .D2 Total non-.M units Instructions MPY SHR STH ADD Total/Unit 1 1 1 1 3
However, the IIR has a data dependency constraint defined by its loop carry path. Figure 615 shows that if you schedule LDH yi on cycle 0:
The earliest you can schedule MPY p2 is on cycle 5. The earliest you can schedule ADD s1 is on cycle 7. SHR yi+1 must be on cycle 8 and STH on cycle 9. Because the LDH must wait for the STH to be issued, the earliest the the second iteration can begin is cycle 10.
To determine the minimum loop carry path, add all of the numbers along the loop paths in the dependency graph. This means that this loop carry path is 10 (5 + 2 + 1 + 1 + 1).
Optimizing Assembly Code via Linear Assembly
6-77
Although the minimum iteration interval is the greater of the resource limits and data dependency constraints, an interval of 10 seems slow. Figure 616 shows how to improve the performance.
6.6.4.1
Figure 616. Dependency Graph of IIR Filter (With Smaller Loop Carry)
A side LDH xi 5 p0 2 LDH xi+1 5 p1 B side
MPY
MPY
MPY p2
ADD s0
2 1 ADD
1 s1 1
SUB 1 1 cntr
SHR
yi+1 1
STH
Loop
mem
Note:
6-78
6.6.4.2
Example 639. Linear Assembly for IIR Inner Loop With Reduced Loop Carry Path
LDH MPY LDH MPY ADD MPY ADD SHR STH [cntr]SUB [cntr]B *xptr++,xi c1,xi,p0 *xptr,xi+1 c2,xi+1,p1 p0,p1,s0 c3,y,p2 s0,p2,s1 s1,15,y y,*yptr++ cntr,1,cntr LOOP ; ; ; ; ; ; ; ; ; ; ; xi+1 c1 * xi xi+1 c2 * xi+1 c1 * xi + c2 * xi+1 c3 * yi c1 * xi + c2 * xi+1 + c3 * yi yi+1 store yi+1 decrement loop counter branch to loop
6.6.5
Example 640. Linear Assembly for IIR Inner Loop (With Allocated Resources)
LDH MPY LDH MPY ADD MPY ADD SHR STH [A1] SUB [A1] B .D1 .M1 .D1 .M1X .L1 .M2X .L2X .S2 .D2 .L1 .S1 *A4++,A2 A6,A2,A5 *A4,A3 B6,A3,A7 A5,A7,A9 A8,B2,B3 B3,A9,B5 B5,15,B2 B2,*B4++ A1,1,A1 LOOP ; ; ; ; ; ; ; ; ; ; ; xi+1 c1 * xi xi+1 c2 * xi+1 c1 * xi + c2 * xi+1 c3 * yi c1 * xi + c2 * xi+1 + c3 * yi yi+1 store yi+1 decrement loop counter branch to loop
6-79
6-80
Table 617. Modulo Iteration Interval Table for IIR (4-Cycle Loop)
6.6.6
Note:
Unit/Cycle
Unit/Cycle
.M2
.M1
.M2
.M1
.D2
.D1
.D2
.D1
.S2
.S1
.S2
.S1
.L2
.L1
.L2
.L1
2X 1X 2X 1X
LDH xi
Table 617 shows the modulo iteration interval table for the IIR filter. The SHR instruction on cycle 10 finishes in time for the MPY p2 instruction from the next iteration to read its result on cycle 11.
B LOOP
SHR yi+1
* B LOOP
Unit/Cycle
Unit/Cycle
.M2
.M1
.M2
.M1
.D2
.D1
.D2
.D1
.S2
.S1
.S2
.S1
.L2
.L1
.L2
.L1
2X 1X 2X 1X
LDH xi+1
* LDH xi+1
SUB cntr
** LDH ci+1
STH yi+1
* SUB cntr
* MPY p2
6.6.7
.trip 100 LDH .D1 MPY .M1 LDH .D1 MPY .M1X ADD .L1 MPY .M2X ADD .L2X SHR .S2 STH .D2 [cntr] SUB .L1 [cntr] B .S1 .endproc
*x++,xi c1,xi,p0 *x,xi1 c2,xi1,p1 p0,p1,s0 c3,yi1,p2 s0,p2,s1 s1,15,yi1 yi1,*y++ cntr,1,cntr LOOP
; ; ; ; ; ; ; ; ; ; ;
xi c1 * xi xi+1 c2 * xi+1 c1 * xi + c2 * xi+1 c3 * yi c1 * xi + c2 * xi+1 + c3 * yi yi+1 store yi+1 decrement loop counter branch to loop
6-81
6.6.8
Final Assembly
Example 642 shows the final assembly for the IIR filter. With one load of y[0] outside the loop, no other loads from the y array are needed. Example 642 requires 408 cycles: (4 100) + 8.
||
6-82
6.7.1
If-Then-Else C Code
Example 643 contains a loop with an if-then-else statement. You either add a[i] to sum or subtract a[i] from sum.
Branching is one way to execute the if-then-else statement: branch to the ADD when the if statement is true and branch to the SUB when the if statement is false. However, because each branch has five delay slots, this method requires additional cycles. Furthermore, branching within the loop makes software pipelining almost impossible. Using conditional instructions, on the other hand, eliminates the need to branch to the appropriate piece of code after checking whether the condition is true or false. Simply program both the ADD and SUB as usual, but make them conditional on the zero and nonzero values of a condition register. This method also allows you to software pipeline the loop and achieve much better performance than you would with branching.
Optimizing Assembly Code via Linear Assembly
6-83
6.7.2
CMPEQ is used to create IF. The ADD is conditional when IF is nonzero (corresponds to then); the SUB is conditional when IF is 0 (corresponds to else). A conditional MVK performs the !(!(cond)) C statement. If the result of the bitwise AND is nonzero, a 1 is written into cond; if the result of the AND is 0, cond remains at 0.
6-84
6.7.3
Two nodes on the graph contain sum: one for the ADD and one for the SUB. Because some iterations are performing an ADD and others are performing a SUB, each of these nodes is a possible input to the next iteration of either node. The LDH ai instruction is a parent of both ADD sum and SUB sum, because both instructions read ai. CMPEQ if is also a parent to ADD sum and SUB sum, because both read IF for the conditional execution. The result of SHL mask is read on the next iteration by the AND cond instruction.
6-85
6.7.4
LDH must be on a .D unit. SHL, B, and MVK must be on a .S unit. The ADDs and SUB can be on the .S, .L, or .D units. The AND can be on a .S or .L unit.
From Table 618, you can see that no one resource is used more than two times, so the minimum iteration interval is still 2.
(b) B side
Unit(s) .M2 .S2 .L2 .L2 or .S2 .L2, .S2, or .D2 MVK CMPEQ AND ADD Instructions Total/Unit 0 1 1 1 1 4
The minimum iteration interval is also affected by the total number of instructions. Because three units can perform nonmultiply operations on a given side, a total of five instructions can be performed with a minimum iteration interval of 2. Because only four instructions are on the B side, the minimum iteration interval is still 2.
6-86
6.7.5
.trip 32 AND .S2X MVK .S2 CMPEQ .L2 LDH .D1 ADD .L1 SUB .D1 SHL .S1 ADD .L2 B .S1 .return .endproc sum
cond = codeword & mask !(!(cond)) (theta == !(!(cond))) a[i] sum += a[i] sum = a[i] mask = mask << 1; decrement counter for LOOP
6-87
6.7.6
Final Assembly
Example 646 shows the final assembly code after software pipelining. The performance of this loop is 70 cycles (2 32 + 6).
; decrement counter ;* !(!(cond)) ;** for LOOP ;** a[i] ; sum += a[i] ; sum = a[i] ;* (theta == !(!(cond))) ;** mask = mask << 1; ;** cond = codeword & mask
[B1] ADD .L1 A7,A5,A7 ||[!B1]SUB .D1 A7,A5,A7 || CMPEQ .L2 B6,B2,B1 || SHL .S1 A6,1,A6 || AND .S2X B4,A6,B2 ; Branch occurs here
6-88
6.7.7
Comparing Performance
You can improve the performance of the code in Example 646 if you know that the loop count is at least 3. If the loop count is at least 3, remove the decrement counter instructions outside the loop and put the MVK (for setting up the loop counter) in parallel with the first branch. These two changes save two cycles at the beginning of the loop prolog. The first two branches are now unconditional, because the loop count is at least 3 and you know that the first two branches must execute. To account for the removal of the three decrement-loop-counter instructions, set the loop counter to 3 fewer than the actual number of times you want the loop to execute: in this case, 29 (32 3).
Example 647. Assembly Code for If-Then-Else With Loop Count Greater Than 3
B LDH MVK SHL AND .S1 .D1 .S2 .S1 .S2X .S2 .S1 .D1 .L2 .S1 .S2X .L1 LOOP *A4++,A5 29,B0 A6,1,A6 B4,A6,B2 1,B2 LOOP *A4++,A5 B6,B2,B1 A6,1,A6 B4,A6,B2 A7 ; for LOOP ; a[i] ; set up loop counter ; mask = mask << 1; ; cond = codeword & mask ; !(!(cond)) ;* for LOOP ;* a[i] ; (theta == !(!(cond))) ;* mask = mask << 1; ;* cond = codeword & mask ; zero out accumulator
|| ||
||
; decrement counter ;* !(!(cond)) ;** for LOOP ;** a[i] ; sum += a[i] ; sum = a[i] ;* (theta == !(!(cond))) ;** mask = mask << 1; ;** cond = codeword & mask
[B1] ADD .L1 A7,A5,A7 ||[!B1]SUB .D1 A7,A5,A7 || CMPEQ .L2 B6,B2,B1 || SHL .S1 A6,1,A6 || AND .S2X B4,A6,B2 ; Branch occurs here
Example 647 shows the improved loop with a cycle count of 68 (2 32 + 4). Table 619 compares the performance of Example 646 and Example 647.
Optimizing Assembly Code via Linear Assembly
6-89
6-90
Loop Unrolling
6.8.1
6-91
Loop Unrolling
6.8.2
6-92
Loop Unrolling
6.8.3
You cannot have more than nine non-.M instructions on either side. Only three non-.M instructions can execute per cycle.
Figure 618 shows the dependency graph for the unrolled if-then-else code. Nine instructions are on the A side, and seven instructions are on the B side.
1 SHL maski
sumi 1 1
sumi
6-93
Loop Unrolling
6.8.4
LDH must be on a .D unit. SHL, B, and MVK must be on a .S unit. The ADDs and SUB can be on a .S, .L, or .D unit. The AND can be on a .S or .L unit.
From Table 620, you can see that no one resource is used more than three times so that the minimum iteration interval is still 3. Checking the total number of non-.M instructions on each side shows that a total of nine instructions can be performed with the minimum iteration interval of 3. because only seven non-.M instructions are on the B side, the minimum iteration interval is still 3.
(b) B side
Unit(s) .M2 .S2 .L2 .L2 pr.S2 .L2 ,.S2, or .D2 MVK and B CMPEQ AND SUB and 2 ADDs Instructions Total/Unit 0 2 1 1 3
6.8.5
6-94
Loop Unrolling
cword, mask, theta, ifi, ifi1, a, ai, ai1, cntr cdi, cdi1, sumi, sumi1, sum A4,a B4,cword A6,mask B6,theta 16,cntr sumi sumi1 32 .L1X .S1 .L1X .D1 .L1 .D1 .S1 .L2X .S2 .L2 .D1 .L2 .D2 .S1 ; ; ; ; ; ; ; C callable register C callable register C callable register C callable register cntr = 32/2 sumi = 0 sumi+1 = 0 for for for for 1st 2nd 3rd 4th operand operand operand operand
[ifi] [!ifi]
; ; ; ; ; ; ;
cdi = codeword & maski !(!(cdi)) (theta == !(!(cdi))) a[i] sum += a[i] sum = a[i] maski+1 = maski << 1; cdi+1 = codeword & maski+1 !(!(cdi+1)) (theta == !(!(cdi+1))) a[i+1] sum += a[i+1] sum = a[i+1] maski = maski+1 << 1;
[cdi1]
[ifi1] [!ifi1]
[cntr] [cntr]
; decrement counter ; for LOOP ; Add sumi and sumi+1 for ret value
6-95
Loop Unrolling
6.8.6
Final Assembly
Example 651 shows the final assembly code after software pipelining. The cycle count of this loop is now 53: (3 16) + 5.
||[B0] ||[B0] || ||
[A2] MVK || AND || ZERO [B2] || || || || LOOP: ||[B0] || ||[B0] || || CMPEQ ADD LDH B SHL AND MVK CMPEQ SHL LDH ZERO
; !(!(condi)) ; condi+1 = codeword & maski+1 ; zero accumulator ; !(!(condi+1)) ; (theta == !(!(condi))) ; maski = maski+1 << 1; ;* a[i] ; zero accumulator
.L2 .D2 .D1 .S2 .S1 .L1X .L1 .D1 .S1 .L2X
B6,B2,B1 1,B0,B0 *A4++,B5 LOOP A6,1,A6 B4,A6,A2 A7,A5,A7 A7,A5,A7 1,A2 B4,A6,B2
; (theta == !(!(condi+1))) ; decrement counter ;* a[i+1] ;* for LOOP ;* maski+1 = maski << 1; ;* condi = codeword & maski ; sum += a[i] ; sum = a[i] ;* !(!(condi)) ;* condi+1 = codeword & maski+1 ; sum += a[i+1] ; sum = a[i+1] ;* !(!(condi+1)) ;* (theta == !(!(condi))) ;* maski = maski+1 << 1; ;** a[i]
[B1] ADD .L2 B7,B5,B7 ||[!B1]SUB .D2 B7,B5,B7 ||[B2] MVK .S2 1,B2 || CMPEQ .L1X B6,A2,A1 || SHL .S1 A6,1,A6 || LDH .D1 *A4++,A5 ; Branch occurs here ADD .L1X A7,B7,A4
6-96
Loop Unrolling
6.8.7
Comparing Performance
Table 621 compares the performance of all versions of the if-then-else code examples.
6-97
Live-Too-Long Issues
6.9.1
6-98
Live-Too-Long Issues
6.9.2
6.9.3
6-99
Live-Too-Long Issues
MPY
a0
Split-join path
2 SHR b1 1
2 b2 2
MPY
ADD b3 ADD 1
sum0
sum1
6-100
Live-Too-Long Issues
6.9.4
No specific resource is used more than twice, implying a minimum iteration interval of 2. A total of five non-.M units on each side also implies a minimum iteration interval of 2, because three non-.M units can be used on a side during each cycle.
(b) B side
Unit(s) .M2 .S2 .D2 .L2, .S2, or .D2 Total non-.M units Instructions MPY SHR LDH 2 ADDs and SUB Total/Unit 1 1 1 3 5
However, the minimum iteration interval is determined by both resources and data dependency. A loop carry path determined the minimum iteration interval of the IIR filter in section 6.6, Loop Carry Paths, on page 6-74. In this example, a live-too-long problem determines the minimum iteration interval.
6.9.4.1
Split-Join-Path Problems
In Figure 619, the two split-join paths from a0 to a3 and from b0 to b3 create the live-too-long problem. Because the ADD a3 instruction cannot be scheduled until the SHR a1 and MPY a2 instructions finish, a0 must be live for at least four cycles. For example:
If MPY a0 is scheduled on cycle 5, then the earliest SHR a1 can be scheduled is cycle 7. The earliest MPY a2 can be scheduled is cycle 8. The earliest ADD a3 can be scheduled is cycle 10.
6-101
Live-Too-Long Issues
Because a0 is written at the end of cycle 6, it must be live from cycle 7 to cycle 10, or four cycles. No value can be live longer than the minimum iteration interval, because the next iteration of the loop will overwrite that value before the current iteration can read the value. Therefore, if the value has to be live for four cycles, the minimum iteration interval must be at least 4. A minimum iteration interval of 4 means that the loop executes at half the performance that it could based on available resources.
6.9.4.2
6.9.4.3
Inserting Moves
Another solution to the live-too-long problem is to break up the lifetime of a0 and b0 by inserting move (MV) instructions. The MV instruction breaks up the left path of the split-join path into two smaller pieces.
6.9.4.4
6-102
Live-Too-Long Issues
6.9.5
Live-Too-Long Issues
ai, bi, sum0, sum1, sum a0p, a_0, a_1, a_2, a_3, b_0, b0p, b_1, b_2, b_3, cntr 100,cntr sum0 sum1 ; cntr = 100 ; sum0 = 0 ; sum1 = 0
.trip 100 LDH .D1 LDH .D2 MPY .M1 SHR .S1 MPY .M1X MV .D1 ADD .L1 ADD .L1 MPY .M2X SHR .S2 MPY .M2X MV .D2 ADD .L2 ADD .L2 .S2 .S1
*a++,ai *b++,bi ai,c,a_0 a_0,15,a_1 a_1,d,a_2 a_0,a0p a_2,a0p,a_3 sum0,a_3,sum0 bi,c,b_0 b_0,15,b_1 b_1,e,b_2 b_0,b0p b_2,b0p,b_3 sum1,b_3,sum1 cntr,1,cntr LOOP
; ; ; ; ; ; ; ; ; ; ; ; ; ;
ai bi ai a0 a1 a0 a2 += bi b0 b1 b0 b2 +=
from memory from memory * c >> 15 * d across iterations + a0 a3 * ci >> 15 * e across iterations + b0 b3
; decrement loop counter ; branch to loop ; Add sumi and sumi+1 for ret value
sum0,sum1,sum
6-104
Live-Too-Long Issues
6.9.6
||
||
||
|| || ||
[B2] SUB ||[B2] B SHR SHR MPY MPY LDH LDH MPY MV MPY MV SUB B SHR SHR MPY MPY LDH LDH
; decrement loop counter ; branch to loop ; a1 = a0 >> 15 ; b1 = b0 >> 15 ;* a0 = ai * c ;* b0 = bi * c ;**** load ai from memory ;**** load bi from memory ; a2 = a1 * d ; save a0 across iterations ; b2 = b1 * e ; save b0 across iterations ;* decrement loop counter ;* branch to loop ;* a1 = a0 >> 15 ;* b1 = b0 >> 15 ;** a0 = ai * c ;** b0 = bi * c ;***** load ai from memory ;***** load bi from memory
|| || || || ||
|| || || ||[B2] ||[B2]
|| || || || ||
6-105
Live-Too-Long Issues
Example 655. Assembly Code for Live-Too-Long With Move Instructions (Continued)
LOOP: || || || || || ||[B2] ||[B2] ADD ADD MPY MV MPY MV SUB B .L1 .L2 .M1X .D1 .M2X .D2 .S2 .S1 A7,A2,A9 B7,B8,B9 A5,B6,A7 A3,A2 B5,A8,B7 B10,B8 B2,1,B2 LOOP ;* a3 = a2 + a0 ;* b3 = b2 + b0 ;* a2 = a1 * d ;* save a0 across iterations ;* b2 = b1 * e ;* save b0 across iterations ;** decrement loop counter ;** branch to loop ; sum0 += a3 ; sum1 += b3 ;** a1 = a0 >> 15 ;** b1 = b0 >> 15 ;*** a0 = ai * c ;*** b0 = bi * c ;****** load ai from memory ;****** load bi from memory
|| || || || || || ||
ADD .L1 A1,A9,A1 ADD .L2 B1,B9,B1 SHR .S1 A3,15,A5 SHR .S2 B10,15,B5 MPY .M1 A0,A6,A3 MPY .M2X B0,A6,B10 LDH .D1 *A4++,A0 LDH .D2 *B4++,B0 ; Branch occurs here ADD .L1X A1,B1,A4
6-106
One way to optimize this situation is to perform LDWs instead of LDHs to read two data values at a time. Although using LDW works for the h array, the x array presents a different problem because the C6x does not allow you to load values across a word boundary. For example, on the first outer loop (j = 0), you can read the x-array elements (0 and 1, 2 and 3, etc.) as long as elements 0 and 1 are aligned on a 4-byte word boundary. However, the second outer loop (j = 1) requires reading x-array elements 1 through 32. The LDW operation must load elements that are not word-aligned (1 and 2, 3 and 4, etc.).
6-108
6-109
Figure 621. Dependency Graph of FIR Filter (With Redundant Load Elimination)
A side B side LDH .D1 h0 5 MPY .M1X p10 5 MPY .M1 p00 5 LDH 2 2 x0 .D1 5 2 ADD .L1 sum0 1 1 ADD sum0 .L1X 1 ADD .L2X sum1 1 ADD sum1 .L2 2 MPY .M2X p11 MPY .M2 p01 5 5 5 5 LDH .D2 x1 h1 LDH .D2
6-110
(b) B side
Unit(s) .M2 .S2 .D2 .L2, .S2, .D2 Total non-.M units 2X paths Instructions 2 MPYs B 2 LDHs 2 ADDs and SUB Total/Unit 2 1 2 3 6 2
[octr]
; x0 = x[j]
6-111
[octr]
.S2 ctr,1,ctr .S2 LOOP sum0,15,sum0 sum1,15,sum1 sum0,*y++ sum1,*y++ x,rstx,x h_1,rsth,h_1 OUTLOOP
decrement loop counter branch to loop sum0 >> 15 sum1 >> 15 y[j] = sum0 >> 15 y[j+1] = sum1 >> 15 reset x pointer to x[j] reset h pointer to h[0] branch to outer loop
2 + 9 + 6) + 2.
Nine cycles execute the inner loop prolog. Six cycles execute the branch to the outer loop.
See section 6.12, Software Pipelining the Outer Loop, on page 6-128 for information on how to reduce this overhead.
6-112
Example 660. Final Assembly Code for FIR Filter With Redundant Load Elimination
; set up outer loop counter ; used to rst x ptr outer loop ; used to rst h ptr outer loop
|| OUTLOOP: || || || || ||[A2]
LDH ADD ADD ADD MVK SUB LDH LDH ZERO ZERO LDH LDH LDH LDH SUB LDH LDH B LDH LDH MPY SUB LDH LDH MPY MPY B LDH LDH MPY MPY SUB LDH LDH
.D1 .L2X .D2 .L1X .S2 .S1 .D1 .D2 .L1 .L2 .D2 .D1 .D1 .D2 .S2 .D2 .D1 .S2 .D1 .D2 .M1 .S2 .D2 .D1 .M2 .M1X .S2 .D1 .D2 .M2X .M1 .S2 .D2 .D1
*A4++[2],A0 A4,2,B5 B4,2,B4 B4,0,A5 16,B2 A2,1,A2 *A5++[2],A1 *B5++[2],B1 A9 B9 *B4++[2],B0 *A4++[2],A0 *A5++[2],A1 *B5++[2],B1 B2,1,B2 *B4++[2],B0 *A4++[2],A0 LOOP *A5++[2],A1 *B5++[2],B1 A0,A1,A7 B2,1,B2 *B4++[2],B0 *A4++[2],A0 B1,B0,B7 B1,A1,A8 LOOP *A5++[2],A1 *B5++[2],B1 A0,B0,B8 A0,A1,A7 B2,1,B2 *B4++[2],B0 *A4++[2],A0
; ; ; ; ; ; ; ; ; ;
x0 = x[j] set up pointer to x[j+1] set up pointer to h[1] set up pointer to h[0] set up inner loop counter decrement outer loop counter h0 = x1 = zero zero h[i] x[j+i+1] out sum0 out sum1
|| || ||
||
; h1 = h[i+1] ; x0 = x[j+i+2] ;* h0 = h[i] ;* x1 = x[j+i+1] ; decrement inner loop counter ;* h1 = h[i+1] ;* x0 = x[j+i+2] ; branch to inner loop ;** h0 = h[i] ;** x1 = x[j+i+1]
|| [B2] || || [B2] || ||
||[B2] || ||
|| ||[B2] || ||
|| ||[B2] || ||
6-113
Example 660 Final Assembly Code for FIR Filter With Redundant Load Elimination (Continued)
LOOP: || || || ||[B2] || || ADD ADD MPY MPY B LDH LDH .L2X .L1 .M2 .M1X .S2 .D1 .D2 A8,B9,B9 A7,A9,A9 B1,B0,B7 B1,A1,A8 LOOP *A5++[2],A1 *B5++[2],B1 B7,A9,A9 B8,B9,B9 A0,B0,B8 A0,A1,A7 B2,1,B2 *B4++[2],B0 *A4++[2],A0 branch occurs OUTLOOP A4,A3,A4 B4,B6,B4 A9,15,A9 B9,15,B9 A9,*A6++ B9,*A6++ ; sum1 += x1 * h0 ; sum0 += x0 * h0 ;* x1 * h1 ;* x1 * h0 ;** branch to inner loop ;**** h0 = h[i] ;**** x1 = x[j+i+1] ; sum0 += x1 * h1 ; sum1 += x0 * h1 ;* x0 * h1 ;** x0 * h0 ;*** decrement inner loop cntr ;**** h1 = h[i+1] ;**** x0 = x[j+i+2] here ; branch to outer loop ; reset x pointer to x[j] ; reset h pointer to h[0] ; sum0 >> 15 ; sum1 >> 15 ; y[j] = sum0 >> 15 ; y[j+1] = sum1 >> 15
1
|| || || ||[B2] || ||
ADD .L1X ADD .L2 MPY .M2X MPY .M1 SUB .S2 LDH .D2 LDH .D1 ; inner loop .S1 .L1 .L2 .S1 .S2 .D1 .D1
||
3 4 5 6
6-114
Memory Banks
8N
8N + 1
8N + 2 8N + 3 Bank 1
8N + 4 8N + 5 Bank 2
8N + 6 8N + 7 Bank 3
Bank 0
For devices that have more than one memory block (see Figure 623), an access to bank 0 in one block does not interfere with an access to bank 0 in another memory block, and no pipeline stall occurs.
6-115
Memory Banks
8N
8N + 1
8N + 2 8N + 3 Bank 1 8M + 2 8M + 3
8N + 4 8N + 5 Bank 2 8M + 4 8M + 5
8N + 6 8N + 7 Bank 3 8M + 6 8M + 7
Bank 0
Bank 1
Bank 2
Bank 3
If each array in a loop resides in a separate memory block, the 2-cycle loop in Example 657 on page 6-108 is sufficient. This section describes a solution when two arrays must reside in the same memory block.
6-116
Memory Banks
Example 661. Final Assembly Code for Inner Loop of FIR Filter
LOOP: || || || ||[B2] || || ADD ADD MPY MPY B LDH LDH ADD ADD MPY MPY SUB LDH LDH .L2X .L1 .M2 .M1X .S2 .D1 .D2 .L1X .L2 .M2X .M1 .S2 .D2 .D1 A8,B9,B9 A7,A9,A9 B1,B0,B7 B1,A1,A8 LOOP *A5++[2],A1 *B5++[2],B1 B7,A9,A9 B8,B9,B9 A0,B0,B8 A0,A1,A7 B2,1,B2 *B4++[2],B0 *A4++[2],A0 ; sum1 += x1 * h0 ; sum0 += x0 * h0 ;* x1 * h1 ;* x1 * h0 ;** branch to inner loop ;**** h0 = h[i] ;**** x1 = x[j+i+1] ; sum0 += x1 * h1 ; sum1 += x0 * h1 ;* x0 * h1 ;** x0 * h0 ;*** decrement inner loop cntr ;**** h1 = h[i+1] ;**** x0 = x[j+i+2]
|| || || ||[B2] || ||
It is not always possible to fully control how arrays are aligned, especially if one of the arrays is passed into a function as a pointer and that pointer has different alignments each time the function is called. One solution to this problem is to write an FIR filter that avoids memory hits, regardless of the x and h array alignments. If accesses to the even and odd elements of an array (h or x) are scheduled on the same cycle, the accesses are always on adjacent memory banks. Thus, to write an FIR filter that never has memory hits, even and odd elements of the same array must be scheduled on the same loop cycle.
6-117
Memory Banks
In the case of the FIR filter, scheduling the even and odd elements of the same array on the same loop cycle cannot be done in a 2-cycle loop, as shown in Figure 624. In this example, a valid 2-cycle software-pipelined loop without memory constraints is ruled by the following constraints:
LDH h0 and LDH h1 are on the same loop cycle. LDH x0 and LDH x1 are on the same loop cycle. MPY p00 must be scheduled three or four cycles after LDH x0, because it must read x0 from the previous iteration of LDH x0. All MPYs must be five or six cycles after their LDH parents. No MPYs on the same side (A or B) can be on the same loop cycle.
Figure 624. Dependency Graph of FIR Filter (With Even and Odd Elements of Each Array on Same Loop Cycle)
A side LDH h0 5 MPY 1 5 MPY p10 6 p00 5 LDH 2 x0 5 2
Note: Numbers in bold represent the cycle the instruction is scheduled on.
MPY p11 7
MPY p01 6
The scenario in Figure 624 almost works. All nodes satisfy the above constraints except MPY p10. Because one parent is on cycle 1 (LDH h0) and another on cycle 0 (LDH x1), the only cycle for MPY p10 is cycle 6. However, another MPY on the A side is also scheduled on cycle 6 (MPY p00). Other combinations of cycles for this graph produce similar results.
6-118
Memory Banks
6-119
Memory Banks
6-120
Memory Banks
B side
LDH x2
6-121
Memory Banks
6.11.5 Linear Assembly for Unrolled FIR Inner Loop With .mptr Directive
Example 664 shows the unrolled FIR inner loop with the .mptr directive. The .mptr directive allows the assembly optimizer to automatically determine if two memory operations have a bank conflict by associating memory access information with a specific pointer register. If the assembly optimizer determines that two memory operations have a bank conflict, then it will not schedule them in parallel. The .mptr directive tells the assembly optimizer that when the specified register is used as a memory pointer in a load or store instruction, it is initialized to point at a base location + <offset>, and is incremented a number of times each time through the loop. Without the .mptr directives, the loads of x1 and h0 are scheduled in parallel, and the loads of x2 and h1 are scheduled in parallel. This results in a 50% chance of a memory conflict on every cycle.
[octr]
6-122
Memory Banks
Example 664. Linear Assembly for Full Unrolled FIR Filter (Continued)
LOOP: .trip 8 LDH LDH MPY MPY ADD ADD LDH LDH MPY MPY ADD ADD LDH LDH MPY MPY ADD ADD LDH LDH MPY MPY ADD ADD .D1 .D1 .M1X .M1 .L1 .L2X .D2 .D2 .M2X .M2 .L1X .L2 .D1 .D1 .M1X .M1 .L1 .L2X .D2 .D2 .M2X .M2 .L1X .L2 *x_1++[2],x1 *h++[2],h0 x0,h0,p00 x1,h0,p10 p00,sum0,sum0 p10,sum1,sum1 *x++[2],x2 *h_1++[2],h1 x1,h1,p01 x2,h1,p11 p01,sum0,sum0 p11,sum1,sum1 *x_1++[2],x3 *h++[2],h2 x2,h2,p02 x3,h2,p12 p02,sum0,sum0 p12,sum1,sum1 *x++[2],x0 *h_1++[2],h3 x3,h3,p03 x0,h3,p13 p03,sum0,sum0 p13,sum1,sum1 ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; x1 = h0 = x0 * x1 * sum0 sum1 x2 = h1 = x1 * x2 * sum0 sum1 x3 = h2 = x2 * x3 * sum0 sum1 x0 = h3 = x3 * x0 * sum0 sum1 x[j+i+1] h[i] h0 h0 += x0 * h0 += x1 * h0 x[j+i+2] h[i+1] h1 h1 += x1 * h1 += x2 * h1 x[j+i+3] h[i+2] h2 h2 += x2 * h2 += x3 * h2 x[j+i+4] h[i+3] h3 h3 += x3 * h3 += x0 * h3
[ctr] [ctr]
.S2 .S2
ctr,1,ctr LOOP
; decrement loop counter ; branch to loop ; ; ; ; ; ; ; sum0 >> 15 sum1 >> 15 y[j] = sum0 >> 15 y[j+1] = sum1 >> 15 reset x pointer to x[j] reset h pointer to h[0] branch to outer loop
[octr]
6-123
Memory Banks
If a value is written at the end of cycle 0 and read on cycle 2 of the loop, it is live for two cycles (cycles 1 and 2 of the loop). If another value is written at the end of cycle 2 and read on cycle 0 (the next iteration) of the loop, it is also live for two cycles (cycles 3 and 0 of the loop).
Because both of these values are not live on the same cycles, they can occupy the same register. Only after scheduling these instructions and their children do you know that they can occupy the same register. Register allocation is not complicated but can be tedious when done by hand. Each value has to be analyzed for its lifetime and then appropriately combined with other values not live on the same cycles in the loop. The assembly optimizer handles this automatically after it software pipelines the loop. See the TMS320C6x Optimizing C Compiler Users Guide for more information.
6-124
Memory Banks
(b) B side
Unit(s) .M2 .S2 .D2 .L2, .S2, or .D2 Total non-.M units 2X paths Instructions 4 MPYs B 4 LDHs 4 ADDs and SUB Total/Unit 4 1 4 5 10 4
6-125
Memory Banks
Example 665. Final Assembly Code for FIR Filter With Redundant Load Elimination and No Memory Hits
MVK MVK MVK .S1 .S1 .S2 50,A2 62,A3 64,B10 ; set up outer loop counter ; used to rst x pointer outloop ; used to rst h pointer outloop
|| OUTLOOP: || || || ||[A2] || || ||
LDH ADD ADD MVK SUB LDH LDH ZERO ZERO LDH LDH LDH LDH LDH LDH SUB LDH LDH LDH LDH MPY MPY LDH LDH B MPY MPY LDH LDH SUB ADD MPY MPY LDH LDH
.D1 .L2X .L1X .S2 .S1 .D2 .D1 .L1 .L2 .D1 .D2 .D1 .D2 .D2 .D1 .S2 .D2 .D1 .D1 .D2 .M1X .M2X .D1 .D2 .S1 .M2 .M1 .D2 .D1 .S2 .L1 .M2X .M1X .D2 .D1
*A4++,B5 ; x0 = x[j] A4,4,B1 ; set up pointer to x[j+2] B4,2,A8 ; set up pointer to h[1] 8,B2 ; set up inner loop counter A2,1,A2 ; decrement outer loop counter *B1++[2],B0 *A4++[2],A0 A9 B9 *A8++[2],B6 *B4++[2],A1 *A4++[2],A5 *B1++[2],B5 *B4++[2],A7 *A8++[2],B8 B2,1,B2 *B1++[2],B0 *A4++[2],A0 *A8++[2],B6 *B4++[2],A1 B5,A1,A0 A0,B6,B6 *A4++[2],A5 *B1++[2],B5 LOOP B0,B6,B7 A0,A1,A1 *B4++[2],A7 *A8++[2],B8 B2,1,B2 A0,A9,A9 A5,B8,B8 B0,A7,A5 *B1++[2],B0 *A4++[2],A0 ; ; ; ; x2 = x1 = zero zero x[j+i+2] x[j+i+1] out sum0 out sum1
||
; h1 = h[i+1] ; h0 = h[i] ; x3 = x[j+i+3] ; x0 = x[j+i+4] ; h2 = h[i+2] ; h3 = h[i+3] ; decrement loop counter ;* x2 = x[j+i+2] ;* x1 = x[j+i+1] ;* h1 = h[i+1] ;* h0 = h[i] ; x0 * h0 ; x1 * h1 ;* x3 = x[j+i+3] ;* x0 = x[j+i+4] ; branch to loop ; x2 * h1 ; x1 * h0 ;* h2 = h[i+2] ;* h3 = h[i+3] ;* decrement loop counter ; sum0 ; x3 * ; x2 * ;** x2 ;** x1 += x0 * h0 h3 h2 = x[j+i+2] = x[j+i+1]
||
|| ||[B2]
||
||
|| || || [B2] || || || || ||[B2]
|| || || ||
6-126
Memory Banks
Example 665. Final Assembly Code for FIR Filter With Redundant Load Elimination and No Memory Hits (Continued)
LOOP: || || || ||[B2] ||[B2] ADD ADD MPY MPY LDH LDH ADD ADD MPY MPY LDH LDH ADD ADD B MPY MPY LDH LDH SUB .L2X .L1X .M2 .M1 .D1 .D2 .L2 .L1 .M1X .M2X .D1 .D2 .L2X .L1X .S1 .M2 .M1 .D2 .D1 .S2 A1,B9,B9 B6,A9,A9 B5,B8,B7 A5,A7,A7 *A8++[2],B6 *B4++[2],A1 B7,B9,B9 A5,A9,A9 B5,A1,A0 A0,B6,B6 *A4++[2],A5 *B1++[2],B5 A7,B9,B9 B8,A9,A9 LOOP B0,B6,B7 A0,A1,A1 *B4++[2],A7 *A8++[2],B8 B2,1,B2 ; sum1 ; sum0 ; x0 * ; x3 * ;** h1 ;** h0 += x1 * h0 += x1 * h1 h3 h2 = h[i+1] = h[i]
|| || || ||[B2] ||[B2]
; sum1 += x2 * h1 ; sum0 += x2 * h2 ;* x0 * h0 ;* x1 * h1 ;** x3 = x[j+i+3] ;** x0 = x[j+i+4] ; sum1 += x3 * h2 ; sum0 += x3 * h3 ;* branch to loop ;* x2 * h1 ;* x1 * h0 ;** h2 = h[i+2] ;** h3 = h[i+3] ;** decrement loop counter ; sum1 += x0 * h3 ;* sum0 += x0 * h0 ;* x3 * h3 ;* x2 * h2 ;*** x2 = x[j+i+2] ;*** x1 = x[j+i+1]
|| || || ||[B2] ||[B2]
ADD .L2 B7,B9,B9 ADD .L1 A0,A9,A9 MPY .M2X A5,B8,B8 MPY .M1X B0,A7,A5 LDH .D2 *B1++[2],B0 LDH .D1 *A4++[2],A0 ; inner loop branch occurs here B SUB SUB SUB SHR SHR STH STH .S2 .L1 .L2 .S1 .S1 .S2 .D1 .D1 OUTLOOP A4,A3,A4 B4,B10,B4 A9,A0,A9 A9,15,A9 B9,15,B9 A9,*A6++ B9,*A6++
[A2] || || ||
; ; ; ;
branch to outer loop reset x pointer to x[j] reset h pointer to h[0] sum0 = x0*h0 (eliminate add)
||
; sum0 >> 15 ; sum1 >> 15 ; y[j] = sum0 >> 15 ; y[j+1] = sum1 >> 15 ; branch delay slots
6-127
6-128
6.12.2 Making the Outer Loop Parallel With the Inner Loop Epilog and Prolog
The final assembly code for the FIR filter with redundant load elimination and no memory hits (shown in Example 665 on page 6-126) contained 16 cycles of overhead to call the inner loop every time: ten cycles for the loop prolog and six cycles for the outer loop instructions and branching to the outer loop. Most of this overhead can be reduced as follows:
Put the outer loop and branch instructions in parallel with the prolog. Create an epilog to the inner loop. Put some outer loop instructions in parallel with the inner-loop epilog.
Example 667 shows the final assembly for the FIR filter with a software-pipelined outer loop. Below the inner loop (starting on page 6-131), each instruction is marked in the comments with an e, p, or o for instructions relating to epilog, prolog, or outer loop, respectively. The inner loop is now only run seven times, because the eighth iteration is done in the epilog in parallel with the prolog of the next inner loop and the outer loop instructions.
6-129
Example 667. Final Assembly Code for FIR Filter With Redundant Load Elimination and No Memory Hits With Outer Loop Software-Pipelined
MVK STW MVK MVK ADD LDH ADD ADD MVK SUB LDH LDH ZERO ZERO LDH LDH LDH LDH .S1 .D2 .S1 .S2 .L2X .D1 .L2X .L1X .S2 .S1 .D2 .D1 .L1 .L2 .D1 .D2 .D1 .D2 50,A2 B11,*B15 74,A3 72,B10 A6,2,B11 *A4++,B8 A4,4,B1 B4,2,A8 8,B2 A2,1,A2 *B1++[2],B0 *A4++[2],A0 A9 B9 *A8++[2],B6 *B4++[2],A1 *A4++[2],A5 *B1++[2],B5 ; set up outer loop counter ; ; ; ; ; ; ; ; ; ; ; ; ; push register used to rst x ptr outer loop used to rst h ptr outer loop set up pointer to y[1] x0 = x[j] set up pointer to x[j+2] set up pointer to h[1] set up inner loop counter decrement outer loop counter x2 = x1 = zero zero x[j+i+2] x[j+i+1] out sum0 out sum1
|| || ||
|| || || ||[A2]
|| || ||
||
|| OUTLOOP: || ||[B2]
LDH LDH SUB LDH LDH LDH LDH MPY MPY LDH LDH B MPY MPY LDH LDH SUB
.D2 .D1 .S2 .D2 .D1 .D1 .D2 .M1X .M2X .D1 .D2 .S1 .M2 .M1 .D2 .D1 .S2
*B4++[2],A7 *A8++[2],B8 B2,2,B2 *B1++[2],B0 *A4++[2],A0 *A8++[2],B6 *B4++[2],A1 B8,A1,A0 A0,B6,B6 *A4++[2],A5 *B1++[2],B5 LOOP B0,B6,B7 A0,A1,A1 *B4++[2],A7 *A8++[2],B8 B2,1,B2
; h2 = h[i+2] ; h3 = h[i+3] ; decrement loop counter ;* x2 = x[j+i+2] ;* x1 = x[j+i+1] ;* h1 = h[i+1] ;* h0 = h[i] ; x0 * h0 ; x1 * h1 ;* x3 = x[j+i+3] ;* x0 = x[j+i+4] ; branch to loop ; x2 * h1 ; x1 * h0 ;* h2 = h[i+2] ;* h3 = h[i+3] ;* decrement loop counter
||
||
|| || || [B2] || || || || ||[B2]
6-130
Example 667. Final Assembly Code for FIR Filter With Redundant Load Elimination and No Memory Hits With Outer Loop Software-Pipelined (Continued)
|| || || || LOOP: || || || || || ADD ADD MPY MPY LDH LDH ADD ADD MPY MPY LDH LDH ADD ADD B MPY MPY LDH LDH SUB .L2X .L1X .M2 .M1 .D1 .D2 .L2 .L1 .M1X .M2X .D1 .D2 .L2X .L1X .S1 .M2 .M1 .D2 .D1 .S2 A1,B9,B9 B6,A9,A9 B5,B8,B7 A5,A7,A7 *A8++[2],B6 *B4++[2],A1 B7,B9,B9 A5,A9,A9 B5,A1,A0 A0,B6,B6 *A4++[2],A5 *B1++[2],B5 A7,B9,B9 B8,A9,A9 LOOP B0,B6,B7 A0,A1,A1 *B4++[2],A7 *A8++[2],B8 B2,1,B2 ; sum1 ; sum0 ; x0 * ; x3 * ;** h1 ;** h0 += x1 * h0 += x1 * h1 h3 h2 = h[i+1] = h[i] ADD MPY MPY LDH LDH .L1 .M2X .M1X .D2 .D1 A0,A9,A9 A5,B8,B8 B0,A7,A5 *B1++[2],B0 *A4++[2],A0 ; sum0 ; x3 * ; x2 * ;** x2 ;** x1 += x0 * h0 h3 h2 = x[j+i+2] = x[j+i+1]
|| || || || ||
; sum1 += x2 * h1 ; sum0 += x2 * h2 ;* x0 * h0 ;* x1 * h1 ;** x3 = x[j+i+3] ;** x0 = x[j+i+4] ; sum1 += x3 * h2 ; sum0 += x3 * h3 ;* branch to loop ;* x2 * h1 ;* x1 * h0 ;** h2 = h[i+2] ;** h3 = h[i+3] ;** decrement loop counter ; sum1 += x0 * h3 ;* sum0 += x0 * h0 ;* x3 * h3 ;* x2 * h2 ;*** x2 = x[j+i+2] ;*** x1 = x[j+i+1]
|| ||[B2] || || || || ||[B2]
|| || || || ||
ADD .L2 B7,B9,B9 ADD .L1 A0,A9,A9 MPY .M2X A5,B8,B8 MPY .M1X B0,A7,A5 LDH .D2 *B1++[2],B0 LDH .D1 *A4++[2],A0 ; inner loop branch occurs here ADD ADD MPY MPY SUB SUB B .L2X .L1X .M2 .M1 .D1 .D2 .S1 A1,B9,B9 B6,A9,A9 B5,B8,B7 A5,A7,A7 A4,A3,A4 B4,B10,B4 OUTLOOP
|| || || || || ||[A2]
;e ;e ;e ;e ;o ;o ;o
sum1 += x1 * h0 sum0 += x1 * h1 x0 * h3 x3 * h2 reset x pointer to x[j] reset h pointer to h[0] branch to outer loop
6-131
Example 667. Final Assembly Code for FIR Filter With Redundant Load Elimination and No Memory Hits With Outer Loop Software-Pipelined (Continued)
|| || || || || ADD ADD LDH ADD ADD MVK ADD ADD LDH LDH SUB ADD SHR LDH LDH SHR LDH LDH .D2 .L1 .D1 .L2X .S1X .S2 .L2X .L1X .D2 .D1 .S1 .L2 .S1 .D1 .D2 .S2 .D1 .D2 B7,B9,B9 A5,A9,A9 *A4++,B8 A4,4,B1 B4,2,A8 8,B2 A7,B9,B9 B8,A9,A9 *B1++[2],B0 *A4++[2],A0 A2,1,A2 B7,B9,B9 A9,15,A9 *A8++[2],B6 *B4++[2],A1 B9,15,B9 *A4++[2],A5 *B1++[2],B5 ;e ;e ;p ;o ;o ;o ;e ;e ;p ;p ;o ;e ;e ;p ;p sum1 += x2 * h1 sum0 += x2 * h2 x0 = x[j] set up pointer to x[j+2] set up pointer to h[1] set up inner loop counter sum1 += x3 * h2 sum0 += x3 * h3 x2 = x[j+i+2] x1 = x[j+i+1] decrement outer loop counter sum1 sum0 h1 = h0 = += x0 * h3 >> 15 h[i+1] h[i]
|| || || ||[A2]
|| || ||
|| ||
;e sum1 >> 15 ;p x3 = x[j+i+3] ;p x0 = x[j+i+4] ;e ;e ;o ;o y[j] = sum0 >> 15 y[j+1] = sum1 >> 15 zero out sum0 zero out sum1
|| || ||
STH .D1 A9,*A6++[2] STH .D2 B9,*B11++[2] ZERO .S1 A9 ZERO .S2 B9 ; outer loop branch occurs here
6-132
6-133
6-134
[sctr] [!sctr] [!sctr] [!sctr] [!sctr] [!sctr] [pctr] [!pctr] [!pctr] [!pctr] [!pctr] [!pctr] SUB SHR SHR STH STH MVK SUB SUB SUB SUB SUB MVK
The resetting of the x and h pointers is conditional on the pointer reset counter, prc. The shifting and storing of the even and odd y elements are conditional on the store counter, sctr.
When these counters are 0, all of the instructions that are conditional on that value execute. The MVK instruction resets the pointers to 8 because after every eight iterations of the loop, a new inner loop is completed (8 4 elements are processed).
The pointer reset counter becomes 0 first to reset the load pointers, then the store counter becomes 0 to shift and store the result.
6-136
LDWs are used instead of LDHs to reduce the number of loads in the loop. The reset pointer instructions immediately follow the LDW instructions. The first ADD instructions for sum0 and sum1 are conditional on the same value as the store counter, because when sctr is 0, the end of one inner loop has been reached and the first ADD, which adds the previous sum07 to p00, must not be executed. The first ADD for sum0 writes to the same register as the first MPY p00. The second ADD reads p00 and p01. At the beginning of each inner loop, the first ADD is not performed, so the second ADD correctly reads the results of the first two MPYs (p01 and p00) and adds them together. For other iterations of the inner loop, the first ADD executes, and the second ADD sums the second MPY result (p01) with the running accumulator. The same is true for the first and second ADDs of sum1.
6-137
Example 672. Linear Assembly for FIR With Outer Loop Conditionally Executed With Inner Loop
LDW LDW LDW LDW LDW LDW LDW LDW LDH [sctr] [!sctr] [!sctr] [!sctr] [!sctr] SUB SHR SHR STH STH MV MPYLH ADD MPYHL ADD MPYLH ADD MPYHL ADD MPYLH ADD MPYHL ADD MPYLH ADD MPYHL ADD MPY ADD MPYH ADD *h++[2],h01 *h_1++[2],h23 *h++[2],h45 .*h_1++[2],h67 *x++[2],x01 *x_1++[2],x23 *x++[2],x45 *x_1++[2],x67 *x,x8 sctr,1,sctr sum07,15,y0 sum17,15,y1 y0,*y++[2] y1,*y_1++[2] x01,x01b h01,x01b,p10 p10,sum17,p10 h01,x23,p11 p11,p10,sum11 h23,x23,p12 p12,sum11,sum12 h23,x45,p13 p13,sum12,sum13 h45,x45,p14 p14,sum13,sum14 h45,x67,p15 p15,sum14,sum15 h67,x67,p16 p16,sum15,sum16 h67,x8,p17 p17,sum16,sum17 h01,x01,p00 p00,sum07,p00 h01,x01,p01 p01,p00,sum01 ; ; ; ; ; ; ; ; ; ; ; ; ; ; h[i+0] h[i+2] h[i+4] h[i+6] & & & & h[i+1] h[i+3] h[i+5] h[i+7] & & & & x[j+i+1] x[j+i+3] x[j+i+5] x[j+i+7]
dec store lp cntr (sum0 >> 15) (sum1 >> 15) y[j] = (sum0 >> 15) y[j+1] = (sum1 >> 15)
[sctr]
; move to other reg file ; p10 = h[i+0]*x[j+i+1] ; sum1(p10) = p10 + sum1 ; p11 = h[i+1]*x[j+i+2] ; sum1 += p11 ; p12 = h[i+2]*x[j+i+3] ; sum1 += p12 ; p13 = h[i+3]*x[j+i+4] ; sum1 += p13 ; p14 = h[i+4]*x[j+i+5] ; sum1 += p14 ; p15 = h[i+5]*x[j+i+6] ; sum1 += p15 ; p16 = h[i+6]*x[j+i+7] ; sum1 += p16 ; p17 = h[i+7]*x[j+i+8] ; sum1 += p17 ; p00 = h[i+0]*x[j+i+0] ; sum0(p00) = p00 + sum0 ; p01 = h[i+1]*x[j+i+1] ; sum0 += p01
[sctr]
6-138
Example 672. Linear Assembly for FIR With Outer Loop Conditionally Executed With Inner Loop (Continued)
MPY ADD MPYH ADD MPY ADD MPYH ADD MPY ADD MPYH ADD [!sctr] [pctr] [!pctr] [!pctr] [!pctr] [!pctr] [!pctr] [octr] [octr] MVK SUB SUB SUB SUB SUB MVK SUB B h23,x23,p02 p02,sum01,sum02 h23,x23,p03 p03,sum02,sum03 h45,x45,p04 p04,sum03,sum04 h45,x45,p05 p05,sum04,sum05 h67,x67,p06 p06,sum05,sum06 h67,x67,p07 p07,sum06,sum07 4,sctr pctr,1,pctr x,rstx2,x x_1,rstx1,x_1 h,rsth1,h h_1,rsth2,h_1 4,pctr octr,1,octr LOOP ; p02 = h[i+2]*x[j+i+2] ; sum0 += p02 ; p03 = h[i+3]*x[j+i+3] ; sum0 += p03 ; p04 = h[i+4]*x[j+i+4] ; sum0 += p04 ; p05 = h[i+5]*x[j+i+5] ; sum0 += p05 ; p06 = h[i+6]*x[j+i+6] ; sum0 += p06 ; p07 = h[i+7]*x[j+i+7] ; sum0 += p07 ; reset store lp cntr ; ; ; ; ; ; dec pointer reset lp cntr reset x ptr reset x_1 ptr reset h ptr reset h_1 ptr reset pointer reset lp cntr
6.13.6 Translating C Code to Linear Assembly (Inner Loop and Outer Loop)
Example 673 shows the linear assembly with functional units assigned. (As in Example 664 on page 6-122, symbolic names now have an A or B in front of them to signify the register file where they reside.) Although this allocation is one of many possibilities, one goal is to keep the 1X and 2X paths to a minimum. Even with this goal, you have five 2X paths and seven 1X paths. One requirement that was assumed when the functional units were chosen was that all the sum0 values reside on the same side (A in this case) and all the sum1 values reside on the other side (B). Because you are scheduling eight accumulates for both sum0 and sum1 in an 8-cycle loop, each ADD must be scheduled immediately following the previous ADD. Therefore, it is undesirable for any sum0 ADDs to use the same functional units as sum1 ADDs. One MV instruction was added to get x01 on the B side for the MPYLH p10 instruction.
Optimizing Assembly Code via Linear Assembly
6-139
Example 673. Linear Assembly for FIR With Outer Loop Conditionally Executed With Inner Loop (With Functional Units)
.global _fir _fir: .cproc .reg .reg .reg .reg .reg .reg .reg ADD ADD ADD MVK MVK MVK MVK MVK MVK MVK ZERO ZERO .mptr .mptr .mptr .mptr LOOP: .trip 8 x, h, y x_1, h_1, y_1, octr, pctr, sctr sum01, sum02, sum03, sum04, sum05, sum11, sum12, sum13, sum14, sum15, p00, p01, p02, p03, p04, p05, p06, p10, p11, p12, p13, p14, p15, p16, x01b, x01, x23, x45, x67, x8, h01, y0, y1, rstx1, rstx2, rsth1, rsth2 x,4,x_1 h,4,h_1 y,2,y_1 60,rstx1 60,rstx2 64,rsth1 64,rsth2 201,octr 4,pctr 5,sctr sum07 sum17 x, x_1, h, h_1, x+0 x+4 h+0 h+4 ; ; ; ; ; ; ; ; ; ; ; ;
point to x[2] point to h[2] point to y[1] used to rst x pointer each outer loop used to rst x pointer each outer loop used to rst h pointer each outer loop used to rst h pointer each outer loop loop ctr = 201 = (100/2) * (32/8) + 1 pointer reset lp cntr = 32/8 reset store lp cntr = 32/8 + 1 sum07 = 0 sum17 = 0
LDW LDW LDW LDW LDW LDW LDW LDW LDH [sctr] [!sctr] [!sctr] [!sctr] [!sctr] SUB SHR SHR STH STH
.D1T1 .D2T2 .D1T1 .D2T2 .D2T1 .D1T2 .D2T1 .D1T2 .D2T1 .S1 .S1 .S2 .D1 .D2
*h++[2],h01 ; *h_1++[2],h23 ; *h++[2],h45 ; *h_1++[2],h67 ; *x++[2],x01 ; *x_1++[2],x23 ; *x++[2],x45 ; *x_1++[2],x67 ; *x,x8 ; sctr,1,sctr sum07,15,y0 sum17,15,y1 y0,*y++[2] y1,*y_1++[2] ; ; ; ; ;
h[i+1] h[i+3] h[i+5] h[i+7] & & & & x[j+i+1] x[j+i+3] x[j+i+5] x[j+i+7]
dec store lp cntr (sum0 >> 15) (sum1 >> 15) y[j] = (sum0 >> 15) y[j+1] = (sum1 >> 15)
6-140
Example 673. Linear Assembly for FIR With Outer Loop Conditionally Executed With Inner Loop (With Functional Units) (Continued)
MV MPYLH [sctr] ADD MPYHL ADD MPYLH ADD MPYHL ADD MPYLH ADD MPYHL ADD MPYLH ADD MPYHL ADD MPY ADD MPYH ADD MPY ADD MPYH ADD MPY ADD MPYH ADD .L2X .M2X .L2 .M1X .L2X .M2 .L2 .M1X .L2X .M1 .L2X .M2X .S2 .M2 .L2 .M1X .L2X .M1 .L1 .M1 .L1 .M2 .L1X .M2 .L1X .M1 .L1 .M1 .L1 x01,x01b h01,x01b,p10 p10,sum17,p10 h01,x23,p11 p11,p10,sum11 h23,x23,p12 p12,sum11,sum12 h23,x45,p13 p13,sum12,sum13 h45,x45,p14 p14,sum13,sum14 h45,x67,p15 p15,sum14,sum15 h67,x67,p16 p16,sum15,sum16 h67,x8,p17 p17,sum16,sum17 h01,x01,p00 p00,sum07,p00 h01,x01,p01 p01,p00,sum01 h23,x23,p02 p02,sum01,sum02 h23,x23,p03 p03,sum02,sum03 h45,x45,p04 p04,sum03,sum04 h45,x45,p05 p05,sum04,sum05 ; move to other reg file ; p10 = h[i+0]*x[j+i+1] ; sum1(p10) = p10 + sum1 ; p11 = h[i+1]*x[j+i+2] ; sum1 += p11 ; p12 = h[i+2]*x[j+i+3] ; sum1 += p12 ; p13 = h[i+3]*x[j+i+4] ; sum1 += p13 ; p14 = h[i+4]*x[j+i+5] ; sum1 += p14 ; p15 = h[i+5]*x[j+i+6] ; sum1 += p15 ; p16 = h[i+6]*x[j+i+7] ; sum1 += p16 ; p17 = h[i+7]*x[j+i+8] ; sum1 += p17 ; p00 = h[i+0]*x[j+i+0] ; sum0(p00) = p00 + sum0 ; p01 = h[i+1]*x[j+i+1] ; sum0 += p01 ; p02 = h[i+2]*x[j+i+2] ; sum0 += p02 ; p03 = h[i+3]*x[j+i+3] ; sum0 += p03 ; p04 = h[i+4]*x[j+i+4] ; sum0 += p04 ; p05 = h[i+5]*x[j+i+5] ; sum0 += p05
[sctr]
6-141
Example 673. Linear Assembly for FIR With Outer Loop Conditionally Executed With Inner Loop (With Functional Units)(Continued)
MPY ADD MPYH ADD [!sctr] [pctr] [!pctr] [!pctr] [!pctr] [!pctr] [!pctr] [octr] [octr] .endproc MVK SUB SUB SUB SUB SUB MVK SUB B .M2 .L1X .M2 .L1X .S1 .S1 .S2 .S1 .S1 .S2 .S1 .S2 .S2 h67,x67,p06 p06,sum05,sum06 h67,x67,p07 p07,sum06,sum07 4,sctr pctr,1,pctr x,rstx2,x x_1,rstx1,x_1 h,rsth1,h h_1,rsth2,h_1 4,pctr octr,1,octr LOOP ; p06 = h[i+6]*x[j+i+6] ; sum0 += p06 ; p07 = h[i+7]*x[j+i+7] ; sum0 += p07 ; reset store lp cntr ; ; ; ; ; ; dec pointer reset lp cntr reset x ptr reset x_1 ptr reset h ptr reset h_1 ptr reset pointer reset lp cntr
6-142
(b) B side
Unit(s) .M2 .S2 .D2 .L2 Total non-.M units 2X paths Total/Unit 8 6 6 8 20 5
6-143
|| || || ||
|| ||
; x[j+i+2] & x[j+i+3] ; x[j+i+0] & x[j+i+1] ; set pointer reset lp cntr ; ; ; ; ; ; ; ; ; ; h[i+2] & h[i+3] h[i+0] & h[i+1] used to reset x ptr (16*44) used to reset x ptr (16*44) x[j+i+4] & x[j+i+5] x[j+i+6] & x[j+i+7] dec pointer reset lp cntr used to reset h ptr (16*4) used to reset h ptr (16*4) point to y[j+1]
|| || ||
|| ||[A1] || || ||
; h[i+4] & h[i+5] ; h[i+6] & h[i+7] ; reset x ptr ; reset x ptr ; reset h ptr ; x[j+i+8] ; move to other reg file ; set store lp cntr ; p10 = h[i+0]*x[j+i+1] ; reset h ptr ; p11 = h[i+1]*x[j+i+2] ; ; ; ; p00 = h[i+0]*x[j+i+0] p12 = h[i+2]*x[j+i+3] dec store lp cntr zero out initial accumulator
||
||[!A1] ||
; (Bsum1 >> 15) ; p02 = h[i+2]*x[j+i+2] ; p01 = h[i+1]*x[j+i+1] ; sum1(p10) = p10 + sum1 ;* x[j+i+2] & x[j+i+3] ;* x[j+i+0] & x[j+i+1] ; zero out initial accumulator
6-144
SHR SUB MPYH ADD MPYHL ADD LDW LDW ADD MPYHL MPYLH ADD LDW LDW SUB B MPY ADD MPYLH ADD LDW LDW SUB MPY MPYH ADD ADD SUB SUB LDH MVK MPYH ADD MPYHL ADD STH STH ADD ADD ADD MPYLH MVK SUB MPYHL
.S1 .S2 .M2 .L1 .M1X .L2X .D2 .D1 .L1 .M2X .M1 .L2 .D2 .D1 .S1 .S2 .M1 .L1X .M2 .L2X .D1 .D2 .S1 .M2 .M1 .L1X .L2X .S2 .S1 .D2 .S1 .M2 .L1 .M1X .S2 .D2 .D1 .L2X .L1 .L2 .M2X .S1 .S2 .M1X
A10,15,A12 B0,1,B0 B7,B9,B13 A7,A10,A7 B7,A11,A10 A14,B4,B7 *B2++[2],B7 *A0++[2],A8 A10,A7,A13 A9,B10,B12 A9,A11,A10 B13,B7,B7 *B1++[2],A11 *A4++[2],B10 A1,1,A1 LOOP A9,A11,A11 B9,A13,A13 B8,B10,B13 A10,B7,B7 *A0++[2],A9 *B2++[2],B8 A4,A3,A4 B8,B10,B11 A9,A11,A11 B13,A13,A9 A10,B7,B7 B1,B14,B1 A0,A5,A0 *B1,A8 4,A2 B8,B10,B13 A11,A9,A9 B8,A8,A9 B12,B7,B10 B11,*B6++[2] A12,*A6++[2] A10,0,B8 A11,A9,A12 B13,B10,B8 A8,B8,B4 4,A1 B2,B5,B2 A8,B9,A14
; (Asum0 >> 15) ; dec outer lp cntr ; p03 = h[i+3]*x[j+i+3] ; sum0(p00) = p00 + sum0 ; p13 = h[i+3]*x[j+i+4] ; sum1 += p11 ;* h[i+2] & h[i+3] ;* h[i+0] & h[i+1] ; sum0 += p01 ; p15 = h[i+5]*x[j+i+6] ; p14 = h[i+4]*x[j+i+5] ; sum1 += p12 ;* x[j+i+4] & x[j+i+5] ;* x[j+i+6] & x[j+i+7] ;* dec pointer reset lp cntr ; Branch outer loop ; p04 = h[i+4]*x[j+i+4] ; sum0 += p02 ; p16 = h[i+6]*x[j+i+7] ; sum1 += p13 ;* h[i+4] & h[i+5] ;* h[i+6] & h[i+7] ;* reset x ptr ; p06 = h[i+6]*x[j+i+6] ; p05 = h[i+5]*x[j+i+5] ; sum0 += p03 ; sum1 += p14 ;* reset x ptr ;* reset h ptr ;* x[j+i+8] ; reset store lp cntr ; p07 = h[i+7]*x[j+i+7] ; sum0 += p04 ; p17 = h[i+7]*x[j+i+8] ; sum1 += p15 ; y[j+1] = (Bsum1 >> 15) ; y[j] = (Asum0 >> 15) ;* move to other reg file ; sum0 += p05 ; sum1 += p16 ;* p10 = h[i+0]*x[j+i+1] ;* reset pointer reset lp cntr ;* reset h ptr ;* p11 = h[i+1]*x[j+i+2]
|| || ||[!A1] ||[!A1] ||
6-145
||[!A2] || || ||[A2] || ||
ADD .L1X B13,A12,A10 SHR .S2 B11,15,B11 MPY .M2 B7,B9,B9 MPYH .M1 A8,A10,A10 ADD .L2 B4,B11,B4 LDW .D1 *A4++[2],B9 LDW .D2 *B1++[2],A10 ;Branch occurs here SHR STH STH .S1 .D2 .D1 A10,15,A12 B11,*B6++[2] A12,*A6++[2]
; sum0 += p07 ;* (Bsum1 >> 15) ;* p02 = h[i+2]*x[j+i+2] ;* p01 = h[i+1]*x[j+i+1] ;* sum1(p10) = p10 + sum1 ;** x[j+i+2] & x[j+i+3] ;** x[j+i+0] & x[j+i+1]
; (Asum0 >> 15) ; y[j+1] = (Bsum1 >> 15) ; y[j] = (Asum0 >> 15)
FIR with redundant load elimination and no memory 50 (8 hits with outer loop conditionally executed with inner loop
6-146
Chapter 7
Interrupts
This chapter describes interrupts from a software-programming point of view. A description of single and multiple register assignment is included, followed by code generation of interruptible code and finally, descriptions of interrupt subroutines.
Topic
7.1 7.2 7.3 7.4 7.5
Page
Overview of Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2 Single Assignment vs. Multiple Assignment . . . . . . . . . . . . . . . . . . . . . 7-3 Interruptible Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5 Interruptible Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-6 Interrupt Subroutines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
7-1
Overview of Interrupts
Single and multiple assignment of registers Loop interruptibility How to use the C6x code generation tools to satisfy different requirements Interrupt subroutines
7-2
Example 72 shows the same code with a new register allocation to produce single assignment code. Now the LDW assigns a value to register A6 instead of A1. Now, regardless of whether an interrupt is taken or not, A1 maintains the value written by the SUB instruction because LDW now writes to A6. Because there are no in-flight registers that are read before an in-flight instruction completes, this code is interruptible.
Interrupts
7-3
56 NOP 7 MPY
Both examples involve exactly the same schedule of instructions. The only difference is the register allocation. The single assignment register allocation, as shown in Example 72, can result in higher register pressure (Example 72 uses one more register than Example 71). The next section describes how to generate interruptible and non-interruptible code with the C6x code generation tools.
7-4
Interruptible Loops
Interrupts
7-5
Software pipelined loops that have high register pressure can fail to register allocate at a given iteration interval when single assignment is required, but might otherwise succeed to allocate if multiple assignment were allowed. This can result in a larger iteration interval for single assignment software pipelined loops and thus lower performance. To determine if this is the problem for looped code, use the mw feedback option. Because loops with minimum iteration intervals less than 6 are not interruptible, higher iteration intervals might be used which results in lower performance. Unrolling the loop, however, prevents this reduction in performance (see section 7.2) Higher register pressure in single assignment can cause data spilling to memory in both looped code and non-looped code when there are not enough registers to store all temporary values. This reduces performance but occurs rarely and only in extreme cases.
The tools provide 3 levels of control to the user. These levels are described in the following sections. For a full discussion of interruptible code generation, see the TMS320C6x Optimizing C Compiler Users Guide.
7.4.1
7-6
7.4.2
7.4.3
Interrupts
7-7
Interrupt Subroutines
7.5.1
Alternatively, you can use the interrupt pragma to define a function to be an ISR:
#pragma INTERRUPT(func);
The result either case is that the C compiler automatically creates a function that obeys all the requirements for an ISR. These are different from the calling convention of a normal C function in the following ways:
All general purpose registers used by the subroutine must be saved to the stack. If another function is called from the ISR, then all the registers (A0A15, B0B15) are saved to the stack. A B IRP instruction is used to return from the interrupt subroutine instead of the B B3 instruction used for standard C functions A function cannot return a value and thus, must be declared void.
See the section on Register Conventions in the TMS320C6x Optimizing C Compiler Users Guide for more information on standard function calling conventions.
7-8
Interrupt Subroutines
7.5.2
All registers used must be saved to the stack before modification. For this reason, it is preferable to maintain one general purpose register to be used as a stack pointer in your application. (The C compiler uses B15.) If another C routine is called from the ISR (with an assembly branch instruction to the _c_func_name label) then all registers must be saved to the stack on entry. A B IRP instruction must be used to return from the routine. If this is the NMI ISR, a B NRP must be used instead. An NOP 4 is required after the last LDW in this case to ensure that B0 is restored before returning from the interrupt.
7.5.3
Nested Interrupts
Sometimes it is desirable to allow higher priority interrupts to interrupt lower priority ISRs. To allow nested interrupts to occur, you must first save the IRP, IER, and CSR to a register which is not being used or to or some other memory location (usually the stack). Once these have been saved, you can reenable
Interrupts
7-9
Interrupt Subroutines
the appropriate interrupts. This involves resetting the GIE bit and then doing any necessary modifications to the IER, providing only certain interrupts are allowed to interrupt the particular ISR. On return from the ISR, the original values of the IRP, IER, and CSR must be restored.
|| || || ||
7-10
Part I
Introduction
Part II
C Code
Part III
Assembly Code
Part IV
Appendix
Appendix A Appendix A
Applications Programming
This appendix provides extensive code examples from the Global Systems for Mobile Communications (GSM) enhanced full-rate (EFR) vocoder. The assembly code examples in this appendix represent hand-optimized code; the code produced by the assembly optimizer will vary, depending on the version used.
Topic
A.1 A.2
Page
Summary of Major Programming Methods . . . . . . . . . . . . . . . . . . . . . . . A-2 Implementation of GSM EFR Vocoder . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A-1
Rearranging the C code If you are implementing a system based on an existing C code, rearranging the tasks in the C code is a useful method to gain better performance. Avoiding memory bank hits Memory bank hits, especially those in the inner loop in a nested loop application, hurt the performance dramatically and must be avoided. Most of the memory bank hits, however, can be eliminated by allocating the relevant arrays properly. Some situations, like accessing a word and a halfword in the same cycle, can also create the chance of a memory bank hit and should also be avoided.
If the system implementation is quite complicated, the program-memory size becomes an issue. To achieve a good balance between program-memory size and speed, you can implement the less critical portions with highly-compact assembly code that sacrifices performance.
A-2
Multiply-accumulate loop Windowing and scaling part of autocorr.c cor_h rrv computation in search_10i40 Index search in search_10i40 FIR filter (residu.c) Lag search in the lag_max ( ) routine
Note: European Telecommunications Standards Institute (ETSI) has the copyright to all the C code used in this section. The following global constants/symbols are defined in the EFR vocoder: #define Word16 #define Word32 #define MAX_32 #define MIN_32 #define MAX_16 #define MIN_16 short int 0x7fffffffL 0x80000000L 0x7fff 0x8000
Applications Programming
A-3
Example A2 shows a list of symbolic instructions for each iteration of the loop.
In Example A2, xptr is the pointer for the x array and yptr is the pointer for the y array. Because there are eight functional units, these instructions can easily fit into one execution packet. In general, unrolling the loop once as in the code in Example A3 does not give the same result as the code shown in Example A1, because of the ordering dependence of the saturated addition.
A-4
However, both approaches lead to the same result if x[i] = y[i] for every i, Because _smpy (x[i], x[i]) is always greater than or equal to 0. This special MAC loop is used to compute the energy of a particular signal segment. In this case, take the approach shown in Example A3, because it doubles the performance of the code shown in Example A2. Example A4 shows the C code for this special MAC loop. Example A5 lists the symbolic instructions for this loop.
; sum=sadd(sum_o+sum_e)
Applications Programming
A-5
In Example A5, xptre and xptro are the pointers for the x array and, at the beginning, point to x[0] and x[1], respectively. The eight instructions in the loop fit perfectly into one execution packet. This approach computes two MACs in one cycle. It doubles the performance of the code shown in Example A2 for the general MAC loop. The final assembly code is shown in Example A6.
Example A6. Assembly Code for the Energy Computation MAC Loop
******************************************************************************* ** Texas Instruments, Inc ** ** ** ** MAC Loop Energy Computation ** ** ** ** Compute two samples a time ** ** ** ** Total cycles = (N/2+2) ** ** ** ** Register Usage: A B ** ** 4 5 ** ** ** ** Notice that x[0] and x[1] will not be available till LOOP ** ** is executed once. Therefore, sum_e and sum_o should be 0s ** ** for the first three iterations. This is why A5, B5, A6, ** ** and B6 should be set to 0s in the prolog. ** ******************************************************************************* ; A4 &x[0] ; B4 N ; A6 sum ADD SUB B MVK LDH LDH B MV LDH LDH B MV MV LDH LDH B .L2X .D2 .S2 .S1 .D1 .D2 .S2 .L2X .D1 .D2 .S1 .L1 .L2 .D1 .D2 .S1 A4,2,B4 B4,6,B1 LOOP 0,A6 *A4++[2],A5 *B4++[2],B5 LOOP A6,B6 *A4++[2],A5 *B4++[2],B5 LOOP A6,A5 B6,B5 *A4++[2],A5 *B4++[2],B5 LOOP ; ; ; ; ; ; ; ; ; ; ; ; ; &x[1] loop counter branch to the loop initialize sum_e load x[0] load x[1] branch to the loop initialize sum_o load x[2] load x[3] branch to the loop take care the initial three iterations take care the initial three iterations
|| || ||
|| || ||
|| || || ||
|| ||
A-6
Example A6. Assembly Code for the Energy Computation MAC Loop (Continued)
||
LOOP: || || || || || ||[B1] ||[B1] SMPY SMPY SADD SADD LDH LDH B SUB SADD .M1 .M2 .L1 .L2 .D1 .D2 .S1 .S2 .L1X A5,A5,A7 B5,B5,B7 A7,A6,A6 B7,B6,B6 *A4++[2],A5 *B4++[2],B5 LOOP B1,2,B1 A6,B6,A6 ; ; ; ; ; ; ; ; ; smpy(x[i],x[i]) smpy(x[i+1],x[i+1]) sum_e=sadd(sum_e,smpy(x[i],x[i])) sum_o=sadd(sum_o,smpy(x[i+1],x[i+1])) load x[i] load x[i+1] branch to the loop decrement loop counter final result, sum = sum_e + sum_o
LDH LDH
.D1 .D2
*A4++[2],A5 *B4++[2],B5
; ;
Applications Programming
A-7
Example A7. C Code for the Windowing and Scaling Part of autocorr.c
#define L_WINDOW input: Word16 x[L_WINDOW], wind[L_WINDOW]; local variables/arrays: Word16 i; Word16 y[L_WINDOW]; Word32 sum; Word16 overfl, overfl_shft; Original C code: /* Windowing of signal */ for (i = 0; i < L_WINDOW; i++) { y[i] = mult_r (x[i], wind[i]); } /* Compute r[0] and test for overflow */ overfl_shft = 0; do { overfl = 0; sum = 0L; for (i = 0; i < L_WINDOW; i++) { sum = L_mac (sum, y[i], y[i]); } /* If overflow divide y[] by 4 */ if (L_sub (sum, MAX_32) == 0L) { overfl_shft = add (overfl_shft, 4); overfl = 1; /* Set the overflow flag */ for (i = 0; i < L_WINDOW; i++) { y[i] = shr (y[i], 2); } } } while (overfl != 0); Where mult_r(a,b) = _sadd(_smpy(a,b),0x8000L)>>16 L_mac(a,b,c)= _sadd(a,_smpy(b,c)) L_sub(a,b) = _ssub(a,b) add(a,b) = ((_sadd((a)<<16,((b)<<16)))>>16) shr(a,b) = ((b)<0 ? (_sshl((a),(b+16))>>16):((a)>>(b))) 240
A-8
Figure A1. Flow Diagram for the Windowing and Scaling Part of autocorr.c
Start Loop 1
for(i = 0;i < L_WINDOW; i++) sum = L_mac (sum, y[i], y[i])
Loop 2
Exit
Applications Programming
A-9
In Example A8, windptr, xptr, and yptr are the pointers of wind, x, and y. The .D unit is used most often (three times). With properly partitioned resources, this is a 2-cycle loop. If you unroll the loop once and load both x and wind in words (in GSM EFR, both x and wind can be loaded in words if they are map-aligned with the word boundary), you can compute two y values with two cycles. The following is the new list of the instructions in one loop iteration.
[cntr] [cntr]
In Example A9, yptre and yptro are the pointers for the y array and, at the beginning, point to y[0] and y[1], respectively. Note: Loop 2 is a special MAC loop, as described in section A.2.1 on page A-4. It can be implemented either as shown in Example A10 without loop unrolling or as in Example A11 with loop unrolling for one iteration.
Example A10. Linear Assembly for Loop 2 of autocorr.c (No Loop Unrolling)
LOOP2: LDH SMPY SADD [cntr] SUB [cntr] B .D .M .L .S .S *yptr++,yi yi,yi,yyi sum,yyi,sum cntr,1,cntr LOOP2 ;load y[i] ;smpy(y[i],y[i]) ;sadd(sum,smpy(y[i],y[i])) ;decrement loop counter ;branch to loop
A-10
Example A11. Linear Assembly for Loop 2 of autocorr.c (With Loop Unrolling)
LOOP2: LDH LDH SMPY SMPY SADD SADD [cntr] SUB [cntr] B SADD .D .D .M .M .L .L .S .S .L *yptre++,yi *yptro++,yi+1 yi,yi,yyi yi+1,yi+1,yyi+1 sum_e,yyi,sum_e sum_o,yyi+1,sum_o cntr,2,cntr LOOP2 sum_e,sum_o,sum ;load y[i] ;load y[i+1] ;smpy(y[i],y[i]) ;smpy(y[i+1],y[i+1]) ;sadd(sum_e,smpy(y[i],y[i]) ;sadd(sum_o,smpy(y[i+1],y[i+1])) ;decrement loop counter ;branch to loop ;sum=sum_o+sum_e
Later, you will see that both approaches are used in this application. Loop 3 is a single-cycle loop and you cannot speed it up by simply unrolling the loop. The instructions for each iteration are shown in Example A12.
In Example A12, yptrl is the pointer for loading the y array and yptrs is the pointer for storing the y array. The new flow diagram is shown in Figure A2.
Applications Programming
A-11
for(i = 0; i < L_WINDOW; i+=2) { y[i] = mult_r (x[i], wind[i]) y[i+1] = mult_r (x[i+1], wind[i+1]) }
Loop 1
for(i = 0; i < L_WINDOW; i++) { sum_o = L_mac (sum_o, y[i], y[i]) sum_e = L_mac (sum_e, y[i+1], y[i+1]) } sum = sum_o + sum_e
Loop 2
for(i = 0; i < L_WINDOW; i+=2) { y[i] = mult_r (x[i], wind[i]) y[i+1] = mult_r (x[i+1], wind[i+1]) sum_o = L_mac (sum_o, y[i], y[i]) sum_e=L_mac (sum_e, y[i+1], y[i+1]) } sum = sum_o + sum_e
Loop I
sum == MAX_32 ?
No
Exit
y[i] = shr(y[i], 2)
sum = L_mac (sum, y[i], y[i]) }
You can implement loop I as one of the two approaches as shown in Example A13.
Applications Programming
A-13
or as
LOOPI: LDW LDW SMPY SMPYH SADD SADD SHR SHR SMPYH SMPYH SADD SADD STD STD SUB B .D .D .M .M .L .L .S .S .M .M .L .L .D .D .S .S *windptr++,windi_windi+1 *xptr++,xi_xi+1 windi_windi+1,xi_xi+1,windxi0 windi_windi+1,xi_xi+1,windxi0+1 windxi0,0x8000L,windxi1 windxi0+1,0x8000L,windxi1+1 windxi1,16,yi windxi1+1,16,yi+1 windxi1,windxi1,yyi windxi+1,windxi+1,yyi+1 sum_e,yyi,sum_e sum_o,yyi+1,sum_o yi,*yptre++[2] yi+1,*yptro++[2] cntr,2,cntr LOOPI ;load wind[i] and wind[i+1] ;load x[i] and x[i+1] ;smpy(x[i],wind[i]) ;smpy(x[i+1],wind[i+1]) ;sadd(smpy(x[i],wind[i]),0x8000L) ;sadd(smpy(x[i+1],wind[i+1]),0x8000L) ;sadd(smpy(x[i],wind[i]),0x8000L)>>16 ;sadd(smpy(x[i+1],wind[i+1]),0x8000L)>>16 ;smpy(y[i],y[i]) ;smpy(y[i+1],y[i+1]) ;sum_e=sadd(sum_e,smpy(y[i],y[i])) ;sum_o=sadd(sum_o,smpy(y[i+1],y[i+1]) ;store y[i] ;store y[i+1] ;decrement loop counter ;branch to loop
[cntr] [cntr]
The only difference between these two implementations is how to compute yyi and yyi + 1. Using yyi as an example, the former approach computes yyi following the order of the original C code:
yyi = _smpy(_sadd(_smpy(a,b),0x8000L)>>16, _sadd(_smpy(a,b),0x8000L)>>16),
This provides the flexibility to better pack the instructions and reduces cycle count. Loop I is a two-cycle loop. Loop II is still a single-cycle loop. Its instructions are shown in Example A14.
.D .S .M .L .D .L .S
;load y[i] ;shr(y[i],2) ;smpy(shr(y[i],2),shr(y[i],2)) ;sum=sadd(sum,smpy(shr(y[i],2),shr(y[i],2))) ;store y[i]=shr(y[i],2) ;decrement loop counter ;branch to loop
x[i] + x[i + 1] << 16 and wind[i] + wind[i + 1] << 16 must be loaded in the same cycle. y[i] and y[i+1] must be stored in the other cycle.
To avoid a memory bank hit: Allocate x and wind in different memory spaces, if possible. For instance, allocate wind[i] in data ROM and x in data RAM. If no data ROM is available, allocate x and wind so they are offset from each other by one word.
There is no memory bank problem when storing y[i] and y[i + 1]. No memory bank hits occur in loop II, because the distance between the load and store is always six halfwords. The modified C code of this part of autocorr.c is shown in Example A15.
Applications Programming
A-15
/* Windowing of signal */ sum_e=sum_o=0L; for (i = 0; i < L_WINDOW; i+=2) { y[i] = mult_r (x[i], wind[i]); y[i+1] = mult_r(x[i+1], window[i+1]); sum_e = L_mac(sum_e, y[i], y[i]); sum_o = L_mac(sum_o, y[i+1], y[i+1]); } sum=sum_e+sum_o; /* Compute r[0] and test for overflow */ overfl_shft = 0; do { overfl = 0; /* If overflow divide y[] by 4 */ if (sum == MAX_32) { overfl_shft = add (overfl_shft, 4); overfl = 1; /* Set the overflow flag */ sum=0L; for (i = 0; i < L_WINDOW; i++) { y[i] = shr (y[i], 2); sum = L_mac(sum, y[i], y[i]); } } } while (overfl != 0);
Example A16. Assembly Code for Windowing and Scaling Part of autocorr.c
*********************************************************************************** ** Texas Instruments, Inc ** ** ** ** Implementation of The Windowing and Scaling Part of autocorr.c ** ** In EFR ** ** ** ** Compute two samples a time ** ** ** ** Total cycles = 257 (No Scaling) ** ** = 519 (One Scaling) ** ** ** ** Register Usage: A B ** ** 11 9 ** ** ** *********************************************************************************** ; B4 &x[0] ; A4 &window[0] ; A6 &y[0] ; B8 L_WINDOW ; A0 sum and sum_e ; B0 sum_o ; B15 stack pointer ; notice that we use the latter approach in Example A13
|| || ||
LDW LDW MVK SUB SUB MVK LDW LDW SHL MVK ADD MV
LDW LDW MVKLH MV
.D2 .D1 .S1 .S2 .L1X .S2 .D2 .D1 .S2 .S1 .L2X .L1
.D2 .D1 .S1 .L1X
*B4++,B5 *A4++,A5 480,A6 B8,6,B1 B15,A6,A6 1,B7 *B4++,B5 *A4++,A5 B7,15,B7 1,A10 A6,2,B6 A6,A3
*B4++,B5 *A4++,A5 32767,A10 B7,A7
; ; ; ;
load x[0] & x[1] load wind[0] & wind[1] reserve space for y[i] LOOP I counter
; &y[0]
||
||
; load x[2] & x[3] ; load wind[2] & wind[3] ; 32768 or 0x8000L for rounding ; &y[1] ; &y[0]
; ; ; ; load x[4] & x[5] load wind[4] & wind[5] 7fffffff = MAX_32 32768
|| || ||
|| || ||
Applications Programming
A-17
Example A16. Assembly Code for Windowing and Scaling Part of autocorr.c (Continued)
SMPYH SMPY B LDW LDW MVK MVK SMPYH SMPY SADD SADD B LDW LDW SHR SHR SMPYH SMPYH .M2X .M1X .S2 .D2 .D1 .S1 .S2 .M2X .M1X .L1 .L2 .S1 .D2 .D1 .S1 .S2 .M1 .M2 B5,A5,B2 B5,A5,A2 LOOPI *B4++,B5 *A4++,A5 0,A0 0,B0 B5,A5,B2 B5,A5,A2 A2,A7,A2 B2,B7,B2 LOOPI *B4++,B5 *A4++,A5 A2,16,A9 B2,16,B9 A2,A2,A11 B2,B2,B11 ; smpy(x[1],wind[1]) ; smpy(x[0],wind[0])
|| ||
|| || ||
; ; ; ; ; ; ; ;
load x[6] & x[7] load wind[6] & wind[7] sum_o = 0 sum_e = 0 smpy(x[3],wind[3]) smpy(x[2],wind[2]) sadd(smpy(x[1],wind[1]),0x8000L) sadd(smpy(x[0],wind[0]),0x8000L)
|| || || ||
|| || || || || LOOPI:
; ; ; ; ; ;
load x[8] & x[9] load wind[8] & wind[9] y[1]=sadd(smpy(x[1],wind[1]),0x8000L)>>16 y[0]=sadd(smpy(x[0],wind[0])+0x8000L)>>16 smpy(y[0],y[0]) smpy(y[1],y[1])
|| || || || || ||[B1] ||[B1]
STH STH SADD SADD SMPYH SMPY SUB B SADD SADD SMPYH SMPYH SHR SHR LDW LDW SADD MPY
.D1 .D2 .L1 .L2 .M2X .M1X .S2 .S1 .L1 .L2 .M1 .M2 .S1 .S2 .D2 .D1 .L1X .M2
A9,*A6++[2] B9,*B6++[2] A2,A7,A2 B2,B7,B2 B5,A5,B2 B5,A5,A2 B1,2,B1 LOOPI A0,A11,A0 B0,B11,B0 A2,A2,A11 B2,B2,B11 A2,16,A9 B2,16,B9 *B4++,B5 *A4++,A5 A0,B0,A0 B0,0,B0
; ; ; ; ; ; ;
store y[1] store y[0] sadd(smpy(x[3],wind[3]),0x8000L) sadd(smpy(x[2],wind[2]),0x8000L) smpy(x[5],wind[5]) smpy(x[4],wind[4]) decrement the loop counter
|| || || || || || ||
; ; ; ; ; ; ; ;
sum_e += smpy(y[0],y[0]) sum_o += smpy(y[1],y[1]) smpy(y[2],y[2]) smpy(y[3],y[3]) y[3]=sadd(smpy(x[3],wind[3]),0x8000L)>>16 y[2]=sadd(smpy(x[2],wind[2]),0x8000L)>>16 load x[10] & x[11] load wind[10] & wind[11]
||
A-18
Example A16. Assembly Code for Windowing and Scaling Part of autocorr.c (Continued)
LTEST: CMPEQ [!A1] ||[A1] ||[A1] ||[A1] [A1] ||[A1] ||[A1] ||[A1] [A1] ||[A1] ||[A1] B LDH ADD ADD LDH SUB B MV LDH MVK B .L1 .S1 .D1 .L2X .D2 .D2 .S2 .S1 .L1 .D2 .S1 .S2 .D2 .S1 .L1 .D2 .S1 .D2 .S1X .S2 A0,A10,A1 FINISH *A3,B5 A3,2,B9 B0,4,B0 *B9++,B5 B8,7,B1 LOOP II A3,A9 *B9++,B5 0,A0 LOOPII *B9++,B5 LOOPII A0,A2 *B9++,B5 LOOPII *B9++,B5 B5,2,A5 LOOPII ; if (sum == MAX_32) ; ; ; ; No, exit load y[0] &y[1] add (overfl_shift,4)
[A1] LDH ||[A1] B ||[A1] MV [A1] LDH ||[A1] B [A1] LDH ||[A1] SHR ||[A1] B LOOPII: LDH SHR B STH ADD SMPY SADD STH SMPY SADD B SADD SADD NOP FINISH:
|| ||[B1] || ||[B1] || ||
.D2 .S1X .S2 .D1 .L2 .M1 .L1 .D1 .M1 .L1 .S2 .L1 .L1 3
*B9++,B5 B5,2,A5 LOOPII A5,*A9++ B1,1,B1 A5,A5,A2 A2,A0,A0 A5,*A9++ A5,A5,A2 A2,A0,A0 LTEST A2,A0,A0 A2,A0,A0
; ; ; ; ; ; ; ; ; ; ;
load y[6] y[1] = shr(y[1],2) branch store y[0] decrement LOOPII counter smpy(y[0],y[0]) sum +=smpy(y[i],y[i]) store y[n1] smpy(y[n1],y[n1]) sum +=smpy(y[n3],y[n3]) branch back to LTEST
|| || ||
Applications Programming
A-19
If code size is not an issue, you can eliminate the last three NOPs by expanding the epilog of loop II. This saves three cycle counts every time loop II executes; however, code size increases by two fetch packets (2 32 = 64 bytes).
local variables/arrays: Word16 h2[L_CODE]; /* function of h, the impulse response of weighted synthesis filter */ Word16 dec, j, i, k; Word32 s;
Original C code for (dec=1; dec<L_CODE; dec++) { s = 0; j = L_CODE1; i = sub(j, dec); for (k=0; k<(L_CODEdec); k++, i, j) { s = L_mac(s, h2[k], h2[k+dec]); rr[j][i] = mult(round(s), mult(sign[i],sign[j])); rr[i][j] = rr[j][i]; } } where sub(a,b) = _ssub(a<<16, b<<16)>>16 L_mac(a,b,c) = _sadd(a,_smpy(b,c)) mult(a,b) = _smpy(a,b)>>16 and round(a) = _sadd(a,0x8000L)>>16
The instructions to execute one iteration of the inner loop are listed in Example A18.
A-20
Example A18. Linear Assembly for cor_h (One Inner Loop Iteration)
INNERLOOP: LDH .D LDH .D SMPY .M SADD .L SADD .L LDH .D LDH .D SMPY .M SMPYH .M SHR .S STH .D STH .D [icntr] SUB.ALU [icntr] B .S
*h2ptr++,h2k *h2decptr++,h2deck h2k,h2deck,h2kk s,h2kk,s s,0x8000L,sround *signiptr,signi *signjptr,signj signi,signj,signij signij,sround,rrji0 rrji0,16,rrji rrji,*rrjiptr[41] rrji,*rrijptr[41] icntr,1,icntr INNERLOOP
;load h2[k] ;load h2[k+dec] ;smpy(h2[k],h2[k+dec]) ;sadd(s,smpy(h2[k],h2[k+dec]) ;round(s)<<16 ;load sign[i] ;load sign[j] ;smpy(sign[i],sign[j])=mult(sign[i],sign[j])<<16 ;L_mult(round(s),mult(sign[i],sign[j])) ;rr[j][i] ;store rr[j][i] ;store rr[i][j] ;decrement inner loop counter ;branch to inner loop
In Example A18, h2ptr and h2decptr are the pointers for h2, pointing to h2[k] and h2[k+dec]. The pointers for sign, signiptr and signjptr, point to sign [ i ] and sign[ j ]. The pointers for rr, rrjiptr and rrijptr, point to rr [ j ] [ i ] and rr [ i ] [ j ], respectively. Notice that each element rr [ j ] [ i ] is implemented as: rr [ j ] [ i ] = (_smpyh (_sadd (s, 0x8000L), _smpy (sign [ i ], sign [ j ] ) ) ) >> 16 The .D unit is used most often (six times in the inner loop). Ideally, these instructions can be arranged in three cycles. However, memory bank hits occur with any combination of the load and/or store instructions.
Applications Programming
A-21
Next, consider unrolling the inner loop once. The C code is shown in Example A19.
Eight values must be loaded and four values must be stored in every iteration; however, h2[k] and h2[k + 1] can be loaded in a word. The same is true for sign [ j ] and sign [ j 1 ]. A total of six loads are required. The inner loop instructions are shown in Example A20.
A-22
Example A20. Linear Assembly for cor_h (With Inner Loop Unrolling)
INNER LOOP: LDW .D *h2ptr++,h2k_h2k+1 ;load h2[k] and h2[k+1] LDH .D *h2decptr++,h2deck ;load h2[k+dec] SMPY .M h2k_h2k+1,h2deck,h2kk0 ;smpy(h2[k],h2[k+dec]) SADD .L s,h2kk0,s ;sadd(s,smpy(h2[k],h2[k+dec]) SADD .L s,0x8000L,sround ;round(s)<<16 LDH .D *signiptr,signi ;load sign[i] LDW .D *signjptr,signj_signj1 ;load sign[j] and sign[j1] SMPYLH .M signi,signj_signj1,signij0 ;smpy(sign[i],sign[j]) SMPYH .M signij0,sround,rrji0 ;L_mult(round(s),mult(sign[i],sign[j])) SHR .S rrji0,16,rrji ;rr[j][i] STH .D rrji,*rrjiptr[82] ;store rr[j][i] STH .D rrji,*rrijptr[82] ;store rr[i][j] LDH .D *h2decptr++,h2deck+1 ;load h2[k+1+dec] SMPYHL .M h2k_h2k+1,h2deck+1,h2kk1 ;smpy(h2[k+1],h2[k+1+dec]) SADD .L s,h2kk1,s ;sadd(s,smpy(h2[k+1],h2[k+1+dec]) SADD .L s,0x8000L,sround ;round(s)<<16 LDH .D *signiptr,signi1 ;load sign[i1] SMPY .M signi1,signj_signj1,signij1;smpy(sign[i1],sign[j1]) SMPYH .M signij1,sround,rrji1 ;L_mult(round(s),mult(sign[i1],sign[j1])) SHR .S rrji1,16,rrj1i1 ;rr[j1][i1] STH .D rrj1i1,*rrj1i1ptr[82] ;store rr[j1][i1] STH .D rrj1i1,*rri1j1ptr[82] ;store rr[i1][j1] [icntr]SUB.ALU icntr,2,icntr ;decrement inner loop counter [icntr]B .S INNERLOOP ;branch to loop
Load words (h2[k], h2 [ k + 1 ] ) and (sign [ i 1 ] , sign [ i ] ) together and allocate h2 and sign so that they are aligned with each other. Store rr [ j ] [ i ] and rr [ j 1] [ i 1] together and rr [ i ] [ j ] and rr [ i 1] [ j 1] together.
There are five load/store pairs, so each iteration requires only five cycles. You gain speed by eliminating both the memory bank hits, as well as by reducing the cycles required to complete each rr. The final assembly code with reduced code size is shown in Example A21. Here, the primitive technique introduced in section 6.4.3.4, Priming the Loop, on page 6-47 is used to reduce the code size for both the prolog and epilog of the inner loop.
Applications Programming
A-23
Example A21. Assembly Code for cor_h With Reduced Code Size
********************************************************************************** ** Texas Instruments, Inc ** ** ** ** Implementation of cor_h in EFR ** ** ** ** Compute four rrs at a time ** ** ** ** Total cycles = 2533 ** ** ** ** Register Usage: A B ** ** ** ** 16 15 ** ** ** ********************************************************************************** ; ; ; ; A4 B4 A6 B6 L_CODE &h2[0] &sign[0] &rr[0][0]
|| || ||
; ; ; ; ; ; ; ;
used to obtain &rr[i][j] and &rr[i1][j1] &sign[L_CODE2] &rr[L_CODE1][L_CODE2]+[82]=&rr[j][i]+[82] outer loop counter not doing the initial store &h2[k+dec] used to increase/decrease the pointers for h2 and sign
|| ||
OUTERLOOP: LDW LDW ADD SUB ADD MPY MPY LDH LDH ADD MV SUB ADDK .D1 .D2 .L1X .S1 .L2X .M1 .M2 .D2 .D1 .L2X .S2 .L1 .S1 *A6,A10 *B4,B12 B13,2,A3 A6,A11,A4 A2,2,B0 A13,A11,A3 B11,0,B11 *B13++[2],A7 *A3,B7 A4,2,B9 B6,B14 A6,4,A8 164,A14 ; ; ; ; ; load sign[j1] & sign[j] load h2[k] & h2[k+1] &h2[k+dec+1] &sign[i1] define the inner loop counter
|| || || ||[A2] || ||
; initialize s ; ; ; ; ; ; load h2[k+dec] load h2[k+dec+1] &sign[i] &rr[j][i]+[82] &sign[j3] from &rr[dec][0]+[82] to &rr[dec][0]
|| || || || ||[B2]
A-24
Example A21. Assembly Code for cor_h With Reduced Code Size (Continued)
LDH LDH ADD ADD ADDK MVK .D2 .D1 .L2 .L1 .S2 .S1 *B9,A0 *A4[2],B5 B4,4,B8 A11,2,A11 82,B14 3,A1 ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; load sign[i] load sign[i1] &h2[k+2] update A11 &rr[j][i] determine when the stores in the inner loop actually starts store rr[dec][0] from &rr[0][dec]+[82] tp &rr[0][dec] &rr[j][i]+[82] inner loop counter &rr[i1][j1] &rr[j][i1], for the next outer loop iteration decrement outer loop counter decide if the last store is needed store rr[0][dec] &rr[i][j]+[82] counter for branching to outer loop
|| || || || ||
.D1 .S1 .L1X .S2 .L2X .D2 .S2 .S1 .L2X .D1 .L1 .D2
A12,*A14 164,A9 B6,A14 B0,1,B0 B14,A3,B3 B6,2,B6 INNERLOOP A2,1,A2 A2,1,B2 A12,*A9 A14,A3,A9 B0,B1
INNERLOOP: SHR SMPYH SADD LDW LDW B ADD LDH LDH SMPYH SMPY SUB SUB ADD .S2 .M1 .L2X .D1 .D2 .S1 .L1X .D1 .D2 .M2 .M1X .S2 .L1 .L2X B9,16,B10 A3,A0,A3 B11,A15,B9 *A8,A10 *B8++,B12 OUTERLOOP B13,2,A3 *A3,B7 *B13++[2],A7 B9,B5,B9 A7,B12,A7 B1,1,B1 A1,1,A1 A4,2,B9 ; ## obtain rr[j1][i1] ; # smpyh(sadd(s,0x8000L),smpy(sign[i],sign[j])) ; # sadd(s,0x8000L) ; ; ; ; ; ; ; ; ; ; ; *load sign[j] & sign[j1] *load h2[k] & h2[k+1] outer LOOP &h2[k+dec+1] *h2[k+dec+1] *h2[k+dec] # smpyh(sadd(s,0x8000L),smpy(sign[i1],sign[j1])) smpy(h2[k],h2[k+dec] decrement the counter for branching to the outer loop decrement the inner loop &sign[i]
|| || || || ||[!B1] ||
|| || || || ||[A1] ||
|| || || ||[!A1] ||[!A1]
; ; ; ; ; ;
*load sign[i] *load sign[i1] smpy(sign[j],sign[i]) smpy(h2[k+1],h2[k+1+dec]) ## from &rr[j][i]+[82] to &rr[j][i] ## from &rr[j1][i1]+[82] to &rr[j1][i1]
Applications Programming
A-25
Example A21. Assembly Code for cor_h With Reduced Code Size (Continued)
[!A1] ||[!A1] || || ||[B0] ||[!A1] ||[!A1] [!A1] ||[!A1] || || || ||[B0] STH STH SADD SMPY SUB ADDK ADDK STH STH SHR SADD SADD B ADD B .D1 .D2 .L1X .M2X .L2 .S1 .S2 .D1 .D2 .S1 .L1 .L2X .S2 .L2X .S2 A12,*A14 B10,*B14 B11,A7,A5 A10,B5,B5 B0,1,B0 164,A9 164,B3 A12,*A9 B10,*B3 A3,16,A12 A5,A15,A3 A5,B7,B11 INNERLOOP B4,A11,B13 FINISH ; ; ; ; ; ; ; ; ; ; ; ; ; ## store rr[j][i] ## store rr[j1][i1] s = sadd(s,smpy(h2[k],h2[k+dec]) smpy(sign[i1],sign[j1] decrement inner loop counter ## from &rr[i][j]+[82] to rr[i][j] ## from &rr[i1][j1]+[82] to &rr[i1][j1] ## store rr[i][j] ## store rr[j1][i1] # obtain rr[j1][i1] sadd(s,0x8000L) s = sadd(s,smpy(h2[k+1],h2[k+dec+1] end of INNERLOOP
||[!A2] FINISH:
; &h2[k+dec] ; exit
NOP
The value of s is represented by both B11 and A5 to avoid two .L1 or two .L2 units occurring in the same execute packet. Due to the dependence on s, as well as the removal of memory bank hits, it takes 20 cycles for each iteration of the modified C code. The pound sign (#) in the comments indicates that, each time the outer loop enters the inner loop, this instruction is not executed (or that the result of this instruction is not useful) until the number of iterations denoted by # has occurred. The code size is 11 fetch packets (352 bytes). Without applying the primitive technique, the code size will be at least four fetch packets more than the code shown in Example A21. You can squeeze the instruction
ADD .L2X B4,A11,B13 ; &h2[k+dec]
into the inner loop to save about 1.5% of the cycle counts, with an increase in program memory of one fetch packet.
A-26
(The values of i0, i1, i2, i3, i4, i5, i6, and i7 were obtained before entering this loop.) Original C code for (i9 = ipos[9]; i9 < L_CODE; i9 += STEP) { s = L_mult (rr[i9][i9], _1_16); s = L_mac (s, rr[i0][i9], _1_8); s = L_mac (s, rr[i1][i9], _1_8); s = L_mac (s, rr[i2][i9], _1_8); s = L_mac (s, rr[i3][i9], _1_8); s = L_mac (s, rr[i4][i9], _1_8); s = L_mac (s, rr[i5][i9], _1_8); s = L_mac (s, rr[i6][i9], _1_8); s = L_mac (s, rr[i7][i9], _1_8); rrv[i9] = round (s); } where L_mult(a,b) = _smpy(a,b) L_mac(a,b,c) = _sadd(a,_smpy(b,c)) and round(a) = _sadd(a,0x8000L)>>16
The instructions for one loop iteration are shown in Example A23.
Applications Programming
A-27
Example A23. Linear Assembly for the rrv Computation in Search_10i40 (One Loop Iteration)
LOOP: LDH SMPY LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD SADD SHR STH [icntr]SUB [icntr]B .D .M .D .M .L .D .M .L .D .M .L .D .M .L .D .M .L .D .M .L .D .M .L .D .M .L .L .S .D .ALU .S *rr9ptr++[205],rr99 rr99,_1_16,s *rr0ptr++[5],rr09 rr09,_1_8,s0 s,s0,s *rr1ptr++[5],rr19 rr19,_1_8,s1 s,s1,s *rr2ptr++[5],rr29 rr29,_1_8,s2 s,s2,s *rr3ptr++[5],rr39 rr39,_1_8,s3 s,s3,s *rr4ptr++[5],rr49 rr49,_1_8,s4 s,s4,s *rr5ptr++[5],rr59 rr59,_1_8,s5 s,s5,s *rr6ptr++[5],rr69 rr69,_1_8,s6 s,s6,s *rr7ptr++[5],rr79 rr79,_1_8,s7 s,s7,s s,0x8000L,sround sround,16,rrv9 rrv9,*rrv9ptr++[5] icntr,1,icntr LOOP ;load rr[i9][i9] ;s=L_mult(rr[i9][i9],_1_16) ;load rr[i0][i9] ;L_mult(rr[i0][i9],_1_8) ;s=L_mac(s,rr[i0][i9],_1_8) ;load rr[i1][i9] ;L_mult(rr[i1][i9],_1_8) ;s=L_mac(s,rr[i1][i9],_1_8) ;load rr[i2][i9] ;L_mult(rr[i2][i9],_1_8) ;s=L_mac(s,rr[i2][i9],_1_8) ;load rr[i3][i9] ;L_mult(rr[i3][i9],_1_8) ;s=L_mac(s,rr[i3][i9],_1_8) ;load rr[i4][i9] ;L_mult(rr[i4][i9],_1_8) ;s=L_mac(s,rr[i4][i9],_1_8) ;load rr[i5][i9] ;L_mult(rr[i5][i9],_1_8) ;s=L_mac(s,rr[i5][i9],_1_8) ;load rr[i6][i9] ;L_mult(rr[i6][i9],_1_8) ;s=L_mac(s,rr[i6][i9],_1_8) ;load rr[i7][i9] ;L_mult(rr[i7][i9],_1_8) ;s=L_mac(s,rr[i7][i9],_1_8) ;round(s) ;rrv[i9] ;store rrv[i9] ;decrement inner loop counter ;branch to loop
A-28
The following table shows the pointers In Example A23 and the arrays they point to.
Pointer rr9ptr rr0ptr rr1ptr rr2ptr rr3ptr rr4ptr rr5ptr rr6ptr rr7ptr rrv9ptr for array rr [ i 9] [ i 9 ] rr [ i 0 ] [ i 9 ] rr [ i 1 ] [ i 9 ] rr [ i 2 ] [ i 9 ] rr [ i 3 ] [ i 9 ] rr [ i 4 ] [ i 9 ] rr [ i 5 ] [ i 9 ] rr [ i 6 ] [ i 9 ] rr [ i 7 ] [ i 9 ] rrv [ i 9 ]
The .D unit is used the most (ten times per iteration). Although these instructions can be arranged in five cycles, any combination of the load hits the same memory bank, Because any two values loaded are exactly 40 halfwords apart. It still takes ten cycles for one rrv.
Applications Programming
A-29
Next, consider unrolling the inner loop once. The C code is shown in Example A24.
Example A24. C Code for the rrv Computation in search_10i40 (Unrolled Loop)
for (i9 = ipos[9]; i9 < L_CODE; i9 += 2*STEP) { s = L_mult (rr[i9][i9], _1_16); S = L_mult (rr[i9+5][i9+5], _1_16); s = L_mac (s, rr[i0][i9], _1_8); S = L_mac (S, rr[i0][i9+5], _1_8); s = L_mac (s, rr[i1][i9], _1_8); S = L_mac (S, rr[i1][i9+5], _1_8); s = L_mac (s, rr[i2][i9], _1_8); S = L_mac (S, rr[i2][i9+5], _1_8); s = L_mac (s, rr[i3][i9], _1_8); S = L_mac (S, rr[i3][i9+5], _1_8); s = L_mac (s, rr[i4][i9], _1_8); S = L_mac (S, rr[i4][i9+5], _1_8); s = L_mac (s, rr[i5][i9], _1_8); S = L_mac (S, rr[i5][i9+5], _1_8); s = L_mac (s, rr[i6][i9], _1_8); S = L_mac (S, rr[i6][i9+5], _1_8); s = L_mac (s, rr[i7][i9], _1_8); S = L_mac (S, rr[i7][i9+5], _1_8); rrv[i9] = round (s); rrv[i9+5] = round (S); }
A-30
Example A25. Linear Assembly for rrv Computation in search_10i40 (One Loop Iteration)
LOOP: LDH SMPY LDH SMPY LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY .D .M .D .M .D .M .L .D .M .L .D .M .L .D .M .L .D .M .L .D .M .L .D .M .L .D .M .L .D .M .L .D .M .L .D .M .L .D .M .L .D .M *rr9ptr++[410],rr99 rr99,_1_16,s *rr95ptr++[410],rr995 rr995,_1_16,S *rr0ptr++[10],rr09 rr09,_1_8,s0 s,s0,s *rr05ptr++[10],rr095 rr095,_1_8,S0 S,S0,S *rr1ptr++[10],rr19 rr19,_1_8,s1 s,s1,s *rr15ptr++[10],rr195 rr195,_1_8,S1 S,S1,S *rr2ptr++[10],rr29 rr29,_1_8,s2 s,s2,s *rr2ptr++[10],rr295 rr295,_1_8,S2 S,S2,S *rr3ptr++[10],rr39 rr39,_1_8,s3 s,s3,s *rr3ptr++[10],rr395 rr395,_1_8,S3 S,S3,S *rr4ptr++[10],rr49 rr49,_1_8,s4 s,s4,s *rr4ptr++[10],rr49 rr49,_1_8,S4 S,S4,S *rr5ptr++[10],rr59 rr59,_1_8,s5 s,s5,s *rr5ptr++[10],rr595 rr595,_1_8,S5 S,S5,S *rr6ptr++[10],rr69 rr69,_1_8,s6 ;load rr[i9][i9] ;s=L_mult(rr[i9][i9],_1_16) ;load rr[i9+5][i9+5] ;S=L_mult(rr[i9+5][i9+5],_1_16) ;load rr[i0][i9] ;L_mult(rr[i0][i9],_1_8) ;s=L_mac(s,rr[i0][i9],_1_8) ;load rr[i0][i9+5] ;L_mult(rr[i0][i9+5],_1_8) ;S=L_mac(S,rr[i0][i9+5],_1_8) ;load rr[i1][i9] ;L_mult(rr[i1][i9],_1_8) ;s=L_mac(s,rr[i1][i9],_1_8) ;load rr[i1][i9+5] ;L_mult(rr[i1][i9+5],_1_8) ;S=L_mac(S,rr[i1][i9+5],_1_8) ;load rr[i2][i9] ;L_mult(rr[i2][i9],_1_8) ;s=L_mac(s,rr[i2][i9],_1_8) ;load rr[i2][i9+5] ;L_mult(rr[i2][i9+5],_1_8) ;S=L_mac(S,rr[i2][i9+5],_1_8) ;load rr[i3][i9] ;L_mult(rr[i3][i9],_1_8) ;s=L_mac(s,rr[i3][i9],_1_8) ;load rr[i3][i9+5] ;L_mult(rr[i3][i9+5],_1_8) ;S=L_mac(S,rr[i3][i9+5],_1_8) ;load rr[i4][i9] ;L_mult(rr[i4][i9],_1_8) ;s=L_mac(s,rr[i4][i9],_1_8) ;load rr[i4][i9] ;L_mult(rr[i4][i9],_1_8) ;S=L_mac(S,rr[i4][i9],_1_8) ;load rr[i5][i9] ;L_mult(rr[i5][i9],_1_8) ;s=L_mac(s,rr[i5][i9],_1_8) ;load rr[i5][i9+5] ;L_mult(rr[i5][i9+5],_1_8) ;S=L_mac(S,rr[i5][i9+5],_1_8) ;load rr[i6][i9] ;L_mult(rr[i6][i9],_1_8)
Applications Programming
A-31
Example A25. Linear Assembly for rrv Computation in search_10i40 (One Loop Iteration) (Continued)
SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD SADD SHR STH SADD SHR STH [icntr]SUB [icntr]B .L .D .M .L .D .M .L .D .M .L .L .S .D .L .S .D .ALU .S s,s6,s *rr6ptr++[10],rr695 rr695,_1_8,S6 S,S6,S *rr7ptr++[10],rr79 rr79,_1_8,s7 s,s7,s *rr7ptr++[10],rr795 rr795,_1_8,S7 S,S7,S s,0x8000L,sround sround,16,rrv9 rrv9,*rrv9ptr++[10] S,0x8000L,Sround Sround,16,rrv95 rrv95,*rrv95ptr++[10] icntr,2,icntr INNERLOOP ;s=L_mac(s,rr[i6][i9],_1_8) ;load rr[i6][i9+5] ;L_mult(rr[i6][i9+5],_1_8) ;S=L_mac(S,rr[i6][i9+5],_1_8) ;load rr[i7][i9] ;L_mult(rr[i7][i9],_1_8) ;s=L_mac(s,rr[i7][i9],_1_8) ;load rr[i7][i9+5] ;L_mult(rr[i7][i9+5],_1_8) ;S=L_mac(S,rr[i7][i9+5],_1_8) ;round(s) ;rrv[i9] ;store rrv[i9] ;round(S) ;rrv[i9+5] ;store rrv[i9+5] ;decrement inner loop counter ;branch to loop
The following table shows the pointers In Example A25 and the arrays they point to.
Pointer rr9ptr and rr95ptr rrxptr and rrx5ptr rrv9ptr and rrv95ptr for array rr [ i 9 ] [ i 9 ] and rr [ i 9 + 5 ] [ i 9 + 5 ] rr [ i x ] [ i 9 ] and rr [ i x ] [ i 9 + 5 ] (where x = 0, 1, ..., 7) rrv[i9] and rrv[i9+5]
Again, the .D unit is used the most (twenty times per iteration). None of the pairs of rr [ i x ] [ i 9 ], rr [ i y ] [ i 9 + 5 ] hit the same memory bank (where ix, iy = i0, i1, ..., i7). The same is true for pairs rrv [ i 9 ] , rrv [ i 9 + 5 ], as well as for rr [ i 9 ] [ i 9 ] and rr [ i 9 + 5 ] [ i 9 + 5 ] . For ease of understanding:
In this way, each iteration takes ten cycles without any memory bank hits. You double the speed by unrolling the loop once. The final assembly code is shown in Example A26.
A-32
||
|| || ||
|| ||
|| || || || ||
; ; ; ; ; ;
Applications Programming
A-33
Example A26. Assembly Code for the rrv Computation in search_10i40 (Continued)
MPYU MPYU ADD ADD LDH LDH ADD MPYU MPYU ADD ADD ADD ADD LDH LDH MPYU ADD ADD ADD ADD LDH LDH ADD ADD ADD ADD LDH LDH ADD ADD LDH LDH MVK LDH LDH SMPY SMPY SHL ADD .M2 .M1 .L1X .L2 .D1 .D2 .S1 .M2 .M1 .L1X .L2 .S1 .S2X .D1 .D2 .M2 .L1X .L2 .S1 .S2X .D1 .D2 .L1X .L2 .S1 .S2X .D1 .D2 .L1X .L2 .D1 .D2 .S2 .D1 .D2 .M1X .M2 .S2 .L2X B6,B0,B6 A11,A1,A11 B4,A15,A4 B4,B15,B4 *A3++[A2],A13 *B3++[B2],B13 A0,A13,A0 B7,B0,B7 A8,A1,A8 B5,A15,A5 B5,B15,B5 A10,A15,A10 A10,B15,B10 *A4++[10],A13 *B4++[10],B13 B9,B0,B9 B6,A15,A6 B6,B15,B6 A11,A15,A11 A11,B15,B11 *A5++[10],A13 *B5++[10],B13 B7,A15,A7 B7,B15,B7 A8,A15,A8 A8,B15,B8 *A6++[10],A13 *B6++[10],B13 B9,A15,A9 B9,B15,B9 *A7++[10],A13 *B7,B13 2048,B7 *A8++[10],A13 *B8++[10],B13 A13,B7,A12 B13,B7,B12 B7,1,B7 A0,10,B0 ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; [i2][0] [i7][0] &rr[i0][i9] &rr[i0][i9+5] load rr[i9][i9] load rr[i9+5][i9+5] &rrv[i9] [i3][0] [i4][0] &rr[i1][i9] &rr[i1][i9+5] &rr[i6][9] &rr[i6][i9+5] load rr[i0][i9] load rr[i0][i9+5] [i9][0] &rr[i2][i9] &rr[i2][i9+5] &rr[i7][i9] &rr[i7][i9+5] load rr[i1][i9] load rr[i1][i9+5] &rr[i3][i9] &rr[i3][i9+5] &rr[i4][i9] &rr[i4][i9+5] load rr[i2][i9] load rr[i2][i9+5] &rr[i5][i9] &rr[i5][i9+5] load rr[i3][i9] load rr[i3][i9+5] _1_16 load rr[i4][i9] load rr[i4][i9+5] s=smpy(rr[i9][i9],_1_16) S=smpy(rr[i9+5][i9+5],_1_16) _1_8 &rrv[i9+5]
|| || || || || ||
|| || || || || || ||
|| || || || || ||
|| || || || ||
|| || || ||
|| || || || ||
A-34
Example A26. Assembly Code for the rrv Computation in search_10i40 (Continued)
LDH LDH SMPY SMPY LDH LDH SMPY SMPY LDH LDH SMPY SMPY SADD SADD MVK SMPY SMPY SADD SADD MVK .D1 .D2 .M1X .M2 .D1 .D2 .M1X .M2 .D1 .D2 .M1X .M2 .L1 .L2 .S1 .M1X .M2 .L1 .L2 .S1 *A9++[10],A13 *B9++[10],B13 A13,B7,A15 B13,B7,B15 *A10++[10],A13 *B10++[10],B13 A13,B7,A15 B13,B7,B15 *A11++[10],A13 *B11++[10],B13 A13,B7,A15 B13,B7,B15 A12,A15,A12 B12,B15,B12 3,A1 A13,B7,A15 B13,B7,B15 A12,A15,A12 B12,B15,B12 32767,A14 ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; load rr[i5][i9] load rr[i5][i9+5] s0=smpy(rr[i0][i9],_1_8) S0=smpy(rr[i0][i9+5],_1_8) load rr[i6][i9] load rr[i6][i9+5] s1=smpy(rr[i1][i9],_1_8) S1=smpy(rr[i1][i9+5],_1_8) load rr[i7][i9] load rr[i7][i9+5] s2=smpy(rr[i2][i9],_1_8) S2=smpy(rr[i2][i9+5],_1_8) s=sadd(s,s0) S=sadd(S,S0) loop counter s3=smpy(rr[i3][i9],_1_8) S3=smpy(rr[i3][i9+5],_1_8) s=sadd(s,s1) S=sadd(S,S1)
|| || ||
|| || ||
|| || || || || ||
|| || || || LOOP: || || || || || ||
SADD SADD SMPY SMPY LDH LDH ADD SMPY SMPY SADD SADD LDH LDH SMPY SMPY SADD SADD LDH LDH
.L1 .L2 .M1X .M2 .D1 .D2 .S1 .M1X .M2 .L1 .L2 .D1 .D2 .M1X .M2 .L1 .L2 .D1 .D2
A12,A15,A12 B12,B15,B12 A13,B7,A15 B13,B7,B15 *A3++[A2],A13 *B3++[B2],B13 A14,1,A14 A13,B7,A15 B13,B7,B15 A12,A15,A12 B12,B15,B12 *A4++[10],A13 *B4,B13 A13,B7,A15 B13,B7,B15 A12,A15,A12 B12,B15,B12 *A5++[10],A13 *B5++[10],B13
; s=sadd(s,s2) ; S=sadd(S,S2) ; s4=smpy(rr[i4][i9],_1_8) ; S4=smpy(rr[i4][i9+5],_1_8) ;* load rr[i9][i9] ;* load rr[i9+5][i9+5] ; 32768 for rounding ; s5=smpy(rr[i5][i9],_1_8) ; S5=smpy(rr[i5][i9+5],_1_8) ; s=sadd(s,s3) ; S=sadd(S,S3) ;* load rr[i0][i9] ;* load rr[i0][i9+5] ; s6=smpy(rr[i6][i9],_1_8) ; S6=smpy(rr[i6][i9+5],_1_8) ; s=sadd(s,s4) ; S=sadd(S,S4) ;* load rr[i1][i9] ;* load rr[i1][i9+5]
|| || || || ||
|| || || || ||
Applications Programming
A-35
Example A26. Assembly Code for the rrv Computation in search_10i40 (Continued)
|| || || || || ||
SMPY SMPY SADD SADD LDH LDH ADD SADD SADD LDH LDH MVK B SADD SADD LDH LDH SMPY SMPY SHL SUB SADD SADD LDH LDH SMPY SMPY SHR SHR SMPY SMPY LDH LDH SMPY SMPY SADD SADD LDH LDH
.M1X .M2 .L1 .L2 .D1 .D2 .S2X .L1 .L2 .D1 .D2 .S2 .S1 .L1 .L2 .D1 .D2 .M1X .M2 .S2 .S1 .L1 .L2X .D1 .D2 .M1X .M2 .S1 .S2 .M1X .M2 .D1 .D2 .M1X .M2 .L1 .L2 .D1 .D2
A13,B7,A15 B13,B7,B15 A12,A15,A12 B12,B15,B12 *A6++[10],A13 *B6++[10],B13 A7,10,B7 A12,A15,A12 B12,B15,B12 *A7++[10],A13 *B7,B13 2048,B7 LOOP A12,A15,A12 B12,B15,B12 *A8++[10],A13 *B8++[10],B13 A13,B7,A12 B13,B7,B12 B7,1,B7 A1,1,A1 A12,A14,A14 B12,A14,B4 *A9++[10],A13 *B9++[10],B13 A13,B7,A15 B13,B7,B15 A14,16,A14 B4,16,B4 A13,B7,A15 B13,B7,B15 *A10++[10],A13 *B10++[10],B13 A13,B7,A15 B13,B7,B15 A12,A15,A12 B12,B15,B12 *A11++[10],A13 *B11++[10],B13
; s7=smpy(rr[i7][i9],_1_8) ; S7=smpy(rr[i7][i9+5],_1_8) ; s=sadd(s,s5) ; S=sadd(S,S5) ;* load rr[i2][i9] ;* load rr[i2][i9+5] ; &rr[i3][i9+5] ; s=sadd(s,s6) ; S=sadd(S,S6) ;* load rr[i3][i9] ;* load rr[i3][i9+5] ; _1_16 ; branch to the loop ; s=sadd(s,s7) ; S=sadd(S,S7) ;* load rr[i4][i9] ;* load rr[i4][i9+5] ;* s=smpy(rr[i9][i9],_1_16) ;* S=smpy(rr[i9+5][i9+5],_1_16) ; _1_8 ; decrement loop counter ; round(s) ; round(S) ;* load rr[i5][i9] ;* load rr[i5][i9+5] ;* s0=smpy(rr[i0][i9],_1_8) ;* S0=smpy(rr[i0][i9+5],_1_8) ; rrv[i9] ; rrv[i9+5] ;* s1=smpy(rr[i1][i9],_1_8) ;* S1=smpy(rr[i1][i9+5],_1_8) ;* load rr[i6][i9] ;* load rr[i6][i9+5] ;* ;* ;* ;* ;* ;* s2=smpy(rr[i2][i9],_1_8) S2=smpy(rr[i2][i9+5],_1_8) s=sadd(s,s0) S=sadd(S,S0) load rr[i7][i9] load rr[i7][i9+5]
|| || || || ||[A1]
|| || || || || || ||[A1]
|| || || || ||
|| || || || ||
|| || || || ||
A-36
Example A26. Assembly Code for the rrv Computation in search_10i40 (Continued)
STH STH SMPY SMPY SADD SADD ADD MVK .D1 .D2 .M1X .M2 .L1 .L2 .S2X .S1 A14,*A0++[10] B4,*B0++[10] A13,B7,A15 B13,B7,B15 A12,A15,A12 B12,B15,B12 A4,10,B4 32767,A14 ; store rrv[i9] ; store rrv[i9+5] ;* s3=smpy(rr[i3][i9],_1_8) ;* S3=smpy(rr[i3][i9+5],_1_8) ;* s=sadd(s,s1) ;* S=sadd(S,S1) ;* &rr[i0][i9+5] ; end of LOOP
|| || || || || || ||
B7 serves as _1_16, _1_8 and as the pointer for rr [ i 3 ] [ i 9 + 5 ]. B4 serves both the value of rrv[i9+5] and the pointer to rr[i0][i9+5]. A14 represents 0x8000L as well as rrv [ i 9 ].
The last iteration of the loop can be expanded as the epilog of the loop to overlap with the prolog of the code that follows this part of the code.
Applications Programming
A-37
Word16 rr[L_CODE][L_CODE], ipos[L_CODE], dn[L_CODE]; local variables/arrays: Word16 rrv[L_CODE]; Word16 i0,i1,i2,i3,i4,i5,i6,i7,i8,i9; /* defined on [0,L_CODE1] */ Word16 ia,ib; Word16 ps,ps0,ps1,ps2,sq,sq2; Word16 alp,alp_16; Word32 s,alp0,alp1,alp2; (The values of i0, i1, i2, i3, i4, i5, i6, i7 , ps0, and alp0 have been obtained before entering this loop.) Original C code sq = 1; alp = 1; ps = 0; ia = ipos[8]; ib = ipos[9]; /* initialize 10 indices for i8 loop (see i2i3 loop) */ for (i8 = ipos[8]; i8 < L_CODE; i8 += STEP) { ps1 = add (ps0, dn[i8]); alp1 alp1 alp1 alp1 alp1 alp1 alp1 alp1 alp1 = = = = = = = = = L_mac L_mac L_mac L_mac L_mac L_mac L_mac L_mac L_mac (alp0, (alp1, (alp1, (alp1, (alp1, (alp1, (alp1, (alp1, (alp1, rr[i8][i8], rr[i0][i8], rr[i1][i8], rr[i2][i8], rr[i3][i8], rr[i4][i8], rr[i5][i8], rr[i6][i8], rr[i7][i8], _1_128); _1_64); _1_64); _1_64); _1_64); _1_64); _1_64); _1_64); _1_64);
A-38
Example A27. C Code for the Index Search for search_10i40 (Continued)
/* initialize 3 indices for i9 inner loop (see i2i3 loop) */ for (i9 = ipos[9]; i9 < L_CODE; i9 += STEP) { ps2 = add (ps1, dn[i9]); alp2 = L_mac (alp1, rrv[i9], _1_8); alp2 = L_mac (alp2, rr[i8][i9], _1_64); sq2 = mult (ps2, ps2); alp_16 = round (alp2); s = L_msu (L_mult (alp, sq2), sq, alp_16); if (s > 0) { sq = sq2; ps = ps2; alp = alp_16; ia = i8; ib = i9; } } } where add(a,b) = _sadd(a<<16,b<<16)>>16 L_mac(a,b,c) = _sadd(a,_smpy(b,c)) mult(a,b) = _smpy(a<<16,b<<16)>>16 L_mult(a,b)=_smpy(a,b) round(a) = _sadd(a,0x8000L)>>16 and L_msu(a,b,c)=_ssub(a,_smpy(b,c))
This is a typical example of the performance being limited by data dependency constraints. In this case, the dependency is between the values of alp and sq.
Applications Programming
A-39
A-40
Example A28. Linear Assembly for the Index Search for search_10i40 (Inner Loop)
INNERLOOP: LDH SHL SADD SMPYH LDH SMPY SADD LDH SMPY SADD SADD SMPYH SMPYH CMPGT [cndr] MV [cndr] MV [cndr] MV [cndr] MV [cndr] MV [icntr]SUB [icntr]B
*dn9ptr++[5],dn9 dn9,16,dn9h ps1,dn9h,ps2 ps2,ps2,sq2 *rrvptr++[5],rrv rrv,_1_8,tmp1 alp1,temp1,alp2 *rr89prt++,rr89 rr89,_1_64,tmp2 alp2,tmp2,alp2 alp2,0x8000L,alp_16 alp,sq2,tmp3 sq,alp_16,tmp4 tmp3,tmp4,cndr sq2,sq ps2,ps alp_16,alp i8,ia i9,ib icntr,1,icntr INNERLOOP
; ; ; ; ; ; ; ; ; ; ; ; ; ;
load dn[i9] dn[i9] << 16 ps2 = sadd(ps1, dn[i9] << 16) sq2 = smpyh(ps2,ps2) load rrv[i9] smpy(rrv[i9], _1_8) alp2=sadd(alp1,smpy(rrv[i9],_1_8)) load rr[i8][i9] smpy(rr[i8][i9],_1_64) alp2=sadd(alp2,smpy(rr[i8][i9],_1_64)) alp_16=sadd(alp2,0x8000L) smpyh(alp,sq2) smpyh(sq,alp_16) if(smpyh(alp,sq2) > smpyh(sq,alp_16))
Because both sq and alp are carried over and required from one iteration to the next, their values should be put in registers to allow speedy retrieval. At least four cycles are required to compute new sq and alp values, and the requirement on the functional units does not exceed four execution packets. Therefore, the inner loop can be effected in four cycles per iteration. For the outer loop, any pair of rr [ i x ] [ i 8 ], rr [ i y ] [ i 8 ] (where ix, iy = i0, i1, ..., i7) will definitely hit the memory bank if they are read together. Therefore, they should be loaded in one cycle each.
For the inner loop, store the results of ps, ia, and ib, whose values are not used in this code. For the outer loop, store the pointers of arrays starting at rr [ i 5 ] [ i 8 ] , rr [ i 6] [ i 8 ] , and rr [ i 7 ] [ i 8 ] , whose values are needed last in the outer loop.
Applications Programming
A-41
Assume that before entering this code, the following values are known: &dn[0], &ipos[0], &rr[0][0], &rrv[0][0], i0, i1, i2, i3, i4, i5, i6, i7, ps0, and alp0. Assume that the short (Word16) integers are stored in the stack in the order i0, i1, i2, i3, i4, i5, i6, i7, ia, and ib, and that a pointer &local_16[0], pointing to i0, is also known. The int integers and the pointers of the rr arrays are stored in the stack in the following order: ps0, ps, alp0, alp1, &rr[i5] [i8], &rr[i6] [i8], and &rr[i7] [i8]. The pointer, &local_32[0], pointing to ps0, is known as well. The C code is shown in Example A29.
/* initialize 3 indices for i9 inner loop (see i2i3 loop) */ for (i9 = ipos[9]; i9 < L_CODE; i9 += STEP) { ps2 = _sadd(ps1, dn[i9]<<16); alp2 = _sadd(local_32[3], _smpy(rrv[i9], _1_8)); alp2 = _sadd(alp2, _smpy(rr[i8][i9], _1_64)); sq2 = _smpyh(ps2, ps2); alp_16 = _sadd(alp2,0x8000L);
A-42
Applications Programming
A-43
The following instructions are used in the inner loop in different memory locations such as the outer loop:
[B2] || STW LDH .D1 .D2 B11,*+A6[1] ; store ps *B10++[5],A5 ; load rr[i5][i8]
In the former case, memory bank hits can be completely eliminated by allocating the corresponding arrays in memory properly. Memory bank hits occur in every other iteration in the latter case, however. Although, in general, you should avoid writing such code, in this case, the performance of the prolog of the outer loop after the first iteration is limited by the .D unit. You still save some cycle counts in this example. To improve the performance, the last two iterations of the inner loop overlap part of the prolog of the outer loop.
; ; ; ; ; ; ; ;
A13 &ipos[0] and alp B6 &local_16[0] A6 stack pointer, point to &local_32[0] B8 &rr[0][0] A4 &rrv[0] B14 &dn[0] B1 reserved for the counter of the outmost loop in search_10i40
|| ||
A-44
Example A30. Assembly Code for the search_10i40 Index Search (Continued)
LDH LDH MVK .D2 .D1 .S1 *+B6[2],B9 *+A5[1],A14 0,A8 ; ; ; ; ; ; ; load i2 load i1 could insert two .D units here for the store of rrv[i9+30] and rrv[i9+35] in the code which this piece immediately follows
||
|| || ||
LDH LDH MVK MVK STW LDH SHL MPYU STH STH ADD MPYU LDW LDH ADD ADD ADD MPYU MPYU LDW LDH ADD ADD LDH LDH ADD ADD LDH ADD MPYU LDH ADD MPYU MPYU
.D1 .D2 .S1 .S2 .D1 .D2 .S2X .M1 .D1 .D2 .L2 .M2X .D1 .D2 .S1X .S2 .L2X .M1 .M2 .D1 .D2 .S2 .L2 .D1 .D2 .S2 .L1X .D2 .L2 .M1X .D1 .L1X .M1 .M2
*+A5[4],A15 *+B6[3],B10 80,A0 80,B0 A8,*+A6[1] *+B6[5],B11 A7,1,B10 A7,A0,A12 A7,*+A5[8] B13,*+B6[9] B8,B10,B2 A13,B0,B3 *A6,B15 *+B6[6],A1 A12,B2,A12 B14,B10,B7 B8,A12,B8 A14,A0,A14 B9,B0,B9 *+A6[2],A11 *+B6[7],B5 B13,B13,B12 B3,B2,B3 *A12,A5 *B7++[5],B12 B14,B12,B14 A14,B2,A14 *B3++[5],A5 B9,B2,B9 B10,A0,A9 *A14++[5],A5 A4,B12,A4 A15,A0,A15 B11,B0,B11
; load i4 ; load i3
|| || ||
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
ps=0 load i5 [0][i8] [i8][0] store ia=i8 store ib=i9 &rr[0][i8] [i0][0] load ps0 load i6 &rr[i8][i8] &dn[i8] &rr[i8][0] [i1][0] [i2][0] load alp0 load i7 [0][i9] &rr[i0][i8] load rr[i8][i8] load dn[i8] &dn[i9] &rr[i1][i8]
|| || ||
|| || || || || ||
|| || ||
|| || ||
|| ||
|| || ||
Applications Programming
A-45
Example A30. Assembly Code for the search_10i40 Index Search (Continued)
|| || ||
LDH MVK ADD MPYU LDH ADD MVK MVK ADD ADD MPYU LDH SHL SHL ADD SMPY LDH MVK SMPY LDH ADD SHL SADD SADD SMPY LDH SADD SMPY
.D2 .S1 L1X .M1 .D1 .D2 .S1 .S .L1X .L2 .M2 .D1 .S1 .S2 .L1X .M1 .D2 .S1 .M1 .D1 .D2 .S1 .L1 .L2 .M1 .D2 .L1 .M1
*B9++[5],A5 256,A0 A9,B2,A9 A1,A0,A1 *A9++[5],B12 B11,B2,B10 7,A2 512,B0 A15,B2,A15 B8,B12,B4 B5,B0,B5 *A15++[5],A5 A0,1,A0 B12,16,B11 A1,B2,A1 A5,A0,A8 *B10++[5],A5 1,A3 A5,A0,A8 *A1++[5],B12 B5,B2,B11 A0,7,A13 A11,A8,A11 B15,B11,B15 A5,A0,A8 *B11++[5],A5 A11,A8,A11 A5,A0,A8
; ; ; ; ; ; ; ; ; ; ;
load rr[i2][i8] A0=_1_128 &rr[i3][i8] [i6][0] load rr[i3][i8] &rr[i5][i8] outer loop counter B0=_1_64 &rr[i4][i8] &rr[i8][i9] [i7][0]
|| || || || || ||
|| || || ||
; load rr[i4][i8] ; _1_64 ; dn[i8] << 16 ; &rr[i6][i8] ; smpy(rr[i8][i8],_1_128) ; load rr[i5][i8] ; sq=1 ; smpy(rr[i0][i8],_1_64) ; ; ; ; ; ; load rr[i6][i8] &rr[i7][i8] alp=0x10000 alp1=sadd(alp0,smpy(rr[i8][i8],_1_128)) ps1 smpy(rr[i1][i8],_1_64)
|| ||
|| || || || ||
|| ||
OUTERLOOP: LDH LDH SADD SUB SMPY LDH SADD SMPY .D1 .D2 .L1 .L2 .M1X .D2 .L1 .M1 *A4++[5],A5 *B4++[5],B12 A11,A8,A11 B13,5,B13 B12,A0,A8 *B14++[5],B12 A11,A8,A11 A5,A0,A8 ; load rrv[i9] ; load rr[i8][i9] ; alp1=sadd(alp1,smpy(rr[i1][i8],_1_64)) ; smpy(rr[i3][i8],_1_64) ; load dn[i9] ; alp1=sadd(alp1,smpy(rr[i2][i8],_1_64)) ; smpy(rr[i4][i8],_1_64)
|| || || ||
|| ||
A-46
Example A30. Assembly Code for the search_10i40 Index Search (Continued)
|| ||
STW SADD SMPY STW SHL SADD SMPY LDH LDH SHL SADD SMPY
.D1 .L1 .M1 .D1 .S1 .L1 .M1X .D1 .D2 .S1 .L1 .M1
B10,*+A6[4] A11,A8,A11 A5,A0,A8 A1,*+A6[5] A0,6,A10 A11,A8,A11 B12,A0,A8 *A4++[5],A5 *B4++[5],B12 A0,3,A0 A11,A8,A11 A5,A0,A8
|| || ||
|| || || ||
|| || ||
|| ||
.D .S2 .L1
|| ||
|| || || ||
;** load rrv[i9] ;** load rr[i8][i9] ; branch to the innerloop ; alp2=sadd(alp2,smpy(rr[i8][i9],_1_64)) ; sq2=smpyh(ps2,ps2)
|| || || || ||
;** load dn[i9] ; innerloop counter ; alp_16 = sacc(alp2, 0x8000L) ;* smpy(rrv[i9],_1_8) ;* smpy(rr[i8][i9],_1_64)
Applications Programming
A-47
Example A30. Assembly Code for the search_10i40 Index Search (Continued)
INNERLOOP: LDW STH SHL ADD SMPYH SMPYH .D1 .D2 .S2 .L2 .M1 .M2X *+A6[3],A11 B13,*+B6[9] B12,16,B10 B13,5,B13 A8,A3,A11 B8,A13,B10 ; load alp1 ; store ib=i9 ;* dn[i9]<<16 ; i9=i9+STEP ; smpyh(alp_16,sq) ; smpyh(alp,sq2)
||[B2] || || || ||
[B2] ||[B2] || || ||
|| ||[A1] ||[A1] || || ||
;*** load rrv[i9+10] ;*** load rr[i8][i9+10] ; decrement innerloop counter ; branch to INNERLOOP ;*alp2=sadd(alp2,smpy(rr[i8][i9],_1_64)) ; if smpyh(alp,sq2) > smpyh(alp_16,sq) ;* sq2=smpyh(ps2,ps2)
[B2] || ||[B2] || || ||
; alp=alp_16 ;*** load dn[i9+10] ; sq=sq2 ;* alp_16=sadd(alp2, 0x8000L) ;*** A0 = _1_8 ;*** B0 = _1_64 ; end of innerloop
[B2] ||[B2] || || || ||
; ; ; ; ; ;
||[B2] || || ||
; ; ; ; ;
A-48
Example A30. Assembly Code for the search_10i40 Index Search (Continued)
LDW LDW MVK SADD CMPGT SMPYH .D1 .D2 .S1 .L1X .L2X .M2 *+A6[5],A1 *B2,B15 205,A0 A5,B12,A11 B10,A11,B2 B5,B5,B8 ; &rr[i6][i8] ; load ps0 ; alp2=sadd(alp2,smpy(rr[i8][i9],_1_64)) ; if smpyh(alp,sq2) > smpyh(alp_16,sq) ; sq2=smpyh(ps2,ps2)
|| || || || ||
|| ||[B2] || ||[B2]
; ; ; ; ;
|| || || ||
; ; ; ;
||[B2] || ||
; ; ; ;
|| || ||[A2]
; ; ; ;
||[B2] || ||
; ; ; ;
||[B0] || || ||[B0] ||
; ; ; ; ; ;
Applications Programming
A-49
Example A30. Assembly Code for the search_10i40 Index Search (Continued)
[B2] || ||[B0] || ||[A2] || STW LDH MV SHL SUB SMPY .D1B11,*+A6[1] .D2 *B10++[5],A5 .S1X B8,A3 .S2 B12,16,B11 .L1 A2,1,A2 .M1 A5,A0,A8 ; ; ; ; ; ; store ps load rr[i5][i8] sq=sq2 dn[i8] << 16 decrement OUTERLOOP counter smpy(rr[i0][i8],_1_64)
||[B0] || || || ||
; ; ; ; ; ;
[B0] || || || || ||
; ; ; ; ; ;
A-50
: : : :
*/ */ */ */
for (i = 0; i < lg; i++) { s = L_mult (x[i], a[0]); for (j = 1; j <= m; j++) s = L_mac (s, a[j], x[i j]); s = L_shl (s, 3); y[i] = round (s); } return; } where L_mult(a,b) = _smpy(a,b) L_mac(a,b,c) = _sadd(a,_smpy(b,c)) L_shl(a,b) = (b>0) ? _sshl(a,b) : a >> (b) round(a) = _sadd(a,0x8000L)>>16 and lg = 40.
You can change the order of rounding and left shift to save one register. (Otherwise, you need another register for 0x8000L.) The C code, after complete inner loop unrolling, is shown in Example A32.
To satisfy the ordering property of _sadd To maximize speed: eleven cycles are required to compute two y values, while six cycles are needed for one y
A-52
Applications Programming
A-53
||
||
||
; ; ; ; ; ; ; ; ;
smpy(x[0],a[0]) smpy(x[1],a[0]) load x[6] & x[5] s0 = sadd(s0, smpy(x[9],a[9])) s1 = sadd(s1, smpy(x[8],a[9])) smpy(x[1],a[1]) smpy(x[0],a[1]) s0 = sadd(s0, smpy(x[10],a[10])) s1 = sadd(s1, smpy(x[9],a[10]))
A-54
|| || || || ||[!A2] ||[!A2]
|| ||
; smpy(x[3],a[3]) ; smpy(x[2],a[3]) ; s0 = sadd(s0, smpy(x[1],a[1])) ; s1 = sadd(s1, smpy(x[0],a[1])) ; s0 = L_shl(s0,3) ; s1 = L_shl(s1,3) smpy(x[4],a[4]) smpy(x[3],a[4]) s0 = sadd(s0, smpy(x[2],a[2])) s1 = sadd(s1, smpy(x[1],a[2])) load x[10] & x[9] and update the pointer ; y[0] = shr(s0, 16) ; y[1] = shr(s1, 16) ; to the new &x[0] ; ; ; ; ; ; ; smpy(x[5],a[5]) smpy(x[4],a[5]) s0 = sadd(s0, smpy(x[3],a[3])) s1 = sadd(s1, smpy(x[2],a[3])) store y[0] decrement loop counter branch to the loop ; ; ; ; ;
|| SADD ||[!A2] SSHL ||[!A2] SSHL SMPYLH SMPYH SADD SADD LDW
|| || || ||
SMPYHL SMPY SADD SADD STH SUB B SMPYLH SMPYH SADD SADD LDW SMPYHL SMPY SADD SADD LDW
.M1X .M2X .L1 .L2 .D1 .S2 S1 .M1X .M2X .L1 .L2 .D1 .M1X .M2X .L1 .L2 .D1
A1,B5,A8 A3,B5,B8 A8,A9,A9 B8,B9,B A7,*A6++ B2,2,B LOOP A1,B5,A8 A1,B5,B8 A8,A9,A9 B8,B9,B9 *A4,A3 A3,B6,A8 A1,B6,B8 A8,A9,A9 B8,B9,B9 *A4,A1
|| || || ||
; smpy(x[6],a[6]) ; smpy(x[5],a[6]) ; s0 = sadd(s0, smpy(x[4],a[4])) ; s1 = sadd(s1, smpy(x[3],a[4])) ;* load x[0] & x[1] for the next iteration ; smpy(x[7],a[7]) ; smpy(x[6],a[7]) ; s0 = sadd(s0, smpy(x[5],a[5])) ; s1 = sadd(s1, smpy(x[4],a[5])) ;* load x[1] & x[2]
|| || || ||
Applications Programming
A-55
|| || || ||[!A2]
|| || || ||[A2] ||
|| || ||
; ; ; ;
There is no memory bank hit within the loop. To avoid a memory bank hit within the prolog of the loop, arrays a and x must be allocated so that a[1] and x[0] are offset from each other by one word. Some of the instructions in the loop cannot be executed in the first iteration. Register A2 indicates which instructions these are.
A-56
input: Word16 scal_sig[PIT_MAX+L_FRAME]; (pointed at scal_sig[PIT_MAX] when passed) Word16 scal_fac; (not used in this part of the code) Word16 L_frame, lag_min, lag_max;
for (j = 0; j < L_frame; j++, p++, p1++) { t0 = L_mac (t0, *p, *p1); } if (L_sub (t0, max) >= 0) { max = t0; p_max = i; } } where L_mac(a,b,c) = _sadd(a,_smpy(b,c)) L_sub(a,b) = _ssub(a,b) L_frame = L_FRAME/2 = 80 and the search range (lag_min, lag_max) is (18,35), (36,71), or (72,143).
Applications Programming
A-57
Example A36. C Code for the Lag Search in lag_max ( ) (Comparison Order Changed)
max = MIN_32; p_max = lag_min; for (i = lag_min; i < lag_max; i++) { p = scal_sig; p1 = &scal_sig[i]; t0 = 0; for (j=0; j<L_frame; j++, *p++, *p1++) { t0 = L_mac(t0, *p, *p1); } if (t0 > max) { max = t0; p_max = i; } }
Next, look at the inner loop, a general MAC loop. Because *p does not always equal *p1, it does not fall into the special case described in section A.2.1, Implementation of the Multiply-Accumulate Loop, beginning on page A-4. Therefore, the performance cannot be improved by simply unrolling the inner loop. Now consider unrolling the outer loop once. The C code with outer loop unrolling is shown in Example A37. Because the number of lags that needs to be searched within each search range is always even, such unrolling does not create an additional case to handle.
A-58
Example A37. C Code for the Lag Search in lag_max() With Outer Loop Unrolling
Word32 t1;
max = MIN_32; p_max = lag_min; for (i = lag_min; i < lag_max; i+=2) { p = scal_sig; p1 = scal_sig[i]; t0 = 0; t1 = 0; for (j=0; j<L_frame; j++, p++, p1++) { t1=_sadd(t1,_smpy(*p,*p1)); (or t1=_sadd(t1,_smpy(scal_sig[j],scal_sig[i1+j])) t0=_sadd(t0,_smpy(*p,*p1)); (or t0=_sadd(t0,_smpy(scal_sig[j],scal_sig[i+j])) } if (t0 > max) { max = t0; p_max = i; } if( t1 > max) { max = t1; p_max = i+1; } }
The smaller lag is always compared first in the order of the comparisons. The instructions required for one iteration of the inner loop are shown in Example A38.
Example A38. Linear Assembly for the Lag Search in lag_max() Inner Loop
INNERLOOP: LDH LDH SMPY SADD LDH SMPY SADD [icntr] SUB [icntr] B
.D .D .M .L .D .M .L .S .S
*p++, sigj *p1, scalij1 sigj,scalij1,tmp1 t1,tmp1,t1 *p1++,scalij sigj,scalij,tmp0 t0,tmp0,t0 icntr,1,icntr INNERLOOP
; ; ; ; ; ; ; ; ;
load scal_sig[j] load scal_sig[i1+j] smpy(scal_sig[j],scal_sig[i1+j]) t1=sadd(t1,smpy(scal_sig[j],scal_sig[i1+j]) load scal_sig[i+j] smpy(scal_sig[j],scal_sig[i+j]) t0=sadd(t0,smpy(scal_sig[j],scal_sig[i+j]) decrement inner loop counter branch to inner loop
Applications Programming
A-59
The .D unit is used the most (three times). Therefore, the inner loop takes two cycles. Now unroll the inner loop once. The first iteration of t1 and the last iteration of t0 perform outside the inner loop. This avoids memory bank hits. The C code with the inner and outer loops unrolled is shown in Example A39.
Example A39.C Code for the Lag Search in lag_max() With Inner and Outer Loops Unrolled
Word32 t1;
max = MIN_32; p_max = lag_min; for (i = lag_min; i < lag_max; i+=2) { p = scal_sig; p1 = scal_sig[i]; t0 = 0; t1=_sadd(t1,_smpy(*p,*p1)); (or t1=_sadd(t1,_smpy(scal_sig[j],scal_sig[i1+j])) for (j=0; j<(L_frame1); j+=2, p+=2, p1+=2) { t0=_sadd(t0,_smpy(*p,*p1)); (or t0=_sadd(t0,_smpy(scal_sig[j],scal_sig[i+j])) t1=_sadd(t1,_smpy(*+p,*p1)); (or t1=_sadd(t1,_smpy(scal_sig[j+1],scal_sig[i+j])) t0=_sadd(t0,_smpy(*+p,*+p1)); (or t0=_sadd(t0,_smpy(scal_sig[j+1],scal_sig[i+j+1])) t1=_sadd(t1,_smpy(*+p[2],*+p1)); (or t1=_sadd(t1,_smpy(scal_sig[j+2],scal_sig[i+j+1])) } t0=_sadd(t0,_smpy(scal_sig[L_frame1],scal_sig[i+L_frame1])); if (t0 > max) { max = t0; p_max = i; } if( t1 > max) { max = t1; p_max = i+1; } }
Although five values of scal_sig, scal_sig [ j ] , scal_sig [ j + 1 ] , scal_sig [ j + 2 ] , scal_sig[ i + j ] , and scal_sig [ i + j + 1 ] , are required for each inner loop iteration, scal_sig [ j ] does not need to be loaded, because it was loaded in the previous iteration. This means only four loads are required per iteration. Example A40 gives the instructions for the modified inner loop.
A-60
Example A40. Linear Assembly for the Lag Search in lag_max() Inner Loop
LDH LDH SMPY INNERLOOP: LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD LDH SMPY SADD [icntr] SUB [icntr] B .D .D .M *p++, sigj *p1, scalij1 sigj, scalij1,t1 ; load scal_sig[j] ; load scal_sig[i1+j] ; t1=smpy(scal_sig[j],scal_sig[i1+j])
.D .M .L .D .M .L .D .M .L .D .M .L .S .S
*p1++, scalij sigj,scalij,tmp0 t0,tmp0,t0 *p++, sigj+1 sigj+1,scalij,tmp1 t1,tmp1,t1 *p1++,scalij+1 sigj+1,scalij+1,tmp0 t0,tmp0,t0 *p++, sigj+2 sigj+2,scalij+1,tmp1 t1,tmp1,t1 icntr,2,icnt INNERLOOP
; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
load scal_sig[i+j] smpy(scal_sig[j],scal_sig[i+j]) t0=sadd(t0,smpy(scal_sig[j],scal_sig[i+j]) load scal_sig[j+1] smpy(scal_sig[j+1],scal_sig[i+j]) t1=sadd(t1,smpy(scal_sig[j+1],scal_sig[i+j]) load scal_sig[i+j+1] smpy(scal_sig[j+1],scal_sig[i+j+1]) t0=sadd(t0,smpy(scal_sig[j+1],scal_sig[i+j+1]) load scal_sig[j+2], the scal_sig[j] for the next iteration smpy(scal_sig[j+2],scal_sig[i+j+1]) t1=sadd(t1,smpy(scal_sig[j+2],scal_sig[i+j+1]) decrement inner loop counter branch to inner loop
The inner loop uses two cycles. You double the performance, therefore, by unrolling both the outer loop and inner loop if no memory bank hits occur.
Applications Programming
A-61
|| || || || || ||
|| || ||
OUTERLOOP: LDH LDH SADD MV MPY MPY ADD SUB LDH LDH B CMPGT LDH LDH MV MPY .D1 .D2 .L2 .S2 .M1 .M2 .S1 .L1 .D1 .D2 .S .L2 .D1 .D2 .L2 .M1X *A7,A5 *+B7[1],B6 B10,B8,B10 37,B1 A3,0,A3 B8,0,B8 A7,2,A9 A7,4,A7 *A9++,A5 *+B7[2],B5 INNERLOOP B10,B2,B0 *A9++,A5 *+B7[3],B6 B10,B2 B1,1,A2 ; scal_sig[LAG_MIN] ; scal_sig[1] ; inner loop counter
|| ||[A2] ||[A1] || || || ||
; &scal_sig[LAG_MIN+1] ; update p1 = &scal_sig[LAG_MIN2] ; ; ; ; ; ; ; ; scal_sig[LAG_MIN+1] scal_sig[2] branch to the inner loop if(t0>max) scal_sig[LAG_MIN+2] scal_sig[3] max = t0 counter to branch to the outerloop
|| ||[B1] ||[A2]
|| ||[B0] ||
A-62
Example A41. Assembly Code for the Lag Search in lag_max() (Continued)
LDH LDH B CMPGT SUB ADD MPY MPY LDH LDH SMPY MV SUB SUB ADD .D1 .D2 .S2 .L2X .L1 .S1 .M1 .M2 .D1 .D2 .M1X .L2X .L1 .S1 .S2 *A9++,A5 *+B7[4],B5 INNERLOOP A0,B2,B0 A6,2,A4 A6,2,A6 A0,0,A0 B10,0,B10 *A9++,A5 *+B7[5],B6 A5,B5,A3 A0,B2 A6,3,A4 A1,2,A1 B7,12,B9 ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; scal_sig[LAG_MIN+3] scal_sig[4] branch to the inner loop if(t1>max) p_max = i update i initialize t1=0 initialize t0=0 scal_sig[LAG_MIN+4] scal_sig[5] _smpy(scal_sig[LAG_MIN1], scal_sig[0]) max = t1 p_max = i+1 update inner loop counter &scal_sig[1]
INNERLOOP: LDH LDH SMPY SMPY SADD SADD B SUB .D1 .D2 .M1X .M2X .L1 .L2 .S1 .S2 .D1 .D2 .M1X .M2X .L1 .L2 .S1 .S2 .D1 .D2 .L1 .L2 .S1 *A9++,A5 *B9++,B5 A5,B6,A3 A5,B5,B8 A0,A3,A0 B10,B8,B10 INNERLOOP B1,1,B1 *A9++,A5 *B9++,B6 A5,B5,A3 A5,B6,B8 A0,A3,A0 B10,B8,B10 A2,1,A2 OUTERLOOP *A7[1],A5 *B7,B5 A0,A3,A0 B10,B8,B10 FINISH ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; scal_sig[LAG_MIN+5] scal_sig[6] _smpy(scal_sig[LAG_MIN], scal_sig[1]) _smpy(scal_sig[LAG_MIN], scal_sig[0]) update t1 update t0 branch to inner loop decrement inner loop counter scal_sig[LAG_MIN+6] scal_sig[7] _smpy(scal_sig[LAG_MIN+1], scal_sig[2]) _smpy(scal_sig[LAG_MIN+1], scal_sig[1]) update t1 update t0 decrement the counter to branch to the outer loop branch to the outer loop scal_sig[LAG_MIN3] scal_sig[0] update t1 update t0 lag search is complete
|| || || || || ||[B1] ||[B1]
LDH || LDH || SMPY || SMPY || SADD || SADD || SUB ||[!A2]B LDH || LDH || SADD || SADD ||[!A1]B FINISH: NOP
All the epilogs and prologs of the outer and inner loops are compressed to minimize the code size. A2 is both the indicator for avoiding comparisons during the initial iteration of the outer loop and the counter for branching to the outer loop during inner loop executions.
Applications Programming
A-63
Index
Index
[ ] in assembly code 5-3 @ symbol in assembly output 2-14 || (parallel bars) in assembly code 5-2 _ (underscore) in intrinsics 4-9 assembly code (continued) final if-then-else 6-87, 6-88, 6-95 IIR filter 6-81 index search in search_10i40 A-43, A-44 to A-50 live-too-long, with move instructions 6-104 MAC loop for energy computation A-6 residu.c A-54 to A-57 rrv computation A-33 to A-37 weighted vector sum 6-71 functional units in 5-6 instructions in 5-4 labels in 5-2 linear autocorr.c, one iteration of loop A-9 dot product, fixed-point 6-5, 6-16, 6-20, 6-26, 6-35 dot product, floating-point 6-17, 6-21, 6-27, 6-36 FIR filter 6-108, 6-110, 6-119, 6-121 FIR filter, outer loop 6-134 FIR filter, outer loop conditionally executed with inner loop 6-137, 6-139 FIR filter, unrolled 6-133 if-then-else 6-83, 6-86, 6-91, 6-94 IIR filter 6-74, 6-78 index search in search_10i40 A-41 lag search in lag_max( ) A-59 live-too-long 6-98, 6-103 MAC loop A-4 rrv computation in search_10i40 A-28, A-31 special MAC loop A-5 weighted vector sum 6-54, 6-56, 6-58 mnemonics in 5-4 operands in 5-8 optimizing (phase 3 of flow), description 6-2 parallel bars in 5-2 structure of 5-1 to 5-11 writing parallel code 6-4
A
_add2 intrinsic 4-14 tutorial 2-18 aliasing 4-6 allocating resources conflicts 6-61 dot product 6-19 if-then-else 6-86, 6-93 IIR filter 6-78 in writing parallel code 6-6 live-too-long resolution 6-102 weighted vector sum 6-58 AND instruction, mask for 6-70 arrays, controlling alignment 6-116 assembler directives 5-4 assembly code comments in 5-9 conditions in 5-3 directives in 5-4 dot product, fixed-point nonparallel 6-10 parallel 6-11 final autocorr.c, windowing and scaling part A-17 to A-20 dot product, fixed-point 6-22, 6-42, 6-48, 6-51 dot product, floating-point 6-44, 6-49, 6-52 FIR filter 6-116, 6-125, 6-129 to 6-132, 6-143 to 6-146 FIR filter with redundant load elimination 6-112
Index-1
Index
assembly optimizer for dot product 6-37 tutorial 2-25, 2-28 using to create optimized loops 6-35 autocorr.c, windowing and scaling part A-7
B
big-endian mode and MPY operation 6-17 runtime support (rts6201e.lib) 2-6 biquad filter inner loop kernel assembly from C with intrinsics 2-23 linear assembly 2-29 original assembly code 2-16 linear assembly 2-27 original C code 2-4 with word instructions and intrinsics 2-20 branch target, for software-pipelined dot product 6-37, 6-39 branching to create if-then-else 6-82 breakpoints 4-3
C
C code analyzing performance of 4-2 autocorr.c A-8, A-16 basic vector sum 4-5 copyright for A-3 cor_h A-20 dot product 4-16 fixed-point 6-4, 6-15 floating-point 6-16 FIR filter 4-16, 4-25, 6-106, 6-118 inner loop completely unrolled 4-26 optimized form 4-17 unrolled 6-127, 6-132, 6-135 with redundant load elimination 6-107 if-then-else 6-82, 6-90 IIR filter 6-73 index search in search_10i40 A-38, A-40, A-42 lag search in lag_max( ) A-57, A-58, A-59, A-60 live-too-long 6-97 MAC loop A-4, A-5 rearranging A-2, A-12, A-51 refining (phase 2 of flow), in flow diagram 1-3 residu.c A-51, A-52, A-53 Index-2
C code (continued) rrv computation in search_10i40 A-27, A-30 saturated add 4-9 trip counters 4-21 vector sum with const keywords 4-7 with const keywords, _nassert 4-22 with const keywords, _nassert, word reads 4-14, 4-15 with const keywords, _nassert, word reads, unrolled 4-24 with three memory operations 4-23 word-aligned 4-23 weighted vector sum 6-54 unrolled version 6-55 writing 4-2 C_OPTIONS environment variable 2-6 C6x mnemonics 5-5 char data type 4-2 child node 6-6 cl6x command 2-5, 4-4 clk register 4-3 clock ( ) function 2-12, 4-2 code development flow diagram 1-3 phase 1: develop C code 1-3, 2-14 to 2-16 phase 2: refine C code 1-3, 2-17 to 2-24 phase 3: write linear assembly 1-3, 2-25 to 2-30 code development steps 3-2 code documentation 5-9 comments in assembly code 5-9 compiler options ms 4-22 o2 4-27 o3 4-22, 4-27 pm 4-22 conditional break 4-27 conditional execution of outer loop with inner loop 6-134 conditional instructions to execute if-then-else 6-83 conditional SUB instruction 6-25 conditions in assembly code 5-3 const keyword 4-5, 4-6 in vector sum 4-14 constant operands 5-8 cor_h, implementing A-20 .cproc directive 2-25 CPU elements 1-2
Index
cycle count for biquad filter 2-29 for functions in demo1.c 2-11 for multiply accumulate 2-11 for vector multiply 2-22 formula for calculating 2-11
D
data types 4-2 2-3 2-21 2-28 demo1.c example code demo2.c example code demo3.c example code
dependency graph dot product, fixed-point 6-7 parallel execution 6-11 with LDW 6-18, 6-20, 6-26 dot product, floating-point, with LDW 6-19, 6-21, 6-27 drawing 6-6 steps in 6-7 FIR filter with arrays aligned on same loop cycle 6-117 with no memory hits 6-120 with redundant load elimination 6-109 if-then-else 6-84, 6-92 IIR filter 6-75, 6-77 live-too-long code 6-99, 6-102 showing resource conflict 6-61 resolved 6-64 vector sum 4-6 weighted 6-57, 6-61, 6-64, 6-66 with const keywords 4-7 weighted vector sum 6-64 destination operand 5-8 dot product C code 6-4 fixed-point 6-4 translated to linear assembly, fixed-point 6-5 with intrinsics 4-16 dependency graph of basic 6-7 fixed-point assembly code with LDW before software pipelining 6-22 assembly code with no extraneous loads 6-42 assembly code with no prolog or epilog 6-48
dot product (continued) fixed-point assembly code with smallest code size 6-51 assembly code, fully pipelined 6-38 assembly code, nonparallel 6-10 C code with loop unrolling 6-15 dependency graph of parallel assembly code 6-11 dependency graph with LDW 6-20 fully pipelined 6-37 linear assembly for full code 6-35 linear assembly for inner loop with LDW 6-16 linear assembly for inner loop with LDW and allocated resources 6-20 linear assembly for inner loop with conditional SUB instruction 6-26 nonparallel assembly code 6-10 parallel assembly code 6-11 floating-point assembly code with LDW before software pipelining 6-23 assembly code with no extraneous loads 6-44 assembly code with no prolog or epilog 6-49 assembly code with smallest code size 6-52 assembly code, fully pipelined 6-39 C code with loop unrolling 6-16 linear assembly for inner loop with LDW 6-17 linear assembly for inner loop with LDW and allocated resources 6-21 linear assembly for inner loop with conditional SUB instruction 6-27 fully pipelined 6-39 linear assembly for full code 6-36 word accesses in 4-15 double data type 4-2
E
.endproc directive 2-25 energy computation in MAC loop A-6 to A-8 enhanced full rate (EFR) A-3 epilog 4-20 execute packet 2-11, 2-15, 6-36 execution cycles, reducing number of 6-4 extraneous instructions, removing 6-41 SUB instruction 6-51
Index-3
Index
F
feedback, from compiler or assembly optimizer 3-3 File menu (debugger) 2-8 FIR filter C code 4-16, 6-106 optimized form 4-17 unrolled 6-132, 6-135 with inner loop unrolled 6-127 with redundant load elimination 6-107 final assembly 6-143 for inner loop 6-116 with redundant load elimination 6-112 with redundant load elimination, no memory hits 6-125 with redundant load elimination, no memory hits, outer loop software-pipelined 6-129 linear assembly for inner loop 6-108 for outer loop 6-134 for unrolled inner loop 6-119 for unrolled inner loop with .mptr directive 6-121 with inner loop unrolled 6-133 with outer loop conditionally executed with inner loop 6-137, 6-139 software pipelining the outer loop 6-127 using word access in 4-16 with inner loop unrolled 6-118 fixed-point, dot product linear assembly for inner loop with LDW 6-16 linear assembly for inner loop with LDW and allocated resources 6-20 float data type 4-2 floating-point, dot product dependency graph with LDW 6-21 linear assembly for inner loop with LDDW 6-17 linear assembly for inner loop with LDDW with allocated resources 6-21 flow diagram autocorr.c A-9, A-12, A-13 code development 1-3 functional units description 5-7 in assembly code 5-7 reassigning for parallel execution 6-10, 6-12 functions clock ( ) 4-2 printf ( ) 4-2 Index-4
G
g option 2-5 global constants/symbols defined in EFR A-3 global systems for mobile communications (GSM) A-3
I
if-then-else branching versus conditional instructions 6-82 C code 6-82, 6-90 final assembly 6-87, 6-88, 6-95 linear assembly 6-83, 6-86, 6-91, 6-94 IIR filter, C code 6-73 iir1.asm, inner loop kernel 2-16 iir1.c example code 2-4 in-flight value 7-3 index search in search_10i40 A-38 information elements in tutorial 2-2 inserting moves 6-101 instructions, placement in assembly code 5-4 int data type 4-2 interrupt subroutines 7-8 to 7-10 hand-coded assembly allowing nested interrupts 7-10 nested interrupts 7-9 with hand-coded assembly 7-9 with the C compiler 7-8 interruptible code generation 7-6 to 7-7 loops 7-5 interrupts overview 7-2 single assignment versus multiple assignment 7-3 to 7-4 intrinsics _add2 ( ) 4-14 _mpy ( ) 4-15 _mpyh ( ) 4-15 _mpyhl ( ) 4-14 _mpylh ( ) 4-14 _nassert 4-22 described 2-18, 4-9 in residu.c A-51 to A-53 in saturated add 4-9 summary table 4-10 to 4-12 iteration interval, defined 6-28
Index
K
k compiler option 2-5, 4-4 kernel loop 2-14, 4-7, 4-20 of iir1.asm code 2-16 of iir2.asm code 2-23 of iir3.asm code 2-29 of mac1.asm code 2-14 of vec_mpy1.asm code 2-15 of vec_mpy2.asm code 2-22
L
l linker option 2-6 labels in assembly code 5-2 lag search in lag_max ( ) A-56 linear, optimizing (phase 3 of flow), in flow diagram 1-3 linear assembly 2-25 code autocorr.c, one iteration of loop A-9 dot product, fixed-point 6-5 dot product, fixed-point 6-10, 6-16, 6-20, 6-26, 6-35 dot product, floating-point 6-17, 6-21, 6-27, 6-36 FIR filter 6-108, 6-110, 6-119, 6-121 FIR filter with outer loop conditionally executed with inner loop 6-137, 6-139 FIR filter, outer loop 6-134 FIR filter, unrolled 6-133 if-then-else 6-86, 6-94 index search in search_10i40 A-41 lag search in lag_max( ) A-59 live-too-long 6-103 MAC loop A-4 rrv computation in search_10i40 A-28, A-31 special MAC loop A-5 weighted vector sum 6-58 resource allocation conflicts 6-61 dot product 6-19 if-then-else 6-86, 6-93 IIR filter 6-78 in writing parallel code 6-6 live-too-long resolution 6-102 weighted vector sum 6-58
linker command file 2-6 little-endian mode and MPY operation 6-17 runtime support (rts6201.lib) 2-6 live-too-long code 6-63 C code 6-97 inserting move (MV) instructions 6-101 unrolling the loop 6-101 issues 6-97 and software pipelining 4-27 created by split-join paths 6-100 load doubleword (LDDW) instruction 6-15 word (LDW) instruction 6-15 Load Program File dialog box (debugger) 2-8 load6x 2-12, 2-13 long data type 4-2 loop carry path, described 6-73 control variable, conditionally incremented 4-27 counter, handling odd-numbered 4-15 interruptible 7-5 iterations 4-21 kernel 2-14 unrolling as major programming method A-2 dot product 6-15 for simple loop structure 4-25 for windowing and scaling in autocorr.c A-9 if-then-else code 6-90 in cor_h A-22 in FIR filter 6-118, 6-121, 6-127, 6-132, 6-134 in lag_max A-58 in live-too-long solution 6-101 in vector sum 4-23
M
mac1.asm kernel, inner loop 2-14 mac1.c example code 2-3 memory bank hits avoiding A-2 cor_h A-23 in windowing and scaling in autocorr.c A-15 memory bank scheme, interleaved 6-114 to 6-116 mg compiler option 2-5
Index-5
Index
minimum iteration interval, determining 6-30 for FIR code 6-110, 6-124, 6-142 for if-then-else code 6-85, 6-93 for IIR code 6-76 for live-too-long code 6-100 for weighted vector sum 6-55, 6-56 mnemonic (instruction) 5-4 modulo iteration interval table dot product, fixed-point after software pipelining 6-31 before software pipelining 6-28 dot product, floating-point after software pipelining 6-32 before software pipelining 6-29 IIR filter, 4-cycle loop 6-79 weighted vector sum 2-cycle loop 6-60, 6-65, 6-68 with SHR instructions 6-62 modulo-scheduling technique, multicycle loops 6-54 move (MV) instruction 6-101 _mpy intrinsic 4-15 tutorial 2-18 _mpyh ( ) intrinsic 4-15 _mpyhl intrinsic 4-14 _mpylh intrinsic 4-14 tutorial 2-18 multicycle instruction, staggered accumulation 6-33 multiple assignment, code example 7-3 multiply accumulate function inner loop kernel of original assembly code 2-14 original C code 2-3 multiply-accumulate loop (MAC), implementation in vocoder application A-4 mw compiler option 3-3
operands placement in assembly code types of 5-8 optimization checklist optional tasks in tutorial 3-1 to 3-5
5-8
6-2
outer loop conditionally executed with inner loop 6-132 OUTLOOP 6-111, 6-124
P
parallel bars, in assembly code parent instruction parent node 6-6 6-6 6-6 5-2
performance analysis index search in search_10i40 A-40 of C code 4-2 of dot product examples 6-14, 6-24, 6-53 of FIR filter code 6-124, 6-131, 6-145 of if-then-else code 6-89, 6-96 residu.c A-52 pipeline in C6x pointer operands 1-2 4-4, 4-5, 4-8, 4-22 2-1 2-2 6-47 5-8 pm compiler option preparation for tutorial primary tasks in tutorial priming the pipeline printf ( ) function
processor mnemonics
N
_nassert intrinsic node 6-6 4-12, 4-14, 4-22
Profile Marking dialog box 2-9 menu (debugger) 2-8 Run dialog box 2-10 profiling 2-8 to 2-13 program-level optimization 4-5 A-2 programming methods, summary of prolog 4-20, 6-47, 6-49 pseudo-code, for single-cycle accumulator with ADDSP 6-33
O
o compiler option 2-5, 4-4, 4-20, 4-22, 4-27 o linker option 2-6 Index-6
Index
R
redundant load elimination 6-106 loops 4-22 .reg directive 2-25, 6-16, 6-17 register allocation 6-123 operands 5-8 partitioning A-41 residu.c (FIR filter in EFR) A-51 resource conflicts described 6-61 live-too-long issues 6-63, 6-97 table FIR filter code 6-110, 6-124, 6-142 if-then-else code 6-85, 6-93 IIR filter code 6-76 live-too-long code 6-100 routines autocorr.c A-7 cor_h A-20 lag_max ( ) A-56 rrv computation in search_10i40 A-27 rts6201.lib file 2-6 rts6201e.lib file 2-6 RUNB debugger command 4-3
stand-alone simulator (load6x) 2-12, 4-2 SunOS shell initialization 2-7 symbolic names, for data and pointers 6-16, 6-17
T
techniques for priming the loop 6-47 for refining C code 4-9 for removing extra instructions 6-41, 6-51 using intrinsics 4-9 word access for short data 4-14 TMS320C6x pipeline 1-2 translating C code to C6x instructions dot product fixed-point, unrolled 6-16 floating-point, unrolled 6-17 IIR filter 6-74 with reduced loop carry path 6-78 weighted vector sum 6-54 unrolled inner loop 6-56 translating C code to linear assembly, dot product, fixed-point 6-5 trip count 2-25, 4-21 communicating information to the compiler 4-22 determining the minimum 4-21 trip counter converting to a downcounting loop 4-27 defined 4-21 .trip directive 2-25
S
.sa extension 2-25 _sadd intrinsic 4-9, 4-12 scheduling table. See modulo iteration interval table shell program (cl6x) 2-5, 4-4 short arrays 4-14 data type 4-2, 4-14 single assignment, code example 7-4 software pipeline 4-20, 4-24 accumulation, staggered results due to 3-cycle delay 6-34 described 6-25 when not used 4-26 software-pipelined schedule, creating 6-30 source operands 5-8 split-join path 6-97, 6-98, 6-100
V
vec_mpy1.asm kernel, inner loop 2-15 vec_mpy1.c example code 2-4 vector multiply function C with word instructions and intrinsics 2-18 inner loop kernel of assembly from C with intrinsics 2-22 of original assembly code 2-15 original C code 2-4 tutorial C code example (vec_mpy1.c) 2-4 vector sum function See also weighted vector sum C code 4-5 with const keyword 4-7 with const keywords and _nassert 4-22 with const keywords, _nassert, word reads 4-14
Index-7
Index
vector sum function (continued) C code with const keywords, _nassert, word reads, and loop unrolling 4-24 with const keywords,_nassert, and word reads (generic) 4-15 with three memory operations 4-23 word-aligned 4-23 compiler output (original assembly code) 4-8 dependency graph 4-6, 4-7 handling odd-numbered loop counter with 4-15 handling short-aligned data with 4-15 rewriting to use word accesses 4-14 VelociTI 1-2 very long instruction word (VLIW) vocoder application A-1 implementing A-3 1-2
W
weighted vector sum C code 6-54 unrolled version 6-55 final assembly 6-71 linear assembly 6-69 for inner loop 6-54 with resources allocated 6-58 translating C code to assembly instructions 6-56 windowing and scaling, autocorr.c A-7 word access in dot product 4-15 to 4-16 in FIR filter 4-16 using for short data 4-14 to 4-19
Z
z compiler option 2-6
Index-8