AN1999 APPLICATION NOTE
Tuning C code with the STCC compiler
INTRODUCTION The ST100 processor family has been designed to provide a DSP solution that implements the best tradeoff between the following factors: Compiler friendliness. Reduced code size. High parallelism. High core frequency.
The main goal is to allow the programmer to write as much as possible his application at C level and thus to limit the need to write at assembly language. Advantages of this approach are multiple: Higher productivity that reduces time to market. Easier maintainability of sources. Higher reusability for different processor targets.
This document describes the process and techniques that can be used to write efficient C code that compiles into ST100 optimal assembly with the stcc compiler.
The first part of this document details unique optimizing features implemented in stcc, such as Inlining, Inter Procedural Analysis (IPA), Vectorizer, and Pragmas Directed Optimizations. Each optimization benefit is explained using C code examples and generated assembly codes. The second part is dedicated to general C coding style rules so that the compiler can generate very efficient assembly code. The last part introduces the techniques that can be used to write C code fully exploiting the ST100 processors parallelism.
AN1999/0604
Rev. 1 1/42
AN1999 APPLICATION NOTE
TABLE OF CONTENTS 1 2 2.1 2.2 2.2.1 2.2.2 2.2.3 2.2.4 2.3 2.4 2.5 3 3.1 3.2 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.3 4 4.1 4.2 DEVELOPMENT FLOW. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . ..... OPTIMIZING THE FEATURES OF THE STCC COMPILER . . .. . .. . .. . .. . .. . .. . .... INLINING . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . ... INTER PROCEDURAL ANALYSIS (IPA) . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .... Constant Propagation and Argument Removal . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . ..... Data placement . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . ... Pointer Target disambiguation. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . Global Reference and Pure Functions Detection through IPA . . .. . .. . .. . .. . .. . .. . . VECTORIZER . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . ..... PRAGMA USAGE . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .... SOFTWARE PIPELINING . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. CODING RULES . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . POST-INCREMENTATION USAGE. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. PATTERN RECOGNITION . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . ..... Min/Max . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . ..... Abs . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. Testing Value of a Bit in a Variable . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . ... Multiply and Accumulate. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . ..... Using 16 MSB of a 32-bit word as source operand of an instruction. . .. . .. . .. . .. . ... CONDITIONAL POINTER MODIFICATION . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .... WRITING C TO EXPLOIT PROCESSOR PARALLELISM . . .. . .. . .. . .. . .. . .. . .. . . SAFE ASSOCIATIVE TRANSFORMATIONS . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. MULTI SAMPLE TECHNIQUES. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . ..
PAGE 3 5 5 6 6 11 13 16 16 17 18 21 21 24 24 24 25 25 25 28 33 33 34
2/42
AN1999 APPLICATION NOTE
1 - DEVELOPMENT FLOW When developing an application for the ST100 processors, the first thing to take care of is to map variables on the correct type. Most common problem encountered is to use type long to map 32-bit variables. STCC for the ST100 processors defines long type as 40-bit, thus using 32-bit variables declared as long might produce a slower and bigger code. See the STCC Compiler User's Guide [1] for all types definition for the ST100 processor.
The ST100 instruction set is well suited for 32-bit variables manipulation. It is recommended for scalar variables local to functions that are used as temporary or as loop indexes to declare them as int. Since these variables are local to the function, they would have a better chance to be assigned to a register, so they would not take more memory space as a short declaration. Using int rather than short variable allows the compiler to perform optimization and to avoid cast when the variable is used as index in an array.
Once, the code has been developed using the appropriate types, it should be compiled using fast O4 options for a code optimized for speed and small for a code optimized for size. If your application uses many small functions, it is recommended to inline them. Inlining functions allows the compiler to perform better optimizations in the callee function and for functions that are "small" resulting code might be both faster and smaller because the code for argument preparation and variable saving is removed when a function is inlined. Then performing IPA (Inter Procedural Analysis) on complete application will help the compiler to better optimize all functions.
The code is then ready for profiling. Using profiling you can determine which functions are critical and start a fine tuning in optimization process. It consists in taking most cycle consuming functions and analyzing generated assembly. If generated assembly is not optimal, first thing to do is to determine which optimization the compiler can perform to get closer to the optimal assembly. In other words, trying to understand which compiler option/pragma can be applied to speed-up the code. If no compiler option helps the compiler to produce better code, then the code has to be modified. Once the function is optimized then the process should be reiterated to the next most consuming function and so on.
3/42
AN1999 APPLICATION NOTE
The general development flow to optimize an application with the stcc compiler for ST122 is summarized in Figure 1: General Development Flow:
Figure 1: General Development Flow
Mapping C types to ST100 types
Compilation with fast options using inlining and IPA
PROFILING
Assembly code analysis
Is code optimal ? No
Yes
Better optimization option selection
Is code optimal ? No
Yes
C level Code modification
Is code optimal? No Coding in assembly
Yes
4/42
AN1999 APPLICATION NOTE
2 - OPTIMIZING THE FEATURES OF THE STCC COMPILER This chapter describes how to efficiently use optimizing features provided by the stcc.
2.1 - Inlining The stcc compiler implements three types of inlining mechanism: Local inlining (-Minline). The functions defined as static are inlined automatically if they meet the inlining criteria. Global inlining using inline libraries. Global inlining allows cross file inlining, thus it requires first to build an inline library (-Mextract) with all files including functions that are good candidates for inlining, then to use this library to build the complete application (-Minline). Automatic inlining using IPA information (-Mipa=inline). IPA performs automatic inlining according to heuristic based on function size/function call occurrences. Main difference with global inlining is that it does not require the user to specify inlining criteria using Minline for which there is no possibility to inline functions based on a call occurrence criterion. See the STCC Compiler User's Guide [1] for complete details on Minline/-Mextract/-Mipa=inline sub-options. To efficiently benefit of inlining it is recommended to either use automatic inlining driven by IPA or global inlining based on a library built with all C files of the application.
Advantages of using IPA driven inlining are: Good results in performances. No intervention from the programmer to direct inlining phase. Function's call occurrences are taken into account for inlining. Disadvantages of using IPA driven inlining are: Code size increase might be significant. No possibility to tune inlining phase. Advantages of using global inlining are: Good results in performances. Wide range of possibility to tune inlining criteria. Users can specify size, name of functions to inline and prevent some functions to be inlined. Disadvantages of using global inlining are: Function's call occurrences are not taken into account for inlining. An extract library needs to be built before starting compilation. The user should take care of inlining criteria carefully. Recommended criteria are to specify of size of 7 for functions to be inlined, and to specify names for each other functions for which size is less than 15. Note that size can be determined at extraction phase using compilation switch Minfo.
5/42
AN1999 APPLICATION NOTE
2.2 - Inter Procedural Analysis (IPA) Inter Procedural Analysis implemented in the stcc provides the compiler the ability to perform global optimizations across function call and across C file. A traditional compiler will work on a C file basis and perform optimizations for each function in a file without taking account how this function is called. For instance, a C module (file) is frequently written to be reusable, as a consequence the functions defined in a given module may have useless parameters for a given application. In this case, the stcc IPA is able to propagate and to eliminate those redundant or useless function parameters. Generated code is more efficient since argument preparation is removed from caller routine and in some cases, it allows better optimization of called routine. Command line option Mipa directs usage of IPA by the compiler. As described in STCC Compiler User's Guide [1], IPA is a three phases process (Collection, Propagation, Inheritance) that requires to compile and link twice a set of C file to benefit of IPA directed optimizations.
2.2.1 - Constant Propagation and Argument Removal Let us consider the following example to understand how the IPA works and what the benefits are. In this example, an executable will be built from compilation of file a.c, b.c and main.c.
/* file a.c */ int a[10]; void funcb( int, int*, int); void funca(int x) { funcb( x+2, a, 5 ); funcb( x+1, a, 5 ); }
/* file b.c */ void funcb( int y, int* p, int z ) { p[z] = y; }
/* file main.c */ int main() { funca(10); }
6/42
AN1999 APPLICATION NOTE
main routine calls funca with parameter x=10 and funca performs two calls to funcb with parameters p=a and z=5. If this project is compiled without IPA the following ST122 assembly code is generated for a.c, b.c and main.c.
funca: push .set .rso1 subs2a gr-lk 8 sp, sp, +1 // stack reservation
sdw ..EN1: // lineno: 8 makea make morea ldw add call makea make morea ldw add call // lineno: 10 adds2a poprts
@(sp + 0), r0
// argument homing to survive function call
p0, %abs16to31(a) r1, 5 p0, %abs0to15(a) r15, @(sp + 0) r0, r15, 2 funcb p0, %abs16to31(a) r1, 5 p0, %abs0to15(a) r0, @(sp + 0) r0, r0, 1 funcb sp, sp, +1 gr-lk
// R1 = 5 // P0 = a // R0 = x+2
// R1 = 5 // P0 = a // R0 = x+1
// stack free
In previous code for each call to funcb first integer parameter is computed in R0, second integer parameter is computed in R1, first pointer parameter is computed in P0 as defined in ST122 EABI .
funcb: .set
.rso1
0
..EN1: // lineno: 4 copya adds2a sdw // lineno: 5 rts
p14, r1 p14, p0, p14 @(p14 + 0), r0
// R1 moved to P14 to compute p[x]
7/42
AN1999 APPLICATION NOTE
main: push .set .rso1 gr-lk 8
..EN1: // lineno: 3 make call // lineno: 4 poprts
r0, 10 funca gr-lk
// R0 = 10
If same project is compiled using Mipa=const, constant value are propagated. Inter Procedural Analysis will detect that funca is always called with a constant value of 10 for its parameter x and funcb is always called with a constant value of 5 for its parameter z. Code generated for main will be unchanged since there is no argument removal with -Mipa=const option.
funca: push .set .rso1 gr-lk 8
..EN1: // lineno: 8 makea p0, %abs16to31(a) make r1, 5 morea p0, %abs0to15(a) make r0, 12 call funcb makea p0, %abs16to31(a) make r1, 5 morea p0, %abs0to15(a) make r0, 11 call funcb // lineno: 10 poprts gr-lk
// R0 = x+2, since x=10, R0 = 12
// R1 = x+1, since x=10, R0 = 11
funca code takes advantage of IPA constant propagation and no more parameter "homing" is generated and first parameter of funcb is directly computed as a constant.
8/42
AN1999 APPLICATION NOTE
funcb: .set .rso1 0
..EN1: // lineno: 4 sdw // lineno: 5 rts
@(p0 + 20), r0
// p[z] = y, since z = 5, p[5] = y is generated
funcb code generation has been optimized after IPA since p[z] can be replaced by p[5] because funcb is always called with z equal to 5. Note that, however constant building has been propagated from caller routine to callee routine, parameter preparation is still generated in callee routine. To remove this useless code option Mipa=const,arg should be used. In this case, funcb code remain unchanged whereas funca and main code is generated as follows:
main: push .set .rso1 gr-lk 8
..EN1: // lineno: 3 call // lineno: 4 poprts
__ipai_funca gr-lk
No code is generated in main routine for argument preparation in R0 before call to funca, which has been renamed into __ipai_funca after IPA usage because its prototype does not match anymore its C definition.
9/42
AN1999 APPLICATION NOTE
__ipai_funca: push .set .rso1
gr-lk 8
..EN1: // lineno: 8 makea make morea call makea make morea call // lineno: 10 poprts
p0, %abs16to31(a) r0, 12 p0, %abs0to15(a) __ipai_funcb p0, %abs16to31(a) r0, 11 p0, %abs0to15(a) __ipai_funcb gr-lk
Similarly, no code is generated in funca routine for argument preparation in R1 before call to funcb, which has been renamed into __ipai_funcb. __ipai_funcb is always called with its parameter p equal to base address of array a. Redundant construction of parameter can be performed using option Mipa=const,ptr,arg. Generated code for funca and funcb becomes:
__ipai_funca: push .set .rso1
gr-lk 8
..EN1: // lineno: 8 make call make call // lineno: 10 poprts
r0, 12 __ipaip_funcb r0, 11 __ipaip_funcb gr-lk
// R0 = 12 // R1 = 11
Only code for parameter preparation of y is generated before call to funcb.
10/42
AN1999 APPLICATION NOTE
__ipaip_funcb: .set .rso1 0
..EN1: // lineno: 4 makea morea sdw // lineno: 5 rts
p14, %abs16to31(a+20) p14, %abs0to15(a+20) @(p14 + 0), r0
funcb routine now includes code to build address of array a. This code might not be faster since address of array a is computed for each call to funcb inside funcb instead of being computed for each call in funca, but global code size is smaller with Mipa=const,arg,ptr than with Mipa=const,arg.
2.2.2 - Data placement The ST12x processors are able to fetch up to two 32-bit words in data memory per cycles. In fact, maximizing data memory bandwidth usage depends on what type of data memory sub-system is implemented in hardware. By default, stcc assumes that two memory accesses will be executed in the same cycle. With current hardware implementations ST122 megacells'memory access conflicts are resolved through a X/Y banking of memory. In other words, data accesses to memory bank X and data access to memory bank Y never conflict and thus are performed in the same cycle. To take advantage of such memory sub-system IPA performs automatic data placement in memory bank X and Y. Since the stcc IPA does not process information contained in link file, the following assumptions are made for efficient data placement over X and Y: .xspace, .yspace sections are assumed to be defined in link file. .xspace, .yspace sections are assumed to be mapped respectively in X and Y memory. No assumption is made for mapping of .stack section in memory. Using these assumptions when IPA detects that two memory accesses should be performed in parallel it tries to place one array in X memory and the other in Y. If one of them is already placed, it tries to place the other in other memory bank. If both of them are placed in the same memory bank, IPA marks them as a conflict and determines if assigning one of them to another bank is possible. In some cases, memory conflicts can not be avoided by IPA, then an algorithm based on static occurrence determination is used to find the best data placement that could minimize the impact of conflicting memory accesses. Not only global data are placed, but also local data located in the stack are placed in X/Y memory when a conflict occurs. This means that these local data become global, this preventing routine using them from being reentrant.
11/42
AN1999 APPLICATION NOTE
To illustrate how the IPA data placement works, let us consider following example:
/* file main.c */ #include short ArrayA[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} ; short ArrayC[10] ; void main() { short ArrayB[10] ; int i, s1, s2; for (i=0; i<10; i++) { ArrayB[i] = i*127&5 ; ArrayC[i] = i*23&5 ; } s1 = dprod(ArrayA, ArrayB, 10) ; s2 = dprod(ArrayC, ArrayB, 10) ; printf ("s1 = %d, s2 = %d\n", s1, s2); }
In generated executable, ArrayA will be located in .data section because it is defined as an initialized global array; ArrayB will be located in .stack section because it is declared as a local array to main routine; ArrayC will be located in .bss section because it is declared as a non-initialized global array.
/* file dprod.c */ int dprod(short a[], short b[], short N) { int i, s ; s=0; for (i=0 ; i
12/42
AN1999 APPLICATION NOTE
The following code is generated for dprod loop with stcc using fast Mvect=packed options:
.LB142: //.B0006 // * * * SLIW Loop Code. Cycles: 1 Lineno:11 .LB162: ldp r15, @(p14 ?+ 4) ldp r14, @(p2 ?+ 4) massll r1, r1, r14, r15 masshh r12, r12, r14, r15 nop nop nop gp32md // lineno: 7 add r1, r12, r1
// loop starts here
// loop ends here
Using stcc vectorizer ability to split associative computation, this loop has been generated in one SLIW bundle that executes in one cycle if arrays pointed by P14 and P2 are in different memory banks. When using Mipa=bank option to build this executable ArrayA is located in section .xspace, ArrayB is located in section .yspace, ArrayC is located in section .xspace. Conflicts on memory accesses have been optimally resolved allowing dprod loop to execute in one cycle.
2.2.3 - Pointer Target disambiguation To allow efficient scheduling of instructions, the stcc IPA implements a powerful pointer disambiguation mechanism. Using Mipa=ptr option enables pointer disambiguation across function call. It is useful to detect that any pointer does not alias another pointer. In other words, two pointers do not refer to memory locations that may overlap.
13/42
AN1999 APPLICATION NOTE
For instance, vector addition routine might be improved significantly using pointer disambiguation:
void vadd (short *dst, short *src1, short *src2) { int i; for (i=0; i<10; i++) { dst[i] = src1[i] + src2[i] ; } } short a[10], b[10], c[10]; void foo () { int i; for (i=0; i<10; i++) { a[i] = i ; b[i] = 2*i ; } vadd (c, a, b) ; }
When compiled using fast Munroll=n:2, inner loop for vadd routine is generated as follows:
.align .LB97: ldh ldh addu sdh ldh ldh addu sdh .align 8 .LB190:
16 r15, r14, r13, @(p7 r15, r14, r13, @(p7 @(p5 - 2) @(p14 - 2) r14, r15 - 2), r13 @(p5 !+ 4) @(p14 !+ 4) r14, r15 !+ 4), r13 // loop iteration n
// loop iteration n+1
14/42
AN1999 APPLICATION NOTE
Instructions for loop iteration n+1 are not scheduled with instructions of loop iteration n because the compiler assumes that memory pointed by dst may overlap with memory locations pointed by src1 and src2. It would be the case if, for instance, vadd routine is called with:
short x[12] ; ... vadd (x+2, x+1, x) ;
Then there is a recurrence relation between loop iterations: I=0 ; x[2] = x[1] + x[0] ; I=1 ; x[3] = x[2] + x[1] ; I=2 ; x[4] = x[3] + x[2] ; ... Memory loads for iteration can not be scheduled before memory stores for previous iteration. In our example, IPA pointer disambiguation will detect that dst, src1, src2 point to distinct memory locations. Therefore, adding Mipa=ptr on command line will produce following assembly code for vadd loop:
.align .LB97: ldh ldh ldh ldh addu addu sdh sdh .align 8 .LB190:
16 r15, @(p5 - 2) r14, @(p14 - 2) r13, @(p5 !+ 4) r12, @(p14 !+ 4) r2, r14, r15 r3, r12, r13 @(p7 - 2), r2 @(p7 !+ 4), r3
However, this code is not faster than without pointer disambiguation, it is now a good candidate for packing operations. Using Mipa=ptr in combination with Mvect=packed will generate the following code for the inner loop:
.align .LB156: ldp ldp addup sdp .align 8 .LB178:
16 r15, r14, r13, @(p7 @(p5 !+ 4) @(p14 !+ 4) r15, r14 !+ 4), r13
15/42
AN1999 APPLICATION NOTE
2.2.4 - Global Reference and Pure Functions Detection through IPA Information can be detected by stcc IPA and propagated to compiler optimization phases. For instance, without IPA, a call to a function is assumed to modify any global variable, thus usage of a global variable after a function call implies generation of code to load this variable from memory. Mipa=pure allows detections of functions that are pure in the mathematical sense, i.e. result depends only on function argument.
2.3 - Vectorizer The stcc compiler can automatically exploit ST122 parallelism using Mvect option. It is recommended to use this option specifying packed and fuse sub-options (-Mvect=packed, fuse), because it enables automatic SIMD usage, automatic split summation, and loop fusion for candidate loops. See previous vector addition example for automatic SIMD usage. Loop fusion consists in merging two loops with same number of iteration into a single loop, thus allowing instructions of the two loops to be executed in parallel.
Advantages of Mvect=packed, fuse are: Automatic packing of SIMD operations. Automatic packing of load/store instructions. Automatic split summation for associative operations. Automatic fusion of loops with same number of iterations. Disadvantages of Mvect=packed, fuse are: Some useless code might be generated even if loop is not "vectorized". Last value computation for a NOT countable (i.e. number of iterations is not a compile time known constant) loop leads to extra code generation that can degrade performance when number of iterations is small. Taking into account these pro and cons, it is recommended to apply Mvect=packed,fuse on limited set of files which include only loops that are good candidates for "vectorization". Good candidates are loops with: Constant number of iterations, i.e known at compile time. No dependency between loop iterations. Arithmetic computations with 16-bits operands. Note that sometimes the programmer expects a loop to be "vectorized" by the compiler and no transformation happens. There are many reasons for "non-vectorization" of a loop, using Mneginfo option helps to understand why the compiler is not performing expected optimization.
16/42
AN1999 APPLICATION NOTE
2.4 - Pragma usage Pragma directive optimizations should be dedicated to apply specific optimizations to a function, a file, or a loop, without impacting the complete application. For instance, it can be useful to restrict unrolling to some loops and not to unroll each loop for the complete application. Some unsafe command line options can be applied using pragma directives, where the user knows it is safe to use them. Using pragma directives helps optimizing following function:
#include
void foo (short exc_subfr[], short code[], short gain_pit, short gain_code) { short i; short *p_src, *p_dst ; int L_temp ; p_dst = p_src = exc_subfr ; for (i = 0; i { L_temp = L_temp = L_temp = p_dst[i] } } < 40; i++) (p_src[i]*gain_pit)<<1; L_temp + ((code[i]*gain_code)<<1); L_temp << 3 ; = __roundwh (L_temp);
Code generated using stcc fast for this routine leads that inner loop is generated in GP32 (SLIW isn't use because of low instruction density per bundle). Inner loop code executes in 7 cycles per iteration:
.align // lineno: 11 .LB538: ldh ldh mpfll mafll shlu rnd2c sdhh .align 8 .LB590:
16 //.B0004 r15, @(p14 !+ 2) r14, @(p2 !+ 2) r15, r1, r15 r15, r15, r0, r14 r15, r15, 3 r13, r15 @(p4 !+ 2), r13
17/42
AN1999 APPLICATION NOTE
Adding following pragma directives before function declaration instructs the compiler to apply an unrolling factor of 4 to each loop of the routine and to assume that local and parameter pointer's targets do not refer to overlapping memory regions.
#pragma routine unroll=n:4 #pragma routine safeptr=local,arg
Code generated using stcc fast is then a 10 SLIW bundles loop that execute in 10 cycles for 4 iterations. Cycle count per iteration is thus reduced from 7 to 2.5 using these pragma directives.
Note that despite its name, safeptr pragma directive is UNSAFE. It instructs the compiler that memory regions pointed by code, p_src and p_dst do not overlap. In this case, using this pragma implies that the programmer knows this assertion is always true. Therefore, it is strongly recommended to use pragma directives safeptr, and nozerotrip where the programmer knows it is safe to use them rather than compiling a complete application with the option Msafeptr or Mnozerotrip.
2.5 - Software Pipelining Software pipelining is a scheduling technique to reduce cycle count for loops with dependencies between iterations. Pipelining a loop consist in hoisting part of a loop body out of the loop so that scheduling of loop body is optimal. stcc software pipeliner is enabled when O4 switch is used on command line.
Following loop extracted from Norm_Corr routine in ETSI EFR has an intra loop Read After Write (RAW) dependency on variable s. L_mult and L_shl ETSI operators are mapped on ST100 MPFCxx and SHLCW instructions, which are both dual cycles. (MPFCxx, is a single cycle instruction when accumulator is used as accumulator for another Multiply Accumulate instruction). Therefore following loop takes 5 cycles to execute:
for (j = L_subfr - 1; j > 0; j--) { s = L_mult (exc[k], h[j]); /* L_mult is a 2 cycles operator */ s = L_shl (s, h_fac); /* L_shl is a 2 cycles operator */ s_excf[j] = add (extract_h (s), s_excf[j - 1]); } s_excf[0] = shr (exc[k], scaling);
18/42
AN1999 APPLICATION NOTE
Using O4 on this loop would automatically generates code as if C has been written as follows:
s1 = L_mult (exc[k], h[L_subfr - 1]); s = L_shl (s1, h_fac);
/* loop prolog */
for (j = L_subfr-1; j > 1; j--) /* loop iteration minus one */ { s1 = L_mult (exc[k], h[j-1]); s_excf[j] = add (extract_h (s), s_excf[j - 1]); s = L_shl (s1, h_fac); } s_excf[1] = add (extract_h (s), s_excf[0]);/* loop epilog */ s_excf[0] = shr (exc[k], scaling);
Operator latency has been hidden, loop body take then only 3 cycles to execute.
.LB102:
//.B0000 ldh r15, @(p14 !- 2) ldfh r4, @(p5 - 2) mpfcll r13, r12, r15 nop nop addcp nop sdhh
r5, r6, r4 @(p5 !- 2), r5
.LB188: nop shlcw r6, r13, r1 nop nop
19/42
AN1999 APPLICATION NOTE
Generated loop can be further improved by using Munroll=n:2 in combination with O4, loop body becomes:
ldh ldfh mpfcll shlcw ldh mpfcll addcp sdhh
r15, @(p14 + 2) r8, @(p5 + 0) r6, r12, r15 r10, r4, r1 r13, @(p14 !- 4) r4, r12, r13 r9, r5, r8 @(p5 + 2), r9
This loop executes in 3 cycles for two iterations. It means that using O4 Munroll=n:2 has reduced cycle count for this loop from 5*N cycles to 3*N/2 cycles.
20/42
AN1999 APPLICATION NOTE
3 - CODING RULES However good the compiler is, there are always codes that compiler could not efficiently optimize. In many cases, the way of writing code helps to generate optimal assembly.
3.1 - Post-Incrementation Usage By default, writing code with array access through post-incrementation is NOT recommended. In fact, stcc implements a very sophisticated induction variable analysis thus, it is recommended to use loop counter as index for an array. For instance, same code will be generated for the two following loops:
void vmult0 (short *a, short*b, short *c) { int i; for (i=0; i<10; i++) { c[i] = a[i] * b[i] ; } } void vmult1 (short *a, short*b, short *c) { int i; for (i=0; i<10; i++) { *c = (*a) * (*b) ; c++ ; a++ ; b++ ; } }
Note that when pointer post-incrementation is used it is better to write: a = *p ; p++; rather than a=*p++ ;
21/42
AN1999 APPLICATION NOTE
In some cases using post-incrementation usage helps to reduce pointer register pressure, In particular, when last iterations for a loop are hoisted out of the loop. Let's consider following loop:
void filter0 (short *a, short*b, short *c, int N0, int N1) { int i, j, s;
for (i=0; i>16 ; } }
Performing last iteration of inner loop outside the loop can be written:
void filter1 (short *a, short*b, short *c, int N0, int N1) { int i, j, s;
for (i=0; i>16 ; } }
// (1) last iteration
22/42
AN1999 APPLICATION NOTE
Last iteration (1) would require the usage of two extra pointer registers to hold &a[N-1] and &b[N-1+i] values. Written as follows, generated code does not use any extra pointer register:
void filter2 (short *a, short*b, short *c, int N0, int N1) { int i, j, s; short *p_a, *p_b ; for (i=0; i>16 ; b++ ; } }
In previous examples, hoisting last iteration out of the loop does not help to get better performance, but when N1 is a constant, reinitialization of base pointer can be performed using post-incrementation as follows and thus improves generated code:
#define N1 10 void filter3 (short *a, short*b, short *c, int N0) { int i, j, s; for (i=0; i>16 ; } }
In this case only two pointer registers will be used to access arrays a and b.
23/42
AN1999 APPLICATION NOTE
3.2 - Pattern Recognition stcc is able to recognize some patterns to minimize generated code sequence. Here are some examples on how to write C code that helps the compiler to generate optimal code:
3.2.1 - Min/Max To affect variable c to the maximum between variables a and b code should be written:
int a, b, c; ... if (a > b) c=a; else c = b;
In this example code generated will be: maxw Ra, Rb, Rc Similarly writing code like if (a>b) b=a; Will be generated maxw Rb, Rb, Ra
3.2.2 - Abs
To affect variables b to the absolute value of variable a code should be written:
int a, b ; if (a<0) b = -a; else b = a;
In this example code generated will be:
absw Rb, Ra
Similarly writing code like
If (a<0) a = -a ;
Will be generated
absw Ra, Ra
24/42
AN1999 APPLICATION NOTE
3.2.3 - Testing Value of a Bit in a Variable To test a bit n in a variable a and do a conditional action code should be written:
if ((a>>n)&1) /* (1) */ { /* action to perform */ (2) ... }
Statement (1) will be generated in a single instruction
ttbw Gx, Ra, Rn
Predicate Gx will be used for conditional execution of block of instruction (2).
3.2.4 - Multiply and Accumulate For code like:
int s ; short a, b; ... s=s+a*b;
Where s is declared as int, a and b are declared as short, stcc compiler generates an integer multiply and accumulate instruction.
massll Rs, Rs, Ra, Rb
If code is written:
int s ; short a, b; ... s = s +(a*b<<1) ;
Then fractional multiply accumulate is generated:
mafll Rs, Rs, Ra, Rb
Note that fractional saturating arithmetic can not be represented at ANSI C level; so to be able to use saturating arithmetic a wide range of intrinsic operators has been defined (see Intrinsic Function User's Guide for complete list of intrinsic operators supported by stcc). Code using saturating fractional multiply accumulate should be written:
s = __mafcw (s, a, b) ;
3.2.5 - Using 16 MSB of a 32-bit word as source operand of an instruction The stcc compiler performs some pattern optimizations to generate a single instruction when only 16-MSB of a 32-bit variable should be used as source operand of an instruction. For instance, 32-bit integer multiplication algorithm using 16-bits multiplications is given by:
25/42
AN1999 APPLICATION NOTE
int mult32(int a, int b) { short ah, bh; unsigned short al, bl ; unsigned int low, mid; int high ; ah = (short) (a >> 16); al = (unsigned short) a ; bh = (short) (b >> 16); bl = (unsigned short) b ; low = bl*al ; mid = ah*bl + bh*al ; low = low + (mid<<16) ; return low ; }
When compiled with stcc fast following code is generated:
mult32: mpuslh mauslh shlu mauull rts r15, r1, r0 r15, r15, r0, r1 r15, r15, 16 r0, r15, r0, r1
The compiler has optimized sequence of code were source operand for a 16-bit multiplication is a 32-bit variable right shifted by 16 into a single 16-bit multiplication by high part of the variable. Memory accesses with high 16 MSB of a 32-bit variable are also optimized by the compiler. For instance, a 16-bit load followed by left shift by 16 is generated into a single ldfh instruction, similarly a 32-bit variable right shifted by 16 and stored into 16-bit memory location is generated into a single sdhh instruction as shown in following example.
int var32; short *p_short; ... var32 = (*p_short) << 16 ; ... (*p_short) = var32 >> 16 ;
/* => ldfh R0, @(P0 + 0) */ /* => sdhh @(P0 + 0), R0 */
26/42
AN1999 APPLICATION NOTE
This feature is important to optimize some ETSI vocoders because sometime keeping results in high part of register improves both code size and performance. ETSI intrinsic mult(a, b) is equivalent to (short) (L_mult(a, b)>>16), it is common to encounter in ETSI vocoders sequence of code like:
short tmp ; short tab[40]; ... tmp = mult(a, b) ; tab[i] = add (tab[i], tmp) ;
This code sequence will be generated as follows:
mpfcll shrw ldh addcp sdh
R0, R2, R3 R1, R0, 16 R2, @(P0 + 0) R4, R1, R2 @(P0 + 0), R4
// (1) // (2)
There will be a stall of 2 cycles between (1) and (2). Rewriting code as follows does not prevent the stall but reduces code sequence and improve performance:
int L_tmp ; short tab[40]; ... L_tmp = L_mult(a, b) ; tab[i] = __addp (tab[i]<<16, L_tmp) >> 16 ;
This code sequence will be generated as follows:
mpfcll ldfh addcp sdhh
R0, R2, R3 R2, @(P0 + 0) R4, R0, R2 @(P0 + 0), R4
27/42
AN1999 APPLICATION NOTE
3.3 - Conditional Pointer Modification The code that leads to generation of conditional pointer modification must be avoided if possible. Main problem with conditional pointer modification is that when an if-else block includes a pointer modification, the compiler might generate predicated code for this block, a guard transfer between execution units might happen thus degrading performance by a significant ratio. In general, many of conditional pointer modifications can be avoided by rewriting the code.
void q_p ( short *ind, /* Pulse position */ short n /* Pulse number */ ) { static const short gray[8] = {0, 1, 3, 2, 6, 4, 5, 7}; short tmp; tmp = *ind;
if (n < 5) { tmp = (tmp & 0x8) | gray[tmp & 0x7]; } else { tmp = gray[tmp & 0x7]; } *ind = tmp; }
28/42
AN1999 APPLICATION NOTE
Previous code is generated as follows using stcc fast:
q_p: ldh cmpltw and makea morea copya and ldh or exth and copya r12, @(p0 + 0) g1, r0, 5 r15, r12, 7 p12, %abs16to31(.st95) p12, %abs0to15(.st95) p14, r15 r14, r12, 8 r15, @(p12 + p14) r15, r15, r14 r12, r15 r15, r12, 7 p14, r15 r12, @(p12 + p14) @(p0 + 0), r12 // tmp = *ind // G1/G9 pairs is computed in DU exu // R15 = tmp & 7
g1 ?
g1 AU g1 g1 g1 g1 g9 g9 AU g9
? ? ? ? ? ? ? ?
// Guard and Data Reg transfer DU =>
// Guard transfer DU => AU
// R15 = tmp & 7 // Guard and Data Reg transfer DU => // Guard transfer DU => AU
ldh sdh // lineno: 23 rts
There are four guard transfers from DU to AU, which induce a loss of decoupling between AU and DU. A more efficient way to write this code would be:
void q_p ( short *ind, /* Pulse position */ short n /* Pulse number */ ) { static const short gray[8] = {0, 1, 3, 2, 6, 4, 5, 7}; short tmp, tmp_gray; tmp = *ind; tmp_gray = gray[tmp & 0x7] ; if (n < 5) { tmp = (tmp & 0x8) | tmp_gray; } else { tmp = tmp_gray ; } *ind = tmp; }
29/42
AN1999 APPLICATION NOTE
Then code is generated as follows:
q_p: ldh makea and morea cmpltw copya and ldh or exth addu sdh rts r1, @(p0 + 0) p12, %abs16to31(.st95) r15, r1, 7 p12, %abs0to15(.st95) g1, r0, 5 p14, r15 // transfer DU => AU r15, r1, 8 r12, @(p12 + p14) r15, r12, r15 r1, r15 r1, r12, 0 @(p0 + 0), r1
g1 ? g1 ? g1 ? g9 ?
Four transfers have been replaced by one; code is faster and smaller. Similarly, another example taken from Viterbi trellis decoding algorithm, shows how a code can be improved using this method:
int *p_trans ; int *p_metric ; int metric1, metric2 ; ... *p_trans >>= 1; if (metric1 >= metric2) { *p_metric++ = metric1 ; } else { *p_metric++ = metric2 ; *p_trans |= 0x8000 ; }
/* (1) */
/* (3) */
/* (4) */ /* (2) */
For this code, when the stcc generates the predicated code for the if-else block, two types of problems occur.The first problem is related to variable pointed by p_trans, the second concerns conditional post-incrementation of p_metric.
Variable pointed by p_trans is read from memory in (1) (load), then written in memory in (1) (store), then read from memory in (2) (load), then written in memory in (2). Since, this code would be if-converted by stcc, load (2) from memory will be always performed whatever the taken decision is. In this case there is
30/42
AN1999 APPLICATION NOTE
a potential SAQ-hit between store in (1) and load in (2). A SAQ-hit occurs when a memory location is read before being effectively written by the processor. The processor performs all pending stores in SAQ and the conflicting store before performing the load. When this occurs decoupling between AU and DU is lost and processor is stalled until all operations are completed. To avoid this loss of decoupling, also to implement a more power efficient application it is recommended to use an intermediate variable to keep track of *p_trans evolution. For previous block a better way to write it for the ST122 is:
int *p_trans ; int *p_metric ; int metric1, metric2 ; int trans ; ... trans = *p_trans ; ... trans >>= 1; /* trans is now a scalar variable, no more memory write */
if (metric1 >= metric2) { *p_metric++ = metric1 ; } else { *p_metric++ = metric2 ; trans |= 0x8000 ; } ... *p_trans = trans ;
/* (3) */
/* (4) */
Using trans as intermediate variable prevents SAQ-hit generation because stcc will be able to keep trans in a data register. The second problem to address is then to void conditional post-incrementation in (3) and (4). Code should be rewritten as follows:
int *p_trans ; int *p_metric ; int metric1, metric2 ; int trans ; ... trans = *p_trans ; ... trans >>= 1; if (metric1 >= metric2) { *p_metric = metric1 ; }
31/42
AN1999 APPLICATION NOTE
else { *p_metric = metric2 ; trans |= 0x8000 ; } p_metric++ ; ... *p_trans = trans ; /* p_metric is updated outside if/else block */
The stcc generates the following assembly:
shrw cmpgew g1 ? g9 ? g9 ? adda
sdw or sdw
r12, r12, 1 g1, r15, r1 @(p14 + 0), r15 r12, r2, r12 @(p14 + 0), r1 p14, p14, 4
Note that to get optimal code for this sequence it should be written as follows:
trans >>= 1; if (metric1 < metric2) { trans |= 0x8000; } if (metric1 >= metric2) { metric2 = metric1; } *p_metric = metric2 ; p_metric++ ;
Then stcc recognizes max idiom and generates:
shrw maxw cmpltw sdw g1 ?
r12, r12, 1 r14, r15, r1 g1, r1, r15 @(p2 !+ 4), r14 r12, r3, r12
or
32/42
AN1999 APPLICATION NOTE
4 - WRITING C TO EXPLOIT PROCESSOR PARALLELISM The stcc compiler offers a wide range of optimizations feature that helps the programmers to not modify his code to take advantage of processor parallelism. However, sometimes rewriting the code is necessary, this chapter describes two techniques to be applied when the compiler fails to fully exploit processor parallelism.
4.1 - Safe Associative Transformations In general, safe associative transformations are handle by the compiler using -Mvect=packed option, some cases can not be transformed because of usage of non-associative operations such as saturating arithmetic. Saturating arithmetic is not associative:
(A +sat B) +sat C ? A +sat ( B +sat C)
However; saturating computation of signal's energy which is a sum of squares can be performed using split summation technique, because if saturation occurs at a computation step then adding a positive (square) value induces that result will be also saturated. Energy computation loop is given by:
int energy (short x[], int N) { int i, s ; s = 0; For (i=0; i
Assuming N is a multiple of two, this routine can be rewritten:
int energy (short x[], int N) { int i, s0, s1 ; s0 = 0; s1 = 0; For (i=0; i
}
33/42
AN1999 APPLICATION NOTE
This routine is now twice faster; generated assembly for the loop is a single SLIW bundle which takes one cycle to execute:
ldh
r15, @(p1 - 2)
ldh r14, @(p1 !+ 4) mafcll r2, r2, r15, r15 mafcll r1, r1, r14, r14
4.2 - Multi Sample Techniques General principle of signal processing is to take an input signal (vector of samples), to apply a mathematic transformation to each sample to produce an output signal. For instance, filtering an input signal x using an N coefficients finite impulse response (FIR) filter is defined by the following formula:
yi = xi - j *c j
N j =0
yi is the i-sample of filtered output signal. cj is the j-coefficient. Equivalent C algorithm to filter a 16-bit sample signal of M elements can be:
void filter (short c[], short x[], short y[], int N, int M) { int i, j; int s; for (i = 0; i < M; i++) { s=0; for (j = 0; j < N; j++) { s = __mafcw (s, c[j], x[i - j]); } y[i] = s>>16; } }
34/42
AN1999 APPLICATION NOTE
This algorithm uses the 16-bit saturating arithmetic of the ST122. When compiled with stcc fast following assembly is generated for the nested loops:
.LB592: //.B0012 // * * * SLIW Loop Code. Cycles: 1 Lineno:16 nop make r2, 0 adda p4, p0, 0 adda p14, p6, 0 // lineno: 11 .LB598: //.B0014 // * * * SLIW Loop Code. Cycles: 1 Lineno:16 .LB669: ldh r15, @(p14 !- 2) ldh r14, @(p4 !+ 2) mafcll r2, r2, r14, r15 nop // lineno: 0 // * * * SLIW Loop Code. Cycles: 1 Lineno:16 .LB678: nop add r3, r3, 1 adda p6, p6, 2 sdhh @(p5 !+ 2), r2
.
Assuming c and x are located in different memory banks, this loop structure takes (2+N)*M cycles to execute. Since saturating arithmetic is not associative, split summation of inner loop is not possible. To take advantage of ST122 dual ALU issue capability the algorithm should be rewritten to compute two samples yi, yi+1 of output signal. Following scheme helps to understand how to group computation to parallelize computations:
Iteration 1
Iteration 2
...
Iteration N-1
Iteration N
yi yi+1
= =
xi+0 xi+1
* *
c0 c0
+ +
xi-1 xi+0
* *
c1 c1
+ +
... ...
+ +
xi-N+2 xi-N+3
* *
c N-2 c N-2
+ +
xi-N+1 xi-N+2
* cN-1 * cN-1
Unrolling outer loop by a factor of two (M should be multiple of two) as gives following C to compute yi, yi+1:
35/42
AN1999 APPLICATION NOTE
for (i = 0; i < M; i+=2) { s0 = 0 ; for (j = 0; j < N; j++) { s0 = __mafcw (s0, c[j], x[i-j]); } y[i] = s0>>16; s1 = 0 ; for (j = 0; j < N; j++) { s1 = __mafcw (s1, c[j], x[i-j+1]); } y[i+1] = s1>>16; }
Then inner loops can be fused to produce a parallel version of inner loop:
void filter2 (short c[], short x[], short y[], int N, int M){ int i, j; int s0, s1; for (i = 0; i < M; i+=2) { s0 = 0 ; s1 = 0 ; for (j = 0; j < N; j++) { s0 = __mafcw (s0, c[j], x[i-j]); s1 = __mafcw (s1, c[j], x[i-j+1]); } y[i] = s0>>16; y[i+1] = s1>>16; } }
When compiled with stcc fast, generated code for inner loop is a two SLIW bundle loop because three 16-bit memory accesses are performed to load cj, xi-j and xi-j+1. Generated assembly is not faster than previous one because inner loop takes two cycles to execute, algorithm cycle count is equal then equal to (2*N+4)*M/2 = (N+2)*M.
.LB499: //.B0004 // * * * SLIW Loop Code. Cycles: 2 Lineno:35 nop
36/42
AN1999 APPLICATION NOTE
make adda adda nop addu nop nop
r3, 0 p4, p6, 0 p14, p0, 0
r2, r3, 0
// lineno: 28 .LB503: //.B0006 // * * * SLIW Loop Code. Cycles: 2 Lineno:35 ldh r15, @(p4 - 2) ldh r14, @(p14 !+ 2) mafcll r3, r3, r14, r15 nop .LB757: ldh r13, @(p4 !- 2) mafcll r2, r2, r14, r13 nop nop // lineno: 32 // * * * SLIW Loop Code. Cycles: 2 Lineno:35 nop add r12, r12, 2 adda p6, p6, 4 sdhh @(p5 - 2), r3 .LB770: nop nop nop sdhh @(p5 !+ 4), r2
Since the current ST122 memory sub-system supports 32-bit word accesses on 16-bit word alignment, previous code can be optimized using stcc fast Mvect=packed. Usage of Mvect=packed allows the compiler to perform a 32-bit memory access to load variables xi-j and xi-j+1. Resulting assembly takes then (N+4)*M/2=(N/2+2)*M cycles to execute.
37/42
AN1999 APPLICATION NOTE
.LB742: //.B0017 // * * * SLIW Loop Code. Cycles: 2 Lineno:35 nop make r3, 0 adda p4, p0, 0 adda p14, p6, 0 nop addu nop nop
r2, r3, 0
// lineno: 28 .LB747: //.B0018 // * * * SLIW Loop Code. Cycles: 1 Lineno:35 .LB807: ldp r15, @(p14 !- 2) ldh r14, @(p4 !+ 2) mafcll r3, r3, r14, r15 mafclh r2, r2, r14, r15 // lineno: 0 // * * * SLIW Loop Code. Cycles: 2 Lineno:35 nop add r12, r12, 1 adda p6, p6, 4 sdhh @(p5 - 2), r3 .LB820: nop nop nop sdhh @(p5 !+ 4), r2
Note that "packing" memory accesses is only possible with a data memory sub-system that support 32-bit word accesses aligned on any 16-bit word boundary. Current implementation of ST122 supports this feature, however same principle can be applied to generate code for a memory sub-system that does not allow "safe" packing of memory accesses. The basic principle is to rewrite the algorithm to perform only 16-bit memory accesses; computing four outputs samples allows load reduction by reusing variable loaded to compute yi, yi+1 in computation of yi+2, yi+3. Using following scheme, only four 16-bit loads are required per inner loop iteration for variables xj, xj+1, xj+2 and cj+1. cj+1 will be kept in a variable for next loop iteration.
38/42
AN1999 APPLICATION NOTE
Loop Prolog
Iteration 1
Iteration 2
Iteration N-1
Loop Epilog
yi = yi+1 =
xi+0 * c0 + xi-1 * c1 +...+ xi-N+2 * cN-2 + xi-N+1 * cN-1 xi+1 * c0 + xi+0 * c1 +... + xi-N+3 * cN-2 + xi-N+2 * cN-1
yi+2 = xi+2 * c0 + xi+1 * c1 + ...+ xi+0 * c2 + xi-N+3 * cN-1 yi+3= xi+3 * c0 + xi+2 * c1 + ...+xi+1 * c2 + xi-N+4 * cN-1
This algorithm can be written in C as follows:
void filter4 (short c[], short x[], short y[], int N, int M) { int i, j; int s0, s1, s2, s3; short c0, c1, x0, x1, x2 ; for (i = 0; i < M; i+=4) { s0 = 0 ; s1 = 0 ; c0 = c[0] ; s2 = __mpfcw(x[i+2], c0) ; s3 = __mpfcw(x[i+3], c0) ; for (j = 0; j < N-1; j++) { x0 = x[i+0 - j] ; x1 = x[i+1 - j] ; s0 = __mafcw (s0, x0, s1 = __mafcw (s1, x1, x2 = x[i+2 - j] ; c0 = c1 = c[j+1] ; s2 = __mafcw (s2, x1, s3 = __mafcw (s3, x2, } x1 = x0 ; x0 = x[i-N+1] s0 = __mafcw (s0, x0, c0); s1 = __mafcw (s1, x1, c0); y[i+0] = s0>>16; y[i+1] = s1>>16; y[i+2] = s2>>16; y[i+3] = s3>>16; } }
c0); c0);
c1); c1); ;
39/42
AN1999 APPLICATION NOTE
When compiled with stcc fast, code for the loops is translated into:
.LB523: //.B0008 // * * * SLIW Loop Code. Cycles: 3 Lineno:76 ldh r2, @(p0 + 0) make r14, 0 nop adda p14, p9, 0 ldh ldh addu mpfcll r15, @(p7 - 2) r1, @(p7 + 0) r13, r14, 0 r12, r15, r2
nop mpfcll r3, r1, r2 nop adda p4, p8, 0 // lineno: 53 .LB528: //.B0010 // * * * SLIW Loop Code. Cycles: 2 Lineno:76 ldh r15, @(p4 - 2) ldh r4, @(p4 - 4) mafcll r13, r13, r15, r2 mafcll r14, r14, r4, r2 .LB1008: ldh r1, @(p4 !- 2) ldh r2, @(p14 !+ 2) mafcllr12, r12, r15, r2 mafcllr3, r3, r1, r2 // lineno: 64 // * * * SLIW Loop Code. Cycles: 5 Lineno:76 ldh r15, @(p5 !+ 8) mafcll r1, r13, r4, r2 add r5, r5, 4 adda p7, p7, 8 nop mafcll r15, r14, r15, r2 sdhh @(p6 - 6), r15 adda p8, p8, 8 nop nop nop sdhh nop nop nop sdhh .LB1049:
@(p6 - 4), r1
@(p6 - 2), r12
40/42
AN1999 APPLICATION NOTE
nop nop nop sdhh
@(p6 !+ 8), r3
This code executes in ((N-1)*2+8)*M/4 = (N/2+3)*M cycles, with the restriction of M being a multiple of four. It can be further improved by using pointer arithmetic to access x and c arrays and thus perform pointer reinitialization in outer loop via pointer arithmetic. Table 1. Revision History
Date June 2004 Revision 1 First Issue Description of Changes
41/42
AN1999 APPLICATION NOTE
The present note which is for guidance only, aims at providing customers with information regarding their productsin order for them to save time. As a result, STMicroelectronics shall not be held liable for any direct, indirector consequential damages with respect to any claims arising from the content of such a note and/or the use made by customers of the information contained herein in connection with their products.
Information furnished is believed to be accurate and reliable. However, STMicroelectronics assumes no responsibility for the consequences of use of such information nor for any infringement of patents or other rights of third parties which may result from its use. No license is granted by implication or otherwise under any patent or patent rights of STMicroelectronics. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all information previously supplied. STMicroelectronics products are not authorized for use as critical components in life support devices or systems without express written approval of STMicroelectronics. The ST logo and the ST100 are registered trademarks of STMicroelectronics STCC is a registered trademark of STMicroelectronics in the U.S.A and other countries All other names are the property of their respective owners 2004 STMicroelectronics - All rights reserved STMicroelectronics GROUP OF COMPANIES Australia - Belgium - Brazil - Canada - China - Czech Republic - Finland - France - Germany - Hong Kong - India - Israel - Italy - Japan - Malaysia - Malta - Morocco - Singapore - Spain - Sweden - Switzerland - United Kingdom - United States www.st.com
42/42
|