# ARM VFP Vector Programming, Part 2: Examples

In Part 1, I explained the design philosophy of vector calculations using the ARM Vector Floating Point system. This article will demonstrate the actual practice of using vectors in the ARM VFP.

# Enabling and disabling vector banks

The Floating-Point Status and Configuration Register (FPSCR) is shared between coprocessors 10 and 11, so any changes to it apply for both single-precision and double-precision calculations using the VFP. It contains the following fields:

- Bits 31-28: Comparison result flags
- Bit 25: Not-a-Number default mode
- Bit 24: Flush-to-Zero mode
- Bits 23-22: Rounding mode
- Bits 21-20: Vector
**Stride** - Bits 18-16: Vector
**Length** - Bits 15,12-8: Exception enabling
- Bits 7, 4-0: Exception “sticky” flags

All other bits (27, 26, 19, 14,13, 6, 5) should not be modified by any software.

Normally, in “scalar mode,” the Length and Stride fields in the FPSCR are all 0, meaning Length is 1 and Stride is ignored. To enable the SIMD mode of the VFP, the Length must be something greater than 1. The Stride can be either “consecutive registers” (field set to binary 00) or “every other register” (field set to binary 11). To stay within the bounds of defined behavior, the effective Length times the effective Stride should not exceed the effective bank size (4 double-precision, or 8 single-precision values).

A formula in C to determine the bits set for Length and Stride, in their proper positions, is:

bitfield_set = ((stride ^ 1) << 20) | ((length - 1) <00) or 2(10->11) // stride field as 01 or 10 is undefined eor r0, r0, #1 // adjust/truncate r1 sub r1, r1, #1 and r1, r1, #7 // shift/accumulate them in r0 mov r0, r0, lsl #20 orr r0, r1, lsl #16 // get SysCtlReg fmrx r2, fpscr // clear stride and length fields bic r2, #55*65536 // set them with the adjusted values orr r2, r2, r0 // this is the new SysCtlReg fmxr fpscr, r0 // restore user registers pop {r0-r2} bx lr

The “reconfig” routine does no validation of its inputs; it merely guarantees that only the Stride and Length fields of the FPSCR are modified. It is still possible to specify *combinations* of Length and Stride that overflow a register bank. It is also possible to pass “illegal” values through the r0 and r1, which result in unexpected or undesired behavior. It is up to the programmer to set the Length and Stride correctly.

To revert to “scalar mode,” simply reset the Length and Stride each to 1:

// inputs: (none) // outputs: (none) // modified: Floating-Point System Control Register (FPSCR) // clears the stride and length fields of the FPSCR, // returning it to "scalar mode" (no vectors). This is // what the GNU EABI expects. deconfig: push {r0,r1,lr} // set stride and length to 1 mov r0, #1 mov r1, #1 // and call "reconfig" bl reconfig // then return pop {r0,r1,pc}

Of course, if the “reconfig” routine works correctly, then “deconfig” is guaranteed to work correctly.

The GNU EABI does not expect anything but scalar mode for ARM VFP instructions, and any program enabling a vector mode for the VFP must take this into account. The easiest approach is to disable vector mode before either calling or returning to GCC-generated code.

# A simple, stupid, totally brain-dead example

For a basic introduction, let’s add two parallel arrays of single-precision values together, and store the results into a third array. Here’s the design in C:

void doit(float *sum, float *a1, float *a2, int length) { int i; for (i=0; i < length; i++) *(sum+i) = *(a1+i) + *(a2+i); }

With ‘-O3 -march=native -ftree-vectorize -funroll-all-loops -mfpu=vfp’ options, GCC does a decent job of compiling this function, creating an interesting specimen of Duff’s device. But we can do somewhat better.

**Step 0:** Restrict the array length to be a factor of the maximum VFP vector length. For single-precision vectors, that is 8 registers. This is optional, but it simplifies this example greatly.

**Step 1:** Define the parallel operation. In this case, it’s addition. The assembly code will look something like “FADDS dest, source1, source2”. The instruction operands specify the base register for each vector.

**Step 2:** Set up the loop for operand fetch, vector operation, and result store. The ARM EABI passes C function arguments in r0, r1, r2, r3, and the stack, in that order. This means that r0 is the pointer to the array of results, r1 and r2 are the pointers to the source arrays, and r3 contains the length of the arrays. In step 0, this was restricted to a multiple of 8 for single-precision arrays.

The net result of these steps becomes something like this:

doit: push {r0,r1,r4,lr} // first thing to do is change the // Stride and Length fields in the // VFP System Control Register (SCR). // Linux keeps these as "pure scalar," // so that all operations specify // single registers. We want to enable // vector mode, to add 8 pairs of floats // at a time. push {r0,r1} // stride length 1 mov r0, #1 // vector length 8 mov r1, #8 bl reconfig pop {r0,r1} // divide count by 8, since we're doing 8 at a time asr r3, #3 .loop: // fetch 8 floats from source array 1 fldmias r1!,{s8-s15} // and 8 floats from source array 2 fldmias r2!,{s16-s23} // here's where the magic happens: // s[24..31] = s[8..15] + s[16..23] fadds s24, s8, s16 // and store the result in the destination array fstmias r0!,{s24-s31} // one less block subs r3, r3, #1 // loop if we're not done bne .loop // restore the FPSCR bl deconfig // and we're done pop {r0,r1,r4,pc}

Did you see that? Loading 8 floats, loading 8 more floats, then adding them together in parallel, then storing 8 floats, repeating to the ends of the arrays. The double-precision version is nearly the same, except that the vector length is 4, and therefore the block count is shifted right by 2 (4=2^{2}) rather than 3 (8=2^{3}).

This is what vector-based programming is all about: carrying out as many operations as possible in a single machine instruction. Even if the internal architecture carries out the operations serially, it’s still fewer micro-ops per operation than discrete machine code instructions in scalar mode.

Using my earlier routine to read the 1MHz timer, the speed-up shows as slight but measurable. With three arrays of 6,291,456 single-precision members (two source arrays and one destination array), the GCC 4.8.1 version loaded, added, and stored the entire set in roughly 0.418 seconds, while the VFP-based vectorized assembly code took roughly 0.399 seconds. The biggest limitation is memory bandwidth; the VFP can add floating-point vectors as fast as RAM can supply them.

# A real-world example

The stride can be either “every register” (stride 1) or “every other register” (stride 2). One common real-world application is signal processing, which can involve operations with complex numbers (**a**+*bi*, where i is the square root of -1). Multiplying two complex numbers also yields a complex number. Algebraically,

(**a _{1}** +

*b*)(

_{1}i**a**+

_{2}*b*) =

_{2}i**a**–

_{1}a_{2}**b**+

_{1}b_{2}*(a*

_{1}b_{2}+ a_{2}b_{1})iwith the real parts in bold, and the imaginary parts in italics. For example, (**2**+*i*)(**2**+*3i*) becomes **2×2**–**3×1**+*(2×3+2×1)i*, or **1**+*8i*.

Algorithmically, the real part is a multiplication (FMUL) followed by a multiplication/subtraction (FNMAC), and the imaginary part is a multiplication (FMUL) followed by a multiplication/addition (FMAC). The flexibility of the VFP vector mode means there’s no need for special extra instructions (unlike the x86 SSE!). So, if the coefficients a_{1} and b_{1} are in S8 and S9 (bank 1), and the coefficients a_{2} and b_{2} are in S16 and S17 (bank 2), their complex product can go into S24 and S25 (bank 3) with the following sequence:

// real portion fmuls s24, s8, s16 fnmacs s24, s9, s17 // imaginary portion fmuls s25, s8, s17 fmacs s25, s9, s16

Four instructions to generate a complex product. Typical, by ARM standards. But here’s the beauty of vector mode in the VFP: since each single-precision bank can hold 8 values, it can also hold the coefficient pairs of 4 complex values. With stride set to 2 and length set to 4, that becomes four instructions to generate *four complex products!* This would not be possible with a stride of 1, every register, because the real and imaginary coefficients of the complex factors interact differently. The real part of the complex product is the *difference* of the products of the real and imaginary parts; the imaginary part of the complex product is the *sum* of the cross-products of the complex factors’ coefficients. It’s almost like the VFP’s “every other register” stride was made for complex value calculations.

With single-precision coefficients, the assembly-language version takes 20% less time than a highly-optimized GCC-compiled version. With complex arrays of 3,145,728 elements, parallel multiplication of two arrays typically takes 0.476 seconds in a C version compiled by GCC 4.8.1, while the assembly-language version takes 0.361 seconds, a difference of 0.115 seconds.

With double-precision coefficients, the difference is much lower. The assembly-language version shows about a 0.02 second improvement over arrays of 1,572,864 elements. Since the VFP vector banks are only 4 members long, this means we can multiply only 2×2 complex value coefficients at a time. This isn’t to say that the benefit isn’t worth the time saved in development and usage. It’s up to the developer to understand the needs of the end-user, and to create the program accordingly. This means, in the right circumstances, a Raspberry Pi cluster could see a few minutes saved by using effective VFP vector-based code.

# Conclusion

Vector programming with the ARM VFP is a straightforward matter. Calculation instructions that work in scalar mode also work in vector mode, with only a few extra considerations to bear in mind. This is a considerable advantage over Intel x86 for assembly-language programmers; MMX added separate instructions, and SSE added even more instructions and an entirely new register set (XMM/YMM). Rather than adding a separate instruction set, the VFP architecture lets a programmer perform the same basic operations on multiple registers in parallel.

The code for these examples (will be available shortly in Part 3: Appendix).

Thank you for this example. Here is a runnable setup of it with qemu-user and an assertion of the sum result: https://github.com/cirosantilli/arm-assembly-cheat/blob/ba19d7bf31dc35db2c6839766a94d3cdbe83847d/v7/vfp.S