In Part 1, I explained the design philosophy of vector calculations using the ARM Vector Floating Point system. This article will demonstrate the actual practice of using vectors in the ARM VFP.

# Enabling and disabling vector banks

The Floating-Point Status and Configuration Register (FPSCR) is shared between coprocessors 10 and 11, so any changes to it apply for both single-precision and double-precision calculations using the VFP. It contains the following fields:

• Bits 31-28: Comparison result flags
• Bit 25: Not-a-Number default mode
• Bit 24: Flush-to-Zero mode
• Bits 23-22: Rounding mode
• Bits 21-20: Vector Stride
• Bits 18-16: Vector Length
• Bits 15,12-8: Exception enabling
• Bits 7, 4-0: Exception “sticky” flags

All other bits (27, 26, 19, 14,13, 6, 5) should not be modified by any software.

Normally, in “scalar mode,” the Length and Stride fields in the FPSCR are all 0, meaning Length is 1 and Stride is ignored. To enable the SIMD mode of the VFP, the Length must be something greater than 1. The Stride can be either “consecutive registers” (field set to binary 00) or “every other register” (field set to binary 11). To stay within the bounds of defined behavior, the effective Length times the effective Stride should not exceed the effective bank size (4 double-precision, or 8 single-precision values).

A formula in C to determine the bits set for Length and Stride, in their proper positions, is:

```bitfield_set = ((stride ^ 1) << 20) | ((length - 1) <00) or 2(10->11)
// stride field as 01 or 10 is undefined
eor r0, r0, #1
sub r1, r1, #1
and r1, r1, #7
// shift/accumulate them in r0
mov r0, r0, lsl #20
orr r0, r1, lsl #16
// get SysCtlReg
fmrx r2, fpscr
// clear stride and length fields
bic r2, #55*65536
// set them with the adjusted values
orr r2, r2, r0
// this is the new SysCtlReg
fmxr fpscr, r0
// restore user registers
pop {r0-r2}
bx lr
```

The “reconfig” routine does no validation of its inputs; it merely guarantees that only the Stride and Length fields of the FPSCR are modified. It is still possible to specify combinations of Length and Stride that overflow a register bank. It is also possible to pass “illegal” values through the r0 and r1, which result in unexpected or undesired behavior. It is up to the programmer to set the Length and Stride correctly.

To revert to “scalar mode,” simply reset the Length and Stride each to 1:

```// inputs: (none)
// outputs: (none)
// modified: Floating-Point System Control Register (FPSCR)
// clears the stride and length fields of the FPSCR,
// returning it to "scalar mode" (no vectors). This is
// what the GNU EABI expects.

deconfig:
push {r0,r1,lr}
// set stride and length to 1
mov r0, #1
mov r1, #1
// and call "reconfig"
bl reconfig
// then return
pop {r0,r1,pc}
```

Of course, if the “reconfig” routine works correctly, then “deconfig” is guaranteed to work correctly.

The GNU EABI does not expect anything but scalar mode for ARM VFP instructions, and any program enabling a vector mode for the VFP must take this into account. The easiest approach is to disable vector mode before either calling or returning to GCC-generated code.

# A simple, stupid, totally brain-dead example

For a basic introduction, let’s add two parallel arrays of single-precision values together, and store the results into a third array. Here’s the design in C:

```void doit(float *sum, float *a1, float *a2, int length) {
int i;

for (i=0; i < length; i++)
*(sum+i) = *(a1+i) + *(a2+i);
}
```

With ‘-O3 -march=native -ftree-vectorize -funroll-all-loops -mfpu=vfp’ options, GCC does a decent job of compiling this function, creating an interesting specimen of Duff’s device. But we can do somewhat better.

Step 0: Restrict the array length to be a factor of the maximum VFP vector length. For single-precision vectors, that is 8 registers. This is optional, but it simplifies this example greatly.

Step 1: Define the parallel operation. In this case, it’s addition. The assembly code will look something like “FADDS dest, source1, source2”. The instruction operands specify the base register for each vector.

Step 2: Set up the loop for operand fetch, vector operation, and result store. The ARM EABI passes C function arguments in r0, r1, r2, r3, and the stack, in that order. This means that r0 is the pointer to the array of results, r1 and r2 are the pointers to the source arrays, and r3 contains the length of the arrays. In step 0, this was restricted to a multiple of 8 for single-precision arrays.

The net result of these steps becomes something like this:

```doit:
push {r0,r1,r4,lr}
// first thing to do is change the
// Stride and Length fields in the
// VFP System Control Register (SCR).
// Linux keeps these as "pure scalar,"
// so that all operations specify
// single registers. We want to enable
// vector mode, to add 8 pairs of floats
// at a time.
push {r0,r1}
// stride length 1
mov r0, #1
// vector length 8
mov r1, #8
bl reconfig
pop {r0,r1}
// divide count by 8, since we're doing 8 at a time
asr r3, #3

.loop:
// fetch 8 floats from source array 1
fldmias r1!,{s8-s15}
// and 8 floats from source array 2
fldmias r2!,{s16-s23}
// here's where the magic happens:
// s[24..31] = s[8..15] + s[16..23]
fadds s24, s8, s16
// and store the result in the destination array
fstmias r0!,{s24-s31}
// one less block
subs r3, r3, #1
// loop if we're not done
bne .loop

// restore the FPSCR
bl deconfig
// and we're done
pop {r0,r1,r4,pc}
```

Did you see that? Loading 8 floats, loading 8 more floats, then adding them together in parallel, then storing 8 floats, repeating to the ends of the arrays. The double-precision version is nearly the same, except that the vector length is 4, and therefore the block count is shifted right by 2 (4=22) rather than 3 (8=23).

This is what vector-based programming is all about: carrying out as many operations as possible in a single machine instruction. Even if the internal architecture carries out the operations serially, it’s still fewer micro-ops per operation than discrete machine code instructions in scalar mode.

Using my earlier routine to read the 1MHz timer, the speed-up shows as slight but measurable. With three arrays of 6,291,456 single-precision members (two source arrays and one destination array), the GCC 4.8.1 version loaded, added, and stored the entire set in roughly 0.418 seconds, while the VFP-based vectorized assembly code took roughly 0.399 seconds. The biggest limitation is memory bandwidth; the VFP can add floating-point vectors as fast as RAM can supply them.

# A real-world example

The stride can be either “every register” (stride 1) or “every other register” (stride 2). One common real-world application is signal processing, which can involve operations with complex numbers (a+bi, where i is the square root of -1). Multiplying two complex numbers also yields a complex number. Algebraically,

(a1 + b1i)(a2 + b2i) = a1a2 b1b2 + (a1b2 + a2b1)i

with the real parts in bold, and the imaginary parts in italics. For example, (2+i)(2+3i) becomes 2×23×1+(2×3+2×1)i, or 1+8i.

Algorithmically, the real part is a multiplication (FMUL) followed by a multiplication/subtraction (FNMAC), and the imaginary part is a multiplication (FMUL) followed by a multiplication/addition (FMAC). The flexibility of the VFP vector mode means there’s no need for special extra instructions (unlike the x86 SSE!). So, if the coefficients a1 and b1 are in S8 and S9 (bank 1), and the coefficients a2 and b2 are in S16 and S17 (bank 2), their complex product can go into S24 and S25 (bank 3) with the following sequence:

```// real portion
fmuls s24, s8, s16
fnmacs s24, s9, s17
// imaginary portion
fmuls s25, s8, s17
fmacs s25, s9, s16
```

Four instructions to generate a complex product. Typical, by ARM standards. But here’s the beauty of vector mode in the VFP: since each single-precision bank can hold 8 values, it can also hold the coefficient pairs of 4 complex values. With stride set to 2 and length set to 4, that becomes four instructions to generate four complex products! This would not be possible with a stride of 1, every register, because the real and imaginary coefficients of the complex factors interact differently. The real part of the complex product is the difference of the products of the real and imaginary parts; the imaginary part of the complex product is the sum of the cross-products of the complex factors’ coefficients. It’s almost like the VFP’s “every other register” stride was made for complex value calculations.

With single-precision coefficients, the assembly-language version takes 20% less time than a highly-optimized GCC-compiled version. With complex arrays of 3,145,728 elements, parallel multiplication of two arrays typically takes 0.476 seconds in a C version compiled by GCC 4.8.1, while the assembly-language version takes 0.361 seconds, a difference of 0.115 seconds.

With double-precision coefficients, the difference is much lower. The assembly-language version shows about a 0.02 second improvement over arrays of 1,572,864 elements. Since the VFP vector banks are only 4 members long, this means we can multiply only 2×2 complex value coefficients at a time. This isn’t to say that the benefit isn’t worth the time saved in development and usage. It’s up to the developer to understand the needs of the end-user, and to create the program accordingly. This means, in the right circumstances, a Raspberry Pi cluster could see a few minutes saved by using effective VFP vector-based code.

# Conclusion

Vector programming with the ARM VFP is a straightforward matter. Calculation instructions that work in scalar mode also work in vector mode, with only a few extra considerations to bear in mind. This is a considerable advantage over Intel x86 for assembly-language programmers; MMX added separate instructions, and SSE added even more instructions and an entirely new register set (XMM/YMM). Rather than adding a separate instruction set, the VFP architecture lets a programmer perform the same basic operations on multiple registers in parallel.

The code for these examples (will be available shortly in Part 3: Appendix).