Comparing the x87 and ARM VFP
The first generation of x86 processors, the 8086 and 8088, had support for synchronous asymmetric multiprocessing. Synchronous, in that the main processor could wait for a READY/#BUSY or ERR signal from the coprocessor; and asymmetric, because the coprocessor’s instructions were completely different from the main processor’s. The coprocessor could not address memory, but had to wait for the main processor to decode an instruction’s addressing mode and then “open a window” into RAM for the coprocessor. The coprocessor could then read from or write to the data lines as required. For this reason, coprocessor instructions are always followed by at least one byte (ModR/M, in Intel terminology) indicating the addressing mode of the instruction. Coprocessor instructions operating on internal state only, with no memory access, could be carried out in parallel with main processor instructions, and the addressing mode byte indicates no memory access.
The coprocessor most often connected to the 8086 was the 8087 floating-point math unit. Later generations saw the 80x87 unit integrated with the main processor, at first by moving the chip logic onto the same wafer, but still using the same x86/x87 interconnect. With the Pentium series, the x87 was fully integrated into the superscalar design. The “coprocessor instruction space” became the “x87 instruction space.”
The ARM architecture has always had support for operations on up to 16 coprocessors. However, some operations now handled by coprocessors have not always been. For example, memory management is now handled by coprocessor 15, but the BBC Micro had the MEMC memory controller chip. The MEMC was managed by asserting values on the address bus, that is, reading or writing a particular location. If the address bits 25-17 contained 11011x111 in binary, then the MEMC interpreted the overall address as a re-configuration command for itself. (Interesting note: The binary sequence 11011 in the high bits is also how the 80x86 denotes a coprocessor instruction.)
Looking specifically at floating-point coprocessors, the x87 uses an extended precision (10 bytes) internally; handles logarithmic and trigonometric functions on-chip; and was strongly stack-based, though no longer so. The ARM vector floating-point (VFP) unit on the Raspberry Pi uses 4 or 8 bytes internally, does not support logarithms or trig functions in hardware, and supports the 3-register paradigm common to ARM (two source operands, and a third destination).
Let’s take a look at adding two floating-point values.
How it’s done (in the software)
The x87 opcode to add the top two values on the FPU stack is DE C1, and the mnemonic is FADDP. The instruction decomposes thus:
- The highest 5 bits of the first byte are 11011. That means this is a coprocessor instruction.
- The low 3 bits of the first byte are 110, so the instruction will be basic math operations.
- The high two bits of the next byte are 11, indicating internal registers only. This means the next 3 bits are also part of the instruction. They are 000, so the operation is “add the value at the top of the stack (register 0) to the value in register N, store the result in register N, then pop register 0 from the stack.” The position N is indicated in the remaining 3 bits, in this case 001, register 1.
FADDP, with no arguments, is thus the same as FADDP ST(0),ST(1).
The ARM VFP takes a similar approach, but the instruction encoding is just a generic coprocessor instruction. The opcode to add D0 and D1, storing the result in D2, is EE 30 2B 01, and the mnemonic is FADDD D2,D0,D1 (or FADD.F64 D2,D0,D1 in GNU assembler syntax). Decomposing this instruction:
- The first E means “always execute” in the ARM instruction predication scheme.
- The second E means this is a coprocessor instruction. The high nybble of the last byte is even, meaning its bottom bit is 0. That means this is an internal operation, requiring no memory access to carry out.
- The B indicates this is an instruction for coprocessor #11, the double-precision floating-point processor.
- The bottom nybbles of the second and fourth bytes contain the two source registers, D0 and D1.
- The top nybble of the third byte contains the destination register, D2.
However, this leaves seven bits unaccounted for, the top nybble of the second byte and the top 3 bits of the fourth byte. These are the actual opcodes to be interpreted by the coprocessor. Thus, the generic coprocessor instruction received by the ARM processor is:
in which the parameters are coprocessor 11, primary opcode 3, destination register 2, first source register 0, second source register 1, and secondary opcode 0, respectively. When the ARM receives this instruction, it merely passes the parameters to coprocessor 11, which then acknowledges the instruction and carries it out.
ARM VFP peculiarities
Just as the MMX registers use the same physical space as the x87 floating-point registers, the VFP single-precision registers overlay the double-precision registers. However, where using both x87 and MMX code requires mode-switching instructions between the two instruction sets, the VFP can operate on single-precision and double-precision registers simultaneously. The actual physical registers are shared between coprocessors 10 and 11, the single- and double-precision FPU’s. D0 contains S0 and S1, D1 contains S2 and S3, and so on, to D15, which contains S30 and S31.
Register operands in the instruction occupy the same slots, so sending the CDP instruction above to coprocessor 10 instead of 11 would be the equivalent of FADDS S4,S0,S2, that is, using the upper halves of D2, D0, and D1. To access the lower halves of the D registers, coprocessor 10 interprets three bits in the instruction as the operand registers’ lowest bits, allowing access to the odd-numbered Sn registers. So, to execute the instruction FADDS S5,S1,S3, the instruction bits work out thus:
- The destination register’s lowest bit comes from instruction bit 22, adding 4 to the primary opcode.
- The first source register’s lowest bit comes from instruction bit 7, adding 4 to the secondary opcode.
- The second source register’s lowest bit comes from instruction bit 5, adding 1 to the secondary opcode.
So the following two mnemonics are really the same instruction:
And, while coprocessor 10 is adding these two single-precision numbers, coprcessor 11 can be adding two more double-precision numbers from other, non-conflicting registers. D0, D1, and D2 would be unsafe to use, because S1, S3, and S5 occupy the same space. D3 thru D15 would still be available.
Note that the somewhat contrived examples here do not reflect any restrictions in programming. I used all even-numbered registers, or all odd-numbered registers, to illustrate the addressing scheme in the ARM VFP instruction encoding. There is no such restriction in real VFP programming.
While the x86 coprocessor space is now completely given over to the FPU, quite the opposite has happened in the ARM architecture. Coprocessors handle several auxiliary tasks, or tasks that must be handled via dedicated hardware, such as single- and double-precision math (coprocessors 10 and 11), system control (MMU and coprocessor 15), and hardware debugging support (coprocessor 14). These are merely the architecture-defined coprocessors; implementors may choose to add others as they see fit, without impinging on the ARM instruction space.