Smart Structures, Dumb Code
“Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; it’ll be obvious.”—Fred Brooks, The Mythical Man-month, chapter 11
“Smart data structures and dumb code works a lot better than the other way around.”—Eric Raymond, The Cathedral and the Bazaar, chapter 5
My experience concurs these two identical observations, so I decided to re-work my assembly-language Collatz program to use smarter data and somewhat dumber code.
The basic Collatz sequence definition goes something like this:
- If N is 1, exit.
- If N is odd, multiply it by 3 and add 1.
- If N is even, divide it by two.
- Go back to the first step.
A number’s Collatz value is how many times an underlined step was carried out.
One optimization for this is to note that N may be even multiple consecutive times, but it will never be odd twice in a row. Take the sequence 68, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1. Any odd number, times 3, plus 1, always gives an even result. Code flow can be optimized based on this.
Still, this results in code with multiple conditional branches. Modern CPU’s support branch probability hints, to help their branch prediction circuitry, but that isn’t enough. Might it be possible to adjust the code flow, so that the only conditional branch is the exit condition? Yes, using the principle of predication.
CPU’s originally had very limited data testing ability. Any mathematical operation could alter the program status flags; if the program designer wanted to act on the results of an operation, but not immediately, the state of the flags had to be stored somewhere in memory and retrieved later. This is why the typical translation of “if A is true, then do B” is to turn it into “if A is false, skip beyond B”.
Among x86-based CPU’s, the set of testable status flags has always been Zero, Carry, Parity/Overflow, and Sign. These haven’t changed since the first IBM PC came out in 1981. The first Intel CPU to support arbitrary predication was the Itanium, with 63 predicates, plus one that always reads as true, for unconditional operations. That way, you can say something like this:
- test if N is evenly divisible by 2, and store the result in P5
- store the opposite of P5 in P6
- if P6 is set, multiply N by 3
- if P6 is set, add 1 to N
- if P5 is set, divide N by 2
The Itanium will simply skip the instructions that don’t apply for a given value of N, with no need for a branch after the initial test. How can this design be back-ported to an x86 CPU that doesn’t have arbitrary predication?
From the Pentium Pro onward, the x86 CPU’s have a set of conditional move instructions, to facilitate simple data movement based on program status flags. “If the Zero flag is set, load the multiplier into EAX” is a perfectly legal instruction, provided the multiplier value isn’t provided in the instruction (it must come from data memory or another register). In x86 assembly language, it would be
CMOV ESI, multiplier. A good design decision by Intel is to leave program status flags unaffected by CMOV instructions, so that several operands can be loaded into place in preparation for calculations.
The Collatz sequence, having such disjoint calculations, can be re-worked into a single sequence with carefully chosen operands:
next N = (current N * A + B) / 2C
where A, B, and C are
- 1, 0, and 1 if current N is even
- 3, 1, and 0 if current N is odd
This way, if N is even, A and B have no effect; if N is odd, 2C reduces to 1, and has no effect. Actually, because the division is 2-based, it’s faster to shift N to the right by C places, 0 (to divide by 1) or 1 (to divide by 2).
So why jump through all these hoops? It’s the old adage, “Because I can.” This may have been another learning experience in my programming journeys, but my testing shows it to be a slight loss. Even with very tight coding in assembly language, the CPU-based multiplication is simply more expensive than figuring 2N+N (copy, shift one left, add them). On the register-starved x86-32, the 2N+N approach would force memory writes and reads, surely killing any slight performance gained from using a predicated operand selection approach. The slightly less register-starved x86-64 can store the constants in the “high” registers R8-R15, but the multiplication still can’t keep up with 2N+N.
I think the phrase that applies here is, “too smart by half.”