Optimizing Cache Usage
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. (Tony Hoare)
In an earlier article, I praised delayed library binding as a way to relieve memory pressure. Now, I will show how accessing data sooner, rather than later, can make a program run faster. The example I use is the famous Conway’s Game of Life.
The traditional rules are fairly simple. The game is played on a large two-dimensional grid of “cells,” which are either “alive” or “dead.” The cells’ status for the next generation is determined using a set of rules, usually the following:
- Cells with fewer than 2 or more than 3 living neighbors will not be alive.
- Living cells with 2 living neighbors will remain alive, while dead cells with 2 living neighbors will remain dead.
- Cells with 3 living neighbors will remain alive, or “come to life.”
A “neighbor” is defined as any cell one unit away on either or both axes. For example:
The center cell above has four living neighbors, and will die of “overcrowding” in the next generation. Corner cells have three neighbors, edge cells have five neighbors, and all remaining cells have eight neighbors.
Altering either the rules or the starting conditions creates a different “environment” for this simulated life. This formed the basis of cellular automata theory.
The implementation traditionally follows a simple two-stage approach:
- Count and store each cell’s living neighbors.
- Apply each cell’s status of “living” or “dead” for the next generation.
This is sufficient for teaching programming techniques, but it doesn’t take into account modern CPU design. The net result of this approach is that each cell’s in-memory data is fetched at least twice, possibly more, into the data cache (Dcache): once for neighbor counting, again for determining life or death in the next generation. Data fetched at the beginning of the first stage will be written back to memory and ejected from cache before the scan finishes, in order to make room for more data. When the second stage begins, cell data must be fetched again
Every fetch into Dcache can slow down program execution. Conway’s Game of Life, on a large grid, runs amazingly slow. On my Linux-based 2.0 GHz AMD Athlon, with 768 megs of memory, a 1,000×1,000 grid will use no swap, but still requires 81.54 CPU seconds to run 1,000 generations.
One way to reduce Dcache fetches is to act on the data as soon as all dependencies are satisfied. In order to expose the data dependencies, we can combine both stages into a single scan through the cell array. As we scan, we do two things:
- All living cells’ neighbors get their neighbor counts incremented by 1.
- After each cell’s last neighbor is checked, its status for the next generation is determined and its neighbor count is reset to 0.
The “last neighbor” is determined by the row and column scanning order. In a left-to-right, bottom-to-top approach, a cell’s last neighbor is above and to the right. That is, after checking a cell at (X,Y), we can decide if the cell at (X-1,Y-1) will be alive or dead on the next generation. The effect on execution speed is tremendous. For large areas of dead cells, only one neighbor must be fetched into Dcache. If the system’s Dcache will hold three rows of cell data, then writeback and ejection of each cell occurs only once per iteration. The increased efficiency becomes even more obvious when cell data must be fetched from swap space on disk.
This new approach creates very real “edge cases” and “corner cases.” In order to simplify the scanning, I added a one-cell buffer around the whole array, so that all active cells would have the full complement of neighbor cells. This means, for an N by N grid, the buffer is in rows 0 and N+1, and in columns 0 and N+1. Iteration starts at (1,1) and ends at (N,N). These buffer cells will never become alive. For row 1 and column 1, the second part of the analysis involves only resetting the neighbor count to 0. Column N can be handled at the end of each row’s analysis. A separate loop is necessary to guarantee that row N will be properly analyzed. The pseudo-code looks something like this:
For each row x, from 1 to N: For each cell in column y, from 1 to N: If cell(x,y) is alive, Add 1 to all its neighbors' "living neighbor" counts. If x > 1 and y > 1, Determine if cell(x-1, y-1) should be alive or dead. Set "living neighbor" count for cell(x-1, y-1) to 0. Repeat last 2 steps in last cell of the row: Determine if cell(x-1, N) should be alive or dead. Set "living neighbor" count for cell(x-1, N) to 0. Again, repeat last 2 steps in the last row: For each cell in column y, from 1 to N: Determine if cell (N,y-1) should be alive or dead. Set "living neighbor" count for cells (N, y-1) and (N+1, y-1) to 0. Last cell is always border, ergo empty: Set "living neighbor" count for cell(N+1,N+1) to 0.
This looks like more work, and for the programmer, it is. The performance benefits far outweigh the cost of the work; on my system, the memory fetches are reduced by 50%, and CPU time is reduced even more. The same simulation that took 81.54 seconds, is now reduced to 33.65 seconds!
“Premature optimization” may be evil, but serious design consideration to the flow of data through a program’s execution will show great benefit later on.