Finding the Fastest Filesystem
What follows is based on my observations. My focus is the relative performance of different filesystems, not the raw benchmark numbers of my hardware. For this reason, I have not included any specific model numbers of the hardware.
Part of my “economic stimulus check” went to a 500GB SATA drive. My original intention was to buy two of them, so I could claim, “over a terabyte of disk space!”. Alas, I got a little ahead of myself; my system had only one open hard drive bay. With a slightly bruised ego, I returned the unopened second hard drive and began to ponder how to exploit my super-roomy disk space. I quickly settled on one goal: find the fastest journaling filesystem (FS) for my SLAMD64 dual-core computer, with 2G of memory. My testing focused on three main areas: filesystem, disk I/O scheduler, and CPU speed.
So that others may run similar tests on their own systems, I have provided a gzipped tarball (CAPTCHA and 30-second delay for free download) containing the scripts and my own test results.
Frankly, the final results stunned me.
My first round of tests was a home-brew hack involving Slackware’s package management suite, distributed via threads across several directories, but I found my results too difficult to interpret. I also tried bonnie++, but test after test turned up no clear winner on anything. Part of the problem, I think, is that most of testing in bonnie++ runs in the context of a single thread, leaving each test CPU-bound.
After presenting a little bit of testing results to a discussion board, I learned about irqbalance, which works to distribute the IRQ load evenly across multiple processors, while keeping a particular IRQ response on a single CPU or core as much as possible. On general principles, I installed it immediately.
A little more research reminded me of dbench, the fantastic suite by Andrew Tridgell of Samba fame. His general goal with dbench is to exercise and time all of the basic file operations: create, write, seek, read, rename, lock, unlock, stat, and delete. In addition, the design of dbench specifically includes multi-threading; if a system is designed with multi-processing in mind, dbench will be able to demonstrate its advantage. The results I report here are from dbench, because each run showed a clear winner, by a substantial margin.
As an aside, another favorable point of dbench over bonnie++ is that dbench determines its run length by clock time, rather than number of operations completed. This gives a more deterministic approach to filesystem benchmarking, something I prefer. I’d rather provide a test script that runs for 5 minutes on all systems, than a test that operates on 10,000 files in 30 seconds here, and 20 minutes there.
FINDING A STARTING POINT
My first step (while still stuck in bonnie++-land) was to find a set of FS options that provided somewhat good performance. One option that stood out for each was an external journal, on a different controller. By isolating main partition I/O from journal I/O as much as possible, an SMP system can drive both of them at once. This helps all journaling filesystems.
With that in mind, I used /dev/sdb5 for the main partition, with /dev/sda2 for the journal.
The following are the commands and options I used:
ext3: mke2fs -O journal_dev /dev/sda2 #first step
mke2fs -j -J device=/dev/sda2 /dev/sdb5 #second step
JFS: jfs_mkfs -j /dev/sda2 /dev/sdb5
XFS: mkfs.xfs -l logdev=/dev/sda2,lazy-count=1,size=64m -d agcount=8 -f /dev/sdb5
Three things to note:
1. mke2fs looks for /etc/mke2fs.conf, which may contain additional ext2/ext3 options. My system specifies sparse_super, filetype, resize_inode, dir_index, and ext_attr, with a 4K block size and a 128-byte inode size.
2. jfs_mkfs has surprisingly few documented options.
3. mkfs.xfs has many options. The two mandatory options I needed were “logdev=” for the journal (logging) device, and “size=64m” to clarify that only 64 megs (the maximum) of the 2G partition would go to the journal. The other options are “lazy-count=1″, which eliminates a serialization point in the superblock, and “agcount=8″, meaning 8 allocation groups in the main data partition.
After creating each filesystem, I needed to mount it to /tester. I always passed “-o noatime” to each mount, so that every call to read() would not cause a write to disk.
For ext3, I also specified “data=writeback”, which greatly increased overall performance. This option is explained in the Linux kernel documentation as well as the mount(8) man page.
For XFS, I added “logdev=/dev/sda2,logbufs=8,logbsize=262144″. One drawback of XFS is that the external journal device must always be specified; it has no data in the superblock to indicate the journal location, other than “internal” or “external”. (I will follow up on this in a later article.) I specified 8 logging buffers, to match the number of allocation groups in the filesystem, and gave as much RAM to each log buffer as possible (262,144 bytes).
As with jfs_mkfs, JFS has a dearth of mount options.
RUNNING THE TESTS
In my initial dbench tests, I noticed that my filesystem throughput was mostly stabilized after about 15 seconds of actual test time. Since I was interested more in the short-term throughput typical of program loading and linking (and I tend to be impatient), I shortened the test time to 60 seconds with the following command line:
dbench -t 60 -D /tester $THREAD_COUNT
For a mild system load, I used 5 threads; for a heavy load, I used 20 threads. These are admittedly arbitrary figures, but they did expose well-threaded or poorly-threaded design on my dual-core system.
I created a script to do two primary tasks:
1. Warm the cache with a preliminary run of dbench.
2. Run dbench with each of the four I/O elevator algorithms (noop, deadline, anticipatory [which was deprecated later], and CFQ) as well as the slowest and fastest CPU speed available on my system. The output of these 8 iterations went to /tmp/$FILESYSTEM-$SPEED-$ELEVATOR.txt.
I also hacked up a little script for preliminary analysis of these output files. It found the best and second-best performers for each dbench operation, followed by an ascending list of overall throughput. I found that the elevator can be more important than CPU speed for a filesystem’s performance. The deadline elevator generally did best for all three filesystems in my tests, although the impact of elevator choice varied.
Since not everyone has an SMP system, I also ran the tests after booting with “nosmp”.
I was amazed by the numbers. For all but one uniform subset of tests, XFS was the clear winner. I even tried synchronous mounts and encrypted volumes as little “what if?” exercises, and XFS still came out on top. The parameter that cost XFS a total victory was the CFQ elevator on a slower system; ext3 won most of those cases.
One thing that shines about XFS is its highly-threaded design. For every permutation of elevator and CPU speed, XFS scaled upward from 5 threads to 20, although the degree of scaling with “noop” is probably within statistical noise. However, the CFQ elevator had a very large difference between a slow and fast CPU. In fact, (XFS + CFQ + fast CPU speed) on my system clashed so badly that the 5-thread test was the worst one for XFS.
JFS had a surprising disappointment: it never scaled upward. In fact, every single test had better performance with 5 threads than with 20. CPU speed and I/O elevator did not matter. All the 5-thread tests had throughput somewhere between 100-160 MB/s, and all the 20-thread tests came in somewhere between 40-45 MB/s. Once JFS reaches saturation, it performs no better than it would with a synchronous mount (-o sync).
What about ext3? Well, it showed some strange behavior on my system. With a fast CPU, it also showed the characteristic of performing (slightly) better under light load than under heavy load. However, with a slow CPU, all elevator differences disappeared into statistical noise, falling between 160-170 MB/s.
Grouping the results by CPU speed, threads, and I/O elevator, I found that XFS was best on SMP in all but two permutations. For the noop, deadline, and anticipatory elevators, all results come up with XFS best, then ext3, then JFS. With the CFQ elevator and 5 threads, ext3 won, then XFS, then JFS.
With a single CPU and 20 threads, the story was the same: XFS, then ext3, then JFS. However, with the lighter load of 5 threads, there was no uniform winner. ext3 topped the list with the CFQ scheduler and a faster CPU; otherwise, XFS was the winner.
APPLYING THE RESULTS
So which was fastest? On my SMP system, XFS with the deadline elevator topped the list, giving over 400 MB/s throughput. I switched my /home and /usr directories to use XFS with external journals, set the deadline elevator on the kernel command line, and suddenly OpenOffice.org went from 6.5 seconds launch time to 3.5 seconds. I am not the only one to notice that XFS performance improves with the deadline elevator.
Another drawback of XFS’s external journal is that it is currently impossible to specify the journal location on the kernel command line. The option “rootflags=logdev=/dev/XXXX” is not handled properly, due to shortcomings in the kernel’s handling of the first root volume mount. I circumvented this with a hacked initrd, which is another article in itself.