Finding the Fastest Filesystem, 2011 Edition
Introduction
In my previous report about journaling filesystem benchmarking using dbench, I observed that a properly-tuned system using XFS, with the deadline I/O scheduler, beat both Linux’s ext3 and IBM’s JFS. A lot has changed in the three years since I posted that report, so it’s time to do a new round of tests. Many bug fixes, improved kernel lock management, and two new filesystem (btrfs and ext4) bring some new configurations to test.
Once again, I’ll provide raw numbers, but the emphasis of this report lies in the relative performance of the filesystems under various loads and configurations. To this end, I have normalized the charted data, and eliminated the raw numbers on the Y-axes. Those who wish to run similar tests on their own systems can download a tarball containing the testing scripts; I’ll provide the link to the tarball at the end of this report.
(Note: This report has been superseded by the 2012 edition.)
System configuration
The test system is my desktop at home, an AMD Athlon 64 X2 dual-core 4800+ with 4 gigs of RAM and two SATA interfaces. The drives use separate IRQ’s, with the journal on sda using IRQ 20, and the primary filesystem on sdb using IRQ 21. The kernel under test is Linux 2.6.38-rc2 which now has built-in IRQ balancing between CPU’s/cores. The installed distribution is Slackware64-current. During the tests, the system was in runlevel 1 (single-user), and I didn’t touch the keyboard anytime except during un-measured warm-ups.
The motherboard chipset supposedly supports Native Command Queuing, but the Linux kernel disables it due to hardware bugs. Even with this limitation, “hdparm -Tt” reports about 950M/s for cached reads on both drives, and 65M/s buffered disk read for sda, 76M/s for sdb. That raw throughput serves my usual desktop purposes well.
Filesystem options
I made a big improvement over hand-written notes, by formalizing and scripting the filesystem initialization and mounting options. I also broadened the list of tested filesystems, adding ext2, ext4, ReiserFS, and btrfs. All filesystems were mounted with at least “noatime,nodiratime” in the mount options; this is becoming standard practice for many Unix and Linux sites, where system administrators question the value of writing a new access time whenever a file is merely read.
A quick perusal of Documentation/filesystems/ in the kernel source tree, turned up a treasure trove of mount options, even for the experimental btrfs. One unsafe option I added where possible, was to disable write barriers. Buffered writes can be the bane of journal integrity, so write barriers attempt to force the drive to write to the permanent storage sooner rather than later, at the cost of limiting the I/O elevator’s benefits. I opted for bandwidth in my short tests, for btrfs and ext4.
btrfs
This filesystem format isn’t yet finalized, so it is completely unsuitable for storage of critical data. Still, it has been getting a lot of press coverage and online comment, with a big boost from Ted Ts’o, who called it “the future of Linux filesystems.” Strictly speaking, btrfs isn’t a filesystem with a journal. It’s a log-structured filesystem, in which the journal is the filesystem. Btrfs supports RAID striping of data, metadata, or both, so I opted to enable RAID1 to distribute the I/O load:
mkfs.btrfs -d raid0 -m raid0 ${LOGDEV} ${PRIMARY}
(EDIT: I previously used RAID1, mirroring, instead of RAID0 striping. I have adjusted the results below.)
The btrfs mount options added “nobarrier,space_cache” for performance.
ext2
I added ext2, to provide a reference point based on highly stable code. It provided one of the early surprises in the tests.
mke2fs ${PRIMARY}
The default features enabled in /etc/mke2fs.conf were “sparse_super,filetype,resize_inode,dir_index,ext_attr”, with no mount options beyond “noatime,nodiratime”.
ext3
mke2fs -O journal_dev ${LOGDEV} mke2fs -J device=${LOGDEV} ${PRIMARY}
The only addition to the base ext2 features is the journal. The mount options added for this test were “data=writeback,nobh,commit=30”.
ext4
The other new Linux filesystem is ext4, which adds several new features over ext2/3. The most notable feature replaces block maps with extents, which require less on-disk space for tracking the same amount of file data. The ext4 journal also has stronger integrity checking than ext3 uses. (Another feature, not used in this test, is the ability to omit the journal from an ext4 filesystem. Combined with the efficiency of extents, this makes ext4 a strong candidate for flash storage, using fewer writes for the same amount of file data.)
mke2fs -O journal_dev ${LOGDEV} mke2fs -E lazy_itable_init=0 -O extents -J device=${LOGDEV} ${PRIMARY}
The features from /etc/mke2fs.conf were “has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize”, but the “uninit_bg” feature was overridden by specifying “-E lazy_itable_init=0” to mke2fs. This reduces extra background work during the dbench run.
JFS
Just as was the case three years ago, JFS still has no mkfs or mount options useful for testing throughput. WYSIAYG (What You See Is ALL You Get).
mkfs.jfs -q -j ${LOGDEV} ${PRIMARY}
ReiserFS
I caught a lot of guff three years ago, for omitting ReiserFS from my testing. This time around, I decided that, if btrfs is good enough to test, even though it’s still in beta, then I should be fair to the ReiserFS community and include it as well. Specifying “-f” twice skips the request for confirmation, useful for scripting.
mkreiserfs -f -f -j ${LOGDEV} ${PRIMARY}
Unfortunately, there is no file explaining ReiserFS options in Documentation/filesystems/, and the best advice in mount(8) uses weasel-words: “This [option] may provide performance improvements in some situations.” Without an explanation of what situations would benefit from the various options, I saw no point in testing them. Hence, the only non-default option in my ReiserFS testing is the external journal.
XFS
This was the hands-down winner in my previous testing. Designed with multi-threading and aggressive memory management, XFS can sustain heavy workloads of many different operations. It has many tunable options for both mkfs.xfs and mount, so the scripted options are the most complicated:
mkfs.xfs -f -l logdev=${LOGDEV},size=256m,lazy-count=1 \ -d agcount=16 ${PRIMARY}
One shortcoming of XFS is its lack of a pointer to an external journal device. As far as I can tell, it is the only journaled filesystem on Linux to have only a flag specifying whether the journal is internal or external. If the journal is external, then the mount command must include a valid “logdev=” option, or the mount will fail.
I also expanded the mounted journal buffers, with “logbufs=8,logbsize=262144”. On my computer, memory management is faster than disk I/O.
Testing the elevators
The original testing was intended to show the effects of disk I/O elevators and CPU speed on the various filesystems, using medium and heavy I/O load conditions. Since I ran the original tests, the “anticipatory” I/O elevator has been dropped from the Linux kernel, leaving only “noop”, “deadline”, and “cfq”. This round of testing still shows significant differences between them.
With a 5-thread dbench load, I was surprised to see that ext2 was the consistent winner. Its lack of a journal makes for less overall disk I/O per operation, at the cost of a longer time to check the filesystem after an improper shutdown. XFS came in a close second, at roughly 97% the performance of ext2.
The rest of the filesystems aren’t nearly as competitive. Even with their best elevators, JFS, ReiserFS, and btrfs have less than half the performance of ext2 or XFS.
When the load increases to 20 threads, XFS is once again the clear winner. Ext4 benefits in the overall ranking, thanks to extent-based allocation management, a trait it shares with XFS. Ext2 falls to third place, probably due to the increased burden of managing block-based allocations. Ext3 again comes in fourth, with block-based allocations and added journal I/O. The clear loser is once again JFS, coming in at only 40% under heavy load. (More on this later.)
Normalizing the throughput by a filesystem’s best elevator, shows which filesystems benefit from which elevators. Oddly, under a 5-process load, the only filesystem to benefit from “cfq” is JFS on a fast CPU. As seen above, that isn’t enough to make it a strong contender against XFS or any of the native Linux ext{2,3,4} filesystems.
Here is where the game has changed. The “cfq” elevator clashed badly with XFS three years ago; it is now mostly on par with “deadline” and “noop”. The XFS developers have put a lot of work into cleaning up the internals, improving the integration of XFS with the Linux frameworks for VFS and disk I/O. They still have work to do, as explained in Documentation/filesystems/xfs-delayed-logging-design.txt.
At its best, ReiserFS had only about 1/3 the throughput as the best filesystem, in any tested configuration. Some mount options could probably improve the throughput, but without clear guidance, I wasn’t going to test every combination to find the best.
Bandwidth saturation
I decided to run a separate series of tests, to see what process loads would saturate the various filesystems, and how they would scale after passing those saturation points. Using their best elevators, I tested the throughput of each filesystem under loads from 1 to 10 processes.
The two worst performers were JFS and ext2. JFS peaked at 3 processes, then dropped off badly, ending up at 33% of its best performance at 10 processes. Ext2 didn’t suffer as badly, peaking at 5 processes, then falling only to 75% of its peak. Ext3, ext4, XFS, and ReiserFS didn’t suffer significantly under saturated load, staying mostly within a horizontal trend.
If I had to make a guess why JFS scales so poorly, I can only suppose that, following IBM’s philosophy, it’s better to be correct than to be fast.
A special surprise
Btrfs was something of a mystery, hitting a performance valley at 4 processes, then climbing steadily upward nearly to the end of the test. Given that its raw number under 20 processes was better than its raw number under 10 processes, I decided to extend its test all the way to 50 processes, hoping to find its saturation point.
Btrfs managed to scale somewhat smoothly, all the way from 4 to 30 processes. Beyond that, its performance began to exhibit some noise, while still keeping an upward trend. This is a very impressive development for a dual-core system. (For the math geeks, the trend line from 4 to 50 is f(x)=0.34x0.29, with coefficient of determination R2=0.99.)
Conclusion
The Linux filesystem landscape has changed a lot in the past three years, with two new contenders and lots of clean-up in the code. The conclusions of three years ago may not hold today.
Similarly, what’s true on my system, may not be the case on yours. If you wish to examine your own system’s behavior, this tarball (CAPTCHA and 30-second wait required) contains the scripts I used for this article, as well as two PDF’s with the raw and normalized results of my own testing.
31 Comments
Leave a CommentTrackbacks
- LXer: Finding the Fastest Filesystem, 2011 Edition - oBlurb
- Tweets that mention Finding the Fastest Filesystem, 2011 Edition « I Am, Therefore I Think -- Topsy.com
- Sistemas de ficheros: ¿cuál es más rápido? | MuyLinux
- Links 3/2/2011: PCLinuxOS Magazine Issue for February 2011, CentOS 6 Interview | Techrights
- Linux News » Finding the Fastest Filesystem, 2011 Edition
- Recomendaciones de la semana | Pillateunlinux
- Anonymous
- » เปรียบเทียบความเร็วของ filesystem บน Linux (ล่าสุด)
- Finding the Fastest Filesystem, 2012 Edition « I Am, Therefore I Think
- Tentang File System
You don’t give a final recommendation. Which is the best or faster system for the average user?
For the average user, the best one is the one that came with the system and gets the job done.
For someone who needs continuous, steady performance, it depends on the hardware configuration. That’s why I provide the testing scripts. The best filesystem might be JFS or btrfs, somewhere else.
“average user” as you mention in the article is an enterprise system admin. In common usage, average users are notebook & netbook owners, often with the latest SATA system (just released) which sys admins do not yet use – it’s too fast & has compatibility problems.
This explains why you do not consider FAT (12, 16, 32), OS X, nor any of the many versions of NTFS.
You also ignore the fact that national, educational & non-government agencies do not use Linux as file systems. The federal Australian government has just reinforced (forced) the use of Microsoft software onto this nation ; no open software, etc allowed, except after tedious, expensive legal procedures are attempted by any government agency.
Retired (medical) IT Consultant, Australian Capital Territory
I don’t mention “average user” in the article. However, a “slightly above-average user” can use the scripts I provide, to test another system’s behavior, especially w.r.t. disk I/O elevator. That one tunable is not nearly as difficult to adjust as a backup/re-format/restore, and it can have a significant impact on disk performance. XFS used to suffer under the CFQ elevator; switching to “deadline” was “like getting a new laptop” as one user put it. (My previous article discusses this.)
My understanding of Macintosh OS X is that it uses UFS. This wouldn’t be impossible to test under Linux, but it isn’t really used as a primary filesystem in the Linux world. NTFS isn’t well-supported at all; the NTFS write support in the Linux kernel config comes with a stronger-than-usual “we hope this works right, but tough cookies if it messes up your data” warning.
dbench is not exactly a great filesystem benchmark (any serious fs comparation should use several benchmarks)
It won’t be difficult to substitute a different benchmark into the testing script. I use dbench for two reasons:
(1) dbench shows clear differences between filesystems, more than any others I tried. Writing large files on my system will saturate at around 70 M/s, for any filesystem under test.
(2) dbench reports data volume per fixed time, rather than reporting time for fixed data volume. Other benchmarks perform a fixed number of operations, and report the time required; I prefer dbench’s opposite approach.
it would have been fun to include Fat32 and NTFS, maybe HFS+ or whatever OSX uses now too.
just to see how they level up. :)
Check Wikipedia; it’s changed a lot since I instigated updates. There are many versions of NTFS: NTFS-3G (Ms-Win & Mac), & M-$ NTFS (several versions). With both FAT32 & all versions of NTFS, The published opinions are that perhaps a cluster size of 4mb is faster.
Most PC users have notebooks & laptops. But his article only is about system administrators, on enterprise servers. The author wrongly labelled the posting.
Thanks for sharing these tests. I will try it here, on my cenarios/hardware.
Rafael from Suporte Linux Team
You state, Btrfs supports RAID striping of data, metadata, or both, so I opted to enable RAID1 to distribute the I/O load: but isn’t RAID-1 mirroring? RAID-0 is striping AFAIK.
I’d guess that this could have a significant negative effect on your Btrfs performance results?
Oh dear, you are exactly right. I’ll point out my blunder, and re-run the btrfs tests tonight.
To be fair though, I think all the other filesystems can be striped (raid-0) by using mdraid. Perhaps there is a valid difference if Btrfs does this natively?
Yes, btrfs does volume management natively, when initializing the filesystem, or later with “btrfs device add/remove”.
I’d like too see HammerFS in the comparison as well, too bad it’s not been ported to Linux yet.
This is an interesting article. General user may not look under the hood, but the ‘geeks’ and the ‘knowledgeable’ must provide the file-system for any operating system which is sturdy, takes the same space on disk that it is, is able to recover when corrupted for some reason, defrags almost simultaneously, and yes, scales to take advantage of the sizes that are available or the pooled ones. And data stored in it must not be lost.
Like programming languages ‘platform independent” we should also have “OS independent FS”. Let there be competing ones for people to choose from. (Ofcourse, someone else may decide it, but in the Linux world, the user can exercise the right at the time of installation).
Thanks indeed, I am enriched.
Nagin Chand
Strictly speaking, all filesystems are OS independent. Any OS can be “taught” via the right drivers how to access the filesystem structures on the hard drive, and through those structures the files themselves. M$ Windows can access ext2 filesystems as easily as Linux can access NTFS filesystems.
UFS is the common name for Solaris, BSD, and HP-UX, but each of those has customized the UFS format to the point that they can’t share UFS volumes between them. Accessing UFS in Linux requires that the administrator specify which flavor of UFS is on the volume. The man page for “mount” includes a complaint about this.
Thank you!
Nagin Chand
the mount option you are loocking for with reiserfsfs is “notail”.
you used a space efficiency optimized filesystem, not a performance tuned one.
not that reiserfs will become the fastest filesystem in your test, but will be faster then how it appears with hash tailing enabled.
for sure it is even now a good choiche untill btrfs in not production ready.
btw the notail mount option has to be used in order to benefit from O_DIRECT with databases…
This could (should?) be documented in the kernel source tree, as I explained in the article.
if you type “man 8 mount” and look for reiserfs, all this stuff is well documented.
i cannot think of a more default location.
notail,noborder,date=writeback to tune reiserfs to be fast with DB (noborder is not good you you do not have the typical DB workload).
barrirer=node is the same of nobarrier with btrfs, DO NOT USE IT you you think it is better not to loose data on crash.
hashed_relocation and no_unhashed_relocation are almost the same, this is the XFS default behaviour.
—–
Mount options for reiserfs
Reiserfs is a journaling filesystem.
conv Instructs version 3.6 reiserfs software to mount a version 3.5
file system, using the 3.6 format for newly created objects.
This file system will no longer be compatible with reiserfs 3.5
tools.
hash=rupasov / hash=tea / hash=r5 / hash=detect
Choose which hash function reiserfs will use to find files
within directories.
rupasov
A hash invented by Yury Yu. Rupasov. It is fast and pre-
serves locality, mapping lexicographically close file
names to close hash values. This option should not be
used, as it causes a high probability of hash collisions.
tea A Davis-Meyer function implemented by Jeremy
Fitzhardinge. It uses hash permuting bits in the name.
It gets high randomness and, therefore, low probability
of hash collisions at some CPU cost. This may be used if
EHASHCOLLISION errors are experienced with the r5 hash.
r5 A modified version of the rupasov hash. It is used by
default and is the best choice unless the file system has
huge directories and unusual file-name patterns.
detect Instructs mount to detect which hash function is in use
by examining the file system being mounted, and to write
this information into the reiserfs superblock. This is
only useful on the first mount of an old format file sys-
tem.
hashed_relocation
Tunes the block allocator. This may provide performance improve-
ments in some situations.
no_unhashed_relocation
Tunes the block allocator. This may provide performance improve-
ments in some situations.
noborder
Disable the border allocator algorithm invented by Yury Yu.
Rupasov. This may provide performance improvements in some sit-
uations.
nolog Disable journalling. This will provide slight performance
improvements in some situations at the cost of losing reiserfs’s
fast recovery from crashes. Even with this option turned on,
reiserfs still performs all journalling operations, save for
actual writes into its journalling area. Implementation of
nolog is a work in progress.
notail By default, reiserfs stores small files and `file tails’
directly into its tree. This confuses some utilities such as
LILO(8). This option is used to disable packing of files into
the tree.
replayonly
Replay the transactions which are in the journal, but do not
actually mount the file system. Mainly used by reiserfsck.
resize=number
A remount option which permits online expansion of reiserfs par-
titions. Instructs reiserfs to assume that the device has num-
ber blocks. This option is designed for use with devices which
are under logical volume management (LVM). There is a special
resizer utility which can be obtained from
ftp://ftp.namesys.com/pub/reiserfsprogs.
user_xattr
Enable Extended User Attributes. See the attr(5) manual page.
acl Enable POSIX Access Control Lists. See the acl(5) manual page.
barrier=none / barrier=flush
This disables / enables the use of write barriers in the jour-
naling code. barrier=none disables, barrier=flush enables
(default). This also requires an IO stack which can support bar-
riers, and if reiserfs gets an error on a barrier write, it will
disable barriers again with a warning. Write barriers enforce
proper on-disk ordering of journal commits, making volatile disk
write caches safe to use, at some performance penalty. If your
disks are battery-backed in one way or another, disabling barri-
ers may safely improve performance.
“This may provide performance improvements in some situations.” In what situations? If the developers don’t care to explain why they added the options, I’m not going to spend a bunch of time trying to figure out what their intentions were.