As detailed in my last post, ZFS looks like the best fit for my needs. It’s taken some time to get my head wrapped around it, so I thought I would share my initial findings here.
So, what is ZFS anyway?
ZFS is short for Zettabyte File System. This is a technology that dates back to 2004 when it was created by Sun to solve some filesystem-related issues. This was needed to support Sun’s X4500 “Thumper” server with its 48 drive capacity, which was huge at the time: Sun shared ZFS with the open-source community, then Oracle bought Sun and took ZFS closed-source again, then Oracle’s ZFS developers left, and it’s now being developed and actively used in various open-source projects. Oracle is going in another direction with Solaris’ filesystem needs, but many seem to think this is because they can’t use recent ZFS innovations without making Solaris open source again. Oops.
ZFS is a high-level view of your disk storage subsystem. It performs software RAID, handles file systems nearly transparently, and uses its comprehensive view of your data storage to do a whole lot more as well.
So why is ZFS interesting?
I’m interested because ZFS was designed with data integrity as the core principle, but since it was a green field project its designers took advantage of the opportunity to do some unique things that offer a whole lot of performance and flexibility as well. You can spend months learning this system, but some of the big features I see are as follows:
- ARC Cache. ARC is short for adjustable replacement cache, and you can think about it as server memory being used to cache “hot” data. This means that reads from frequently accessed data (like databases) are fast, and the system will adjust to the current workload and optimize performance. RAID cards can do the same sort of thing, but ARC caches are significantly larger. My test server has 72 gigs of RAM in it, so ~ 62 gigabytes of that should be set aside as an incredibly fast read cache. RAID cards max out at ~ 1 gigabyte of cache.
- L2ARC. You can designate fast storage (like an SSD drive) as a secondary cache for ARC. Doing this reduces the memory available to ARC though, because memory is required to track what’s on the SSD. In most cases the right time to add an SSD as L2ARC is after you’ve maxed out your storage server’s RAM.
- Transaction Groups. ZFS insures that the data written to disk is always consistent by queueing up disk writes then writing them to disk as a single transaction. This is good from a data integrity point of view, but it’s also good from a performance point of view. Writes are expensive, and collecting then reordering them is more efficient than simply writing as the requests come in, so transaction groups reduce demand on drives and squeezes more performance out of an pool.
- ZIL stored on SLOG. The problem that pops up when using transaction groups is dealing with synchronous data. If the power goes out after a customer has paid for a product and that data only exists in memory because it hasn’t been flushed to disk yet, then that data is lost. This is very much not OK, so applications that deal with critical data use synchronous writes. Basically the application designer decides that data integrity is more important than performance, and the application is designed to tell the system “write this to permanent storage and tell me when that’s complete so I can move on to the next step.” This protects the data, but if ZFS were to flush the queued transaction every time a synchronous write request came in there would be serious performance implications — especially in a world where some applications (like VMWare) play it safe by trying to make every write synchronous. So ZFS implements a ZFS intent log where synchronous writes are recorded then never referenced again unless the server fails before the next transaction group is written. This insures that sensitive data can be written to disk after an unexpected server failure while allowing the write-via-transaction model to continue running as designed. If you implement a separate logging device on fast SSD, then you can greatly reduce the performance hit of synchronous writes which would otherwise be written to your storage pool directly. Just a note here: all ZFS pools have a ZIL, so SLOG is the term used for a ZIL that’s been allocated its own device(s).
- Data Protection. ZFS checksums all data written or read, so that bit rot won’t set in and files are self-healing. Periodic scrubs run that verify the checksum of all files on disk looking to find errors while they are still correctable.
- Flexible RAID. You decide how you want to protect your data, then implement it. Mirrors, RAID10, RAID6 (called RAIDZ6) and RAID7 (called RAIDZ7) are all possible. As are 3-way mirrors, or RAIDZ9, or whatever.
- Portability. ZFS pools can be exported from one machine and imported on another. This works across systems as well, so you can export from FreeBSD (or one of the storage-based BSD distributions like FreeNAS or NAS4Free) and import into Solaris, or IllumOS, or a Linux-based installation that supports ZFS. Mostly I like this feature because it means that when I run out of drive bays on my storage server, I can buy a new server, install an OS, move the disks across, import the ZFS pool, and keep on running. That’s a beautiful feature that reduces waste in the form of purchased but unused hardware.
- Deduplication. This is a way of recognizing that lots of data is stored in multiple places and doing smart things to make sure data is only written once to save space. If your environment is made up of virtualized desktops where you have 50 Windows 7 virtual machines, there is a lot of duplicate data there (like every Windows or Office executable). Deduplication can save on storage costs, but ZFS implements this in a way that can consume serious resources. Worse: once you turn it on it’s permanently “on” for that data that’s been dedup’d, so performance penalties will exist until you migrate the deduplicated data of the server and back onto it again.
- Compression. The standard compression algorithm is very efficient, and all the recommendations I’ve read say “turn it on and don’t worry about it.” GZIP compression is less efficient, but if you’re running ZFS on a backup server where capacity is more important than performance, then GZIP level 9 might be perfect and offers some possibilities other solutions don’t.
- Storage Expansion. On a hardware RAID controller running parity raid, you can often expand your volume to add more drives to it, or change from RAID5 to RAID6 when you learn how foolish RAID5 is in the era of disk sizes in excess of 1 TB. ZFS won’t do those things, but it will allow you to expand Zpools by adding more virtual devices, and with no performance penalty while the addition is happening.
- Snapshots. ZFS is a copy-on-write filesystem, so snapshots take up zero space on the drive until subsequent writes change files and the file system diverges from snapshot. Snapshots can be reverted, or copied from one system to another, or promoted so they are accessible directly. This kind of flexibility offers some useful tools for backup implementation and disaster recovery planning.
- Replicated Snapshots. My original goal here was to have a clustered storage system, but that is not really practical in 2013 within my budget. ZFS allows for the next-best thing though: snapshots can be taken periodically (say every 5 minutes) and asynchronously sent to a secondary ZFS server. This means that failure of a primary storage server will result in downtime requiring manual intervention, but data loss in these cases (and total recovery time if implemented properly) should be greatly reduced. It also makes it fairly simple to have storage replicated from one datacenter to another, so failover after a site failure is reasonably straightforward.
There is plenty more to cover, but these are the main points that I’m taking into account when planning my storage server.
OK, so what are the downsides of ZFS?
There are a few points to keep in mind:
- Complexity. All those features listed above give a great amount of flexibility, but with that comes complexity. There are a lot of moving parts here, and sometimes those moving parts don’t interact the way you think they will. Planning is definitely required.
- Different enough to require retraining. If you have a failed drive on a hardware RAID array, the fix is to pull out the drive and insert a good drive and everything chugs along happily. With ZFS if you don’t tell the system to expect the drive to disappear then really bad things may happen. If you expand a hardware-based RAID6 array by adding new drives, the controller will go ahead and redistribute data across all the drives. ZFS will not allow you to expand a RAIDZ2 array by adding new drives, and when you expand your pool by adding another virtual device, it will not automatically redistribute data across the member devices as you might expect if you have worked with hardware RAID for a while.
- Hardware Requirements. This was a system designed by Sun engineers for some really high-end storage servers. If you assume a high quality servers, run by competent staff, in the proper environmental conditions, then there are a whole lot of failure modes you don’t need to design around. Be sure you meet the minimums, especially with regard to things like ECC RAM. Remember: lots of memory is the key to performance, so don’t skimp.
- Data integrity, not performance, was the key design goal. RAIDZ2 is better than RAID6 from an integrity point of view, because where RAID cards trust the data they read (and don’t mess with parity unless they detect a problem), RAIDZ2 always reads the data and calculates parity in order to insure the data is still correct. This is great, until you look at performance. One would expect reads on a 20-drive RAID6 array to perform about 18 times faster than the slowest drive in the array (20 drives minus 2 parity drives = 18). A 20 drive RAIDZ2 array, due to its insistence on reading and calculating the parity data, performs about the same as one drive. That’s a lot of performance to give up, and if you make assumptions about how the system you are designing will perform without doing enough research into ZFS, then you may end up really disappointed.
So is it worth it?
Let’s find out. The next article will take us one step closer to testing.
2018 Update: I never could get my test hardware to perform well, so I abandoned the project here. This has changed in 2018 though, and read IOPS for a non-optimized ZFS mirror with a $329 Samsung SSD as a SLOG are around 14,000. Sequential writes are significantly slower, but for my purposes this is outstanding performance.