I am a huge fan of ZFS. It’s usually quite awesome. I have been using ZFS for about 4-5 years now. In most scenarios, ZFS is highly resilient, and will let you know of problems before they become a major issue. There are times where the right set of circumstances can make things fail spectacularly.
I have been dealing with issues on-and-off with our storage environments using ZFS at work. Most times things were easily recovered. However, there are times where the words that come to mind are “Why is this so difficult?” when getting into perfect-storm scenarios.
We had purchased two failover cluster setups from a vendor. One cluster for our LA location, one for our NY location. The units were installed, and except for a few initial hiccups worked well. Soon we started encountering kernel panics. One of the head-units was swapped out in the NY location. We encountered SAS errors. We encountered yet more kernel panics, coupled with failover issues. We tried replacing cables. We tried different kernel builds. We tried lots of things. All of this, over a mutli-month period and lots of time on the phone, led to us keeping the hardware and moving to software which had more robust clustering.
After the conversion, the NY location hummed along without much incident. The LA location was similar, until Dec 25 2012. In an unrelated issue, we had a network event which caused the cluster to fail over from the primary node to the secondary. The exact sequence of events has never been determined, but I got an email with a message similar to this:
pool: pool0
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Jan 7 19:28:42 2013
266M scanned out of 180T at 1.27M/s, 41421h59m to go
10.4M resilvered, 0.00% done
config:
NAME STATE READ WRITE CKSUM
pool0 DEGRADED 0 0 30
raidz2-0 ONLINE 0 0 0
c0t5000C50041BA0F63d0 ONLINE 0 0 0
c0t5000C50041BA0E17d0 ONLINE 0 0 0
c0t5000C50041BA0F73d0 ONLINE 0 0 0
c0t5000C50041BA29EFd0 ONLINE 0 0 0
c0t5000C50041BA1F0Bd0 ONLINE 0 0 0
c0t5000C50041BA145Bd0 ONLINE 0 0 0
c0t5000C50041BA36B7d0 ONLINE 0 0 0
c0t5000C50041BA0F27d0 ONLINE 0 0 0
c0t5000C50041BA252Bd0 ONLINE 0 0 0
c0t5000C50041BA2FEFd0 ONLINE 0 0 0
c0t5000C50041BA113Fd0 ONLINE 0 0 0
c0t5000C50041AEDFDBd0 ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
c0t5000C50041BA2423d0 ONLINE 0 0 0
c0t5000C50041BA1423d0 ONLINE 0 0 0
c0t5000C50041BA1043d0 ONLINE 0 0 0
c0t5000C50041BA1653d0 ONLINE 0 0 0
c0t5000C50041BA26CBd0 ONLINE 0 0 0
c0t5000C50041BA2EAFd0 ONLINE 0 0 0
c0t5000C50041BA2003d0 ONLINE 0 0 0
c0t5000C50041BA16A3d0 ONLINE 0 0 0
c0t5000C50041BA30F3d0 ONLINE 0 0 0
c0t5000C50041BA29CFd0 ONLINE 0 0 0
c0t5000C50041BA19EFd0 ONLINE 0 0 0
c0t5000C50041BA12BBd0 ONLINE 0 0 0
raidz2-3 ONLINE 0 0 0
c0t5000C50041BA2A0Bd0 ONLINE 0 0 0
c0t5000C50041BA27ABd0 ONLINE 0 0 0
c0t5000C50041BA1497d0 ONLINE 0 0 0
c0t5000C50041BA3687d0 ONLINE 0 0 0
c0t5000C50041BA22EBd0 ONLINE 0 0 0
c0t5000C50041BA14C7d0 ONLINE 0 0 0
c0t5000C50041BA2843d0 ONLINE 0 0 0
c0t5000C50041BA323Fd0 ONLINE 0 0 0
c0t5000C50041BA0F9Bd0 ONLINE 0 0 0
c0t5000C50041BA157Fd0 ONLINE 0 0 0
c0t5000C50041BA19C3d0 ONLINE 0 0 0
c0t5000C50041BA12D3d0 ONLINE 0 0 0
raidz2-4 ONLINE 0 0 0
c0t5000C50041BA284Bd0 ONLINE 0 0 0
c0t5000C50041BA1197d0 ONLINE 0 0 0
c0t5000C50041BA0DDFd0 ONLINE 0 0 0
c0t5000C50041BA431Fd0 ONLINE 0 0 0
c0t5000C50041BA1907d0 ONLINE 0 0 0
c0t5000C50041BA0FAFd0 ONLINE 0 0 0
c0t5000C50041BA1EABd0 ONLINE 0 0 0
c0t5000C50041BA1A6Fd0 ONLINE 0 0 0
c0t5000C50041BA1F2Fd0 ONLINE 0 0 0
c0t5000C50041BA11C7d0 ONLINE 0 0 0
c0t5000C50041BA1C67d0 ONLINE 0 0 0
c0t5000C50041BA14B7d0 ONLINE 0 0 0
raidz2-5 DEGRADED 0 0 60
spare-0 DEGRADED 0 0 0
c0t5000C50041BA2653d0 DEGRADED 0 0 0 too many errors
c0t5000C50041BA167Bd0 ONLINE 0 0 0 (resilvering)
spare-1 DEGRADED 0 0 0
c0t5000C50041BA324Bd0 DEGRADED 0 0 0 too many errors
c0t5000C50041BA134Fd0 ONLINE 0 0 0 (resilvering)
spare-2 DEGRADED 0 0 0
c0t5000C50041BA1ACBd0 DEGRADED 0 0 0 too many errors
c0t5000C50041B7575Fd0 ONLINE 0 0 0 (resilvering)
c0t5000C50041BA17D3d0 ONLINE 0 0 0
c0t5000C50041BA23FBd0 ONLINE 0 0 0
c0t5000C50041BA1ABFd0 ONLINE 0 0 0
c0t5000C50041BA1D2Bd0 ONLINE 0 0 0
c0t5000C50041BA174Bd0 ONLINE 0 0 0
c0t5000C50041BA120Bd0 ONLINE 0 0 0
c0t5000C50041BA27E3d0 DEGRADED 0 0 0 too many errors
c0t5000C50041BA1C97d0 DEGRADED 0 0 0 too many errors
spare-11 DEGRADED 0 0 0
c0t5000C50041BA233Bd0 DEGRADED 0 0 0 too many errors
c0t5000C50041BA36AFd0 ONLINE 0 0 0 (resilvering)
raidz2-6 DEGRADED 0 0 60
c0t5000C50041BA2F0Fd0 ONLINE 0 0 0
c0t5000C50041BA125Bd0 ONLINE 0 0 0
c0t5000C50041BA24CBd0 ONLINE 0 0 0
c0t5000C50041BA2FEBd0 DEGRADED 0 0 0 too many errors
c0t5000C50041BA18F7d0 DEGRADED 0 0 0 too many errors
c0t5000C50041BA3097d0 DEGRADED 0 0 0 too many errors
c0t5000C50041BA25E3d0 DEGRADED 0 0 0 too many errors
c0t5000C50041BA147Fd0 DEGRADED 0 0 0 too many errors
c0t5000C50041BA18EFd0 ONLINE 0 0 0
c0t5000C50041BA3503d0 ONLINE 0 0 0
c0t5000C50041BA1BEFd0 ONLINE 0 0 0
c0t5000C50041BA2D9Fd0 ONLINE 0 0 0
raidz2-7 ONLINE 0 0 0
c0t5000C50041BA13BBd0 ONLINE 0 0 0
c0t5000C50041BA2E1Fd0 ONLINE 0 0 0
c0t5000C50041BA308Bd0 ONLINE 0 0 0
c0t5000C50041BA30DBd0 ONLINE 0 0 0
c0t5000C50041BA350Fd0 ONLINE 0 0 0
c0t5000C50041BA0F87d0 ONLINE 0 0 0
c0t5000C50041BA18AFd0 ONLINE 0 0 0
c0t5000C50041BA3123d0 ONLINE 0 0 0
c0t5000C50041BA15B3d0 ONLINE 0 0 0
c0t5000C50041BA1987d0 ONLINE 0 0 0
c0t5000C50041BA3063d0 ONLINE 0 0 0
c0t5000C50041BA2FCFd0 ONLINE 0 0 0
spares
c0t5000C50041B7575Fd0 INUSE currently in use
c0t5000C50041BA134Fd0 INUSE currently in use
c0t5000C50041BA167Bd0 INUSE currently in use
c0t5000C50041BA36AFd0 INUSE currently in use
errors: No known data errors
Wow, ZFS is not happy. Scrubs up until four weeks or so prior had not shown any issues. The pool is fairly large at 180T. Goal number one was to start getting everything off of there that we could. Logs that were pulled from the HBA showed over 100k retransmits to the JBODs, and Solaris FMA showed 1 million checksum errors in total, after the initial scrub finished. Fast-forward to March 2013, and we are at a point where we can re-create the pool and start moving data back on to the unit. We lost about 1.9TB of data. Thankfully for us, by this point in time the data would have been pruned due to age.
Takeaways
The root cause of this issue was hardware. Poorly designed hardware. It was determined that the JBOD design was such that while the internals were fine, the connection between the internals and the outside world was not particularly good. They were picked because they are fairly high density, at 45 drives in a 4U footprint. The two 45 drive JBODs were replaced with 4 24 drive JBODs. After this was done in both locations, the amount of retransmits is sitting at zero after more then a week of TBs of data movement in LA and multiple days of data movement in NY. I can say with confidence that it was the root of the problems we were having. Hardware issues can manifest themselves in software in interesting ways.
Stick with the original. While there are good ports of ZFS to other OSes, for the strongest toolset, stick with Solaris. While I use the ZFSonLinux port for my laptop, and am happy with it, I don’t think it’s polished enough for production use yet. Performance is good, but not up to that of Solaris.
ZFS does have ways to recover from even corruped metadata, but that is not for the weak of heart. You can use ZDB to attempt to roll-back to an earlier transaction group, but the window for this is extremely small. If you can’t recover it using zdb / mdb, then the only way to fix corrupted metadata is a re-creation of the pool. This is by far the weakest piece of it all. There needs to be a way to fix this without having to re-create everything from scratch. When dealing with 100TB of data, this is not a good solution.
“iostat -en” is your friend on Solaris, in addition to SMART data, to figure out what disks are bad. Remove bad disks from pools physically as soon as possible. If you’re direct connecting to a port on an HBA, this is less of an issue, such as that of home users. But for those using any sort of shared backplane, the bad disk can make the rest of the fabric’s performance drop through the floor.
RAID-Z doesn’t work easily for No-Single-Point-of-Failure (NSPF) configurations. With a 2 JBOD configuration, the easiest way to make sure it’s NSPF is simple mirrors. This was my initial plan for re-creating the pools. However, I’m looking into doing something like 4 disk RAIDZ2, which provides up to 50% of the disks to fail, with the same disk space usage. I plan on doing some benchmarks and will add more to this post. The concern is that a disk fails, and the adjacent one fails, and then it’s over. Not certain of the statistics. This page: WHEN TO (AND NOT TO) USE RAID-Z shows some interesting performance numbers, though that is for RAID-Z, and I don’t recommend anything less then RAID-Z2. With a 3 JBOD configuration, RAID-Z3 works pretty well, and if set up right, allows you to lose 1 JBOD and still stay up and running.
When all is working, the performance is pretty awesome. There are regular transaction group commits at 3GB/s over 4 SAS ports. That’s pretty awesome for the total cost versus something from everyone’s favorite three letter storage corporation.
With respect to bad information - I came across a blog talking about ZFS tweaks and things to do to make it faster. Let’s be clear here. ZFS is fast given fast disks and fast hardware. But it is not as fast as something like XFS. It’s not designed to be. Priority 1 is data integrity. Thus, doing one of the things suggested in the blog here (http://icesquare.com/wordpress/how-to-improve-zfs-performance/) , which was to turn zfs checksumming off is a terrible idea. Other commenters agree with this. As a representative for my current storage vendor said to me, the tweaks that can really make a difference matter wtih extremes - very low RAM, and very large RAM amounts. Aside from that, ZFS is tuned fairly well out of the box - with the exception of…
Don’t use Dedup without realizing the cost. You need LOTS of RAM, or RAM plus LOTS of L2ARC, for this not to suck. As soon as your Deduplication lookup table grows larger then RAM, you have to scan through it, for every IO. This is VERY painful.
Additional Performance Notes
The following are good resources: ZFS Sequential Read/Write Performance and ZFS RAID recommendations: space, performance, and MTTDL and Sample RAIDOptimizer Output.
What I have gleaned about mirror + vs raid-z2 of 4 disks is..Half the IOPS in raid-z2, and a much higher reliability number. I think for my company’s usage pattern, I’m going to stick with my original idea of keeping it as simple as possible, which is mirroring. To improve reliability, I will ensure there are at least 2 hot spares, which definitely increases MTTDL (Mean Time To Data Loss) by a whole lot.
In the end, ZFS is good. It’s far from perfect, but it’s certainly my first go-to for storage. XFS is a close second, especially since there was a lot of code cleanup recently which made performance even better then it already was.
Comments and thoughts appreciated.
Comments