Re: Is ZFS hard - or just a bit different?
The primary design ethos of ZFS is "Disks are crap. Deal with it" - instead of trying to deal with issues by layering lots of expensive redundancy all over the place and taking a big hit if a drive fails, it's designed around the _expectation_ that disks are cheap will fail and performance shouldn't suffer.
Vendors tend to package enterprise drives with ZFS but the actual design is intended to take advantage of consumer-grade drives. My test system uses a mixture of WD REDs and Seagate NAS drives, but it originally had WD greens in it (they were crap and all failed as did every single ST2000-DM series drive but no data was ever lost)
1: It detects and fixes silent ECC fails on the fly (and pushes the fix back to disk) - this is important because the standard error rate for disks will result in 1 bad sector slipping past ECC checking about every 44TB read off a drive.
2: There's no regular downtime needed for FSCK - scrubbing is done on the fly (a huge advantage when you have hundreds of TB of FSes)
3: RAIDZ3 - yes, 3 parity disks. With large arrays comes the increased risk of multiple drive failures during a RAID rebuild. We've lost RAID6 arrays in the past and whilst you can restore from tape the downtime is still a pain in the arse. My test 32TB array (2TB drives) has been rebuilt on the fly a large number of times and some of the drives are over 7 years old. It's still to lose any data,
4: Simpler hardware requirements. You don't need expensive RAID hardware (in fact it's a nuisance). ZFS talks directly to the drives and expects to be able to do so - this is where a number of "ZFS" vendors have badly cocked up what they're offering, crippling performance and reliability.
The money saved on HW raid controllers should be put into memory. You'd be surprised now many vendors are flogging multi-hundred TB systems with only 8 or 16GB ram. The more memory you can feed into a ZFS system the better it will perform - you need 1GB/TB up to about 50TB and that can be relaxed to about 0.5GB/TB above that, but more is better. Dedupe will massively increase both CPU and memory requirements and for most uses is not worth enabling (it's great in things like document stores or mail spools/IMAP stores)
5: Read caching (ARC and L2ARC) - this is relatively intelligent. Metadata (directories) are preferentially cached. Large files and sequentially read ones are usually not - preference is given to small files read often, to minimize headseek.
L2ARC allocations use ARC memory (pointers). If there is not enough ARC allocated, then L2ARC will not be used or may not be fully used.
6: Write caching. (ZIL and SLOG) - all pending writes are written to the ZIL first, then to the main array, then erased from the ZIL. The ZIL is part of the ZFS array unless you have a dedicated SLOG disk.
Important: SLOG and ZIL are NOT write caches. They're only there for crash recovery and are write-only under normal operation. ZFS keeps pending writes in memory and flushes them from there. The advantage of having the SLOG is that writes can be deferred until the disk is "quiet" and streamed out sequentially.
ZFS is designed to be autotuning, but some assumptions are wrong for dedicated systems (vs general purpose servers)
7: Tuning is important - Autotuning the ARC will result in only half the memory being allocated. In a dedicated ZFS server like TrueNAS, you can set this up around 80-90% (only about 3-4Gb is actually needed for the operating system) - and on top of that there's a tunable for how much metadata is cached (usually only about 20% of ARC). This can be wound up high on systems with lots of little files.
There are other tweaks you can make. The IOPS reported are pretty poor compared to the rusty arrays here.
The thing to bear in mind about ZFS is that it is entirely focussed around data integrity. Performance comes second. It will never be "blisteringly fast", but you can guarantee that the bits you put in are the same bits you get out.
That said, the ZFS arrays here are at least 10 times faster than equivalent hardware using EXT4/GFS/XFS filesystems when serving NFS and the caching (effectively a massive hybrid disk) means that head thrash on repeated hits to the same files is nonexistent. If you have a network /home this is an important feature.
I'm getting upwards of 400MB/s out of the TrueNAS when backing up, whilst still running fileserving operations and with access latencies staying low enough that people don't notice. When headthrash starts happening on any disk system, latencies go sky high extremely quickly so this is a critical parameter to keep an eye on.
Watch those graphs. If your ARC or L2ARC aren't growing to maximums then you're doing something wrong.