Production ready?
I've been using it in production on Solaris/Solris derivatives for years now. Hella fine FS. I'm happy for all you linux guys who finally have a reasonable FS.
The maintainers of the native Linux port of the ZFS high-reliability filesystem have announced that the most recent release, version 0.6.1, is officially ready for production use. "Over two years of use by real users has convinced us ZoL [ZFS on Linux] is ready for wide scale deployment on everything from desktops to super …
Amen. That's why I stopped using XFS
BTRFS may be lots of things, but it's not particularly robust. That's why I stopped using it.
ZFS (so far) has been bulletproof. As for Linux versions, it's available for Debian/Ubuntu and Redhat/clones
If you want a commercially supported version, there's Nexenta.
They all work - and ZFS is the only FS for linux which can detect and repair disk ECC failures (others can detect, but not repair)
Apples and Oranges, it really helps to know what the heck you're talking about.
Comparing XFS (which I've been using myself on Linux for ages) and ZFS is so absurd its not even remotely funny. For the record; I personally prefer XFS over Ext3 and Ext4 (see below).
Lets see here... ZFS allows you to setup one huge storage pool and then create virtual filesystems which all share the main storage pool. Meaning: I hope we can all agree that using one huge filesystem in Linux / Unix is a bad idea. So the very least you'd want /, /var and /home to make sure one doesn't interfere with the other. Now what would happen if you notice that /var is gobbling up too much space than is good for it ?
With XFS you'd have no other alternative but to change your setup (quicker log rotations, quicker removals, etc.), down the system to resize (outage, which is a big no no for production), or perhaps setup a whole new box then move the data over (if this really is a huge deal while uptime is too).
ZFS? Well, you simply change this on the fly. You can resize filesystems all you want, you can setup quota's, hard or soft, you can basically do whatever you want while the system remains running.
Then there's the issue of backups. One of the reasons I favour XFS is the xfsdump/restore program. It doesn't only make a filesystem snapshot like dump/restore does; it also allows you to restore your stuff interactively. On a per-file basis if you need to. Last time I checked dump/restore simply didn't work at all anymore on ext4, and it has given many issues in between (up to last year it seems). XFS just kept working ;-) (this is one of the main reasons I prefer XFS; Ext4 is a filesystem where restore tools stopped working? for real?!).
ZFS otoh... Snapshots as well as dump/restore setups (though those were a bit flakey too; you couldn't easily restore parts nor restore to a smaller filesystem).
But snapshots FTW. It basically means you make a backup in one second or so, and then continue working. This is esp. true if your storage is completely redundant (raid5 / 6 or so). You can make as much snapshots as you have diskspace and of course also remove older snapshots and such.
And needless to say; restore can either be a complete rollback or you simply get individual files back.
These are only 2 points where ZFS differs with XFS, but I hope you do realize comparing the two like that is simply absurd.
PS: I know about the online resizing capabilities of XFS btw. But the same story goes; its simply not comparable AT ALL.
@ShelLuser your response, while undoubtedly excellent, makes the chronic error of failing to address what I ACTUALLY wrote about, which was the PRODUCTION READY remark (emphasis added, since you missed it the first time). I would have thought my "few petabytes" remark may have been a clue that I am well aware that ZFS is a _different_ sort of beast (including a volume-manager substitute), even if the article hadn't pointed that out anyway.
As to the remarks about online FS recovery tools, I can honestly say that I've been involved with industrial-strength (well, OK, military-industrial strength) XFS filesystems for the past 15 years and have never felt the lack. Possibly this is related to the quality of the hardware being used?
Anyway, as the watchful may have noticed, I absolutely acknowledge the virtues of ZFS. Unfortunately, until/unless it becomes part of a standard distro, it's not a lot of use.
"......Unfortunately, until/unless it becomes part of a standard distro, it's not a lot of use." Agreed, ZFS has its uses and advantages. What I object to is the manner in which some people seem determined to force it on the Linux community with no regard for its technical limitations, and then berate those that dare to point out that there are other options already in use and already in the kernel. After suffering their browbeatings for a while you start to wonder why is it they are so determined to shout down any opposition.....
@Malcolm Weir,
Recent research shows that XFS is not really safe. It does not protect your data against corruption. And it also does not detect all type of errors. Here is a PhD thesis on data protection capabilities of XFS, JFS, ext3, etc:
http://www.zdnet.com/blog/storage/how-microsoft-puts-your-data-at-risk/169
The conclusion is that all those filesystems are not designed for data protection.
When you have a small filesystem, say a few TB there are not much risk of silently corrupted data. But when you venture into Big Data of many many TB or even PB, there is always silently corrupted data somewhere, just read the experience of Amazon.com:
http://perspectives.mvdirona.com/2012/02/26/ObservationsOnErrorsCorrectionsTrustOfDependentSystems.aspx
"...Another frequent question is “non-ECC mother boards are much cheaper -- do we really need ECC on memory?” The answer is always yes. At scale(!!!!!!!!), error detection and correction at lower levels fails to correct or even detect some problems. Software stacks above introduce errors. Hardware introduces more errors. Firmware introduces errors. Errors creep in everywhere and absolutely nobody and nothing can be trusted....Over the years, each time I have had an opportunity to see the impact of adding a new layer of error detection, the result has been the same. It fires fast and it fires frequently. In each of these cases, I predicted we would find issues at scale. But, even starting from that perspective, each time I was amazed at the frequency the error correction code fired..."
.
.
Most of the time, you wont even notice you have corrupted data, because the system will not know it, nor detect it. For instance, just look at the spec sheet of any high end Fibre Channel or SAS disk, and it will always say "One irrecoverable error for every 10^16 bits read". Those errors are not recoverable. Some of the errors are not even detectable. There are always cases that error repairing algorithms can not handle. Some errors are uncorrectable, some errors are undetectable. Here is more information with lots of research papers on error detection:
http://en.wikipedia.org/wiki/ZFS#Data_integrity
.
.
@Alan Brown
"They all work - and ZFS is the only FS for linux which can detect and repair disk ECC failures (others can detect, but not repair)"
This is not true. Read the research above. Other filesystems can not even detect errors, let alone ECC failures or other types of failures such as ghost writes.
OTOH, researchers have tried to provoke ZFS and inject artificial errors too, and ZFS detected and recovered from all errors. No other filesystem nor hardware raid, can do that. THAT is the reason ZFS is hyped. Not because it is faster or all of its functions such as snap shot, who cares about performance if your data is silently altered without the system even noticing?
http://research.cs.wisc.edu/wind/Publications/zfs-corruption-fast10.pdf
.
.
ZFS production ready on Linux? I doubt that. Linux has a long history of cutting corners just to win benchmarks, etc. Safety suffers on Linux, just to win benchmarks. See what Ted Tso writes, the creator of the ext4 filesystem:
"In the case of reiserfs, Chris Mason submitted a patch 4 years ago to turn on barriers by default, but Hans Reiser vetoed it. Apparently, to Hans, winning the benchmark demolition derby was more important than his user's data. (It's a sad fact that sometimes the desire to win benchmark competition will cause developers to cheat, sometimes at the expense of their users.)...We tried to get the default changed in ext3, but it was overruled by Andrew Morton, on the grounds that it would represent a big performance loss, and he didn't think the corruption happened all that often (!!!!!) --- despite the fact that Chris Mason had developed a python program that would reliably corrupt an ext3 file system if you ran it and then pulled the power plug "
The conclusion is that Linux can not be trusted, because of all cheating. Linux users are prematurely declaring Linux tech as safe, when it is not. It is almost as if Microsoft would declare ReFS and Storage spaces to be production ready, that would be funny. Just google on peoples experiences of them.
I really doubt BTRFS will be production ready soon. ZFS is over ten years old, and still we find bugs in it. There are sysadmins that does not trust ZFS, because it is not tried enough, it is too new and fancy. It takes decades before a filesystem gets proven. Even when/if BTRFS gets production ready, it will take years.
ZFS on linux is production ready? Hmmm....
If you're going to mention Vijayan Prabhakaran's dissertation, it'd be nice to actually cite it, and not some handwaving six-year-old editorial on ZDNet.
For those who are interested, it's at http://research.cs.wisc.edu/wind/Publications/vijayan-thesis06.pdf. It's now seven years old, but it's still of interest, though I wouldn't want to take what it says about the various filesystems as gospel without looking into what may have changed in them since 2006.
"ZFS on linux is production ready? Hmmm...."
It's always a question of: how much production readiness do you want?
Is it production-ready enough? I guess so. You left out this little thing in the middle of the text, written in 2009:
"In the case of ext3, it's actually an interesting story. Both Red Hat and SuSE turn on barriers by default in their Enterprise kernels. SuSE, to its credit, did this earlier than Red Hat. We tried to get the default changed in ext3, but it was overruled by Andrew Morton...."
Is the sky falling? Evidently not. Is it getting better? Yes! And seriously, is there anyone (except the ones who like to riding bikes where their unprotected balls 5 are cm from the tarmac) who uses ReiserFS?
Also:
BARRIERS ON BY DEFAULT IN EXT4: YES PLEASE and Enabling/Disabling Write Barriers in Fedora 14
@Kebabbert It is undoubtedly true that almost all commercial file systems depend on the storage accurately, well, storing data. It is undoubtedly lovely that ZFS provides a mechanism (at some cost) to improve the quality of the storage subsystem by adding error checking features.
It is NOT true to assert that XFS (or anything else) is "unsafe" simply because they do not have those error checks. Error checking can be implemented in many different places and in many different ways, and the fact that the ZFS folks have decided there is One True Way is irrelevant to the reality that, if the underlying storage fails in various ways, you may get hurt -- which is true even with ZFS, because as a non-clustered FS is the host croaks, you are down. CXFS (as an example) is immune from those sorts of errors.
So is CXFS "safe" and ZFS "not safe"? Of course not: XFS (underlying CXFS) is vulnerable to certain types of failure, and ZFS is vulnerable to other types. Yer pays yer money and yer takes your choices.
Meanwhile CXFS + dmapi is my friend.
Malcolm Weir,
"....It is NOT true to assert that XFS (or anything else) is "unsafe" simply because they do not have those error checks. Error checking can be implemented in many different places and in many different ways, and the fact that the ZFS folks have decided there is One True Way is irrelevant ..."
Yes, my assertion is TRUE. Let me explain. There are lot of error checksums in every domain. There are checksums on disk, on ECC RAM, on interface, etc. As my amazon link above shows: there are checksums everywhere. Every piece of hardware have checksums. Checksums are implemented in many different places and in different ways. Does this massive checksumming help? No. Let me explain why.
The reason all these checksums does not help is because of this:
Have you ever played a game as a kid? There are lot of children sitting in a ring, and one kid whispers a word to the next kid. And he whispers on, etc. At the end of the ring, the words are compared and they always differ. The word got distorted in the chain.
Lesson learned, it does not help to have checksums within a domain. You must have checksums that passes through the boundaries, you must be able to compare checksum from the beginning of the chain, and the end of the chain. Are those checksums identical? End-to-end checksums are needed! When you pass a boundary, the data might get corrupt. So within a boundary, the corrupted data have a good checksum. But that does not help. You must have end-to-end checksums, you must always compare the beginning checksum with the last checksum. This is what ZFS does.
ZFS is monolithic, it is a raid manager, filesystem, etc - all in one. Other solutions have a separate raid layer, a filesystem, separate raid card, etc. There are many different layers, and the checksum can not be passed between the layers. ZFS has control of everything, from RAM down to disk because it is monolithic and therefore can compare from end to end. Other layered solutions can not do this.
For instance, ZFS can detect faulty power supplies whereas other solutions can not. If the power supply is flaky, ZFS will notice data corruption within minutes and warn immediately. Earlier filesystems on the same computer did not notice:
https://blogs.oracle.com/elowe/entry/zfs_saves_the_day_ta
And ZFS also immediately detects faulty RAM dimms. ZFS also detects faulty switches!!! Here is a fibre channel switch that is corrupt. ZFS was the first one to detect it, it had gone unnoticed earlier:
http://jforonda.blogspot.se/2007/01/faulty-fc-port-meets-zfs.html
Please dont tell me that other filesystems or hardware raid can detect faulty switches, because they can not. If ZFS stores the data on a storage server via a switch, then ZFS can detect all problems in the path, because ZFS compares what is on disk, with what is on RAM. End to end. No one else does that, they can not detect faulty switches, or faulty power supplies, or....
Sun learned that checksums does not help. CERN confirms this in a study "checksumming is not enough, you need end-to-end checksums (zfs has a point)". I can google this CERN study for you, if you wish to read it. The point is that ZFS does end-to-end checksums, whereas other solutions does not. It does not suffice to add checksums everywhere, you will not get a safer solution. You need end-to-end. Which is what ZFS does.
Do you understand now why ZFS is safe, and other solutions are not?
Kebbie, every one of your crusading posts just makes the whole idea sound even more silly. You can't just say to the people that have between them happily and successfully run millions of systems over the years on solutions other than ZFS that they were "all wrong and ZFS is the only right answer". They will just laugh at you.
"....There are lot of children sitting in a ring...." Hmmm, good thing I use arrays to store my data and not groups of children then!
".... ZFS can detect faulty power supplies ...." <Yawn> Most servers and arrays I know of can do this for themselves already by seperate PSU monitoring software. In fact, since many of them link into remote support solutions, they do it BETTER than ZFS in that they will get a replacement PSU out to site whilst the ZFS admin is still working through the logs looking for the ZFS warning on the PSU. Fail!
"...<Yawn> Most servers and arrays I know of can do this for themselves already by seperate PSU monitoring software..."
Well good for them. But the point is, ZFS can detect faulty PSU without additional software. The data corruption detection of ZFS is so strong it can even detect faulty PSU without additional software. People report that ZFS detected faulty SATA cables. Detected faulty fibre channel switches. Faulty ECC RAM dimms. etc. All this, without any additional software.
This is a a true testament to the extremely strong data integrity of ZFS, which surpasses every other filesystem on the market. Or do you know of any other filesystem or storage system that can do this?
As CERN says about hardware raid:
Measurements at CERN
- Wrote a simple application to write/verify 1GB file
- Write 1MB, sleep 1 second, etc. until 1GB has been written
- Read 1MB, verify, sleep 1 second, etc.
- Ran on 3000 rack servers with HW RAID card
- After 3 weeks, found 152 instances of silent data corruption
- Previously thought “everything was fine”
- HW RAID only detected “noisy” data errors
- Need end-to-end verification to catch silent data corruption
This shows that hardware raid does not offer data integrity at all, and should not be trusted. I know that you trust hardware raid, but you shouldnt. I also know that you dont think ECC RAM is necessary in servers, but they are. I have said that you should read research on data corruption, umpteen times but you refuse. I dont really understand why you reject all research on this matter...
This post has been deleted by its author
Yes, it was a joke, humour.
The classical mechanical degrees of freedom of a proton (disregarding the quarks and gluons that compose such a particle) in 3d space are surprisingly enough are 3.
As you rightly state and just as the Heisenberg uncertainty principal states, one cannot precisely measure both the momentum and position of a quantum object such as a proton at the same time. However one could precisely measure the position of such a particle. Of course once the measurement has been made the particle would have moved if not by the action of the measuring device, by zero point fluctuation alone.
Still this is his/her Noodliness that is doing the accounting here. Are you suggesting his/her Noodliness is anything less than Omni-everything? That is heresy and you are likely going straight to the restaurant at the end of the Universe without any Bolognese sauce ;-)
> The classical mechanical degrees of freedom of a proton (disregarding the quarks and gluons that compose such a particle) in 3d space are surprisingly enough are 3.
They clearly would be 9: 3 for the position, 3 for the momentum, and 3 for the axis of rotation.
> However one could precisely measure the position of such a particle.
No. Because you need too much energy to do that.
> by zero point fluctuation alone.
Explaining impossibility to determine position and momentum at the same time is better explained by the olde Fourier Transform: peaks in frequency domain mean flat functions in time domain and conversely and doesn't involved Marvel Comics Science Terminology.
> Just because WE can't know both momentum and position, has nothing to do with her Noodliness.
That's rubbish. Read about Bell inequalities that were pretty much conclusively disproved. And with them the idea that a logically consistent underlying state is possible which would define simultaneously both momentum and position (or the equivalent for polarisation-bases for photons anyway). So, there's nothing for her Noodliness to actually know.
> Hard to do when there aren't enough atoms or energy to keep that much data on the planet Earth.
About the energy, I can agree. The energy needed to sufficiently reduce entropy of whatever to make it meaningful storage for this kind of data mount would be hard to get.
But atoms are plenty. If you figured out the atomic level storage somehow and kept one bit per atom, you'd only need a Silicon cube with side of 18km. That may seem huge but it's tiny compared to Earth. And if you had lots of energy you could easily process entire Earth to common Silicon chips to obtain this kind of data capacity and beyond.
Practical? Perhaps not. But sure we have the atoms.
Your NEEDS will GROW. ZFS can help!
“We're in the process of receiving two visitors from Earth.” Gisela was astonished. “Earth? Which polis?” “Athena. The first one has just arrived; the second will be in transit for another ninety minutes.” Gisela had never heard of Athena, but ninety minutes per person sounded ominous. Everything meaningful about an individual citizen could be packed into less than an exabyte, and sent as a gamma-ray burst a few milliseconds long. If you wanted to simulate an entire flesher body — cell by cell, redundant viscera and all — that was a harmless enough eccentricity, but lugging the microscopic details of your “very own” small intestine ninety-seven light years was just being precious.
This is excellent news for everyone, and well done to the guys/girls who've done this. Thank you :)
However I do think it's a real shame that licensing concerns prevent the inclusion of this in the linux kernel. Whatever those concerns are they're surely relatively petty in comparison to the benefits we'd all get. Surely for something as significant as ZFS some rules could be changed specifically to accommodate it. Couldn't Linus or whoever just scrawl "except for ZFS which is ok so far as we're concerned" somewhere in the middle of GPL2?
It's still open source code. It's not as if anyone's going to be chasing anyone else for money if they use it. It seems unnecessarily obstinate, a bit like refusing a fantastic Christmas present simply because one's favourite Aunt has used gift paper that you didn't like... It didn't seem to worry anyone in the FreeBSD camp.
Hmmm, can I hear the drone of a cloud of hornets rushing towards me from a recently upset nest?
You probably wouldn't want that - except for a limited set of exceptions the GPL is not modifiable. So if code is labelled as being GPL then you don't have to inspect it closely to see if they snuck in a clause "I get to sleep in Bazza's bed wearing big muddy boots" because if they did then it's not GPL at all.
Absolutely agreed that this incompatibility is a PITA - according to Wiki it comes from the CDDL side, with Sun electing to deliberately make it incompatible with GPL, for reasons apparently obscure (maybe idealist, maybe corporate screwing around). But such pains seem something of a virtue to the eyes of GNU stalwarts - better to lose some convenience than dilute essential not-as-in-beer-freedom.
"with Sun electing to deliberately make it incompatible with GPL, for reasons apparently obscure (maybe idealist, maybe corporate screwing around)"
Couldn't it occur to you that maybe - just maybe - Sun, like others, do not like the GPL?
FreeBSD is also actively removing all GPL stuff from the base ( I think replacing gcc with clang in version 10.X is the last thing on the list)
"Is llvm-as and llvm-ld/link usable instead, in a clang-based work-flow, or have they got rid of them completely now ?"
I'm not sure. There's some chatter here:
http://lists.freebsd.org/pipermail/freebsd-current/2011-June/025558.html
"Afaik there is no bsd licensed assembler or linker. clang isn't enough if it is still using binutils.
http://sourceforge.net/apps/trac/elftoolchain/"
True, and thanks for the link. I don't know when "ld" and "as" will be ready, but as your link points out, it's on the cards, and the intention is to have them done in time for 10.X
https://wiki.freebsd.org/BSDToolchain
Afaik there is no bsd licensed assembler or linker. clang isn't enough if it is still using binutils.
Give us a chance, we're getting there! Lots of the standard tools have recently been ported from their GPL equivalent, iconv, sort, grep are all on their way to being fully replaced, clang introduction has been very good, the toolchain will land by FreeBSD 12 I'd guess.
IIRC, the 'problem' with CDDL and GPL is not that the CDDL prohibits the GPL, it is that GPL prohibits itself from CDDL, since it cannot re-license it as GPL. CDDL isn't a problem for a BSD licensed OS, since we just want to use the code, not re-license it.
The zfsonlinux guys are very active in the ZFS community, and have fixed lots of bugs in the upstream (which is Illumos, open source ZFS has little to do with Oracle/Sun anymore). The only feature missing from ZFS, Block Pointer Rewrite¹, will probably come from zfsonlinux if anywhere.
¹ Block Pointer Rewrite is the ability to dynamically resize a pool by adding or removing vdevs, eg by adding a single disk vdev to a 4 disk raidz pool to make a 5 disk raidz pool.
I was sweeping the "GPL - bah!" side under "maybe idealist", though calling out separately "maybe pragmatic" would have been clearer. The shenanigans angle was what I read into http://en.wikipedia.org/wiki/Common_Development_and_Distribution_License#GPL_incompatibility - though of course on a contentious issue WP may be unusually unreliable. In fact I was rather hoping some Sun greybeard would pop up hear with a juicy recollection ("ah, the meeting where Scott did his squeaky-outraged-RMS voice and Schwartzy laughed so much the rubber band fell off his ponytail")
"I was sweeping the "GPL - bah!" side under "maybe idealist", though calling out separately "maybe pragmatic" would have been clearer."
Oh, I see now. Sorry for being (slightly) sarcastic in my reply!
From the wiki link:
"....that the engineers who had written the Solaris kernel requested that the license of OpenSolaris be GPL-incompatible. "Mozilla was selected partially because it is GPL incompatible. That was part of the design when they released OpenSolaris...."
Assuming that is correct, that may again have been because they didn't like the inherent rules of the GPL rather than "It's GPL,ARRRGH"
But I see your point. There are fanbois on both sides :)
However I do think it's a real shame that licensing concerns prevent the inclusion of this in the linux kernel.
Why on earth would you want the file system driver in the kernel?
As for wanting to use ZFS: if you really need the features then it's probably worth looking at some of scale features that Solaris/Illumos offer. Best tool for the job, etc.
As for licence madness: the GPL was set up to provoke precisely this kind of conflict and try and force GPL onto other projects. Reap as you sow, done by as you did, etc.
> Sun didn't want that constraint
Well then that's on Sun. The idea of releasing the source and being specifically hostile to the GPL is a bit of a contradiction. You either are for end user freedom or you aren't. GNU was already here. Linux was already here. Sun chose to be antagonistic to it.
It's not up to the oldest libre projects to pander to the pro-corporate inclinations of the latest shiny thing.
Silent in that errors are noticed via checksum, the checksum indicates another copy of the data should be returned instead, bad blocks (and more) automatically marked as bad. All the things you want with alerting are also happening, otherwise no one would like ZFS.
As in the disk didn't fail out, it's just returning bad data. Frankly, the number of times I've seen this lately make me never want to touch a non-checksummed FS again.
When we discover a disk doing this we DO give warnings, but most other FSes don't check for it.
The problem is disk do checksums, but sometimes they don't see an error. Could be one of the rare patterns that happens to match the algorithm and gets by, could be a similar fault on the data going to/from the HDD, etc.
Most HDD claim something like a 10^-14 error rate, but a 4TB disk has 3.2E13 bits...
Hence XFS applies additional checksums on top of the HDD checks to provide a much reduced chance of an error getting through. MUCH reduced.
There was a paper from CERN a few years back on this sort of thing, covered RAM errors, HDD controller errors, disk errors, etc. Bottom line was if you have a lot of data and/or valuable data, you need more verification than HDD offer internally!
In-flight data corruption or Phantom reads/writes (the data written/read checksums correctly but is actually wrong) are undetectable by most filesystems as they store the checksum with the data. ZFS stores the checksum of each block in its parent block pointer so the entire pool self-validates.
See: https://blogs.oracle.com/bonwick/entry/zfs_end_to_end_data
See also Prabhakaran's dissertation, mentioned above; he discusses why disk checksums, RAID, etc, are not sufficient to produce a probability of undetected error that's low enough for many applications.
"......he discusses why disk checksums, RAID, etc, are not sufficient to produce a probability of undetected error that's low enough for many applications." <Yawn> More mythical bit-rot. Yeah, it reminds me of that guy that had a scientific and indisputable mathematical proof that the is no way bumblebees can ever fly, when reality shows the opposite. I think it is often used to show the fallacy of blindly accepting statistical analysis skewed to show a desired outcome.
When that "mythical" bit-rot blows away a good chunk of your encrypted filesystem, you will be singing from another sheet - and from another hole.
Seriously, I sure hope responsible people keep you in the cellar, away from anything critical. The way you are sounding off, you must be pulling a good spiel and bedazzle quite a few of the people with the purse strings, so probably not.
A Conversation with Jeff Bonwick and Bill Moore, September 1, 2007
BILL MOORE We had several design goals, which we’ll break down by category. The first one that we focused on quite heavily is data integrity. If you look at the trend of storage devices over the past decade, you’ll see that while disk capacities have been doubling every 12 to 18 months, one thing that’s remaining relatively constant is the bit-error rate on the disk drives, which is about one uncorrectable error every 10 to 20 terabytes. The other interesting thing to note is that at least in a server environment, the number of disk drives per deployment is increasing, so the amount of data people have is actually growing at a super-exponential rate. That means with the bit-error rate being relatively constant, you have essentially an ever-decreasing amount of time until you notice some form of uncorrectable data error. That’s not really cool because before, say, about 20 terabytes or so, you would see either a silent or a noisy data error.
JEFF BONWICK In retrospect, it isn’t surprising either because the error rates we’re observing are in fact in line with the error rates the drive manufacturers advertise. So it’s not like the drives are performing out of spec or that people have got a bad batch of hardware. This is just the nature of the beast at this point in time.
I ask you to post something technical and you post the Sunshiner brochure? That's like posting the Ford Mondeo brochure that says they make the best saloon, better than BMW, Mercedes, Jaguar, Honda, etc., etc. Seriously, if all you read is Sunshiner marketing pieces and FUD, is it any surprise you don't have a clue?
"When that "mythical" bit-rot blows away a good chunk of your encrypted filesystem....." That's just it - thirty-odd years of waiting and no mythical bit-rot, and that's working with some of the biggest database instances in Europe (ironically, most of them being Oracle!). Guess I am either extremely and unbelievably lucky and beating the odds you Sunshiners insist must be right, or you're just talking male genetalia. I'm going with the latter and your posts consistently add weight to my argument.
Now, unless you actually have something technical to add to the thread, best you leave it to the grown ups.
Chris, they are not random numbers at all, otherwise there wouldn't be dots in there. Using the scheme you describe is fine where there is a single build number but why start with a 0.x if you have no mechanism to increment that 0? The x.y.z system is well documented for programming where x is a major version starting with 0 for pre-release then incrementing for major changes such as functionality. Y is a minor version, incrementing for small changes such as a small feature addition. Z is the build number, although more likely an even more minor version than actual build number - this is generally for bug fixes. I don't think anyone actually minds which versioning scheme you use, although firefox are pushing their luck, but don't bastardise one into the other to try and look professional while having a complete lack of understanding.
No idea why you've used the joke icon, version numbering used to be an easy way to tell whether something was usable, useful and stable. Now I sit here using Firefox v19 for no better reason than some developer thought it would be funny to jump major versions with no major changes. Of course in the old days, people used to formally learn how to develop code rather than picking up enough while in their bedroom and then calling themselves programmers because they've written yet another notepad app for Linux...
" ... Btrfs, a GPL-licensed filesystem that has been under development at Oracle since 2007 and which offers similar features to ZFS."
While this is what the BTRFS developers have been shouting since the start, shouting it repeatedly doesn't make it true. BTRFS' features and maturity, including on Linux, is nowhere near that of ZFS.
BTRFS RAID 5/6 support is pre-experimental, and there is no n+3 RAID option.
Scrub cannot repair errors when used in RAID 5/6 configuration.
Deduplication support has been dropped even from the planned features list.
There is no tiered caching (ZFS L2ARC) option that allows you to have a big array of slow disks with an SSD for caching the hotest data/metadata or write-caching commits to boost performance.
I could go on, but anyone telling you that BTRFS is even comparable to ZFS is clearly talking from the wrong orifice.
OK, I'll play - when is ZFS going to be able to cluster? Slightly important to real business users, which is why SLES and RHEL will be ignoring ZFS again. ZFS is suitable for desktops and that's about it. OCFS2 will remain the real choice from Oracle for commercial users and that is why it is in the Linux kernel and not ZFS. But don't let those simple facts stop you and the rest of the ZFS cheerleaders, you do provide plenty of amusement.
Knowledge Matt® now in a new edition of: "Undistributed filesystem is undistributed".
Film at 11.
Also: "ZFS is suitable for desktops and that's about it."
Words fail. These must be *big* desktops.
You can check out some of the desktops here. They are evidently using ZFS on backend nodes to provide a clustered filesystem on top. Hmm.....
".....Words fail....." Strangely it didn't seem to affect your ability to post complete cobblers.
"....These must be *big* desktops......" That is the problem - ZFS is not suitable for the big filesystems it is aimed at on the simple grounds clustering is not there. ZFS is fault-tolerant but not highly-available, it is just one great big SPOF. Just try convincing any large organisation that they should risk putting a zetabyte's worth of production data on one OS image without clustering to give them a failover copy - not going to happen! Why do you think SAN arrays for years have had replication as a core offering? Other filesystems such as OCFS2 already have it and are already production-ready and in the Linux kernel (since 2.6.something IIRC) seeing as OCFS2 is cleared under GPL.
"....has been working to integrate ZFS with Lustre on Solaris....." Oh, so production ready then, right? Not! I do like how you Sunshiners deny clustering is necessary for ZFS, but then point to a project where you have to use Lustre to give you the clustering/distributing that ZFS can't do alone! Trying to bolt together a Frankenstein network filesystem out of Luster and ZFS seems pretty pointless when Red Hat already have a superior option in GlusterFS, or hp with LeftHand.
Same question as always put up to make you Sunshiners shutup about ZFS - please supply the details of any FTSE 1000 organistaion using ZFS in production for a major system (CRM, billing, etc.). Don't worry too much, those of us that work in the industry already know the answer is none of them. I suggest you and the other ZFS fanbois could better spend your time reading up on the alternatives already out there. Try here for a start http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_parallel_fault-tolerant_file_systems
Why insist on truck-lifiting distributed filesystems into the discussion? Who mentioned them first? And what is a "Sunshiner"?
> Trying to bolt together a Frankenstein network filesystem out of Luster and ZFS seems pretty pointless when Red Hat already have a superior option in GlusterFS, or hp with LeftHand.
Apparently some people who imply that they are "in the industry" actually believe that a distributed filesystem is a magic black box that drops off a vendor's conveyor belt, glittering in pixie dust, instead of basically a bunch machines, each with disks arrays having their own filesystem plus a remotely accessible centralized lock manager and possibly a metadata server.
A place in marketing beckons. The BS levels are there. Bonus for mentioning "FTSE 1000" and some HP products.
"Why insist on truck-lifiting distributed filesystems into the discussion?....." Because you are failing to look at this from the perspective of a business rather than a techie. Businesses want high-speed, scaleable, redundant and reliable storage at a low price. Traditionally, they have had to relie on monolithic SAN arrays from the like of EMC, which give them all the features they want but cost an arm and a leg. Now, technologies such as GlusterFS, Left Hand and a number of other grid-style storage products give them all that they want plus are using commodity x64 hardware, offering lower costs.
Now, when you come along and say "We have a whizz-bang new filesystem called ZFS that can look after really large amounts of storage, but to do so you have to give up all concepts of hardware RAID and instead use a really big, monolithic server, with massive amounts of RAM, but you have no high availability", they tend to ask "So how is that better than GlusterFS?" or whatever product they have already had the salegrunt come round and do a slideshow on. And then it gets worse when you have to admit ZFS is still a project under development and not really ready for enterprise use, and you have no case studies for real enterprise use of it (real enterprises being the only people that will have the cash and inclination to buy those big monolithic servers with all those cores and RAM required to run really big ZFS instances).
And whilst you moan about whether GlusterFS or Left Hand have lock managers etc., what the business sees is a low cost storage solution that scales, is redundant, and uses commodity x64 architecture. No pixie dust required, just a bit of experience at the coal-face, thanks.
@Matt Bryant
"...you have to give up all concepts of hardware RAID and instead use a really big, monolithic server, with massive amounts of RAM..."
Who wants to use old fashioned hardware raid?
http://en.wikipedia.org/wiki/RAID#Problems_with_RAID
Hw raid is not even safe. Just read the research papers from NetApp who relies heavily on hw raid:
http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.pdf
"A real life study of 1.5 million HDDs in the NetApp database found that, on average, 1 in 90 SATA drives will have silent corruption which is not caught by hardware RAID verification process; for a RAID-5 system, that works out to one undetected error for every 67 TB of data read."
@Matt Bryant:
"That is the problem - ZFS is not suitable for the big filesystems it is aimed at on the simple grounds clustering is not there."
OK, one post might have been a random brainfart, but after two you seem to have amply demonstrated you don't actually know what clustered file systems are, or are for.
1) You don't need a clustered file system to fail over - you fail over the block device, and mount the non-clustered FS on the fail-over node, after fencing off the primary node that failed. Concurrent access on a clustered FS may seem like a fancy feature until you find that your performance reduces by a factor of 1000 when you start simultaneously working on file in the same directory from multiple nodes.
2) GFS, having been around for longer, has more traction than OCFS.
3) Every company that has used Solaris in the past 10 years or so is using ZFS. ZFS has been in Solaris since Solaris 10.
Cluster file systems like GFS, OCFS and VCFS lack a fundamentally important feature, and that is transparent data repair and recovery. Contrary to what you may have been lead to believe by disk manufacturers is that disks fail, and lie about their defects in the most creative of ways to limit warranty claims. When this happens, traditional RAID will corrupt your data, one sector at a time. ZFS (and BTRFS to a lesser extent) will go and scrub out your data from the surviving disks in the array, find a data set that checks out, and repair the error. Your expensive SAN and cluster file system cannot do that in anywhere near the number of cases that ZFS and BTRFS can.
".....but after two you seem to have amply demonstrated you don't actually know what clustered file systems are, or are for....." Actually I was just trying to keep it to the level that a Sunshiner would understand and it looks like I failed to use a simple enough terminology. Maybe I should just draw pictures in crayon for you.
".....You don't need a clustered file system to fail over - you fail over the block device....." Slight problem - you can't share the block device under ZFS, it insists on having complete control right down to the block level. That is why ZFS is a bitch with hardware RAID: ".....When using ZFS on high end storage devices or any hardware RAID controller, it is important to realize that ZFS needs access to multiple devices to be able to perform the automatic self-healing functionality.[39] If hardware-level RAID is used, it is most efficient to configure it in JBOD or RAID 0 mode (i.e. turn off redundancy-functionality)...." (http://en.wikipedia.org/wiki/ZFS#Hardware_RAID_on_ZFS). ZFS also cannot present an unified filesystem namespace: ".....Some clustering technologies have certain additional capabilities beyond availability enhancement; the Sun ZFS Storage 7000 series clustering subsystem was not designed to provide these. In particular, it does not provide for load balancing among multiple heads, improve availability in the face of storage failure, offer clients a unified filesystem namespace across multiple appliances, or divide service responsibility across a wide geographic area for disaster recovery purposes....." (http://docs.oracle.com/cd/E22471_01/html/820-4167/configuration__cluster.html). That is why you need Lustre on top to actually provide clustering capability.
".....GFS, having been around for longer, has more traction than OCFS....." Which has nothing to do with ZFS.
".....Every company that has used Solaris in the past 10 years or so is using ZFS. ZFS has been in Solaris since Solaris 10....." I think you'll find most Slowaris 10 sufferers used UFS or VxFS, not ZFS, for the simple reason that Sun Cluster needed the use of one of those two and not ZFS : (http://docs.oracle.com/cd/E19787-01/820-7358/cihcjcae/index.html)
".....Cluster file systems like GFS, OCFS and VCFS lack a fundamentally important feature, and that is transparent data repair and recovery......" Ooh, nice feature-sell, only us customers decided clustering was actually a feature we really value. Try again, little troll!
/SP&L
@Matt Bryant
Actually, the wikipedia link is not true. It contains false information. This is not correct:
"...When using ZFS on high end storage devices or any hardware RAID controller, it is important to realize that ZFS needs access to multiple devices to be able to perform the automatic self-healing functionality[39]...."
If you read the link [39] it says the opposite:
"As an alternative you could use a special feature of ZFS: By setting the property copies with the zfs command you can tell ZFS to write copies of your data on the same LUN. As you control this on a "per dataset" granularity you can use it just for your most important data. But the basic problem for many people is the same: It's like RAID1 on a single disk "
Hence, it says that ZFS can guarantee data integrity on a single disk. ZFS does not need access to multiple devices, one disk will do. Read the link. Wikipedia is wrong on this.
.
Another thing in the wikipedia article that is not correct is "the hardware raid should be configured as JBOD or RAID 0 mode". No, that is not correct, because some hardware raid cards adds additional information on disks and other stuff, each additional layer confuses ZFS, so ZFS can not guarantee data integrity when using hw raid. If you are using hw raid, then you need to configure it as JBOD mode, but the best would be to reflash the firmware in the card, so the raid functionality disappears, i.e. "IT mode", i.e. turning the hw raid card into a simple HBA disk card. Actually, it is common to reflash hw raid cards into IT mode, by ZFS users.
.
Here is the large investment bank Morgan Stanley talking about the benefits of migrating from Linux ext4 to ZFS (huge cost savings, and increased performance):
http://conferences.inf.ed.ac.uk/eakc2012/slides/AFS_on_Solaris_ZFS.pdf
Another thing i my link above, the Investment Bank Morgan Stanley are migrating away from Linux + ext4 to ZFS because of huge cost (three fold reduction of Linux servers) savings and increased performance, is using OpenAFS with ZFS:
http://conferences.inf.ed.ac.uk/eakc2012/slides/AFS_on_Solaris_ZFS.pdf
OpenAFS is distributed. It seems that these clustered distributed Lustre, OpenAFS, etc filesystems rely on a normal filesystem to do the actual data storage. That is where ZFS fits in. Lustre + ZFS rocks. OpenAFS + ZFS rocks. etc.
LOL! So first you say that ZFS having problems with hardware RAID is a lie, then you admit that ZFS "gets confused" by hardware RAID exactly as I said. Your answer is to insist that hardware RAID is turned off, neatly destroying your own defence and underlining the point I made! Larry is not going to be giving you a second Sunshiners blogging prize if you carry on exposing the holes in ZFS like that.
@Matt Bryant
"...So first you say that ZFS having problems with hardware RAID is a lie,..."
Que? Can you quote me on this? Everybody knows that ZFS + hw raid is a major no no. I am quite active on forums where we discuss ZFS, and I always say that hwraid + ZFS should be avoided.
ZFS can work correctly with hardware raid only if the hardware raid functionality is shut off, i.e. JBOD or flashed away. If you insist on using hw raid with ZFS, then ZFS can detect all errors, but can not repair all errors. That is your problem, not ZFS problem.
".....Can you quote me on this?...." Sunday 21st, 21:50GMT - ".....Actually, the wikipedia link is not true. It contains false information....." The wiki in question talks about he issues of ZFS and hardware RAID, which you denied we're true and are now backtracking desperately.
"..... I am quite active on forums where we discuss ZFS....." I'm not surprised, but the problem for you is there are probably very few others here that read those forums as ZFS is of zero interest to them.
".....ZFS can work correctly with hardware raid only if the hardware raid functionality is shut off...." Which sounds like a problem to me, and is what I posted, and which you then denied and now admit. I assume your initial denial was simply that Sunshiners autonomic reflex to blindly deny any problems with ZFS. Glad we cleared that up.
##".....ZFS can work correctly with hardware raid only if the hardware raid functionality is shut off...." Which sounds like a problem to me, and is what I posted, and which you then denied and now admit. I assume your initial denial was simply that Sunshiners autonomic reflex to blindly deny any problems with ZFS.
You really don't understand why ZFS RAID is better than hardware RAID, do you. ZFS will work on hardware RAID just fine - it still has advantages over other file systems in that it will detect bit-rot, unlike everything else except BTRFS (and no, your precious cluster FS-es won't detect bit rot). But if you use it with raw disks you get _extra_ advantages that hardware RAID simply cannot provide, such as healing, which traditional RAID5 is not capable of because there is no way to establish which set of data blocks constitutes the correct combination when they don't all agree. Having an end-to-end checksum gives you that capability. With RAID6 you could repair it, and Linux MD RAID6 will do so on a scrub, but neither hardware nor software RAID6 implementations do on-access integrity checking, unlike ZFS, so by the time your Linux MD RAID6 scrub has identified a problem, the chances are that your application has already consumed the corrupted data. And with hardware RAID in the vast majority of implementations you don't even have the transparent background scrub functionality. Expensive proprietary SAN or NAS boxes might have such a feature, but this goes completely against your original requirements of a high performance low cost solution. According to everything you have said, your chosen solution (cluster file system) completely violates the requirements you originally listed (high performance, high durability, low cost), because with the exception of GlusterFS, all others require a SAN (unless you are using DRBD, but that solution is probably beyond you), and GlusterFS' durability, split-brain resistance and self-healing from bit-rot in underlying data that your hardware RAID silently exposes you to are either missing or immature compared to the alternatives.
I mean seriously, why are you so hung up on hardware RAID? Because you get to offload the XOR checksumming onto the hardware RAID controller? A modern x86-64 processor can do that faster than any customized ARM on a RAID controller can. An then you are forgetting that traditional parity RAID (5, 6) suffers a substantial performance drop on writes that are smaller than the stripe size. ZFS does away with that because it's stripe size is variable, and it's copy-on-write nature allows it to perform a small write into a small stripe and commit the data across all the disks in the pool in a single operation. In comparison, traditional RAID has to read the rest of the stripe, update the parity and then write the updated data and the parity - a woefully inefficient operation.
Traditional RAID is a 20th century technology. It's time to get with the programme and take advantage of what people cleverer than you have come up with since then.
"..... ZFS will work on hardware RAID just fine....." Hmmm, seeing as even Kebbie has admitted that is not the truth it would suggest you're just piddling into a hurricane and trying to claim you're not only not getting wet but also there are advantages to being soaked in your own urine!
".....other file systems in that it will detect bit-rot,....." As expected, when all else fails it's back to the Sunshiner staple, mythical bit-rot.
"..... but this goes completely against your original requirements of a high performance low cost solution...." It's quite comic when Sunshiners try and push ZFS as low-cost, neatly ignoring that it requires massive amounts of very pricey RAM and cores in a single SMP image for the big storage solutions they claim it is ready for. ZFS above a few TBs in a desktop is more expensive. Come on, please pretend that a 32-socket SMP server is going to be cheaper than a load of two-socket servers. Don't tell me, this is the bit where you switch to the Snoreacle T-systems pitch <yawn>.
".....Traditional RAID is a 20th century technology. It's time to get with the programme and take advantage of what people cleverer than you have come up with since then." Problem for you is far, far cleverer people have come up with far better solutions than ZFS, and the harder zealots like you try to shove ZFS down our throats the more resistance you will face. As I said before, if you like it then you suffer it, just stop trying to force it on the rest of us with a clue. I'm quite happy to let you have some "amazing advantage" if you wish, but STFU about forcing it into the kernel, it's not wanted there. Putting anything Oracle and non-GPL into the kernel would be like garnishing a hotdog with cyanide - just the certain suicide mix Larry wants.
@Matt Bryant:
"Trying to bolt together a Frankenstein network filesystem out of Luster and ZFS seems pretty pointless when Red Hat already have a superior option in GlusterFS."
This just tells me that you haven't actually tried to use GlusterFS for something serious. The fact that they only twigged in the past year or so that you cannot have a split-brain resistant distributed file system without hard fencing capabilities speaks volumes. I have used it extensively in the past (I'm the one that added the initial MDRAID+GlusterFS support into the OpenSharedRoot) , and use it more or less daily at the moment (big data archives at work, because we need to glue together multiple boxes' storage space), and while it is OK for archived data that is read-mostly and tends to require mostly linear transfers/scans of big files, you wouldn't want to run it on a system where you need a lot of small I/O on lots of files.
"This just tells me that you haven't actually tried to use GlusterFS for something serious....." I can't give you specifics but I'll just say we had a production system that was not small, not archive, and hosted a MySQL database on GlusterFS to several thousand users all making reads and writes of customer records. It was in production for just short of three years and has been replaced by GFS2 mainly because we were concentrating on RHEL, though Mad Larry's recent antics give us concerns over continuing with MySQL. You are right in that it was not one of our business critical systems, they all run on UNIX and RAC, but it was definitely not trivial and performed quite happily with plenty of small read and write I/Os. As with all such comparisons, YMMV.
> please supply the details of any FTSE 1000 organistaion using ZFS in production for a major system (CRM, billing, etc.).
That's a pointless comment, since you know (or should know) that no-one can answer it. If you actually worked with those sorts of organizations in a serious position you'd know that they rarely, if ever, release that sort of information, or allow their suppliers to do so. I do work with such companies, and there are many that I would love to use as references for disaster recovery solutions, for example. Without exception they have all said "sorry, that's commercially senstive info, we won't be a public reference".
About the only places that might own up to such detail in public are the ones funded by public money, and so have to be accountable.
So the fact that you don't know of such companies using "software xyz" is meaningless, since even if you did know they'd be unhappy if you talked about it. Why do you think that advertizing for computer companies just says things like "5 of the top 5" or "9 of the top 10" companies use product xyz?
"That's a pointless comment, since you know (or should know) that no-one can answer it....." Really? Seems other vendors can give cases, such as RHEL for GFS (http://www.redhat.com/whitepapers/solutions/RH_GFS.pdf), so why the shortage of such for Solaris if ZFS is just so gosh-darn popular? Believe me, we've had the Oracle salesgrunts in several times trying to tell us how wonderful Slowaris is under Mad Larry, but their case studies are a little thin on the ground.
".....since even if you did know they'd be unhappy if you talked about it....." The problem with that idea is three-fold - first off, I interview lots of people and they tell me plenty about what technologies our competitors are using; secondly, I see what skills our competitors are advertising for and ZFS is not mentioned; thirdly, I also happen to be on good speaking terms with a lot of similarly employed peeps in FTSE 1000 companies, we all talk to each other and whilst we don't share all the details we do talk about what we're doing. If there was some major wave of ZFS love going on I would be aware of it by any or all three of the above mentioned means. There is no such wave.
/SP&L
@Matt Bryant
There are so many fallacies, errors, and just pure and utter lack of comprehension of the storage systems that it's difficult to address them all, but I'll try, and do so in two posts for brevity (too long for a single post), in order of the points made.
##"Businesses want high-speed, scaleable, redundant and reliable storage at a low price."
Sure - but a cluster file system absolutely is not the way to achieve that. In the order of requirements you listed:
1) High speed
Clustering (you seem to be a proponent of OCFS2 and GlusterFS for some reason) has terrible performance compared to normal single-node file systems due to lock bouncing between the nodes. If the nodes are concurrently accessing the same directory subtrees, the lock bounce time is close to the ping time, around 100uS even on 1Gb or 10Gb ethernet. You might think that's pretty quick, but on a local file system, this metadata is typically cached, and since there's no arbitration between nodes to be done, the access time for the cached metadata is the latency of RAM access, typically around 50ns (after all the overheads, don't be fooled by the low nominal supposed clock cycle latency at GHz+ speeds). That's about 2,000x faster than ping time. This is the main reason why even the best cluster file systems severely suck when it comes to general purpose FS performance.
2) Scalable
Clustering has nothing to do with scalability. ZFS can scale to thousands of disks, across many vdevs (vdev being roughly equivalent to a traditional RAID array, only with added benefits of much improved data integrity and recoverability). You might argue that this limits you to how many HBAs you can fit into a machine, but this isn't really that serious a limitation. You can hang an external disk tray with a built in expander, and daisy chain several of those together well before you need to worry even about multiple HBAs in a server. You are looking at hundreds of disks before you have to start to start giving a remotely serious thought to what motherboard has more PCIe slots for HBAs. You could argue that the solution is something like GlusterFS that will take multiple pools of data (GlusterFS puts data on a normal FS, BTW, so if you run it on top of ZFS instead of a different FS, you get all of the unique benefits of ZFS in terms of data integrity and transparent repairability). But you'd be wrong. GlusterFS has it's uses, but the performance penalty is non-trivial, and not just because of FUSE overheads. If you really are dealing with data on that scale, you really need to look at your application and shard the data sensibly at that level, rather than trying to scale an unscalable application through additional layers of performance impairing complexity.
3) Redundant and reliable storage at a low price
This is where ZFS isn't just the best option, it is the ONLY option. Disks are horrendously unreliable, and this typically varies quite vastly between different models (regardless of manufacturer). Once you accept that disks will randomly fail, develop bad sectors, silently send you duff data (yes, it happens, and more often than you might imagine), or do one of the many other things disks do to make a sysadmin's day difficult, you will learn to value ZFS's unique ability to not just detect end-to-end errors and protect you from data corruption (including, for example, a bit-flip in non-ECC RAM on your RAID controller), but also silently repair it, either on access or during a scrub. BTRFS can detect such errors, but last I checked, it's scrubs cannot yet repair them (which makes it of questionable use at best, as even with redundancy, you have to restore your corrupted file from a backup). This and other associated features (e.g. RAIDZ3 which gives you n+3 redundancy which very, very few solutions provide, and none of them are what you might call cheap) enable to you achieve enterprise level reliability using cheap desktop grade disks (not that I think that enterprise grade disks are any more reliable, despite their 2x+ higher price tag for a near identical product). No other solution comes close to providing that level of cost effectiveness.
##"Now, when you come along and say "We have a whizz-bang new filesystem called ZFS that can look after really large amounts of storage, but to do so you have to give up all concepts of hardware RAID and instead use a really big, monolithic server, with massive amounts of RAM, but you have no high availability", they tend to ask "So how is that better than GlusterFS?" or whatever product they have already had the salegrunt come round and do a slideshow on."
First of all, RAID is downright disastrous for data integrity compared to ZFS' data protection for the same level of disk redundancy. If you have a disk that has started to bit-rot and it's feeding you duff data (as I said before, it happens more than I ever wanted to believe) in traditional RAID 1/5 you can detect a mismatch between the data and the mirror or the parity, but you have no idea which mirror is correct or which if the n+1 data chunks in a stripe is correct and which is corrupted. So you can detect an error but you have no way of correcting it. It's time to restore from backup. ZFS saves you in this case because each block has a checksum in the redundantly stored metadata, which means that it can work out which combination of data blocks is correct, and repair the corrupted block - completely transparently (it logs it so you can keep track of which disk is starting to go bad). With traditional RAID6 you can achieve the same thing if you use Linux MD RAID, but there is no checking of this on every read, you will only pick it up on a full array scrub, and by then your application has likely already consumed garbage data. With hardware RAID in most cases you don't even get the option of performing a periodic data scrub.
##"And then it gets worse when you have to admit ZFS is still a project under development and not really ready for enterprise use, and you have no case studies for real enterprise use of it (real enterprises being the only people that will have the cash and inclination to buy those big monolithic servers with all those cores and RAM required to run really big ZFS instances)."
At this point I should probably point out that ZFS on Solaris is older and more mature than GlusterFS GFS2 and OCFS2. It is on the same order of age and maturity of GFS (not even sure why we are bothering comparing these cluster file systems, it is an apples and oranges comparison, but I'm using it as an example of why your anti-ZFS bias is completely baseless on count of maturity). Spend a few months on the ext4 (the default Linux FS on most distributions) mailing list and on the ZFS-on-Linux mailing list and you will find that ext4 actually gets a lot more stories of woe and corruption than ZoL, and ext4 is supposed to be one of the most stable of FS-es.
##"And whilst you moan about whether GlusterFS or Left Hand have lock managers etc., what the business sees is a low cost storage solution that scales, is redundant, and uses commodity x64 architecture."
Except that GlusterFS is completely at the mercy of the underlying file system to keep the data safe and free of corruption. GlusterFS on top of ZFS is actually a very good combination if you need GlusterFS features because you get all of the data-protecting benefits of ZFS, along with tiered storage performance (L2ARC/ZIL on SSD), and the benefits of GlusterFS if you want to mirror/stripe your data across multiple servers. But if you use something other than ZFS underneath it, you fall into the same trap with data corruption. Worse, GlusterFS cannot scrub your data, so if one of the mirror's data ends up being corrupted, GlusterFS will not do anything to check for this - it'll just feed you duff data 50% of the time (assuming GlusterFS AFR/mirroring arrangement). GlusterFS also has no means of scrubbing your data to pre-emptively detect and repair corruption - so you have to rely in the underlying FS to do that for you; and ZFS is the only FS at the moment that can do that.
##"What is unamusing is the rabid reaction you get when you raise objections to ZFS being included in the Linux kernel. For a start, even if it did have the actual features wanted, it is not GPL compliant. End of discussion."
The rabid reaction is, IMO, in the GPL totalitarianism. BSD guys had no problem including it in FreeBSD long before ZoL. Why do you care if it's in the mainline kernel or not? What difference does it make to anyone except the GPL fanatics?
Well, thanks for that amazing bit of Sunshine, please do send me the email address for your manager at Oracle so I can let him know what a great job you are doing spreading The Faith. Don't worry, don't call us, when we want a repeat we'll call you, honest. There are so many deliberate fallacies, lies and downright propaganda in that post I simply can't be bothered to deal with all of them, but the idea of hanging large JBODs of disk off a hardware adapter you have turned hardware RAID off of, and then relying on one filesystem to control all that data without redundancy (you can't share pools between ZFS instances) is simply so stupid, well then I have to suggest you are being paid to subvert your technical knowledge, as there is simply no way I can see anyone regurgitating that bilge otherwise. I'll tell you what - you can go back to Larry, tell him we're all converted and will be using ZFS and anything else non-GPL Larry wants to cripple the Linux community with, and we'll pretend we mean it. No, seriously, if you want to use ZFS in production then go ahead, your choice, just stop trying to whitewash the rest of the World, mmmkay?
One of these days I will stop feeding the trolls...
##"So first you say that ZFS having problems with hardware RAID is a lie, then you admit that ZFS "gets confused" by hardware RAID exactly as I said."
It doesn't get "confused". It just has no way of seeing what the physical shape of the underlying array is, and thus cannot do anything to repair your data automatically, over and above what your hardware RAID is doing (which isn't, and never will be as much as ZFS can do for any given level of redundancy). If it is not obvious to you why by now, I suggest you go to the instution that issued you with your IT qualifications (if indeed you have any, which is looking increasingly doubtful) and demand your tuition fees back, because clearly they have failed to teach you anything.
##"Well, thanks for that amazing bit of Sunshine, please do send me the email address for your manager at Oracle so I can let him know what a great job you are doing spreading The Faith."
Could it be that you are even more ignorant than you had demonstrated prior to that post? Oracle has no intention in ZoL - they are actually betting on BTRFS on Linux (last I checked the only sponsored BTRFS developer was actually working for Slowracle).
And open source ZFS implementations are actually more advanced and feature-rich than Oracle's official Solaris implementation (which have stopped being released in open source form).
##"but the idea of hanging large JBODs of disk off a hardware adapter you have turned hardware RAID off of, and then relying on one filesystem to control all that data without redundancy (you can't share pools between ZFS instances) is simply so stupid,"
Are you being deliberately or naturally obtuse? Forgive me for asking, but it is not particularly obvious any more. ZFS does things to protect your data that your hardware RAID _cannot_ and _will never be able to_ do. Delegating this function to hardware RAID means you lose most of that capability. ZFS does handle redundancy - and higher levels of redundancy than any hardware (or software) traditional RAID. For a start, ZFS supports RAIDZ3, which is n+3 redundancy, higher than any hardware RAID controller you can buy (they top out at RAID6 which is n+2). ZFS also has additional checksums that allow it to figure out which combination of blocks is correct even when there is only n+1 level of redundancy in use, something that traditional RAID cannot and never will be able to do.
And what exactly do you mean that you cannot share pools between ZFS instances? You can always import a pool on a different machine if you need to (e.g. if your server suffers a severe hardware failure). The command in question is "zpool import </path/to/device/node(s)>". Provided the version of the pool and the feature flag set are supported on your target implementation, it will import and work just fine. For example, if you create a pool with version 26, you can import it on the wides range of implementations (ZoL, FreeBSD, ZFS-FUSE, Solaris, OpenSolaris, or OpenIndiana).
Hardware RAID is vastly inferior in terms of durability AND performance. If you haven't read up enough to understand why, you really should do so instead of speaking from the wrong orifice about things you clearly know nothing about.
Your blanket denial of hardware RAID speaks volume of your ZFS zealotry. Other file systems work fine with hardware RAID, meaning you can get the advantages of offloading the RAID computations to the hardware and gain whatever features from the software, but since ZFS can't you have to now try overturning decades of industry experience and trust in hardware RAID just because to admit it adds value detracts from your ZFS belief.
I'm really glad you mentioned ZFS import as I'm just dying to see you try and wiggle out of the fact it takes ages for ZFS to import a pool! It is no substitute for having a shared filesystem. Imagine the scenario - your ZFS server has fallen over, you need to get that production data back online but you have to tell your manager that your Uber file system needs not milliseconds, not seconds, but minutes, maybe over an hour (!) to import a pool, and that's before you have to restart all the applications and import the data into your database to restart production. Oh, and that's not mentioning the fact that ZFS imports have a nasty habit of stalling and hanging the bigger they are, especially if you have deduce enabled. Or, if you have a clustered or shared filesystem instead of ZFS, well you just carry on.
It may be comfy now under Larry's bridge, but I suggest you try learning about alternatives for a more secure future. Enjoy!
/SP&L
What on earth are you talking about? Importing a pool takes no longer than mounting it after a reboot. It might take longer if you have duff hardware (disks that lie about barrier commits, or cached RAID controllers that aren't battery backed, or similar problems), as ZFS goes rolling back through many previous versions of uberblocks trying to find a consistent data set. But that's a duff, lying hardware issue, not a FS issue.
If you have dedupe, you need to make sure you have at least 1GB of RAM per TB of storage if your files are large (i.e. your stripe size tends toward 128KB), or proportionally more if your files are small. If your deduplication hashes don't fit into ARC, the import will indeed take ages if your pool is dirty. But bear in mind that this is another reason why caching hardware RAID is bad - if your machine blows up and you lose the battery backed cache all bets are off on the consistency of the data that will be on your disks because the caching RAID card will do write re-ordering. If you instead use a plain HBA and use a ZIL for write-caching, the write ordering is guaranteed, so your pool cannot get trashed.
"What on earth are you talking about?....." More of that Sunshiners reflexive denial. Just Yahoogle for ZFS import takes long time and watch the hits pile up.
"....It might take longer if you have duff hardware...." Back to blaming the hardware for ZFS's issues? This is my surprised face, honest.
"...., the import will indeed take ages...." Yeah, nice grudging acceptance there might be an issue, but do try and stop insisting it is only with cases of dedupe with too little memory. Not that ZFS's ridiculous memory requirements aren't another issue.
".....hardware RAID is bad - if your machine blows up and you lose the battery backed cache all bets are off on the consistency of the data that will be on your disks....." I think that translates to "sh*t, can't answer that point, must make up a possible issue with hardware RAID to try and deflect attention from the ZFS issue". Fail! Let's just ignore any such battery-backed cache issue would already have generated a hardware error, shall we? Your example is even more silly especially as ZFS becomes one big SPOF, whilst even basic hardware RAID can be fortified with RAID between cards (RAID10 or RAID50) or make use of software RAID built into other filesystems.
Like I said before, you want to use it then go ahead, just don't expect anyone with experience to blindly take your word for gospel.
/SP&L
##"Just Yahoogle for ZFS import takes long time and watch the hits pile up."
And in all cases the situation is use of deduplication with insufficient RAM and hardware that lies about the commits. Use ext* under such circumstances, and see how long fsck takes on a 10TB+ file system. A lot longer than a zpool import. This is what you need to be comparing with in a like-for-like scenario.
##"Not that ZFS's ridiculous memory requirements aren't another issue."
Really? My backup server runs ZFS just fine in 3.5GB of RAM with a 6TB (RAIDZ2, 4TB usable) array. ZFS-FUSE runs just fine on my Toshiba AC100 with 510MB of usable RAM. The memory requirements are by and large a myth. It only applies in cases where you are using deduplication on large arrays.
##"Your example is even more silly especially as ZFS becomes one big SPOF, whilst even basic hardware RAID can be fortified with RAID between cards (RAID10 or RAID50) or make use of software RAID built into other filesystems."
Do you realize that if you run RAID50 between two cards, you are running a stripe of RAID5. If you have RAID5 on each card, if you lose one card, your RAID50 array will be trashed because you have effectively lost one logical disk from a RAID0 array. There is nothing stopping you from having a ZFS pool across multiple HBAs (a lot of my my ZFS pools span 3-4) in a configuration equivalent to RAID50. You do this by having multiple RAIDZ1 vdevs in a pool. And that will still give you better protection than hardware RAID for the reasons discussed (you really need to comprehend the meaning and advantage of end-to-end checksums here).
"And in all cases the situation is use of deduplication with insufficient RAM...." LOL, not so! This case from Oracle makes fun reading about how ZFS has problems with large numbers of LUNs (https://forums.oracle.com/forums/thread.jspa?messageID=10535375). Not appealing when you deal with UNIX systems that often have well over 300 LUNs! Don't tell me, you expect every enterprise system to work on just a few LUN devices only? It seems that ZFS getting confused over device names and stalling on import is more than just a rare occurence. But thanks for admitting you need OTT amounts of RAM for ZFS. And here I await the other standard Sunshiner excuse - "you don't have any flash in your ZFS server", because of course flash is just soooo cheap, right?
"....My backup server runs ZFS just fine in 3.5GB of RAM with a 6TB (RAIDZ2, 4TB usable) array....." Oh, so that's the toy you base your "experience" on! WTF? I have a CentOS backup server running on 1GB of RAM with much more disk than that but I'm not going to pretend that it's an enterprise-ready solution, but you obviously would think it is. So your toy server is over-priced and you want to deny a real business server would have to be equally over-priced?
".....you really need to comprehend the meaning and advantage of end-to-end checksums here...." You really need to comprehend the meaning of not interested, FOAD you annoying Sunshiner. Seriously, I could understand if you were pushing that crap as an overblown April Fools' gag but you lot started your dribbling two days early. It really is reaching the stages where TTTH is far too subtle for you.
In summary, ZFS imports will be very slow or stall unless you turn off the features the ZFS zealots told you were the reasons for having ZFS in the first place. You need to turn off dedupe (often claimed by Sunshiner trolls to be vital), avoid snapshots (the redundancy in ZFS), and keep your number of devices as low as possible and hope they don't use multipathing because ZFS stupidly counts each and every path as a device, multiplying the problem you get with large numbers of devices. And if your device names change then you are seriously screwed as the import will just hang trying to resilver the "missing" disk. And you want to somehow claim that is going to be better in a failover situation than a shared filestsystem?!? And even then the whole import will stall if you don't have rediculously large amounts of RAM. Yeah, so production ready - NOT! Seriously, it's beyond snake oil, it's verging on a scam to claim ZFS "is the answer". I'd be annoyed at your obtuseness but I'm too busy laughing at you!
/SP&L
##"..... ZFS will work on hardware RAID just fine....." Hmmm, seeing as even Kebbie has admitted that is not the truth
OK, Mr. "think you know it all when you know bugger all" Bryant - have you actually tried it? Have you? No? I have news for you - I have tested this setup extensively. It works exctly as you would expect. You get poor small-write performance as you would expect from traditional RAID5 and no transparent data correction capability because as far as ZFS is concerned it is running on a single disk. Is this a sensible setup to use when you could have better write performance and data durability for free? No. But if you want to deliberately use a sub-optimal solution you are free to do so.
##".....other file systems in that it will detect bit-rot,....." As expected, when all else fails it's back to the Sunshiner staple, mythical bit-rot.
It's not mythical. See the links posted earlier on research done by the likes of NetApp. It is a very real problem. More so with large, bleeding edge disks. If NetApp's tested-to-death setup achieves one such error in 67TB of data, the figure for desktop grade high density disks with 4 platters is almost certainly several times worse.
##"..... but this goes completely against your original requirements of a high performance low cost solution...." It's quite comic when Sunshiners try and push ZFS as low-cost, neatly ignoring that it requires massive amounts of very pricey RAM and cores in a single SMP image for the big storage solutions they claim it is ready for. ZFS above a few TBs in a desktop is more expensive. Come on, please pretend that a 32-socket SMP server is going to be cheaper than a load of two-socket servers. Don't tell me, this is the bit where you switch to the Snoreacle T-systems pitch
Considering I just got 48GB of RAM for my new desktop rig for £250 (Registered DDR3 ECC), I don't see how you can argue that RAM is that expensive. You can build a 2x16 core Opteron system for not an awful lot. By the time you have accounted for a decent chassis, couple of decent PSUs and some disks and caddies, the £3K you are likely to spend on a motherboard, two 16 core CPUs and a quarter of a TB of RAM isn't as dominant to the TCO as it might originally appear - especially when you consider that on a setup like that you'll be attaching a hundreds of TBs of disks. If that's not good enough for you, you can add some SSDs to use as a L2ARC which would sort you out even if you wanted to heavily use deduplication on such a setup. And ZFS isn't particularly CPU heavy, contrary to what you are implying (might be on SPARC, but only because SPARC sucks ass on performance compared to x86-64).
Seriously - work out your total cost per TB of storage in various configurations. Split it up across lots of boxes and the overheads of multiple chassies, motherboards and CPUs is going to start to add up to become dominant to the TCO.
##Problem for you is far, far cleverer people have come up with far better solutions than ZFS
If they have, you haven't mentioned it.
##This case from Oracle makes fun reading about how ZFS has problems with large numbers of LUNs (https://forums.oracle.com/forums/thread.jspa?messageID=10535375).
I lost interest in reading that thread when I noticed it talks about Slowaris. I really have 0 interest in Slowaris. We are talking about Linux and the ZFS implementation on that. I know your attention span makes it hard to focus on a subject for any length of time before you demonstrate you cannot even BS about it, but it isn't helping the case you are arguing.
##"....My backup server runs ZFS just fine in 3.5GB of RAM with a 6TB (RAIDZ2, 4TB usable) array....." Oh, so that's the toy you base your "experience" on! WTF?
No, it's just the smallest. The top end deployments I've put together with ZFS are in the region 100TB or so (off a single server - HP servers and disk trays, not Slowracle). What do you base your experience on? Do you use ZFS on anything? Or are you just spreading FUD you know nothing about? Thus far you have only provided ample evidence of your own ignorance.
## In summary, ZFS imports will be very slow or stall unless you turn off the features the ZFS zealots told you were the reasons for having ZFS in the first place.
Which features would that be, exactly? The only feature that will affect import time is deduplication - which is actually not recommended for typical setups. It is a niche feature for a very narrow range of use-cases - and needless to say disabled by default.
##You need to turn off dedupe (often claimed by Sunshiner trolls to be vital),
No, you just need to not turn it on unless you actually know what you are doing; but you've so far amply demonstrated you have no idea what you are talking about, so that point is no doubt wasted on you.
## avoid snapshots (the redundancy in ZFS),
Are you really, really THAT ignorant that you think shapshots have anything at all to do with redundancy? Really? Do you even know what redundancy and snapshots are? Because what you said there shows you are a clueless troll. I pitty your customers, assuming you have managed to BS your way through to any.
##and keep your number of devices as low as possible and hope they don't use multipathing because ZFS stupidly counts each and every path as a device, multiplying the problem you get with large numbers of devices.
What on eary are you talking about? Using /dev/disk/by-id/wwn-* nodes is the correct way to get your device IDs and it works just fine. The number of devices in a pool is pretty open ended. The important thing to pay attention to is the number of devices in a vdev, WRT device size, expected error rate (most large disks are rated at one unrecoverable error for every 11TB of transfers) and the level of redundancy you require.
##And if your device names change then you are seriously screwed as the import will just hang trying to resilver the "missing" disk.
If your devices WWNs have changed you have far bigger problems to worry about.
##And you want to somehow claim that is going to be better in a failover situation than a shared filestsystem?!?
If you aren't to stupid to use it in the way the documentation tells you to - absolutely. But that is clearly a big if in your case.
## And even then the whole import will stall if you don't have rediculously large amounts of RAM.
More completely unfounded FUD. I've shifted may tens of TBs of pools between machines and never had issues with imports taking a long time.
##Yeah, so production ready - NOT! Seriously, it's beyond snake oil, it's verging on a scam to claim ZFS "is the answer". I'd be annoyed at your obtuseness but I'm too busy laughing at you!
That's OK - everybody else reading the thread is laughing at you. Seriously - you can use whatever FS you see fit, I couldn't care less. But you really should stop trolling by spreading FUD about things you clearly know nothing about.
".....have you actually tried it?...." Oh no, as I already pointed out that my company has zero interest in it. However, we did have Sun and then Oracle come round and try to convince us ZFS was just peachy with SPARC Slowaris (M- and T-series), and both were allowed to do PoCs to try and prove their claims, and both failed miserably. The highlight was when Oracle admitted replacing a server with UFS working fine with 16GB of RAM would need a server with 32GB of RAM just to provide the same service, that really made us laugh! As to running it on x64, either with Linux or OpenSlowaris, we have no interest whatsoever. Did you fail to understand the bit where I mentioned (repeatedly) that we're very close to Red Hat when it comes to Linux? You want me to use smaller words?
".....Considering I just got 48GB of RAM for my new desktop rig for £250 (Registered DDR3 ECC)...." Apart from the hint of manure coming from the suggestion of putting 48GB of RAM in a desktop, if you knew anything about hp servers as you claim then you would know they do not support you buying cheapie RAM and plugging it in, you need to buy the hp modules which are markedly more expensive than cheap desktop RAM. Same goes for Dell, IBM, Fujitsu, even Oracle, so I'm going to call male bovine manure on that one.
".... I have tested this setup extensively......" Hmmm, if you say so. I expect another massive and glaring piece of counter-evidence to that statement soon, but I'll just smile for now.
".....The top end deployments I've put together with ZFS are in the region 100TB or so (off a single server - HP servers and disk trays......" Ah, there it is! Please do explain what "disk trays" and server models (hint - they come with RAID cards as standard) you used, and then account for the fact that whilst hp support several Linux flavours - you can buy the licences direct from them and they front the support including the hardware, giving you one throat to choke - they do not offer ZFS and do not support it unless it's part of Slowaris x86. If you had an issue with the filesystem (such as those common import stalls you insist never happen) they would simply refer you to Oracle for support. If you had a problem with ext3 or 4, or even Oracle's OCFS2, since they are in the kernel hp will support you with them, but not ZFS. I find it completely unbelievable that any company would put a 100TB of production data on a SPOF like ZFS without proper support, and that's before we get round to the idea of any company trusting pre-production filesystem software (the so-called production release for Linux was only announced last Wednesday). Busted!
So, now we've exposed your dribblings for the unsubstantiated drivel they are, please feel free to go have sexual intercourse with yourself elsewhere. Oh, and BTW, that sound is the forum laughing at your thorough debunking.
/SP&L - who the fudge did this clown blow to get a gold badge? Was it a charity thing?
I find it completely unbelievable that any company would put a 100TB of production data on a SPOF like ZFS without proper support
zfsonlinux is what you would turn to if you wanted to implement a ZFS filer yourself, without the expense of going to a 3rd party appliance.
There are thousands of companies out there running ZFS filer solutions. Some do it themselves, using Illumos, FreeBSD or ZoL. You claim this means there is no large businesses using ZFS, that no-one would rely on that level of support - and you are sort of right.
Some companies that require that extra level of box ticking buy ZFS based filers from Nexenta. It's exactly the same product, but you have the support of a distributor. Here's a list of Nexenta case studies. Plenty of big boys using ZFS in production there.
This is not the only way to do it. Netflix use <a href="http://lists.freebsd.org/pipermail/freebsd-stable/2012-June/068129.html>FreeBSD for all their distribution nodes</a> now, and the way they support this is that they hired one of the leading FreeBSD developers away from Yahoo.
"....ZFS based filers from Nexenta...." As I understand it, the Nexenta kit is more like an Open Slowaris (IllumOS?) instance with an extended version of ZFS and uses all the Slowaris x86 tricks and software to provide clustering between storage nodes, not just ZFS. That's fine, it's not Linux and not pretending to be Linux, and I don't recall hearing the Nexenta boys were paying for astroturfing to try and get their extended ZFS forced into the Linux kernel.
@Matt Bryant
"...Your blanket denial of hardware RAID speaks volume of your ZFS zealotry. Other file systems work fine with hardware RAID..."
We tried to explain to you that other filesystems does not work fine with hw-raid. First of all, there might be errors in the filesystem, or errors in the hw-raid system, or on the disk, or ...
There are many different domains where errors might creep in. Every domain has checksums, but that does not help because when passing from a domain to another, there are no checksums. That is the point.
Have you played the game as a kid, where you whisper a word to your neighbour sitting in a ring? The word at the end differs always from the first word uttered. The reason is that there are no comparisons from the beginning to the end. There are no end-to-end checksums. This is exactly what ZFS does: it has end-to-end checksums. It checks that the data in RAM is exactly the same as on the disk, ignoring all domains. The reason ZFS has this magic end-to-end checksums, is why ZFS has superior data integrity. It compares the beginning and the end. No one else does that.
The reason ZFS can compare from end-to-end, is because ZFS is monolithic and has control of everything, the whole chain: raid, down to disk. ZFS contains raid manager and filesystem manager. The purpose of this design is because ZFS can do end to end checksums.
So when other filesystems separate raid manager with filesystem, or also adds another layer: the hardware raid - that is bad from data integrity stand point. You can not expect to have checksums on each domain, you need to checksum from end-to-end. That is the only solution. ZFS does that by being monolithic and controling everything. Coincidentally, Linux hackers mocked ZFS design for being monolithic and Andrew Morton called ZFS "rampant layering violation" because it had no layers. The point is, if you have no layers, only then you can do end-to-end!
Finally Linux kernel hackers seemed to have understood this, and created BTRFS which, violates layers just as ZFS, as BTRFS controls everything: raid, filesystem, etc. But, you never hear any complaints from Linux hackers that BTRFS violates layers, you only hear complaints when non Linux tech does. Apparently, ZFS is a bad design according to Linux hackers, but BTRFS which is a clone of ZFS, is not. :)
> maybe over an hour (!) to import a pool
More nonsensical FUD. Have you ever got closer to ZFS than posting abut it on the Register? You've clearly never used it, so why should anyone take your rants seriously?
I've yet to see a zpool import take more than a second or two, unless the underlying devices have some serious problems, and even there ZFS will correct and import the valid pool in less time than a traditional FS will complete an fsck and present a maybe-correct volume.
"More nonsensical FUD......" Hey, go do the Yahoogle and read some of the hits.
"......Have you ever got closer to ZFS than posting abut it on the Register? ....." Why would I want to? I don't need to get an STD to realise it's not something I want to do. Or are you of the mindless persuasion that is willing to try anything if someone tells you to? If so - and going on the evidence of your support for ZFS - I have some prime Everglades real estate you might be interested in.....
".....You've clearly never used it....." <Sigh>. How many times do I have to repeat - we have NO interest in it. We have had Sun's own experts (and Oracle's) come round and PoC it and fail miserably, or are you now going to insist Sun and Oracle know nothing about ZFS?
##".....have you actually tried it?...." Oh no, as I already pointed out that my company has zero interest in it.
So you openly admit you have 0 actual experience of this. Wow. So much opinion, so little actual knowledge.
## Did you fail to understand the bit where I mentioned (repeatedly) that we're very close to Red Hat when it comes to Linux?
Oh, I'm sorry, by "close to RedHat" you mean you have a support contract that you can try to hide behind when things go wrong and it becomes obvious you don't know what you're doing? I hadn't realized that was what you meant.
##".....Considering I just got 48GB of RAM for my new desktop rig for £250 (Registered DDR3 ECC)...." Apart from the hint of manure coming from the suggestion of putting 48GB of RAM in a desktop,
Yup. EVGA SR2 based, since you asked, not that it matters.
## if you knew anything about hp servers as you claim then you would know they do not support you buying cheapie RAM and plugging it in, you need to buy the hp modules which are markedly more expensive than cheap desktop RAM.
I've never gone wrong with Crucial RAM. It has always worked in anything I threw it into, and they have compatibility lists for just about anything. And anyway, you've been saying you wanted things cheap, fast and reliable (and not just two of the above).
##".....The top end deployments I've put together with ZFS are in the region 100TB or so (off a single server - HP servers and disk trays......" Ah, there it is! Please do explain what "disk trays" and server models (hint - they come with RAID cards as standard) you used,
LSI HBAs, and MSA disk trays. The normal SATA disks were easy, but my client required large SSDs in one of the trays for hot data. None (at least of suitable size) were supported, so we got a bunch of disks on a trial from multiple vendors. The expanders choked almost instantaneously on most of them with NCQ enabled. Kingston V100+ 500GB models worked great, however. When you need to realize is that you cannot do bigger and better things while having your hands tied behind your back by what your vendor is prepared to support. If that's all you're doing, your job isn't worth a damn and is ripe for outsourcing. I regularly have to push the limits orders of magnitude past what any vendor supported system in the price range can handle.
## and then account for the fact that whilst hp support several Linux flavours - you can buy the licences direct from them and they front the support including the hardware, giving you one throat to choke - they do not offer ZFS and do not support it unless it's part of Slowaris x86.
Who said anything at all about vendor supported configurations? If you have the luxury of being able to achieve things within your requirements and budget while sticking with those, that's great - meanwhile, those of us that have to get real work done don't have that luxury.
##"More nonsensical FUD......" Hey, go do the Yahoogle and read some of the hits.
Oh, I get it! "It must be true! I read it on the internet!" Is that the best you can do? Do you have any experience of your own on any of these matters? Or are you just regurgitating any odd drivel you can find that supports your pre-conceptions?
##".....You've clearly never used it....." <Sigh>. How many times do I have to repeat - we have NO interest in it.
Then stop trolling baseless FUD about it.
".....So you openly admit you have 0 actual experience of this....." You really do have problems with basic reading and comprehension, don't you? The standard for Oracle astroturfers really is slipping! Did you miss the bit where I said we'd had both Sun and later Oracle in to try a PoC? I have plenty of first-hand experience of seeing ZFS fail, thanks, much more than your imaginary 100TB on a single hp server (somehow without the built-in RAID card) production example. What are you going to claim next, that you've done the same with an IBM mainframe? LOL, you are an embarrassment to trolls everywhere.
".....you mean you have a support contract that you can try to hide behind when things go wrong...." Actually, yes, that's exactly what I mean. I don't try and claim to know everything about the Linux, UNIX and Windows or the apps we use right down to the code level (and in the case of the proprietary apps we run that would be impossible), but then I work for a real business rather than the imaginary one you inhabit, and my business wants the assurance that when things go wrong they get fixed as quickly as possible. My CIO doesn't see our business systems as some personal development toy for his staff, they are there to run business systems to make money for the company. Get a real technical job and you might learn that some day.
".....I've never gone wrong with Crucial RAM....." With every post you are simply re-inforcing the simple fact you have not work in a real IT job. I too have no problem with Crucial RAM in my home systems, but anyone that had actually had an hp server on an hp support contract as you claim would know that putting anything but hp RAM in it would invalidate the hp support contract. This makes it certain that there is no way any company would trust you with 100TB of production data even before you got round to the stupid idea of trying to use ZFS. Back under your bridge you petty astroturfer, you have been exposed for a know-nothing, lying troll, and exactly the type that means there will always be resistance to anything CDDL going anywhere near the Linux kernel.
".....LSI HBAs, and MSA disk trays. The normal SATA disks were easy, but my client required large SSDs in one of the trays for hot data.......we got a bunch of disks on a trial from multiple vendors....." Again, very unlikely as again you would have invalidated your hp warranty using none hp cards and none hp disks. You are talking so much crap it's rediculous! You also talked about 100TB behind one adapter yet you also wanted SSDs behind the same adapter!?!? What type of technical cretin are you?!? You are a complete fake, just admit it, go back to Larry and tell him he needs some better astroturfers. Is Florian Muller available to help you out?
".....If you have the luxury of being able to achieve things within your requirements and budget while sticking with those, that's great - meanwhile, those of us that have to get real work done don't have that luxury....." Translation - Gordon has worked for some ha'penny charity outfit and built them a webserver and he thinks that makes him an expert on all IT. Gordon, even the tiniest of businesses will want support, even if via a third party, as the limits of having staff like you is that when it all goes tits up the business will sink whilst the smart-arse wannabe techies try and learn how it actually works. Your pretence that you have worked with ZFS in real industry use is complete male bovine manure, it is bery obvious the closest you have come to the industry is typing up marketting releases. Stop embarrassing yourself, you're only helping me in exposing the ZFS propaganda for exactly what it is, so much so I'm actually feeling sorry for you.
/SP&L
This post has been deleted by its author
##imaginary 100TB on a single hp server (somehow without the built-in RAID card)
Do you really not understand that just because a server comes with hardware RAID capability you don't actually have to use it?
##".....you mean you have a support contract that you can try to hide behind when things go wrong...." Actually, yes, that's exactly what I mean.
I rest my case, fellow commentards.
##".....LSI HBAs, and MSA disk trays. The normal SATA disks were easy, but my client required large SSDs in one of the trays for hot data.......we got a bunch of disks on a trial from multiple vendors....." Again, very unlikely as again you would have invalidated your hp warranty using none hp cards and none hp disks.
1) Not everybody cares about warranty.
2) If the company is a big enough, well thought of brand, you'd be surprised what HP and the like are preared to support and warranty.
Also consider that the cost of hardware replacements is negligible compared to the value generated by the said hardware, and when the price differencial between DIY and $BigVendor solutions is running in into hundreds of thousands of £, you'd be surprised what the upper management layers suddenly start to deem acceptable when budgets are constrained (and they always are).
##You also talked about 100TB behind one adapter yet you also wanted SSDs behind the same adapter!?!?
I said LSI host adapters (plural, if you know what that word means).
I'm not going to waste any more time on correcting your uninformed, ill understood and under-educated drivel. Arguing with somebody as clearly intellectually challenged as you, entertaining as it may have been to begin with, gets borning eventually. Your ad hominem attacks only serve to further underline your lack of technical knowledge and understanding to support your unfounded opinions.
This post has been deleted by its author
Give it up Matt, you are just coming across as a juvenile angry neckbeard with mysterious chips on the shoulder.
Poor Matt, he's had a lifelong hatred of Sun (the origin of which I'm somewhat curious about). He is also pathologically incapable of admitting any error on his part. As a result, this thread is pretty much guaranteed to wind him to his maximum level of aggravation because:
1) It involves Sun technology.
2) He appears to be quite wrong (caveat: I'm not a ZFS user, but credible people throughout the storage industry have nothing but good things to say about it).
Now where's that popcorn?
".....Poor Matt....." Oh yeah, I forgot, independent thought is something that horrifies The Faithful Followers Of The Mighty Ponytail (Who Will Rise Again).
".....he's had a lifelong hatred of Sun ...." Don't be silly, I got all they way to the mid-Nineties before Sun really started upsetting me. Prior to that I'd been pretty much a happy Sun user. When was it you developed your inability to think for yourself?
".....1) It involves Sun technology....." Wrong again, which is about the normal state for a Sunshiner. As tech it's fine, it's the astroturfing and attempt to force it down Linux users' throats I object to. Seems objections and debate just can't be tolerated in the ranks The Faithful, especially when they can't deal with the objections. Sorry, but you Sunshiners queered your own boat long ago if you seriously think you can say "Trust us, take our word for it".
".....but credible people throughout the storage industry have nothing but good things to say about it...." I do to, when it's in the right context. Those people you mention are not trying to push development software as production-ready for real business use.
In point of fact, Matt, I have no particular allegiance to Sun, but it has always amused me when you've flung yourself into threads about Sun with a venomous rage, which you have been doing for some years now (and on multiple sites, I note via a quick Google search). So, we've pinpointed the mid-90s as the time when Sun touched you in a bad place; what happened, pray tell?
Anyway, best of luck with your jihad. I look forward to the continuing entertainment!
This post has been deleted by its author
Seriously guys - ignore this guy's trolling. As the proverb says: "Do not teach a pig to sing - it wastes your time and it annoys the pig." His illiterate squealing is starting to grate.
For all his private parts waving with the supposed "FTS 1000" [sic] company credentials and having a RH support contract, there are almost certainly more people at RH that have heard of me than him (ever ported RHEL to a different architecture?). Not to mention that he has been talking about cluster file systems to someone who has been rather involved in their development and support (ever deployed a setup with the rootfs on a cluster file system? Using DRBD+GFS or GlusterFS?) without realizing how much this exposed his pittiful ignorance of the subject.
The blinkered view of "it's not supported by the vendor so of course I haven't tried it", had it always prevailed, would have ensured that the fastest way to travel still involves staring an an ox's backside. Seriously - what is the value and usefulness of an engineer that is only capable of regurgitating the very limited homework that the vendor has done for them?
Let it go. Let his ignorance be his downfall as he hides his ineptitude behind support contracts. Incompetence comes home to roost eventually. Jobs of "engineers" that only know how to pick up the phone to whine at the vendor get off-shored to places that provide better value all the time.
ROFLMAO @ Gordon The Plastic Gardener!
"....there are almost certainly more people at RH that have heard of me than him...." Yeah, they probably all heard of you in the legal department when they took out the restraining order! And if they haven't then your frenzied and shrieking posts are probably getting widely passed around RH for their comedy value.
".....ever ported RHEL to a different architecture?...." Nope, been too busy being paid for actually WORKING with it in a REAL ENVIRONMENT. Don't tell me, next you'll claim you are secretly Linux Torvalds in disguise, right? Whatever, claim what you like, it only makes me laugh more, but I note you still can't back up your story by posting the tech details on your claimed "100TB production installation" of ZFS. Do you think anyone is likely to believe your bullshine when you can't back up your claims when challenged? This is my surprised face, honest.
"....what is the value and usefulness of an engineer that is only capable of regurgitating the very limited homework that the vendor has done for them?...." It's called implementing tried and tested technology. I've done plenty of cutting edge stuff with such tech, but - if you actually had some real industry experience - you would know that companies shy away from bleeding edge tech because of the risks involved. But what am I suggesting - you claim to have implemented a production environment riddled with SPOFs and using pre-production software, and all without a support contract! You are the God Of Tech, right? Yeah, right! You are a lying astroturfer and you got busted, now go back to troll school and learn how to do it better.
/SP&L - Enjoy!
## "....what is the value and usefulness of an engineer that is only capable of regurgitating the very limited homework that the vendor has done for them?...." It's called implementing tried and tested technology.
No, Matt. That's what you keep telling yourself because you don't know how to do anything that hasn't been done for you and pre-packaged for bottle feeding.
"No, Matt. That's what you keep telling yourself because you don't know how to do anything that hasn't been done for you and pre-packaged for bottle feeding." LOL! Look at the market, take a hard look (if you can from that alternate reality) at the FTSE 1000 companies I mentioned, and then tell me how many of them are using "pre-packaged for bottle feeding" commercial tech like mainframes, proprietary UNIX servers, Windows, CISCO/Procurve/Juniper networking gear, Balckberry BES - even Apple hardware! - and all with commercial, off-the-shelf software like Oracle or SAP. Yes, that's right - all 1000!
They trust their business to using "pre-packaged for bottle feeding" tech because it is easier to quantify the risks involved with implementing business systems with such gear, compared to the wet-finger-in-the-air, it-may-work-most-of-the-time tech you pretend to have implemented. And that is why I am confident in saying there is no chance you work for any company of size, not even an SMB, because no matter how smart you think you are they would laugh you out the door for suggesting they take onboard the massive risk of your toy solution.
Face it, you have been exposed as an astroturfing troll, and with every post you simply add to the evidence against you. But please do continue as the comedy value is exquisite!
/SP&L
Matt, you have exposed yourself as an opinionated ignoramus beyond whom it is to even aspire to mediocrity. I would invite you to stop embarrasing yourself in public, but I am reluctant to do so for you might heed it and withdraw such an excellent source of Monty-Python-esque comedic amusement.
Facebook runs CentOS on their servers, not supported RHEL.
Facebook was also using GlusterFS since before it was bought by RH (not that that a company running CentSO would care).
Google runs their own "unsupported" version of Linux (IIRC derived from Ubuntu or Debian).
Those two between them are worth more than a non-trivial chunk of the FTSE 100 combined. Note - FTSE 100, there is no such thing as FTSE 1000 (although I'm sure you'll backpaddle, google your hilarious blunder and argue you meant one of the RAFI 1000 indexes not weighted by market capitalization - which considering that there are, for the sake of comparison, only about 2,000 companies traded on the London Stock Exchange, isn't going to be all that heady a bunch).
One of the biggest broadcast media companies in the country (not to make it to specific, but the list of possibles isn't that big, you can probably figure it out) runs various "unsupported" desktop grade MLC SSDs (Integral, Samsung, Kingston) in their caching servers (older Oracle SPARC or HP DL360 and DL380 boxes). They also run a number of unsupported software on their RHEL (supported) machines including lsyncd and lnlb.
One of the biggest companies in mobile advertising (after Google) runs their real-time reporting database systems (biggest real-time MySQL-or-derivative-thereof real-time deployment in the country) on MariaDB 10.0 alpha, because official MySQL wasn't workable for them due to the Tungsten replicator (commercial support available, for all good that will do for you when the product is a pile of junk) being far too unreliable.
This is just the first four that I can think of off the top of my head (having either worked there or having worked with someone who worked there) that completely obliterate your baseless, opinionanted rhetoric. You really should quit while you're behind.
"Matt, you have exposed yourself as an opinionated ignoramus beyond whom it is to even aspire to mediocrity...." Gee, some people have such a hard time dealing with criticism or a contradictory viewpoint. You're just sad, I pity you for your inability to see beyond your own blinkered Sunshine.
".....Facebook runs CentOS on their servers, not supported RHEL.....Google runs their own "unsupported" version of Linux....." Please do pretend Faecesbook or Google are the same business models as the average business, just for laughs. Both are extreme examples where they employ a lot of technical people to design and manage their web services because their business IS the web service, whereas for most companies the web service they run is INCIDENTAL to their business. Sorry, do you want me to use shorter words than incidental? You also fail to show that Faecesbook or Google uses ZFS, so your chosen examples are the extraordinary examples where they design and implement a unique solution, but STILL did not choose to use ZFS! Gosh, if ZFS was so essential as you claim, surely they'd be using it? ROFLMAO @ you as you merely prove my point! You are a master of FAIL!
/SP&L
"....(and on multiple sites, I note via a quick Google search)...." Sorry, not me. Unless someone is ripping off the Matt Bryant nom de plume, I'd have to suggest the other Matt Bryants posting are simply real-life Matt Bryants p*ssed off with Sun. It's quite a common name (not the reason I chose it, but if you're not in on the joke then you're not in, TBH). Maybe the NFL kicker Matt Bryant also had a bad time with Sun, because when I Yahoogle for Matt Bryant and Sun that's all I get.
"....but it has always amused me when you've flung yourself into threads about Sun with a venomous rage...." Really? Wow, what amazingly broken ESP you have! I usual post out of boredom with the continual Sunshine posted, rarely first in any thread unless the author has been doing some marketing whitewash (like Ashley Vance and his old Sunshine cheerleader articles). No venomous rage, more a tired determination to expose the Sunshiners that won't shut up for the frauds they are - they made their own beds with their stupid statements, I just show them where they're going to end up lying.
"....Give it up Matt....." What you mean is please stop poking big holes in the Sunshiner astroturfing.
"..... you are just coming across as a juvenile angry neckbeard with mysterious chips on the shoulder." Puh-lease, that's worse than the pot calling the kettle black! Not hard to spot the limits of your technical knowledge seeing as - yet again - you have added nothing to the thread. Care to share your amazing experiences of ZFS in production instances with a FTS 1000 company? Yeah, thought not.
I think what's going on here is a level of disconnect. In your scenario, when support isn't exactly a priority, where the internal staff is given the full responsibility for the system's upkeep, then a scenario as you've described works. Of course, if something really does go boom, it's your head...
In Matt's case, they're bound by very expensive support contracts due to a demand for a chain of accountability, usually by someone above the IT department (remember, many IT departments still have to answer to the bean counters). Some folks are simply required to play by the rules--it's in their job description for better or worse, outsourceable or not. If it isn't official, it isn't allowed period.
At this point, it seems the only thing you two can agree on is to disagree, as you appear to be in two very different IT worlds.
Ah, Gordon The Plastic Gardener just doesn't realise when he should quit his astroturfing and call it a day!
"....Do you really not understand that just because a server comes with hardware RAID capability you don't actually have to use it?.... I said LSI host adapters (plural, if you know what that word means)...." So first you said you didn't have to turn off hardware RAID with ZFS, then you said you did for performance, and then it was you stuck 100TB behind one adapter, and now it's multiple adapters but with RAID turned off. Seems your story changes every time I point out the gaping holes in it! Was it delivered by flying pigs?
"....1) Not everybody cares about warranty....." And here's another little hole - warranty for hp SATA disks is one year for a reason, they are a lot more likely to fail than SAS disks. So, not only do you want us to believe you have a company stupid enough to trust you with 100TB of production data, they let you put it on development software in a build just riddles with SPOFs, stick a shelf of SSDs (and why SSDs when you have slow SATA anyway?) on the same single controller - oh, sorry, you changed that bit of the story third time round - and then you use SATA disks which will fail and lead to lots of time-consuming ZFS rebuilds and resilvering when you have to replace those SATA disks. Want to make any more changes to your story? How about switching the expensive MSAs for the cheaper and larger MDS 600 (which is actually quite popular with ZFS enthusuiasts, it's even supported with the NexentaOS)? Maybe you want to read this post by a SATA-ZFS user as to why SATA and ZFS is not a good idea (http://serverfault.com/questions/331499/how-can-a-single-disk-in-a-hardware-sata-raid-10-array-bring-the-entire-array-to/331504#331504). Don't worry, take your time, make the next edit carefully as punching holes in the gumph you've posted so far is just too easy.
"......you'd be surprised what HP and the like are preared to support and warranty......" But you just said you didn't worry about support or warranties? And, no, hp will not offer a warranty on anything it does not manufacture or OEM, and they will draw up a special support contract for non-standard hardware and software that would make your eyes water even if you're someone as big as General Electric. And if you were that big you wouldn't be cutting corners with 100TB of production data or accepting designs riddled with SPOFs from some guy that plays with Linux.
And then, when you're dribbling about cheapness (even though you're so big you can force hp to put warranty support on non-hp kit?), you use hp servers? Riiiiiigght! If you really were the DIY type you'd be using white box servers and disk shelves. The reason people buy hp is because they DO care about warranties and support, it's why the hp kit is not cheap. But I suspect you only mentioned hp because you thought it would get me onside - major troll failure! If you want to continue with that line of manure, please detail the model of Proliant, the model of LSI cards, which slots you put them in, what model of MSA, what vendor and model of SATA disks and SSDs, where you got the MSA caddies (oh, didn't you know you need hp caddies to slot disks into an hp MSA?), what Linux variant and what drivers for the LSI cards, the number of MSAs you hung off each LSI card port and what type of cables you used. You know, some real technical detail to back up your otherwise fantasy example.
".....I'm not going to waste any more time on correcting your uninformed, ill understood and under-educated drivel...." Aw, looks like ickle Gordon has realised it's time to throw in the towel. Someone send a memo to Larry - new astroturfers needed, this one just makes stinky Swiss cheese!
/SP&L
Matt, just stop it ... mmm'okay. There is a good boy. Take a deep breath, watch some heist movie, open a bottle of fine red for tonight.
I'm getting flashbacks to the scenes of rabid retardation of clueless testosterone-filled nerds in the uni computer room attempting one-upmanships in the Commodore-vs-Atari flamewars via local chatrooms.
Once two of them found out they were actually sitting in the same room, and it came to blows...
Yeah, that does kinda sum up the desperation being shown by you and your fellow Sunshiners. Am I surprised that it happens so often to you that you have a convenient link to the cartoon handy? Not really. As I said, still waiting for you to post anything technical, if only for comedy value.....
"Why should I indulge you?...." Ah, but that's just it - you can't!
Seeing as you seem determined to suffer ZFS, I suggest you do something more useful than looking for cartoons in Google. I would suggest you read something like this (http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance). As the writer says, "One of the most frequently asked questions around ZFS is: "How can I improve ZFS performance?"." I LOL'd when I saw suggestions 1 and 2 - "Add enough RAM" and "Add more RAM"! I would suggest you read it and hopefully avoid to big a disappointment with ZFS.
Cheers Gordan for sticking through what has been a very informative piece of headbanging with Mr Bryant. The volume of data you guys are dealing with is well beyond my needs, but if I scale up I'm now better informed. How anyone who'es been in IT for more than a few years can deny bit-rot is beyond me... obtuse beyond belief.
###".....You don't need a clustered file system to fail over - you fail over the block device....."
##"Slight problem - you can't share the block device under ZFS, it insists on having complete control right down to the block level."
That is simply not true, at least not in ZoL (not 100% sure, but I don't think it is the case on other implementations, either). You can have ZFS on a partition just fine. There are potential disadvantages from doing so (reduced performance - since different FS-es use different commit strategies, ZFS cannot be sure about barrier interference from the other FS, so it disables the write-caching on the disk), but it does work just fine.
But even if it were true, if you are sharing a block device between machines (typically an iSCSI or AoE share, or in rarer cases DRBD), you would only expose the whole virtual block device as a whole, and use it as a whole. Instead of partitioning that block device you would usually just expose multiple block devices of required sizes. On top of it all, most people doing this do so with a SAN (i.e. hugely expensive, which implies going against your low-cost requirement) rather than a DIY ZFS based solution. So this point you thought you were making doesn't actually make any sense on any level at all and only reinforces the appearance that you have never actually used most of these technologies and only know what you have read in passing from under-researched bad journalism.
##"That is why ZFS is a bitch with hardware RAID: ".....When using ZFS on high end storage devices or any hardware RAID controller, it is important to realize that ZFS needs access to multiple devices to be able to perform the automatic self-healing functionality.[39] If hardware-level RAID is used, it is most efficient to configure it in JBOD or RAID 0 mode (i.e. turn off redundancy-functionality)...."
Why is this a problem? All this is saying that if you give ZFS a single underlying RAID array, you will lose the functionality of ZFS' own redundancy and data protection capabilities. If you are really that emotionally attached to hardware RAID, go ahead and use it. But then don't moan about how it's not as fast or durable as it would be if you just gave ZFS raw disks. If you are using ZFS on top of traditional RAID, ZFS sees it as a single disk. If you insist on crippling ZFS down to the level of a traditional file system, you cannot meaningfully complain it has fewer advantages. But even in that case, you would still have the advantage of checksumming, so if one of your disks starts feeding you duff data, you'd at least know about it immediately, even if ZFS would no longer be able to protect you from it.
##"Ooh, nice feature-sell, only us customers decided clustering was actually a feature we really value. Try again, little troll!"
This seems to be something you have personally decided, while at the same time demonstrating you don't actually understand the ramifications of such a choice on durability and fault tolerance. Clustering is a very specialized application, and in the vast majority of cases there is a better, more reliable and more highly performant solution. You need to smell the coffee and recognize you don't actually know as much as you think you do.
##"I can't give you specifics but I'll just say we had a production system that was not small, not archive, and hosted a MySQL database on GlusterFS to several thousand users all making reads and writes of customer records."
Sounds pretty lightweight for a high performance MySQL deployment. Try pushing a sustained 100K transactions/second through a single 12-core MySQL server (I work with farms of these every day) and see how far you will get with GlusterFS for your file system, regardless on top of what FS your GlusterFS nodes are running. Even if latency wasn't crippling (it is), the pure CPU cost would be. Sure - if you are dealing with a low-performance system and you can live with the questionable-at-best data durability you'll get out of traditional RAID+ext*+GlusterFS (assuming you get GlusterFS quorum right so you don't split-brain, otherwise you can kiss your data goodbye the first time something in your network setup has an unfortunate glitch), that setup is fine. But don't cite a fragile low-performance system as an example of a solution that is supposed to be durable, performant or scalable.
##"why the shortage of such for Solaris if ZFS is just so gosh-darn popular?"
I never said Solaris was that great. My preferred OS is RHEL. I also never said that ZFS was particularly popular - I just said it was better than any other standalone file system.
LOL, I thnk I'll keep that post as an example of the typically blinkered Sunshiners response you get. Masses and masses of insistence that no-one but the Sunshiners have the answer, that any point you raise is simply irrelevant and to be ignored, bla, bla, bla. Wow, I haven't heard that much claptrap since the Sun salesgrunts were round trying to convince us "Rock" was a certainty! Nothing gives away the 100% male bovine manure quotient to your post than the ridiculous claim "I never said Solaris was that great. My preferred OS is RHEL" - it is very obvious from your rabid pushing of ZFS that that statement is an outright lie. The Red Hat user community were some of the most vocal exposers of the holes in ZFS, including the famous (oh, sorry, in Larryland you probably refer to it as infamous) Ten Reasons Not To Use ZFS (http://www.redhat.com/archives/rhl-list/2006-June/msg03623.html). One of the reasons Larry is so desperate to get ZFS into the kernel is because at the moment it stands SFA chance of replacing ext3/4 as the default choice for RHEL users, and Larry so desperately wants to make some money out of the Sun purchase that he cannot make from GPL'd OCFS2. It must really wind you Sunshiners up when us RHEL users simply turn down ZFS for an "old" solution like ext3!
I expect many more of these astroturfing exercises as Larry tries harder and harder to cram ZFS down the throats of the Linux community (http://www.bit-tech.net/news/bits/2012/08/08/oracle-google-astroturf/1). All part of Larry's embrace and extinguish policy on Linux.
/SP&L
LOL.
Just read 'Ten Reasons Not To Use ZFS'.
1. CDDL
2. No support for SELinux ACLs
3. Primarily a Solaris product so it costs money (?)
4. Designed for servers not desktops (WTF?)
5. Linux has Reiser4 (ROTFL)
6. Sun wrote it and they're evil for writing Java
7. No independent benchmarks
8. Microsoft and Sun work together
9. Sun paid SCO for the IP suit
10. No 64-bit Mozilla Java plugin
None of those seem like valid reasons to me. The whole thing seems like rabid fanboyism.
The last sentence - "IMHO any news
from Sun is unwelcome, unless that news is the wholesale GPL
re-licensing of their entire product catalogue."
LOL
"....,The whole thing seems like rabid fanboyism....." Oh it was, but it really wound up the Sunshiners back in 2006, and attitudes in the community have changed little since. Especially after Mad Larry ripped off RHEL with his OEL clone. So the talk of ZFS and RHEL is the most ridiculous bilge ever, there is no way Red Hat are ever going to recommend ZFS unless Oracle convert it to GPL.
Your ten reasons:
Reason 1 is "I love the GPL", Reason 2 is not true (this list is from 2006), Reason 4 is "ZFS is for servers", Reasons 3,6,7,8,9 and 10 are all "I mistrust and dislike Sun". Reason 5 is the best, "don't use ZFS, we've got Resier 4".
".....this list is from 2006....." Oh, thanks for reminding me that first Sun and then Oracle have been trying to fix the problems with ZFS for twelve years, and seven years since they tried to push the problem to the OSS crowd, and still can't get it production ready! Face it, ZFS is just another too-little-too-slow-too-late Sun product, it's the "Rock" of software.
@Matt Bryant:
BTFS can't cluster either, so your point is completely off the mark. In fact, what point were you even trying to make?
Cluster file systems are very specialized, and there are only 3 real cluster file systems out there with any traction: GFS, OCFS, and VCFS. You could also sort of shoehorn GlusterFS in there if you ignore it's somewhat questionable stability and error recovery capabilities, the fact it's fuse based and requires a normal file system underneath it anyway.
"BTFS can't cluster either....." Where did I mention BTRFS? Did you actually bother to read my post before reaching for your tripewriter?
ZFS advocates just don't seem to realise it is a clever solution to a problem that simply isn't the prime worry for users, whilst at the same time missing key features that users do require. What is unamusing is the rabid reaction you get when you raise objections to ZFS being included in the Linux kernel. For a start, even if it did have the actual features wanted, it is not GPL compliant. End of discussion. You want to use it then fine, go ahead, just don't try forcing it on everyone else.
".....if you ignore it's somewhat questionable stability and error recovery capabilities....." Sounds like exactly the gripes you hear from someone that has never used it. To paraphrase what I said above, you don't want to use it then fine, go ahead with ZFS, just don't expect the rest of us to drop any other option just because you say so.
BTW, whilst you're still struggling to find a case study for ZFS, you probably don't want to see Gluster in use by NASA and Amazon (http://devopsangle.com/2012/08/11/how-amazon-web-services-helps-nasas-curiosity-rover-share-mars-with-the-world/). Enjoy!
/SP&L
"Cluster file systems are very specialized, and there are only 3 real cluster file systems out there with any traction: GFS, OCFS, and VCFS."
Having used GFS extensively, I'd say that for more than a small cluster/a couple of TB, you're better off running Gluster on top of ZFS. We've wasted several man-years of effort trying to keep GFS clusters together and come to the conclusoin that it's the part of our "Highly-available" setup which torpedoes the "highly available" part.
Can't speak for the other FSes, but experience on several sites with multiple vendors shows that clustered setups have a helluva difficulty scaling under serious load - things work ok when testing but tend to break when you want them to do serious work, or when the serious work gets beyond N size and X requests. There's a reason why noone's managed to recreate the reliability of TruCluster (which got killed off by HP) and it comes down to "it's bloody hard to make things all work in sync"
@Alan Brown:
You are spot on in terms of cluster FS scaling. The problem is that people use them in a wrong way for the wrong tasks. The main advantage of a cluster FS is that (provided your application does proper file locking), you can have simultaneous access to data from multiple nodes. The main disadvantage is that using the cluster FS in this way comes with a _massive_ performance penalty (about 2,000x, due to lock bouncing).
So while on the face of it a cluster FS might seem like a great idea to the inexperienced and underinformed, in reality it only works passably well where your nodes don't access the same directory subtrees the vast majority of the time. And if your nodes aren't accessing the same subtrees, why would you use a cluster FS in the first place?
In terms of performance, for any remotely realistic workload you will get vastly better FS level performance by failing over a raw block device with a traditional standalone FS on it between nodes.
Many people don't need any of these features, and btrfs could be interesting for them. Moving from ext4 to btrfs, I noticed a very nice speed improvement for some tasks like using subversion, although there were also horrible slowdowns for anything involving sync. What I could not accept, and the reason I dropped it, is that with a raid1 setup of 2 disks, errors on a single disk (not a single problem with the other disk, I checked) was enough to repeatedly crash/freeze the system. That was with linux-3.6.x, not some early snapshot.
It does indeed. A modest home NAS running ZFS demands a minimum of 8GB and 16GB is preferable if you're running any kind RAID-Z.
Run many yourself have you? I ran a 8TB home ZFS server no problems on 4GB of RAM, reserving 2 GB for OS and applications, so effectively a NAS with 2GB RAM.
The more RAM you give to ZFS, the more it can cache, and the faster everything goes. You do not need 1 GB for 1 TB as is often mentioned.
if you don't enable deduplication then the memory requirements aren't anywhere near as heavy - and for a lot of loads deduping isn't needed (eg, anything already compressed, images, mp3s, movie files and astrophysics data) .
In any case, needing 8Gb ram isn't a big deal wioth today's ram prices (Not that my setup uses anything like that much for 32Tb of storage - and bearing in mind that the rule of thumb for ext4 on fileservers is 1Gb per Tb of storage)
Not particularly. In the sense of what you need for a big NAS that is.
Typically the suggested value if !GB RAM per TB of storage, but that is kind of based on the expectation of more IOPS as storage get bigger, so more caching helps.
Running de-dupe is always memory intensive for a big file system, and not all usage patterns make it worth while. If you have lots of VMs then de-dupe and put in *lots* of RAM. Otherwise you can work with a couple of GB if your I/O demands are not that high.
No, the only place where there is a recommendation of "1GB RAM per TB disk", is when you use deduplication on ZFS. Deduplication requires much RAM, otherwise performance grinds to a halt. Dedup on ZFS is not production ready. Avoid it!
ZFS in itself does not require much RAM. I have used it on Pentium4, 1GB RAM systems. The thing is, if you have much RAM, then ZFS will turn it into a huge disk cache called ARC. This will speed up things. If you do not have much RAM, then ZFS will always reach for the disks which is slow. Therefore you should use much RAM if you can - but it is not a requirement!
You can also use fast SSD disks as a cache, called L2ARC. With fast SSD disks you can reach 100.000 of IOPS and several GB/sec bandwidth on ZFS servers.
Of course, IBM latest supercomputer Sequioa is now using Lustre + ZFS to deploy 55 PB and 1TB/sec bandwidth of clustered storage. Just google on it, very interesting read for those of you who are geeks.
The maintainers of the native Linux port of the ZFS high-reliability filesystem have announced that the most recent release, version 0.6.1, is officially ready for production use.
Here was me thinking 0.anything is alpha/beta software.
Isn't that the accepted convention, or did I miss a memo?
You missed a memo. :-) Seriously, the ritual that your version number must align to some pre-determined value seems to be gradually falling away.
On the other side, you don't need to have significant changes to bump a major version number anymore, as Firefox is demonstrating.
>significant changes to bump a major version number anymore, as Firefox is demonstrating.
Funny here I thought it was just another thing they were doing to copy Chrome in every way 2 years after the fact. Oh well imitation is the sincerest form of flattery and Firefox is sure a lot better today than it was in the 3.5 bloaty, slow, memory leaking days.
It's good to know that Linux users now have another great choice of World Class File Sytem in ZFS in addition to fledgling btrFS. I have been using FreeBSD with ZFS with very satisfactory experiences for past year or so.
In 2012 when I indicated in comments on technology forum covering file systems the major functionality and features advantages of ZFS over older EXT* file systems and compared to Windows NTFS and their newer ResFS file systems, many Microsoft supported went ballistic and fabricated all types of very inaccurate 'technically' and nonsensical claims of (even) NTFS being years ahead of and superior to ZFS.
It was interesting to learn that ResFS has adopted (actually copied) many of the features set of ZFS, with Microsoft attempting at one point to "license" ZFS - without success, for recently released Windows Server 2012. However ResFS has come up short in many of the more enterprise or large clustered and "Supercomputer" critical file system requirements that are long proven in ZFS.
Ok, yes it's very flexible and yes it's been stable, and I did commit to using it on a number of systems, but please take the evangelism with a pinch of salt.
For a start, if you want performance then you won't be using all the frills like Checksumming and compression because they simply don't perform in a real environment - that's not to say it could be done better, it's just turning this stuff on can decrease your throughput by 75% or so - which is going to make many baulk.
Then there are still some issues, I was drawn for example to the block device functionality, which (in theory) gives you an elastic, compressed sparse block device which is ideal for sitting virtual machines on top of. Two problems however, (a) so slow it's unusable, and (b) it causes problems with a number of things, specifically, over time block device references go missing and you have to unmount / remount the partition to get your links back. Then there's the issue of the zfs-snapshot tool creating snapshots in the block device link folder. It may be stable, but it's certainly not bug-free.
What really worries me is that if people are prepared to shout from the rooftops about how great all the features are, when some of the features really aren't that great (because they're not usable for whatever reason) , when they also shout about reliability and integrity .. just makes me nervous.
For my usage, I'm actually in the process of moving all the ZFS systems back to raw LVM, the features I wanted I've ended up not being able to use, and ZFS *is* slower.
@oddjobs,
"...For a start, if you want performance then you won't be using all the frills like Checksumming and compression because they simply don't perform in a real environment - that's not to say it could be done better, it's just turning this stuff on can decrease your throughput by 75% or so - which is going to make many baulk..."
If you turn of the checksumming there is no point of using ZFS. The hype with ZFS, the main point of using ZFS, is because it protects your data via checksums. If you dont do that, then use another filesystem together with hw-raid instead.
If you really need to increase performance on ZFS, there are better ways than to turn off checksumming. Instead, you should first get much RAM. With much RAM, the disks will never be touched at all. Step two, is to add a fast SSD disk as a cache. There are two caches, read cache and write cache:
http://en.wikipedia.org/wiki/ZFS#ZFS_cache:_ARC_.28L1.29.2C_L2ARC.2C_ZIL
According to investment bank Morgan Stanley, ZFS performs better than Linux ext4 and at the same time getting away with a threefold reduction in the number of servers needed:
http://conferences.inf.ed.ac.uk/eakc2012/slides/AFS_on_Solaris_ZFS.pdf
"If you turn of the checksumming there is no point of using ZFS. The hype with ZFS, the main point of using ZFS, is because it protects your data via checksums. If you dont do that, then use another filesystem together with hw-raid instead."
So you think checksumming is the only reason to use ZFS? And you think hardware RAID is a good solution?
Seriously?!
Not only can I read the documentation properly, I've been using it on production Linux systems for the last couple of years, so whereas in an ideal world adding loads of RAM would be great, SSD caching doesn't (on a live system) make all that much difference - and I actually value my RAM.
One of the issues I didn't raise in my earlier comment is that ZFS on Linux does not use the system page cache but instead uses it's own pre-allocated memory. Two problems here, if you use the default settings then after a while it *WILL* deadlock and crash your system - badly - this is a known issue. Second, if you try to set the allocation "too high" the same thing will happen - problem being that there is no hard and fast definition of "too high". (and of course it's a chunk of your memory now available for other apps!)
I really don't care what an investment bank have to say on the matter, if you try to use checksumming there simply is no comparison, without checksumming and if you stop ZFS from using it's dedicated memory for caching, if you compare ZFS mirror to EXT4, you're not going to see better speed .. the better speed comes from ZFS's heavy use of RAM for structure caching.
Just to ice the cake, and the reason I'm browsing at the moment and trying to slow my heart rate, earlier I rebooted my workstation to find my entire system had reverted to a copy from ~ two weeks earlier. I have absolutely no idea what happened, but absolutely everything had revered, and all my snapshots from the last two weeks had vanished, leaving only snapshots from prior days that should no longer exists.
After much fiddling I noticed zpool was reporting one of the two disks on the mirror as being offline .. so I rebooted the system in the hope that it would come back and magically recover things. On booting up it did magically recover things, all my files are back, the old snapshots are gone and all my recent snapshots have reappeared .. yet zpool is still showing the same disk as offline ...
So .. I've just backed everying up (again) and ordered 2 x new 2Tb drives .. this is one of my last ZFS based machines and it's going back to LVM + SW RAID10 as soon as the disks arrive !!!
## One of the issues I didn't raise in my earlier comment is that ZFS on Linux does not use the system page cache but instead uses it's own pre-allocated memory. Two problems here, if you use the default settings then after a while it *WILL* deadlock and crash your system - badly - this is a known issue.
That is a gross exaggeration. I have only ever found benefit from changing the ARC maximum allocation on machines wich heavily constrained memory and extreme memory pressure at the same time (i.e. when your apps are eating so much memory that both your ARC and page cache are under large pressure). In the past year there have been changes to make ARC release memory under lower pressure in such cases. If your machine was deadlocking the chances are that there was something else going on that you haven't mentioned (e.g. swapping onto a zvol, support for which has only been implemented relatively recently).
It also doesn't "pre-allocate" memory - the ARC allocation is, and always has been, fully dynamic, just like the standard page cache.
Oddjobz
"...So you think checksumming is the only reason to use ZFS? And you think hardware RAID is a good solution?
Seriously?!...."
Yes, the major reason to use ZFS is the heavy data protection. The rest of its features is just icing on the cake. Remember, ZFS does a checksum every times something is read. It is like doing a MD5 checksum, it takes time. As soon as you read something, ZFS does a checksum. Calculation of the checksum takes time and cpu. I heard of a NTFS driver for Linux which was faster than NTFS on Windows, but if you cut corners and omit all safety nets, then you can achieve which speed you want. For instance, XFS did a fsck of 10TB data which took something like 5 minutes. If you ever traverse that much data, it will take many hours. The conclusion is that XFS fsck cut corners and dont check all data. In fact, fsck normally only checks metadata, but the data itself might be corrupt.
And no, I dont think hardware raid is good. But I am saying that if you turn off ZFS checksumming, you have an unsafe solution. You might as well as use hw-raid which is also unsafe. You have missed the point in using ZFS, if you turn off checksumming.
If ZFS on Linux is a bit unstable, it might be. But I dont think that you should draw the conclusion that ZFS on solaris is unstable.
I looked at ZFS a fair while ago, but ISTR that there was no option to add and remove disks from the pool. Is this still the case? If I have a raid setup and want to add disks, can I? If I then want to remove some older disks can I? My main FS is currently running ext4, but I would like some of the ZFS features.
Google and ye shall find:
http://docs.oracle.com/cd/E19082-01/817-2271/gazgw/
'zpool add' etc is how it is done. But play around with a couple of old HDD and test data for a while first so you find out what goes wrong before it bites you with real stuff.
Oh, and never ever forget the mantra "RAID is not Backup" - checksums and snapshots are all very good, but no backup means no safe data!
Not at all. For starters, you need something to recover when one of the lusers types "rm -rf /path/to/dir/ *" - which happens depressingly often, leaving them wondering where all that source code went.
You can mitigate this by keeping rolling snapshots over the last few hours/days or use a VMS style FS (which requires an explicit purge) but users will still find ways of losig data or not realise they needed it until the snapsots have expired.
"That kind of undermines much of the point of having the newest and shiniest FS available now doesn' t it?"
No, because it means for most of your time the data is safe and you can consistently back-up by taking a snapshot and backing that up while life goes on.
But when you get a fire or flood in the server room, or "gross administrative error" destroying the wrong zpool as someone is doing some other work, you actually have a way of recovering.
@BristolBachelor,
Each ZFS raid consists of one or several "vdev". A vdev is a group of disks. Or files. Or partitions. Or...
Each vdev should be configured as raidz1 (similar to raid-5) or raidz2 (raid-6) or a mirror.
You can never change the number of disks in a vdev. If it is 5 disks, then 5 disks it is.
You can always add another vdev to a ZFS raid, but you can never remove a vdev from a ZFS raid.
Say you have 5 disks. You create a ZFS raid with raidz1. This means there is one vdev, configured as raidz1. You can not expand this 5 disk vdev to 6 disks. The number of disks are fixed at 5 in a vdev. But you can add another vdev. So now you can add a mirror to the ZFS raid. This means the ZFS raid consists of one raidz1 vdev of 5 disks, and a mirror vdev of two disks. And you can add as many vdevs as you like. Actually, you could add a single disk to a ZFS raid, but that would be stupid. Because if that single disk crashes then you have lost all your data, even in the raidz1 vdev. Always add disks with redundancy, never add a single disk.
You can also swap the disks to a larger one. Say you have 5 disks in a ZFS raid, consisting of 1TB disks. Now you can replace one disk with another 2TB disk and repair the raid. After repair, replace another disk with 2TB and repair. Rinse and repeat, and finally you have 5 disks, consisting of 2TB disks.
If you have three 500GB disks and two 1TB disks, then if you create a 5 disk raidz1 then it will be of size: 5 x 500GB. The smallest disk decides the storage capacity.
".....Say you have 5 disks in a ZFS raid, consisting of 1TB disks. Now you can replace one disk with another 2TB disk and repair the raid....." Yeah, that works, as long as you can stomach the ridiculously long rebuild time for each disk. Oh, and please do pretend that is also not a problem with ZFS just like hardware RAID isn't a problem!
"...Yeah, that works, as long as you can stomach the ridiculously long rebuild time for each disk. Oh, and please do pretend that is also not a problem with ZFS just like hardware RAID isn't a problem!..."
Yes, long resilver times on huge disks are a problem. I have never denied this. What, do you expect me to deny there are any problems with ZFS, that ZFS is 100% bullet proof? Have I ever said that?
To mitigate the problem of resilvering times, is that ZFS only resilvers data. Empty bits are not resilvered. Hardware raid cards always resilvers the whole disk, including empty bits and data. Because ZFS controls everything, it knows what bits are data, and what bits are empty space, and only repairs data. Hw-raid does not control everything, and knows not.
##".....Say you have 5 disks in a ZFS raid, consisting of 1TB disks. Now you can replace one disk with another 2TB disk and repair the raid....." Yeah, that works, as long as you can stomach the ridiculously long rebuild time for each disk. Oh, and please do pretend that is also not a problem with ZFS just like hardware RAID isn't a problem!
Rebuild time is an issue on all RAID installations where individual disks are big. At least ZFS only has to resilver only the space that is actually used, rather than the entire device like traditional RAID.
"Ah, there's the catch. Because of incompatible licensing and other intellectual property concerns, the ZFS filesystem module can't be distributed with the Linux kernel."
There's the pain in the arse part of open-source that irks me (and plenty of others, I'm sure). Yea, it's "free" but it's not my flavor of "free" so yea, you can use it without paying anything, we just have these strange requirements that mean you can't use it as seamlessly as you'd like.
See also the strange creature called Iceweasel.
"There's the pain in the arse part of open-source that irks me...." Blame Sun for the history and Larry for its continuance. Sun deliberately chose to use the CDDL over GPL to stop any bits of Open Slowaris getting into Linux. The intention of the CDDL was to make a pretence at openness but in reality keep everything proprietary, which now means under Mad Larry's control, hence the complete refusal to allow ZFS into the kernel. Larry could switch it to GPL, like BTRFS and OCFS2, but then he loses control. Gordon the "RHEL expert" will of course be well aware of Red Hat's antipathy to the "trap" of the CDDL (http://www.sourceware.org/ml/libc-alpha/2005-02/msg00000.html), but I'm sure he'll be long in a moment to insist it's not really a problem, just add more RAM, etc.
CDDL code cannot be included with GPL code because of GPL's insistence to relicense the CDDL under GPL.
It can be included with BSD code, since it is open, free and un-encumbered, and BSD has no clause forcing re-licensing under BSD.
Explain again how that is a problem with CDDL and not GPL?
".....Explain again how that is a problem with CDDL and not GPL?" Sun had the option to release under GPL but it was an internal Sun decision to CHOOSE not to as they knew releasing under the GPL meant allowing re-use and development by the Linux community. In essence, Sun were anti-Linux, therefore they CHOSE the CDDL to deliberately make it incompatible with Linux. This was back when Sun still had stupid dreams about overwhelming Linux with Slowaris x86, and thought that by forcing customers to use their tools under the CDDL would make the customers choose Slowaris x86 over Linux.
"....what are you, 12?...." Nope, I'm old enough to remember the performance problems that got Slowaris the moniker back in the Nineties. Probably before your time though.
".....Show me on the doll where Jonathan Schwartz touched you." LOL, someone that fell for the Sunshine asking about being touched!?! Sorry to disappoint you but I never met Ponytail and didn't really consider it a loss not to have.
BTW, I notice how you're steering clear of answering the points raised, did they upset you that much or are you just covering for the fact you can't counter them?
Iceweasel and the result are the result of trademark issues. The Mozilla Foundation doesn't want anyone using a modified version of its software to carry its trademarks: in essence, "Don't use our name with your version." Now, trademarks are a separate issue from software licensing, so they compromise: if you compile Firefox straight up, you're cool beans, but if you modify it, they require you to make your own logo for your version. They even give you free use of the globe icon which is common to all Mozilla products. Just don't use THEIR logos.
Excelent news, but Behlendorf's effusive contention that "...ZoL is ready for wide scale deployment on everything from desktops to super computers" just isn't compatible with the 0.6.1. version number. Love his enthusiasm, but admins should wait for 1.0 or higher before installing into production.
Regarding ZFS, it originally came with the absurd strapline "no known pathologies". Does that mean 100% bug free ? Unbreakable ? Perfect ? ZFS is none of those, and in practice seems no more or less stable than competitors like VxVM. Often a ZFS mirror will slip into "degraded" (unsilvered) state and the admin is none the wiser until he happens to type "zpool status". Repeated "scrubs" are then required to get the mirror back:
http://unixetc.co.uk/2012/01/22/zfs-corruption-persists-in-unlinked-files/ (my own blog)
VxVM and SDS show similar behaviour. LVM less often. ZFS is the future, especially once the dedupe features become generally available in Linux, and the performance hits of constant checksum calculation are sorted out. In some ways it's a shame that Sun marketing dept. ever pushed the "this system needs no fsck!" nonsense.
I have seen HDD kicked out of Linux md RAID for no obvious reason, no SMART errors, etc. Add it back in and after a rebuild, all is fine. Guess it is an issue of the flaky disk and/or controller and/or driver software. Were you using "enterprise" class disks and stuff?
Either way, you still have to monitor ALL systems for errors!
Also you should do a ZFS scrub regularly, same for Linux RAID or any other technology, as it helps weed out disks with sector errors or that are close to dying. What you DON'T want is to have is an HDD fail, and then find others go during the rebuild.
If you use ZFS on Solaris, there are excellent monitoring tools built in to the OS. The other OS's aren't quite as integrated, although there are plenty of zfs monitoring scripts for things like munin.
The "ZFS needs no fsck" is mainly of interest to users of FS where an unexpected reboot will require a full fsck before coming back online again. A full fsck on a 3TB UFS RAID 5 can take many, many hours. ZFS never requires this, and you never need to run fsck ever - data is always consistent on disk.
You do have to periodically scrub a ZFS array. This ensures that your data is always readable, and helps discover disk flaws earlier than usual usage, and is a positive thing.
Also, zfs scrub is exactly the same purpose and design as a data scrub/patrol read on a RAID array. It's the one thing people with RAID arrays often forget about until they have a single disk failure, do a rebuild and find out they have double disk failure.
zfs or RAID, if you aren't scrubbing your data each week, you don't know it's actually there.
".....do a rebuild and find out they have double disk failure....." Well, that may be the experience with Snoreacle kit, but I think you'll find other vendors have got a lot better predictive kit linked to proactive support. IIRC, IBM had it in 1992. My first time seeing this in action was the late '90s, when a disk turned up one day at the company I worked for then, we called EMC to ask why, and they said the controller was predicting a certain disk in a certain slot was going to fail around 7pm and we should take it offline and replace it. At that point no error was being shown on the array itself. Seeing as it was on a test system we decided to leave it in and see what happened, and just before 8pm it was marked bad by the controller, the RAID set rebuilt itself using a spare, and the disk automatically taken offline. Oh, and when we replaced it we didn't find any unknown dud disks, no data was lost, and no mythical bit-rot magically appeared. Sorry if Snoreacle hardware is keeping you twenty years behind the curve.
"The "ZFS needs no fsck" is mainly of interest to users of FS where an unexpected reboot will require a full fsck before coming back online again."
It's not that ZFS needs no fsck, it's that the fsck can be safely conducted while the filesystem is online.
Kinda like the old days ot fsck -p on Sunos 4.1, but much much better.
This post has been deleted by its author
"They might trust the boring, BAU bits of their business to COTS packages - and so they should. But not the family jewels....." Especially the family jewels as that is the part of the business they will want ot risk least. Seriously, you go try convincing a business to try something new without the safety-net of at least tested and supported by a major vendor. I can remember the days when companies did all their programming in C++/CGI, and then this thing called Java popped up and even with Sun's support they still went all cautious and said they wanted to see it in action first!
".....You are hilarious." You are merely inexperienced.