* Posts by Gordan

653 publicly visible posts • joined 15 Oct 2008

Page:

AMD details new low-bucks, high-oomph graphics cards

Gordan

Re: Rebranding

That's what I was just thinking. All the GPUs released are straight relabeling of the 7xxx series GPUs - for the 2nd generation in a row. The only new addition is the R290 - and that is as yet unreleased. Worse, the R290 only just roughly matches the shader count of the Nvidia Titan - and Titan has been out for months.

I was really looking forward to AMD finally coming up with a worthwhile improvement, but they have really dropped the ball. Again. The only real benefit from the re-relabeling is that the prices are being pushed down slightly.

Open ZFS wielders kick off 'truly open source' dev group

Gordan

Re: CDDL and GPL not compatible

"BTW: pi's already have BTRFS."

BTRFS is the most useless pile of steaming manure that has ever disgraced Linux with it's inclusion into the kernel tree. It's feature were _intended_ to rival ZFS but after years of development it has failed to even match the usability of ancient vanilla file systems like ext*.

It is also telling that EL7 will ship with XFS as the default FS rather than BTRFS.

If I didn't know better I might suspect that Oracle's continued pushing of BTRFS is nothing more than an attempt to dissuade people from using ZFS on Linux and thus assist them in pushing Solaris as having ZFS as the killer feature.

Gordan

Pi support and 32-bitness

It's not the Pi support per se that is the limiting factor on the Linux implementation, it's generally the support for 32-bit platforms. ZFS was designed for a 64-bit platform with a very robust kernel virtual memory subsystem. Linux's kernel virtual memory is somewhat crippled (it's use is generally discouraged, as there are usually better ways to do things), and when you combine that with generally memory starved 32-bit platforms you run into problems.

The FreeBSD implementation works much better if you are stuck with 32-bit hardware. Or if you really want to run Linux on a 32-bit platform with ZFS, zfs-fuse works very well.

AMD's beating HEART of Internet of Things: 64-bit ARMs head for gadgets

Gordan

ARM Micro Devices

Only a matter of time...

WD outs 'Mini Me' Red label NAS drives

Gordan

Re: Bring back 5¼"

If the average error rate (which has remained constant for the past 20 years or so) is still rated at one unrecoverable error in 10^-14 bits (~11TB), that would mean that during RAID rebuilding you are on average going to lose 1-2 sector of data which you will only get back if you have backups and the file in question hasn't changed since the last backup.

4TB drives are enough of a liability as it is. Making everything bigger and heavier will also make it much, much slower (5.25" disks spun at 3600rpm) and less reliable (bigger platters are going to wobble more and the whole drive will be more sensitive to vibration).

Gordan

No doubt with WD's usual lying SMART

Thanks, but no thanks. I would not touch WD or Samsung drives with a bargepole. They consistently do things like log a number of pending sectors, but when you overwrite them (which should cause a remap), the pending count goes to 0, and reallocated count always stays at 0.

That means that either the drive is just re-using sectors that have demonstrably gone bad in the past (and I doubt it checks them again after the write - WD disks don't have the Write-Read-Verify feature), or equally bad, it remaps them but doesn't say how many it has remapped. No doubt to reduce warranty claims.

Seagate (for WRV feature), Hitachi (for reliability) and Toshiba for me, thanks.

Intel ships high-powered C++ compiler for native Android apps

Gordan

Re: Intel's compilers are targeted at marketing folks

ICC has always produced code that runs significantly faster than GCC, and despite having had a decade and a half to catch up and AMD backing, GCC has failed to close the gap by much. Every x86 processor since the Pentium MMX has had vectorization capabilities, which means that for sensibly written code (tall order for most programmers, but bear with me) you will get code that runs 2-8x faster (MMX can process two 32-bit ints in parallel, SSE can process 4 floats or ints or 2 doubles in parallel) than non-vectorized code.

The 8x faster extreme is a bit of an oddity from back in the Pentium 4 days - it broke up operations into micro-ops in a way that didn't work too well without good compiler support, so code generated by most compilers ran slower, but ICC generated code faster. Pentium 4 flopped because it needed a decent compiler to make it run code faster clock-for-clock than the Pentium 3, but with a decent compiler it was actually faster clock-for-clock (by about 20%) for suitably written C or C++ code.

Last I checked ICC generated code ran no slower than GCC code in the worst case (pure pointer chasing, no processing), usually 30-50% faster (e.g. MySQL, google for the pre-Oracle paper on the subject), and up to 4-8x faster on tight number-crunching loops (e.g. custom curve fitting functions I was writing a while back).

I'm an open source supporter and contributor, but it's important to give credit where it's due and not let FOSS prejudice cloud judgements. GCC's vectorization support was years late to the party and is still after all this time nowhere nearly as good as ICC's.

Seagate's shingle bathers stalked by HGST's helium HAMR-head sharks

Gordan
Boffin

Re: Why helium ?

In a word - heat. Platter and actuator surfaces require some degree of cooling to stop them from overheating. Helium is more thermally conductive than air, thus a better coolant.

Vacuum has terrible thermal properties - there would be no convection heat exchange at all. Vacuuming the disks would reduce the possible density, not increase it.

Gordan

"Unless there is something that I haven;'t considered.........."

Only that disk manufacturers want your disks to fail as soon as the warranty runs out so you go and buy new ones. Data growth only goes so far toward a sustainable business.

KingSpec's 2TB Multicore PCI-E SSD whopper vs the rest

Gordan

Questionable

There are three issues I see with this:

1) In any even remotely realistic use-case, you will run out of another resource (CPU or network) long before you reach the level of sequential I/O performance this device can allegedly deliver.

2) The test graphs only cover sequential I/O, not random I/O. What is the 4KB random-write performance on this device after it has been "primed" with a full fill from /dev/urandom?

3) Being based on mSATA SSDs, this is little more than an 8-port SATA card.

So what is the big deal? You can easily achieve similar performance using a decent 8-port SAS card and 8 similar SATA SSDs (e.g. from Intel or Kingston).

Apache OpenOffice 4.0 debuts with IBM code side and centre

Gordan

Any improvements in there...

... that didn't also go into the LibreOffice code tree as well? No? Didn't think so.

ARM servers to gain boost from ARM, Oracle Java partnership

Gordan

Re: Oracle has to be there

More likely OpenJDK than Dalvik.

Gordan

Re: Jazelle

Jazelle was deprecated after ARMv5, on which it was optional. ARMv5 is so old that the latest ARM Fedora dropped support for it. IIRC the "replacement" for it the ThumbEE instruction set which is completely generic, not in any way Java specific, and targetted at compiled code (applicable to Java only in JIT context).

Analyst: Tests showing Intel smartphones beating ARM were rigged

Gordan

Re: Compiler Matters, but it's a fair comparison

You clearly haven't tried leveraging vectorization in your C and C++ code, have you; otherwise you would know that the performance and completeness of the implementation is lacking, to put it kindly. GCC is simply not yet up to the job of approaching performance of ICC generated code. Run some proper tests of your own on real world code you wrote and fully understand, rather than spewing nonsense about things you know nothing about.

Also, the tests used, from what I can tell, were biased toward gaming and graphical workloads which benefit significantly from vectorization. Whether that is the sort of load the average user runs on their phone may be questionable, but that isn't exactly cheating.

Gordan

Compiler Matters, but it's a fair comparison

The main thing this proves is that GCC sucks compare to ICC, at least on x86. This is not news - we have known this for years.

See:

http://www.altechnative.net/2010/12/31/choice-of-compilers-part-1-x86/

Does this make the test suddenly unfair? Hell no! Intel have a fantastic compiler, and it is free for non-commercial and non-academic use (e.g. free for open source projects). If ARM have a compiler that produces similarly improved results over GCC on their processors, and they are prepared to make it available for free on the same terms, I'm sure a lot of projects will use it. If they don't,, tough shit - they were beaten on the basis of both processors being used with the best available compiler.

Im a big fan of ARM but it is important to give credit where it's due.

Linux 3.11 to be known as 'Linux for Workgroups'

Gordan

Does that mean Linux 4 is going to be called Linux 95?

Chromebooks now the fastest-growing segment of PC market

Gordan

I have a Chromebook Pixel, and it's _fantastic_ - now that I've put full fat Linux distro on it.

Snowden: US and Israel did create Stuxnet attack code

Gordan

Re: "most trusted services in the world if they actually desire to do so."

For some reason this seems like a particularly apt quote in response to Evil Auditor @ 18:54 08/07/2013:

http://m.imdb.com/title/tt0434409/quotes?qt=qt0450688

Intel demos real-time code compression for die shrinkage, power saving

Gordan

"The reason it's better to have it done in hardware instead of by the compiler is it makes for less work for the CPU,"

How do you figure that? If both compression and decompression is done in hardware and the initial code is uncompressed, then the CPU has to burn power to compress the code in the first place, then de-compress it just-in-time to execute it.

If the compression is done by the compiler you load the compressed code directly at run-time, and you only have to decompress it just-in-time to execute. By having the compiler do the compression once (and it can spend a lot more optimizing the compression since it doesn't have to be done in real-time at run-time) you are saving at least half of the run-time work, probably more since compressing is typically slower than decompressing.

Gordan

Re: Begs the question

A compiler in hardware? Whasn't that one of the many too-smart-by-half ideas that Java was supposed to bring?

The fearful price of 4G data coverage: NO TELLY for 90,000 Brits

Gordan
Trollface

NO TELLY for 90,000 Brits

I sense the average IQ rising already.

Hm, disk drive maker, what's that smell lingering around you?

Gordan

Re: Spinning rust will die

It's not the transfer rate - it's the average access time, fundamentally dictated by the spindle speed, which fundamentally dictates the number of operations per second you can do. On a 7200 rpm disk, you can only do 120 IOPS.

If you do the maths, you can infer from this that to read all the sectors (assuming 512 byte) in random (i.e. non-sequential) order on a 1TB disk would take more than 188 days. If the sectors are 4KB, the figure goes down to "only" a little under 24 days. Quadruple for a 4TB disk. Suffice to say for anything but bulk sequential access (e.g. one's movie collection transferred to NAS because DVDs were taking up too much shelf space), spinning rust simply isn't a sensible solution purely based on performance. And that is before we get into the issues of reliability.

Reliability-wise, my maths says that with consumer grade 4TB disks, anything less than 4-disk RAID6 is a liability unless the data is worthless or something like a paranoia-mandated tertiary backup the loss of which doesn't really affect anything important.

Review: Western Digital Sentinel DX4000

Gordan
WTF?

£1580 ?!?!

HP Microserver: ~£160 after cashback for the 2.2GHz one (probably about £100 for the older 1.5GHz one).

4TB disks: £160 (£640 for 4)

FreeNAS: Free

Total: £800

£780 seems a bit steep for a Windows licence.

Are biofuels Europe's sh*ttiest idea ever?

Gordan

Waste Oil, Not Fresh

The fundamental flaw in the way the biofuels directive is being implemented is in the fact that fresh vegetable oil is turned straight into biodiesel. This is, quite frankly, retarded. A far more sensible way to do this is to ONLY use waste vegetable oil. That way you are recycling and getting a useful benefit from something that has already been used up for it's original purpose and you are thus reaping an environmentally free secondary benefit.

Inuit all along: Pirate Bay flees Sweden for Greenland

Gordan
Devil

thepiratebay.ml next?

Seen as Mali is going to be giving away free domains anyway?

Boffins say flash disk demands new RAID designs

Gordan

Re: Wear-levelling works against you?

@Nigel 11

I think you are exaggerating a worst-case scenario based on small writes with big stripes. If your writes are small, you should be using small RAID chunks.

Gordan

@Shades

You speak of disks from a somewhat more reliable era for disks. My observation of the reliability track record of 1TB+ disks is that they, quite frankly, suck. The failure rate per year among my server estate (granted, only about 35 1TB+ disks, but it's enough to get the gist of the reliability you can expect) shows a failure rate of nearly 10%/year over 3 years (the worst model of the disk I have has seen a failure rate of about 75% over 3 years - yes, some really are that dire).

Of course, some makes/models are better than others, but the trend is pretty poor, both in terms of the number of complete disk failures and in terms of the disk defects (bad sectors) that arise. And bear in mind that every time you get a bad sector you are liable to lose that data.

The primary cause of flash failure is wear - and that is predictable and trackable. Spinning rust can suffer an electro-mechanical failure at any time, and often does with little or no notice. Early warning and predictability are important if you value your data.

Gordan

"you can't assume SSDs are always as reliable as spinning rust, especially when you use them in the same way."

Does that last statement take into account just how woefully unreliable spinning rust is?

Hold on! Degrees for all doesn't mean great jobs for all, say profs

Gordan
Happy

Full marks for eloquence

"Naturally these are a long way from the sort of jobs once dished out to graduates in the days when a degree was a rare badge of distinction, rather than something which fell out of your cornflakes packet following three years of alcohol-fuelled fornication."

I can see this sentence being recycled many times.

Production-ready ZFS offers cosmic-scale storage for Linux

Gordan

Re: Gordon Gordon Tom Maddox Gordon Phil Gordon Gordon AC Destroyed All .....

Matt, you have exposed yourself as an opinionated ignoramus beyond whom it is to even aspire to mediocrity. I would invite you to stop embarrasing yourself in public, but I am reluctant to do so for you might heed it and withdraw such an excellent source of Monty-Python-esque comedic amusement.

Facebook runs CentOS on their servers, not supported RHEL.

Facebook was also using GlusterFS since before it was bought by RH (not that that a company running CentSO would care).

Google runs their own "unsupported" version of Linux (IIRC derived from Ubuntu or Debian).

Those two between them are worth more than a non-trivial chunk of the FTSE 100 combined. Note - FTSE 100, there is no such thing as FTSE 1000 (although I'm sure you'll backpaddle, google your hilarious blunder and argue you meant one of the RAFI 1000 indexes not weighted by market capitalization - which considering that there are, for the sake of comparison, only about 2,000 companies traded on the London Stock Exchange, isn't going to be all that heady a bunch).

One of the biggest broadcast media companies in the country (not to make it to specific, but the list of possibles isn't that big, you can probably figure it out) runs various "unsupported" desktop grade MLC SSDs (Integral, Samsung, Kingston) in their caching servers (older Oracle SPARC or HP DL360 and DL380 boxes). They also run a number of unsupported software on their RHEL (supported) machines including lsyncd and lnlb.

One of the biggest companies in mobile advertising (after Google) runs their real-time reporting database systems (biggest real-time MySQL-or-derivative-thereof real-time deployment in the country) on MariaDB 10.0 alpha, because official MySQL wasn't workable for them due to the Tungsten replicator (commercial support available, for all good that will do for you when the product is a pile of junk) being far too unreliable.

This is just the first four that I can think of off the top of my head (having either worked there or having worked with someone who worked there) that completely obliterate your baseless, opinionanted rhetoric. You really should quit while you're behind.

Gordan

Re: Gordon Tom Maddox Gordon Phil Gordon Gordon AC Destroyed All Braincells.....

## "....what is the value and usefulness of an engineer that is only capable of regurgitating the very limited homework that the vendor has done for them?...." It's called implementing tried and tested technology.

No, Matt. That's what you keep telling yourself because you don't know how to do anything that hasn't been done for you and pre-packaged for bottle feeding.

Gordan

Re: Tom Maddox Gordon Phil Gordon Gordon AC Destroyed All Braincells.....

Seriously guys - ignore this guy's trolling. As the proverb says: "Do not teach a pig to sing - it wastes your time and it annoys the pig." His illiterate squealing is starting to grate.

For all his private parts waving with the supposed "FTS 1000" [sic] company credentials and having a RH support contract, there are almost certainly more people at RH that have heard of me than him (ever ported RHEL to a different architecture?). Not to mention that he has been talking about cluster file systems to someone who has been rather involved in their development and support (ever deployed a setup with the rootfs on a cluster file system? Using DRBD+GFS or GlusterFS?) without realizing how much this exposed his pittiful ignorance of the subject.

The blinkered view of "it's not supported by the vendor so of course I haven't tried it", had it always prevailed, would have ensured that the fastest way to travel still involves staring an an ox's backside. Seriously - what is the value and usefulness of an engineer that is only capable of regurgitating the very limited homework that the vendor has done for them?

Let it go. Let his ignorance be his downfall as he hides his ineptitude behind support contracts. Incompetence comes home to roost eventually. Jobs of "engineers" that only know how to pick up the phone to whine at the vendor get off-shored to places that provide better value all the time.

Gordan

Re: Gordon Phil Gordon Gordon AC Destroyed All Braincells Gordon BTRFS? You must...

##imaginary 100TB on a single hp server (somehow without the built-in RAID card)

Do you really not understand that just because a server comes with hardware RAID capability you don't actually have to use it?

##".....you mean you have a support contract that you can try to hide behind when things go wrong...." Actually, yes, that's exactly what I mean.

I rest my case, fellow commentards.

##".....LSI HBAs, and MSA disk trays. The normal SATA disks were easy, but my client required large SSDs in one of the trays for hot data.......we got a bunch of disks on a trial from multiple vendors....." Again, very unlikely as again you would have invalidated your hp warranty using none hp cards and none hp disks.

1) Not everybody cares about warranty.

2) If the company is a big enough, well thought of brand, you'd be surprised what HP and the like are preared to support and warranty.

Also consider that the cost of hardware replacements is negligible compared to the value generated by the said hardware, and when the price differencial between DIY and $BigVendor solutions is running in into hundreds of thousands of £, you'd be surprised what the upper management layers suddenly start to deem acceptable when budgets are constrained (and they always are).

##You also talked about 100TB behind one adapter yet you also wanted SSDs behind the same adapter!?!?

I said LSI host adapters (plural, if you know what that word means).

I'm not going to waste any more time on correcting your uninformed, ill understood and under-educated drivel. Arguing with somebody as clearly intellectually challenged as you, entertaining as it may have been to begin with, gets borning eventually. Your ad hominem attacks only serve to further underline your lack of technical knowledge and understanding to support your unfounded opinions.

Gordan

Re: Kebabfart Add / Remove disks?

##".....Say you have 5 disks in a ZFS raid, consisting of 1TB disks. Now you can replace one disk with another 2TB disk and repair the raid....." Yeah, that works, as long as you can stomach the ridiculously long rebuild time for each disk. Oh, and please do pretend that is also not a problem with ZFS just like hardware RAID isn't a problem!

Rebuild time is an issue on all RAID installations where individual disks are big. At least ZFS only has to resilver only the space that is actually used, rather than the entire device like traditional RAID.

Gordan

Re: But ...

## One of the issues I didn't raise in my earlier comment is that ZFS on Linux does not use the system page cache but instead uses it's own pre-allocated memory. Two problems here, if you use the default settings then after a while it *WILL* deadlock and crash your system - badly - this is a known issue.

That is a gross exaggeration. I have only ever found benefit from changing the ARC maximum allocation on machines wich heavily constrained memory and extreme memory pressure at the same time (i.e. when your apps are eating so much memory that both your ARC and page cache are under large pressure). In the past year there have been changes to make ARC release memory under lower pressure in such cases. If your machine was deadlocking the chances are that there was something else going on that you haven't mentioned (e.g. swapping onto a zvol, support for which has only been implemented relatively recently).

It also doesn't "pre-allocate" memory - the ARC allocation is, and always has been, fully dynamic, just like the standard page cache.

Gordan

Re: Phil Gordon Gordon AC Destroyed All Braincells Gordon BTRFS? You must...

##".....have you actually tried it?...." Oh no, as I already pointed out that my company has zero interest in it.

So you openly admit you have 0 actual experience of this. Wow. So much opinion, so little actual knowledge.

## Did you fail to understand the bit where I mentioned (repeatedly) that we're very close to Red Hat when it comes to Linux?

Oh, I'm sorry, by "close to RedHat" you mean you have a support contract that you can try to hide behind when things go wrong and it becomes obvious you don't know what you're doing? I hadn't realized that was what you meant.

##".....Considering I just got 48GB of RAM for my new desktop rig for £250 (Registered DDR3 ECC)...." Apart from the hint of manure coming from the suggestion of putting 48GB of RAM in a desktop,

Yup. EVGA SR2 based, since you asked, not that it matters.

## if you knew anything about hp servers as you claim then you would know they do not support you buying cheapie RAM and plugging it in, you need to buy the hp modules which are markedly more expensive than cheap desktop RAM.

I've never gone wrong with Crucial RAM. It has always worked in anything I threw it into, and they have compatibility lists for just about anything. And anyway, you've been saying you wanted things cheap, fast and reliable (and not just two of the above).

##".....The top end deployments I've put together with ZFS are in the region 100TB or so (off a single server - HP servers and disk trays......" Ah, there it is! Please do explain what "disk trays" and server models (hint - they come with RAID cards as standard) you used,

LSI HBAs, and MSA disk trays. The normal SATA disks were easy, but my client required large SSDs in one of the trays for hot data. None (at least of suitable size) were supported, so we got a bunch of disks on a trial from multiple vendors. The expanders choked almost instantaneously on most of them with NCQ enabled. Kingston V100+ 500GB models worked great, however. When you need to realize is that you cannot do bigger and better things while having your hands tied behind your back by what your vendor is prepared to support. If that's all you're doing, your job isn't worth a damn and is ripe for outsourcing. I regularly have to push the limits orders of magnitude past what any vendor supported system in the price range can handle.

## and then account for the fact that whilst hp support several Linux flavours - you can buy the licences direct from them and they front the support including the hardware, giving you one throat to choke - they do not offer ZFS and do not support it unless it's part of Slowaris x86.

Who said anything at all about vendor supported configurations? If you have the luxury of being able to achieve things within your requirements and budget while sticking with those, that's great - meanwhile, those of us that have to get real work done don't have that luxury.

##"More nonsensical FUD......" Hey, go do the Yahoogle and read some of the hits.

Oh, I get it! "It must be true! I read it on the internet!" Is that the best you can do? Do you have any experience of your own on any of these matters? Or are you just regurgitating any odd drivel you can find that supports your pre-conceptions?

##".....You've clearly never used it....." <Sigh>. How many times do I have to repeat - we have NO interest in it.

Then stop trolling baseless FUD about it.

Gordan

Re: Gordon Gordon Gordon AC Destroyed All Braincells Gordon BTRFS? You.....

##"..... ZFS will work on hardware RAID just fine....." Hmmm, seeing as even Kebbie has admitted that is not the truth

OK, Mr. "think you know it all when you know bugger all" Bryant - have you actually tried it? Have you? No? I have news for you - I have tested this setup extensively. It works exctly as you would expect. You get poor small-write performance as you would expect from traditional RAID5 and no transparent data correction capability because as far as ZFS is concerned it is running on a single disk. Is this a sensible setup to use when you could have better write performance and data durability for free? No. But if you want to deliberately use a sub-optimal solution you are free to do so.

##".....other file systems in that it will detect bit-rot,....." As expected, when all else fails it's back to the Sunshiner staple, mythical bit-rot.

It's not mythical. See the links posted earlier on research done by the likes of NetApp. It is a very real problem. More so with large, bleeding edge disks. If NetApp's tested-to-death setup achieves one such error in 67TB of data, the figure for desktop grade high density disks with 4 platters is almost certainly several times worse.

##"..... but this goes completely against your original requirements of a high performance low cost solution...." It's quite comic when Sunshiners try and push ZFS as low-cost, neatly ignoring that it requires massive amounts of very pricey RAM and cores in a single SMP image for the big storage solutions they claim it is ready for. ZFS above a few TBs in a desktop is more expensive. Come on, please pretend that a 32-socket SMP server is going to be cheaper than a load of two-socket servers. Don't tell me, this is the bit where you switch to the Snoreacle T-systems pitch

Considering I just got 48GB of RAM for my new desktop rig for £250 (Registered DDR3 ECC), I don't see how you can argue that RAM is that expensive. You can build a 2x16 core Opteron system for not an awful lot. By the time you have accounted for a decent chassis, couple of decent PSUs and some disks and caddies, the £3K you are likely to spend on a motherboard, two 16 core CPUs and a quarter of a TB of RAM isn't as dominant to the TCO as it might originally appear - especially when you consider that on a setup like that you'll be attaching a hundreds of TBs of disks. If that's not good enough for you, you can add some SSDs to use as a L2ARC which would sort you out even if you wanted to heavily use deduplication on such a setup. And ZFS isn't particularly CPU heavy, contrary to what you are implying (might be on SPARC, but only because SPARC sucks ass on performance compared to x86-64).

Seriously - work out your total cost per TB of storage in various configurations. Split it up across lots of boxes and the overheads of multiple chassies, motherboards and CPUs is going to start to add up to become dominant to the TCO.

##Problem for you is far, far cleverer people have come up with far better solutions than ZFS

If they have, you haven't mentioned it.

##This case from Oracle makes fun reading about how ZFS has problems with large numbers of LUNs (https://forums.oracle.com/forums/thread.jspa?messageID=10535375).

I lost interest in reading that thread when I noticed it talks about Slowaris. I really have 0 interest in Slowaris. We are talking about Linux and the ZFS implementation on that. I know your attention span makes it hard to focus on a subject for any length of time before you demonstrate you cannot even BS about it, but it isn't helping the case you are arguing.

##"....My backup server runs ZFS just fine in 3.5GB of RAM with a 6TB (RAIDZ2, 4TB usable) array....." Oh, so that's the toy you base your "experience" on! WTF?

No, it's just the smallest. The top end deployments I've put together with ZFS are in the region 100TB or so (off a single server - HP servers and disk trays, not Slowracle). What do you base your experience on? Do you use ZFS on anything? Or are you just spreading FUD you know nothing about? Thus far you have only provided ample evidence of your own ignorance.

## In summary, ZFS imports will be very slow or stall unless you turn off the features the ZFS zealots told you were the reasons for having ZFS in the first place.

Which features would that be, exactly? The only feature that will affect import time is deduplication - which is actually not recommended for typical setups. It is a niche feature for a very narrow range of use-cases - and needless to say disabled by default.

##You need to turn off dedupe (often claimed by Sunshiner trolls to be vital),

No, you just need to not turn it on unless you actually know what you are doing; but you've so far amply demonstrated you have no idea what you are talking about, so that point is no doubt wasted on you.

## avoid snapshots (the redundancy in ZFS),

Are you really, really THAT ignorant that you think shapshots have anything at all to do with redundancy? Really? Do you even know what redundancy and snapshots are? Because what you said there shows you are a clueless troll. I pitty your customers, assuming you have managed to BS your way through to any.

##and keep your number of devices as low as possible and hope they don't use multipathing because ZFS stupidly counts each and every path as a device, multiplying the problem you get with large numbers of devices.

What on eary are you talking about? Using /dev/disk/by-id/wwn-* nodes is the correct way to get your device IDs and it works just fine. The number of devices in a pool is pretty open ended. The important thing to pay attention to is the number of devices in a vdev, WRT device size, expected error rate (most large disks are rated at one unrecoverable error for every 11TB of transfers) and the level of redundancy you require.

##And if your device names change then you are seriously screwed as the import will just hang trying to resilver the "missing" disk.

If your devices WWNs have changed you have far bigger problems to worry about.

##And you want to somehow claim that is going to be better in a failover situation than a shared filestsystem?!?

If you aren't to stupid to use it in the way the documentation tells you to - absolutely. But that is clearly a big if in your case.

## And even then the whole import will stall if you don't have rediculously large amounts of RAM.

More completely unfounded FUD. I've shifted may tens of TBs of pools between machines and never had issues with imports taking a long time.

##Yeah, so production ready - NOT! Seriously, it's beyond snake oil, it's verging on a scam to claim ZFS "is the answer". I'd be annoyed at your obtuseness but I'm too busy laughing at you!

That's OK - everybody else reading the thread is laughing at you. Seriously - you can use whatever FS you see fit, I couldn't care less. But you really should stop trolling by spreading FUD about things you clearly know nothing about.

Gordan

Re: Gordon Gordon Gordon AC Destroyed All Braincells Gordon BTRFS? You.....

##"Just Yahoogle for ZFS import takes long time and watch the hits pile up."

And in all cases the situation is use of deduplication with insufficient RAM and hardware that lies about the commits. Use ext* under such circumstances, and see how long fsck takes on a 10TB+ file system. A lot longer than a zpool import. This is what you need to be comparing with in a like-for-like scenario.

##"Not that ZFS's ridiculous memory requirements aren't another issue."

Really? My backup server runs ZFS just fine in 3.5GB of RAM with a 6TB (RAIDZ2, 4TB usable) array. ZFS-FUSE runs just fine on my Toshiba AC100 with 510MB of usable RAM. The memory requirements are by and large a myth. It only applies in cases where you are using deduplication on large arrays.

##"Your example is even more silly especially as ZFS becomes one big SPOF, whilst even basic hardware RAID can be fortified with RAID between cards (RAID10 or RAID50) or make use of software RAID built into other filesystems."

Do you realize that if you run RAID50 between two cards, you are running a stripe of RAID5. If you have RAID5 on each card, if you lose one card, your RAID50 array will be trashed because you have effectively lost one logical disk from a RAID0 array. There is nothing stopping you from having a ZFS pool across multiple HBAs (a lot of my my ZFS pools span 3-4) in a configuration equivalent to RAID50. You do this by having multiple RAIDZ1 vdevs in a pool. And that will still give you better protection than hardware RAID for the reasons discussed (you really need to comprehend the meaning and advantage of end-to-end checksums here).

Gordan

Re: Kebabfart Gordon Destroyed All Braincells Gordon BTRFS? You must be joking...

##".....ZFS can work correctly with hardware raid only if the hardware raid functionality is shut off...." Which sounds like a problem to me, and is what I posted, and which you then denied and now admit. I assume your initial denial was simply that Sunshiners autonomic reflex to blindly deny any problems with ZFS.

You really don't understand why ZFS RAID is better than hardware RAID, do you. ZFS will work on hardware RAID just fine - it still has advantages over other file systems in that it will detect bit-rot, unlike everything else except BTRFS (and no, your precious cluster FS-es won't detect bit rot). But if you use it with raw disks you get _extra_ advantages that hardware RAID simply cannot provide, such as healing, which traditional RAID5 is not capable of because there is no way to establish which set of data blocks constitutes the correct combination when they don't all agree. Having an end-to-end checksum gives you that capability. With RAID6 you could repair it, and Linux MD RAID6 will do so on a scrub, but neither hardware nor software RAID6 implementations do on-access integrity checking, unlike ZFS, so by the time your Linux MD RAID6 scrub has identified a problem, the chances are that your application has already consumed the corrupted data. And with hardware RAID in the vast majority of implementations you don't even have the transparent background scrub functionality. Expensive proprietary SAN or NAS boxes might have such a feature, but this goes completely against your original requirements of a high performance low cost solution. According to everything you have said, your chosen solution (cluster file system) completely violates the requirements you originally listed (high performance, high durability, low cost), because with the exception of GlusterFS, all others require a SAN (unless you are using DRBD, but that solution is probably beyond you), and GlusterFS' durability, split-brain resistance and self-healing from bit-rot in underlying data that your hardware RAID silently exposes you to are either missing or immature compared to the alternatives.

I mean seriously, why are you so hung up on hardware RAID? Because you get to offload the XOR checksumming onto the hardware RAID controller? A modern x86-64 processor can do that faster than any customized ARM on a RAID controller can. An then you are forgetting that traditional parity RAID (5, 6) suffers a substantial performance drop on writes that are smaller than the stripe size. ZFS does away with that because it's stripe size is variable, and it's copy-on-write nature allows it to perform a small write into a small stripe and commit the data across all the disks in the pool in a single operation. In comparison, traditional RAID has to read the rest of the stripe, update the parity and then write the updated data and the parity - a woefully inefficient operation.

Traditional RAID is a 20th century technology. It's time to get with the programme and take advantage of what people cleverer than you have come up with since then.

Gordan

Re: Gordon Gordon AC Destroyed All Braincells Gordon BTRFS? You must be joking...

What on earth are you talking about? Importing a pool takes no longer than mounting it after a reboot. It might take longer if you have duff hardware (disks that lie about barrier commits, or cached RAID controllers that aren't battery backed, or similar problems), as ZFS goes rolling back through many previous versions of uberblocks trying to find a consistent data set. But that's a duff, lying hardware issue, not a FS issue.

If you have dedupe, you need to make sure you have at least 1GB of RAM per TB of storage if your files are large (i.e. your stripe size tends toward 128KB), or proportionally more if your files are small. If your deduplication hashes don't fit into ARC, the import will indeed take ages if your pool is dirty. But bear in mind that this is another reason why caching hardware RAID is bad - if your machine blows up and you lose the battery backed cache all bets are off on the consistency of the data that will be on your disks because the caching RAID card will do write re-ordering. If you instead use a plain HBA and use a ZIL for write-caching, the write ordering is guaranteed, so your pool cannot get trashed.

Gordan

Re: Gordon AC Destroyed All Braincells Gordon BTRFS? You must be joking...

One of these days I will stop feeding the trolls...

##"So first you say that ZFS having problems with hardware RAID is a lie, then you admit that ZFS "gets confused" by hardware RAID exactly as I said."

It doesn't get "confused". It just has no way of seeing what the physical shape of the underlying array is, and thus cannot do anything to repair your data automatically, over and above what your hardware RAID is doing (which isn't, and never will be as much as ZFS can do for any given level of redundancy). If it is not obvious to you why by now, I suggest you go to the instution that issued you with your IT qualifications (if indeed you have any, which is looking increasingly doubtful) and demand your tuition fees back, because clearly they have failed to teach you anything.

##"Well, thanks for that amazing bit of Sunshine, please do send me the email address for your manager at Oracle so I can let him know what a great job you are doing spreading The Faith."

Could it be that you are even more ignorant than you had demonstrated prior to that post? Oracle has no intention in ZoL - they are actually betting on BTRFS on Linux (last I checked the only sponsored BTRFS developer was actually working for Slowracle).

And open source ZFS implementations are actually more advanced and feature-rich than Oracle's official Solaris implementation (which have stopped being released in open source form).

##"but the idea of hanging large JBODs of disk off a hardware adapter you have turned hardware RAID off of, and then relying on one filesystem to control all that data without redundancy (you can't share pools between ZFS instances) is simply so stupid,"

Are you being deliberately or naturally obtuse? Forgive me for asking, but it is not particularly obvious any more. ZFS does things to protect your data that your hardware RAID _cannot_ and _will never be able to_ do. Delegating this function to hardware RAID means you lose most of that capability. ZFS does handle redundancy - and higher levels of redundancy than any hardware (or software) traditional RAID. For a start, ZFS supports RAIDZ3, which is n+3 redundancy, higher than any hardware RAID controller you can buy (they top out at RAID6 which is n+2). ZFS also has additional checksums that allow it to figure out which combination of blocks is correct even when there is only n+1 level of redundancy in use, something that traditional RAID cannot and never will be able to do.

And what exactly do you mean that you cannot share pools between ZFS instances? You can always import a pool on a different machine if you need to (e.g. if your server suffers a severe hardware failure). The command in question is "zpool import </path/to/device/node(s)>". Provided the version of the pool and the feature flag set are supported on your target implementation, it will import and work just fine. For example, if you create a pool with version 26, you can import it on the wides range of implementations (ZoL, FreeBSD, ZFS-FUSE, Solaris, OpenSolaris, or OpenIndiana).

Hardware RAID is vastly inferior in terms of durability AND performance. If you haven't read up enough to understand why, you really should do so instead of speaking from the wrong orifice about things you clearly know nothing about.

Gordan

Re: Gordon BTRFS? You must be joking...

@Alan Brown:

You are spot on in terms of cluster FS scaling. The problem is that people use them in a wrong way for the wrong tasks. The main advantage of a cluster FS is that (provided your application does proper file locking), you can have simultaneous access to data from multiple nodes. The main disadvantage is that using the cluster FS in this way comes with a _massive_ performance penalty (about 2,000x, due to lock bouncing).

So while on the face of it a cluster FS might seem like a great idea to the inexperienced and underinformed, in reality it only works passably well where your nodes don't access the same directory subtrees the vast majority of the time. And if your nodes aren't accessing the same subtrees, why would you use a cluster FS in the first place?

In terms of performance, for any remotely realistic workload you will get vastly better FS level performance by failing over a raw block device with a traditional standalone FS on it between nodes.

Gordan

Re: AC Destroyed All Braincells Gordon BTRFS? You must be joking...

###".....You don't need a clustered file system to fail over - you fail over the block device....."

##"Slight problem - you can't share the block device under ZFS, it insists on having complete control right down to the block level."

That is simply not true, at least not in ZoL (not 100% sure, but I don't think it is the case on other implementations, either). You can have ZFS on a partition just fine. There are potential disadvantages from doing so (reduced performance - since different FS-es use different commit strategies, ZFS cannot be sure about barrier interference from the other FS, so it disables the write-caching on the disk), but it does work just fine.

But even if it were true, if you are sharing a block device between machines (typically an iSCSI or AoE share, or in rarer cases DRBD), you would only expose the whole virtual block device as a whole, and use it as a whole. Instead of partitioning that block device you would usually just expose multiple block devices of required sizes. On top of it all, most people doing this do so with a SAN (i.e. hugely expensive, which implies going against your low-cost requirement) rather than a DIY ZFS based solution. So this point you thought you were making doesn't actually make any sense on any level at all and only reinforces the appearance that you have never actually used most of these technologies and only know what you have read in passing from under-researched bad journalism.

##"That is why ZFS is a bitch with hardware RAID: ".....When using ZFS on high end storage devices or any hardware RAID controller, it is important to realize that ZFS needs access to multiple devices to be able to perform the automatic self-healing functionality.[39] If hardware-level RAID is used, it is most efficient to configure it in JBOD or RAID 0 mode (i.e. turn off redundancy-functionality)...."

Why is this a problem? All this is saying that if you give ZFS a single underlying RAID array, you will lose the functionality of ZFS' own redundancy and data protection capabilities. If you are really that emotionally attached to hardware RAID, go ahead and use it. But then don't moan about how it's not as fast or durable as it would be if you just gave ZFS raw disks. If you are using ZFS on top of traditional RAID, ZFS sees it as a single disk. If you insist on crippling ZFS down to the level of a traditional file system, you cannot meaningfully complain it has fewer advantages. But even in that case, you would still have the advantage of checksumming, so if one of your disks starts feeding you duff data, you'd at least know about it immediately, even if ZFS would no longer be able to protect you from it.

##"Ooh, nice feature-sell, only us customers decided clustering was actually a feature we really value. Try again, little troll!"

This seems to be something you have personally decided, while at the same time demonstrating you don't actually understand the ramifications of such a choice on durability and fault tolerance. Clustering is a very specialized application, and in the vast majority of cases there is a better, more reliable and more highly performant solution. You need to smell the coffee and recognize you don't actually know as much as you think you do.

##"I can't give you specifics but I'll just say we had a production system that was not small, not archive, and hosted a MySQL database on GlusterFS to several thousand users all making reads and writes of customer records."

Sounds pretty lightweight for a high performance MySQL deployment. Try pushing a sustained 100K transactions/second through a single 12-core MySQL server (I work with farms of these every day) and see how far you will get with GlusterFS for your file system, regardless on top of what FS your GlusterFS nodes are running. Even if latency wasn't crippling (it is), the pure CPU cost would be. Sure - if you are dealing with a low-performance system and you can live with the questionable-at-best data durability you'll get out of traditional RAID+ext*+GlusterFS (assuming you get GlusterFS quorum right so you don't split-brain, otherwise you can kiss your data goodbye the first time something in your network setup has an unfortunate glitch), that setup is fine. But don't cite a fragile low-performance system as an example of a solution that is supposed to be durable, performant or scalable.

##"why the shortage of such for Solaris if ZFS is just so gosh-darn popular?"

I never said Solaris was that great. My preferred OS is RHEL. I also never said that ZFS was particularly popular - I just said it was better than any other standalone file system.

Gordan

Re: AC Destroyed All Braincells Gordon BTRFS? You must be joking...

@Matt Bryant

There are so many fallacies, errors, and just pure and utter lack of comprehension of the storage systems that it's difficult to address them all, but I'll try, and do so in two posts for brevity (too long for a single post), in order of the points made.

##"Businesses want high-speed, scaleable, redundant and reliable storage at a low price."

Sure - but a cluster file system absolutely is not the way to achieve that. In the order of requirements you listed:

1) High speed

Clustering (you seem to be a proponent of OCFS2 and GlusterFS for some reason) has terrible performance compared to normal single-node file systems due to lock bouncing between the nodes. If the nodes are concurrently accessing the same directory subtrees, the lock bounce time is close to the ping time, around 100uS even on 1Gb or 10Gb ethernet. You might think that's pretty quick, but on a local file system, this metadata is typically cached, and since there's no arbitration between nodes to be done, the access time for the cached metadata is the latency of RAM access, typically around 50ns (after all the overheads, don't be fooled by the low nominal supposed clock cycle latency at GHz+ speeds). That's about 2,000x faster than ping time. This is the main reason why even the best cluster file systems severely suck when it comes to general purpose FS performance.

2) Scalable

Clustering has nothing to do with scalability. ZFS can scale to thousands of disks, across many vdevs (vdev being roughly equivalent to a traditional RAID array, only with added benefits of much improved data integrity and recoverability). You might argue that this limits you to how many HBAs you can fit into a machine, but this isn't really that serious a limitation. You can hang an external disk tray with a built in expander, and daisy chain several of those together well before you need to worry even about multiple HBAs in a server. You are looking at hundreds of disks before you have to start to start giving a remotely serious thought to what motherboard has more PCIe slots for HBAs. You could argue that the solution is something like GlusterFS that will take multiple pools of data (GlusterFS puts data on a normal FS, BTW, so if you run it on top of ZFS instead of a different FS, you get all of the unique benefits of ZFS in terms of data integrity and transparent repairability). But you'd be wrong. GlusterFS has it's uses, but the performance penalty is non-trivial, and not just because of FUSE overheads. If you really are dealing with data on that scale, you really need to look at your application and shard the data sensibly at that level, rather than trying to scale an unscalable application through additional layers of performance impairing complexity.

3) Redundant and reliable storage at a low price

This is where ZFS isn't just the best option, it is the ONLY option. Disks are horrendously unreliable, and this typically varies quite vastly between different models (regardless of manufacturer). Once you accept that disks will randomly fail, develop bad sectors, silently send you duff data (yes, it happens, and more often than you might imagine), or do one of the many other things disks do to make a sysadmin's day difficult, you will learn to value ZFS's unique ability to not just detect end-to-end errors and protect you from data corruption (including, for example, a bit-flip in non-ECC RAM on your RAID controller), but also silently repair it, either on access or during a scrub. BTRFS can detect such errors, but last I checked, it's scrubs cannot yet repair them (which makes it of questionable use at best, as even with redundancy, you have to restore your corrupted file from a backup). This and other associated features (e.g. RAIDZ3 which gives you n+3 redundancy which very, very few solutions provide, and none of them are what you might call cheap) enable to you achieve enterprise level reliability using cheap desktop grade disks (not that I think that enterprise grade disks are any more reliable, despite their 2x+ higher price tag for a near identical product). No other solution comes close to providing that level of cost effectiveness.

##"Now, when you come along and say "We have a whizz-bang new filesystem called ZFS that can look after really large amounts of storage, but to do so you have to give up all concepts of hardware RAID and instead use a really big, monolithic server, with massive amounts of RAM, but you have no high availability", they tend to ask "So how is that better than GlusterFS?" or whatever product they have already had the salegrunt come round and do a slideshow on."

First of all, RAID is downright disastrous for data integrity compared to ZFS' data protection for the same level of disk redundancy. If you have a disk that has started to bit-rot and it's feeding you duff data (as I said before, it happens more than I ever wanted to believe) in traditional RAID 1/5 you can detect a mismatch between the data and the mirror or the parity, but you have no idea which mirror is correct or which if the n+1 data chunks in a stripe is correct and which is corrupted. So you can detect an error but you have no way of correcting it. It's time to restore from backup. ZFS saves you in this case because each block has a checksum in the redundantly stored metadata, which means that it can work out which combination of data blocks is correct, and repair the corrupted block - completely transparently (it logs it so you can keep track of which disk is starting to go bad). With traditional RAID6 you can achieve the same thing if you use Linux MD RAID, but there is no checking of this on every read, you will only pick it up on a full array scrub, and by then your application has likely already consumed garbage data. With hardware RAID in most cases you don't even get the option of performing a periodic data scrub.

##"And then it gets worse when you have to admit ZFS is still a project under development and not really ready for enterprise use, and you have no case studies for real enterprise use of it (real enterprises being the only people that will have the cash and inclination to buy those big monolithic servers with all those cores and RAM required to run really big ZFS instances)."

At this point I should probably point out that ZFS on Solaris is older and more mature than GlusterFS GFS2 and OCFS2. It is on the same order of age and maturity of GFS (not even sure why we are bothering comparing these cluster file systems, it is an apples and oranges comparison, but I'm using it as an example of why your anti-ZFS bias is completely baseless on count of maturity). Spend a few months on the ext4 (the default Linux FS on most distributions) mailing list and on the ZFS-on-Linux mailing list and you will find that ext4 actually gets a lot more stories of woe and corruption than ZoL, and ext4 is supposed to be one of the most stable of FS-es.

##"And whilst you moan about whether GlusterFS or Left Hand have lock managers etc., what the business sees is a low cost storage solution that scales, is redundant, and uses commodity x64 architecture."

Except that GlusterFS is completely at the mercy of the underlying file system to keep the data safe and free of corruption. GlusterFS on top of ZFS is actually a very good combination if you need GlusterFS features because you get all of the data-protecting benefits of ZFS, along with tiered storage performance (L2ARC/ZIL on SSD), and the benefits of GlusterFS if you want to mirror/stripe your data across multiple servers. But if you use something other than ZFS underneath it, you fall into the same trap with data corruption. Worse, GlusterFS cannot scrub your data, so if one of the mirror's data ends up being corrupted, GlusterFS will not do anything to check for this - it'll just feed you duff data 50% of the time (assuming GlusterFS AFR/mirroring arrangement). GlusterFS also has no means of scrubbing your data to pre-emptively detect and repair corruption - so you have to rely in the underlying FS to do that for you; and ZFS is the only FS at the moment that can do that.

##"What is unamusing is the rabid reaction you get when you raise objections to ZFS being included in the Linux kernel. For a start, even if it did have the actual features wanted, it is not GPL compliant. End of discussion."

The rabid reaction is, IMO, in the GPL totalitarianism. BSD guys had no problem including it in FreeBSD long before ZoL. Why do you care if it's in the mainline kernel or not? What difference does it make to anyone except the GPL fanatics?

Gordan

Re: Destroyed All Braincells Gordon BTRFS? You must be joking...

@Matt Bryant:

"Trying to bolt together a Frankenstein network filesystem out of Luster and ZFS seems pretty pointless when Red Hat already have a superior option in GlusterFS."

This just tells me that you haven't actually tried to use GlusterFS for something serious. The fact that they only twigged in the past year or so that you cannot have a split-brain resistant distributed file system without hard fencing capabilities speaks volumes. I have used it extensively in the past (I'm the one that added the initial MDRAID+GlusterFS support into the OpenSharedRoot) , and use it more or less daily at the moment (big data archives at work, because we need to glue together multiple boxes' storage space), and while it is OK for archived data that is read-mostly and tends to require mostly linear transfers/scans of big files, you wouldn't want to run it on a system where you need a lot of small I/O on lots of files.

Gordan

Re: Destroyed All Braincells Gordon BTRFS? You must be joking...

@Matt Bryant:

"That is the problem - ZFS is not suitable for the big filesystems it is aimed at on the simple grounds clustering is not there."

OK, one post might have been a random brainfart, but after two you seem to have amply demonstrated you don't actually know what clustered file systems are, or are for.

1) You don't need a clustered file system to fail over - you fail over the block device, and mount the non-clustered FS on the fail-over node, after fencing off the primary node that failed. Concurrent access on a clustered FS may seem like a fancy feature until you find that your performance reduces by a factor of 1000 when you start simultaneously working on file in the same directory from multiple nodes.

2) GFS, having been around for longer, has more traction than OCFS.

3) Every company that has used Solaris in the past 10 years or so is using ZFS. ZFS has been in Solaris since Solaris 10.

Cluster file systems like GFS, OCFS and VCFS lack a fundamentally important feature, and that is transparent data repair and recovery. Contrary to what you may have been lead to believe by disk manufacturers is that disks fail, and lie about their defects in the most creative of ways to limit warranty claims. When this happens, traditional RAID will corrupt your data, one sector at a time. ZFS (and BTRFS to a lesser extent) will go and scrub out your data from the surviving disks in the array, find a data set that checks out, and repair the error. Your expensive SAN and cluster file system cannot do that in anywhere near the number of cases that ZFS and BTRFS can.

Patent shark‘s copyright claim could bite all Unix

Gordan

Re: April Fools!

Yeah. Given how similar it sounds to the whole sorry SCO affair, though, it makes you wonder if that was just the biggest, most expensive, longest running prank of all time. It certainly would have had more credibility as such.

Page: