back to article Don't be shy, vendors: Let's see those gorgeous figures

One of the frustrations when dealing with vendors is actually getting real availability figures for their kit. You will mostly get generalisations, such as "it is designed to be 99.999 per cent available" or perhaps "99.9999 per cent available". But what do those figures really mean to you and how significant are they? Well, …

COMMENTS

This topic is closed for new posts.
  1. Scott Earle
    Thumb Up

    "Don't have these figures to hand"

    This is absolutely correct. DO NOT believe a vendor that says they don't have these numbers. Of course they have the numbers.

    If they say they don't have the numbers then they are OBVIOUSLY lying somewhere - because how can you claim to provide a certain measurable uptime figure, but not have the actual numbers to hand? It's ridiculous.

    Vendors in "lying in order to win a sale" shocker ...

    1. JohnMartin

      Re: "Don't have these figures to hand"

      - Disclosure NetApp Employee -

      While the developers and support management look closely at reliability and failure metrics, In most cases the majority of field sales staff don't have access to the detailed reliability figures, and certainly don't carry them around with them, so its accurate to say the don't have these figures immediately accessible.

      As pointed out elsewhere on these comments, a single metric for a single component doesnt really give a good indication for the overall reliability of a system, especially for disk arrays where the major component that fails has a range of different failure characteristics, none of which are well expressed by a simple MTBF number, especially given the range of mitigating techniques arrays are specifically designed for.

      Because of the difficulty in creating a readily understood model that accurately reflects the complex interrelations of component reliability for systems with a mixture of exponential and Wiebull component failure distributions NetApp publishes independently audited reliability metrics based on a rolling 6 month audit. If you're interested in array reliability, and how to mearure it check out the following two papers.

      “A Comprehensive Study of Storage Subsystem Failure Characteristics by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou and Arkady Kanevsky in April 2008 http://media.netapp.com/documents/dont-blame-disks-for-every-storage-subsystem-failure.pdf” and “A Highly Accurate Method for Assessing Reliability of Redundant Arrays of Inexpensive Disks (RAID) by Jon G. Elerath and Michael Pecht in IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 3, MARCH 2009 http://media.netapp.com/documents/rp-0046.pdf”

      Regards

      John Martin

  2. jake Silver badge

    Well ... My personal "friends & family" system ...

    Has been up, and available, since Flag Day (January 1, 1983). To the best of my knowledge, the system hasn't lost a bit of data in all that time. Does that count?

    http://forums.theregister.co.uk/forum/containing/1199724

    http://forums.theregister.co.uk/forum/containing/1205349

    1. Matt Bryant Silver badge
      Happy

      Re: Well ... My personal "friends & family" system ...

      ".....To the best of my knowledge, the system hasn't lost a bit of data in all that time....." Yes, but was it doing during that time? Vendor uptime figures are meaningless if the storage item in question is doing next to nothing. As an example, hp used to make what was IMHO a truly awful array called the FC60. When we decom'd one in our DR datacentre, by some fluke we actually left the whole rack disconnected from the network but still powered up. Over two years later we noticed and we all had a good laugh about that was the longest one of our FC60s had stayed up and not lost any data!

      1. jake Silver badge

        @Matt (was: Re: Well ... My personal "friends & family" system ...)

        "Yes, but was it doing during that time?"

        Being used by friends & family. Daily. No complaints about lost data yet ...

        http://forums.theregister.co.uk/forum/containing/464412

        http://forums.theregister.co.uk/forum/containing/625870

        http://forums.theregister.co.uk/forum/containing/1475805

  3. Youvegottobe Joking
    Facepalm

    5 x 9's, let down by idiots

    Down through the years I have seen companies spend huge amounts on their storage and then buggerall on everything else. Maybe thinking that they could have an outage but at least they will not lose their data.

    5 x 9's switches are pretty damn expensive and surprisingly often you will find a pair of small departmental switches from the same 13a power socket attached to a storage system worth 100k or even single attached hosts because they ran out of ports on their departmental switches.

    The worst one I saw?

    A cleaners cupboard in the basement, 4 feet by 5 feet approx, with switches balanced on a stack of carpet tiles and cables up and down from a hole in the ceiling. The storage cabinet was backed up against the wall so no-one could service anything in the back of it. The worst part of this? There was no air conditioning and the temperature in the room was in the high 30's (centigrade).

    That customer was expecting 5x9's availability.

    1. jake Silver badge

      Re: 5 x 9's, let down by idiots

      I can top that:

      http://forums.theregister.co.uk/forum/containing/250950

    2. Dazed and Confused

      Re: 5 x 9's, let down by idiots

      I was once asked to go and check an HA cluster at the site of a well known national institution. They had the whole cluster running from a single 13A plug in the ceiling above the rack. In fact there were no end of SPOFs in there system, they only seemed to have duplicated the most expensive elements and forgotten all about multiple links to the network switches, multiple links to the storage devices... etc.

      1. jake Silver badge

        @Dazed and Confused (was: Re: 5 x 9's, let down by idiots)

        Manglement rarely listen to techs in the trenches, alas. See:

        http://forums.theregister.co.uk/forum/containing/652214

  4. Khaptain Silver badge
    Happy

    Planned maintenance

    We have Avaya telephony platforms and I would say that they retain these figures. Outside of human error or telephony provider/operator problems I have never seen the systems go down. The systems do have problems but they can be taken care of without losing uptime.

    It is important that the correct maintenance procedures are carried out and that "maintenance" downtime is not included in the 99.9%.

    You must also remove "human error" for which the manufacturer cannot be held reponsable, unless of course it was one of their staff that made the error ( but that's another problem).

    What I am trying to say is that in a few cases, albeit very few, these figures are actually achievable. Avaya's larger systems, just like larger IBM AS400s etc can have parts replaced in real-time ( hotplugging of almost anything) without having to shut the system down/reboot.. No downtime.....

    1. Dazed and Confused

      Re: Planned maintenance

      Hot plug replacement of HW components is essential but what about critical SW updates. Is is possible to replace these online. I remember years ago teaching an HA class for a major vendor and got onto the subject of SW updates and rolling upgrades to minimize downtime (sadly that's normally minimize rather that eliminate) when one of the students started asking some lovely awkward questions. His background had been in writing SW for switches in the mobile world and was very surprised to find that few bits of SW can be patched really live, so that you could for example unplug a shared library from a running process and plug in a fixed version, or even replace the binary in a running process with a fixed one.

      But when it comes to hot plug HW it is important to consider what the state of the system is in the midst of a HW failure or hot replacement event. How would the equipment react to a second failure? Typically these systems are designed to avoid SPOFs, but sometime shit happens.

      So, since the article is discussing storage systems, lets take that as an example. You have a nice shiny new disk array. To provide acceptable write latencies it uses a cache. What happens if the cache fails? Well a decent array would have mirrored caches so what happens should be that all your data is now safely in the other half of your array. But at this stage although your data is being cached, is it being safely cached? What happens if you get a second failure? Does the array automatically drop to write through behaviour in the event of the protection level being degraded? Can you the customer chose between safety and performance modes? There are many cases where depending on the design a single failure can either leave you badly exposed to potential data loss or corruption if the storage system were to experience a second failure before the first one has been adequately dealt with.

  5. Denarius
    Meh

    Real figures ?

    Slightly off-topic, but I remember a decade or so ago some uni student researching just how often hard drives failed versus vendor claimed MTBF. Achieved MTBF was 10% of claimed MTBF Was not popular with disk and storage vendors. However, I have had the same sad experiences as previous commentards. $$ on some part of kit and not enough on equally as important parts. Main flaw in keeping stuff up for long times is maintenance. So often forbidden on grounds because it may cause downtime, right up to the time bug surfaces, or kit is not power cycled in scheduled maintenance after firmware upgrades, meaning many dying disks cause data loss instead of hot swap routine replacement. Ever had an entire data center power fail? Bad news for beloved customers when this happened.

  6. Ball boy Silver badge

    Okay, say I buy an array of disk that claims (and can back it up with evidence) that they're hitting the 5 9's. I do the same with the server box and switches, etc.

    It's all fine until I stick an OS on and then load an application on top of that - anyone seen any claim for reliability from an app? Yes, I can have multiple copies running in a cluster but they'll all suffer from a generic code fault (a modern Y2K bug, for example) at exactly the same time and make a mess of my systems' reliability.

    Surely any business wants to know how reliable their full system will be - what the failure rate is for a sub-component of it is only of passing interest but how would any vendor go about offering trustworthy data?

  7. Anonymous Coward
    Anonymous Coward

    Unenforable Contract

    " in fact, every presentation I have had with regards to availability are under very strict NDA and sometimes not even notes are allowed to be taken."

    Can a company can actually enforce an agreement that says that you can't say when they are breaking the law? And false advertising is breaking the law in a lot of countries. If they're telling us that they do 99.9999 and they're telling you that they only manage 98.12, then the law should not be allowing that to be kept secret no matter what bit of paper you signed.

    1. Don Jefe
      Meh

      Re: Unenforable Contract

      NDA's may not always be enforceable in court, but they are certainly enforceable in other ways. Violate the NDA and you find yourself no longer invited to events and shunned by others in the industry. Excommunication by the industry will end your career quickly. This type of behavior certainly isn't limited to IT either; it's just the way the world works.

  8. andy bailey

    I am going to surprise you Martin - there is a company that publishes its availability figures every day. Now, you may think its slightly off topic as it is mainly a server vendor. In fairness though, it does sell SAN and storage technology which is also included in its availability figures. Anyhow, I don't believe you can seperate the two and as other commentators have mentioned, should include networking infrastructure and power for example. But back to servers and storage - a huge estate of servers are actively monitored and the downtime is calculated daily using 6 months of data. Today's figure is 99.99957% and yes, it does change. And yes, I do work for the company.

  9. John Smith 19 Gold badge
    Meh

    Note that 5 9s is about *unplanned* outage. and to get it you need a design *proecess"

    As previous commentards have stated having the nice SAN with the data striped across hot swappable drives and dual PSU's (and even power cables) is useless if you're stuffed by the some cheapo SPOF router which is periodically unplugged by a cleaner to run her vacuum.

    I suspect the really tricky part is updating an OS while it's running. Some players understand this quite well (IBM iSeries springs to mind) but I'd guess it's down to sequential shut down, upgrade and re-boot on a per processor basis.

    As for end-to-end reliability well that sounds like you should benchmark your system if you really need that level of capability.

    1. Denarius
      Meh

      Re: Note that 5 9s is about *unplanned* outage. and to get it you need a design *process"

      Spot on John. Score one for QNX which does this well. Perhaps other micro-kernel OS do it also, but I have not seen them. Clustering is supposed to allow nodes and storage units to be taken offline for maintenance while the rest continues on. Rarely seen it work. I remember one simple Media Player application patch trashing a bunch of front end decrypting boxes at a customer site. OS was fine, but post patch the OS mishandled incoming encrypted and compressed data streams. Patch to apps reset registry settings to default so custom app was not called when data arrived. Conclusion: Even if the OS, applications and hardware are tested and fault tolerant, something else unexpected can break it. Yes, full testing required before doing the simplest patching of whole system. Still run across mission critical systems where there is no full test environment though. {S}.

      1. John Smith 19 Gold badge
        Happy

        Re: Note that 5 9s is about *unplanned* outage. and to get it you need a design *process"

        "Score one for QNX which does this well."

        I would expect VMS (which pioneered clustering), OS400 and whatever IBM or Unisys big iron is running to be pretty good as well.

        "Rarely seen it work. I remember one simple Media Player application patch trashing a bunch of front end decrypting boxes at a customer site. OS was fine, but post patch the OS mishandled incoming encrypted and compressed data streams."

        That suggest the PMP patch was rubbish, but the OS should have shut down the decryption apps. The signature should have been the decryption app needing re-booting, not taking the OS with it.

        "Still run across mission critical systems where there is no full test environment though. "

        I've worked core systems where the main app had at its top level a "select company" option, allowing you to set up a dummy environment, and another where it was assumed to be unnecessary.

        The latter was much more worrying as you could not gauge beforehand its interaction with the (large)existing system.

        .

  10. Ron Christian
    Thumb Up

    More like...

    ...nine fives.

  11. Steve Crook

    In a world a long long time ago

    I worked at a local authority, programming on an ICL2900 (in COBOL too). We had plenty of outages, but the guy responsible for keeping the 2900 up and running was always quoting up times of well over 90%. Then we found out that as far as he and his stats were concerned 'up' basically meant the power switch was in the 'on' position.

    Ahhh, how we laughed...

  12. J.T

    Actually, hitting the 5 9's for "reliability" and "downtime" with any decent piece of kit is pretty easy, and the vast majority of vendors when it comes to hardware failures these days far exceed that.

    But, and it's a big but, you then get into five very common situations

    1. You were one of the first to use those shiny new software features that had a bug that corrupted everything

    2. You are dumb enough to take that system running fine for a few years, turn it off, and move it across floor to shove into different corner

    3. The third party company you subbed the upgrade to didn't do something right during the hardware upgrade

    4. You didn't replace that hardware item a few weeks ago when it first failed

    5. Sooner or later, you get to a situation, where due to economy of scale, the guy on the phone from support has to tell the guy on the other end to replace something important and <human> happens

    So you really need three things:

    1. The actual hardware reliability

    2. Average amount of time after major software release final HOLY CRAP bugs get worked out

    3. An idea of how much your refusal to allow remote support is going to hurt you when you release the customer support engineer licks rocks.

This topic is closed for new posts.