back to article EMC mess sends New Zealand University TITSUP for two days

Victoria University in Wellington, New Zealand, experienced two days of Total Inability To Support Usual Performance (TITSUP) after failures in its EMC equipment. An email sent by senior IT staff at the University reports that “the storage systems that manages all our servers have experienced multiple failures over the past 2 …

  1. This post has been deleted by its author

  2. Anonymous Coward
    Anonymous Coward

    WTF?

    Didn't a certain MichaEl eDell just acquire EMC? WTF is going on? I'm starting to feel sorry for him!

  3. This post has been deleted by its author

  4. wyatt

    Is it possible to build this type of redundancy/fail over when systems get this large? If so, is the time taken to switch between the two quick enough to minimise downtime to an acceptable level at a cost which is affordable?

    This isn't meant to be knocking DR plans/redundancy/fault tolerance, more a question that as systems get bigger and storage more complicated is it affordable to split resources over multiple vendors to stop instances like this? It'll be interesting to see what the RCA is for this fault, if it was a EMC failure exacerbated by other faults or just a single failure somewhere which took out a lot more.

    1. thondwe

      Money vs Risk

      Universities (and most corporates) are not always awash with Cash so have to accept a level of risk. If your DR/BC plan is essentially restore from backup, then a single Array Failure can take loads of services off line for the time it takes to restore from Backups, assuming you have/can make capacity available to restore them too...

      NOT everyone wants to spend lots of dosh on dual live systems....

      If you haven't got live capacity to restore to (e.g. by killing Test/Dev) then you may be waiting for EMC to ship a replacement...

      Paul

      1. Cloud 9

        Re: Money vs Risk

        Universities = poor hence EMC failure = acceptable?

        Hmmmm - not really.

        1. Captain DaFt

          Re: Money vs Risk

          "Universities = poor hence EMC failure = acceptable?"

          Acceptable, no. A fact of life? Yes.

    2. Roo

      "Is it possible to build this type of redundancy/fail over when systems get this large?"

      Yes, this is a selling point of stuff like SANs. ;)

      The gotcha is that the extra gubbins to do the Fault Tolerance bit can add a hell of a lot of complexity and things to go wrong...

  5. DavidCarter

    "At a guess, this looks like a major failure of a storage area network, likely including products beyond EMC's given that printing and internet access were also impacted"

    I guess who wrote this doesn't realise that Print and Active Directory servers more than likely run from a SAN as well...

    1. Simon Sharwood, Reg APAC Editor (Written by Reg staff)

      Not don't realise - rather a case of considering that perhaps part of the network was down along with the SAN. But good point on the relevant servers being SAN-dependent.

  6. disgruntled yank

    Hmm.

    The first time EMC came around to sell us something, they were smirking about the Hitachi fiasco that had put much of the Norwegian banking system offline for a few days. Now I guess Hitachi has something to smirk about.

  7. Me19713
    FAIL

    Proverbs 16

    Pride goes before destruction, And a haughty spirit before stumbling.

  8. Anonymous Coward
    Anonymous Coward

    Annoying EMC Video Ads

    Even funnier, when every page on The Register at the moment appears to have a video ad showing EMC's new solutions halfway through the articles. Strangely not on this article though....

  9. FasterFaster

    Not much information, plenty of assumptions

    It is entirely possible it wasn't hardware failure or anything to do with the SAN at all. Network misconfiguration, some fibre channel switch change or even something at the virtualisation layer which could cause issues for a wide range of systems.

    Maybe they do have a replicated storage setup and something changed which flooded the network, effectively an internal DoS. A combination of all the above can lead to that situation.

    Everyone is quick to blame the vendor or quick to suggest that the University didn't spend enough, but with big complex infrastructure you can quickly see a small configuration issue snowball into something huge.

  10. Iknowsomethingaboutstorage
    IT Angle

    I would like some information

    "The storage systems"

    All storage systems?

    "Sign-on to networks was slow, Internet connections went down and even printing was problematic."

    Because a storage failure?

    "We've also contacted the executive who sent the email below. He appears to have flicked it to the University's communications team and hasn't offered any of the detail we requested about the nature of the outage or the EMC products involved."

    Then... with what information are we working?

  11. Afrojazz

    Oops, I did it again

    http://www.theregister.co.uk/2014/07/04/dimension_data_in_cloud_outage/

    "Dimension Data's Australian cloud has been down for over 24 hours after EMC kit failed"

  12. cloudguy

    Wait for the post mortem...

    Well, the lack of facts at hand makes it anyone's guess as to exactly what happened. That said, storage networks have a lot of moving parts and a failure in the networking part could easily disable access to the storage part. If the outage was due to a planned upgrade or maintenance, then there should have been a roll-back procedure in order to recover. While you cannot rule out human error, you expect that the people involved in operating and managing the storage network are adequately trained and experienced. The vendors involved along with the university will likely issue a "post mortem" when the facts surrounding the outage are understood. Then the guilty can be charged.

  13. ms_register

    What was the failure ?

    Was it capacity ? What was the university IT operations team doing before this happened ? Did they have any kind of monitoring on before services degraded ? Do they have some kind of failover or spare capacity in there architecture. Or was it that some beancounter could not be convinced fast enough to add/upgrade IT infra - so lets blame vendor for failure ?

    Or the systems just inexplicably stop working ! Just collapsed like a pile of ...

    as you say "The more EMC tried to fix the worse it got"

    I am sure you would love to blame EMC - oh the big corporate behemoth, who do they think they are - lets get some more facts please before you stick a knife in.

  14. Jonathon Desmond
    Alert

    I don't know if it was the SAN or not the SAN....

    .... all I know is that, for some reason, I now want pancakes!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like