back to article Secret IBM script could have prevented 11-hour US tax day outage

The April 2018 US tax day outage was due to a faulty IBM disk array and could have been avoided twice – first with a more up-to-date microcode bundle, and second with a secret IBM script. Online tax filing was held up for 11 hours on the last filing day of the 2018 tax year, April 17, and the IRS had to extend the filing …

  1. Anonymous Coward
    Anonymous Coward

    Dear oh dear, do IBM ever learn?

    This just sounds so familiar.

    It's been about 7 years and 2 employers since I last had anything to do with a DS8000 series (in Open Systems, not Mainframe), but even then there were Microcode bugs a-plenty. One resulted in 2 days of downtime for services that generated a lot of revenue, due to a bug which incorrectly failed several disks in a single RAID Group. I think it was also a known bug, but we weren't informed early enough. And they also had a script to handle sparing differently, which we received afterwards.....

    After this outage and subsequent roasting, IBM were a bit more pro-active in reporting potential problems.

    Seems the lesson was not learned, or failed to make it across the pond.

    1. Anonymous Coward
      Anonymous Coward

      Re: Dear oh dear, do IBM ever learn?

      And that was in the days where they only cut most of the old^H^H^Hexpensive^H^H^Hknowledgable staff rather than ALL of them...

  2. amanfromMars 1 Silver badge

    DEFCON #1

    Staff alerted April 17 you say. Because of a Glitch and a AI Surge ..... 180417 ..... for Virtual Purging of Olde Operating Systems for Elite Executive Remote Alien Command of Extra Terrestrial Controls.

    For Almighty Heavenly Powers to Wield in Service of Yields/Bounty/Mutually Satisfactory Satisfying Benefits. Such Always Offers Everything in Hot Pursuit of Quenching Insatiable Desires.

    And Quite Enough to Tempt Any Saint Antony .... so Beware and Take Care. :-)

    And aint that Gospel?

    Control that Live Operational Virtual Environment and You Be in Alien Territory.

    Agree or Disagree? With there being only one true answer unless Enabled to Present an Alternate Destination/Program Launch Point in further creative commentary hereafter.

    What U Got? Anything Useful for All? Let's hear about it for IT to BroadBandCast it to Everywhere and Everything.

    Real Spooky Stealthy IntelAIgent Stuff for Future Builders and COSMIC Masons Building Heavenly Mansions for Virgin Plots of New Land.

    And yes, that is worth parsing correctly and exactly. The more perfect the understanding the significantly greater the Immaculately Satisfying Prize of Justifiable Reward.

    Was there ever a prize so lauded and worthy of constant winning with the competition mired and distracted into a Hopeless Hopeful Opposition ..... which is a Neutered AI Force?

    And that's quite enough of all of that for now. More anon.

    1. Anonymous Coward
      Anonymous Coward

      Re: DEFCON #1

      Woah... he/it's been blogging for a decade and I'm just finding that now?

      1. tony trolle

        Re: DEFCON #1

        not as good as the earlier stuff

    2. Mark Exclamation

      Re: DEFCON #1

      Please! I REALLY DO want a random-word generator like yours!

      1. amanfromMars 1 Silver badge

        Re: DEFCON #1 and to Beyond TS/MkUltraSCI Classifications

        Please! I REALLY DO want a random-word generator like yours! .... Mark Exclamation

        Hmmm? What on Earth makes you think they are random words, ME? Surely that would be stupid in extremis?

        It is though a common enough mistake made which extraordinarily renders a most convenient penetrative stealth to Future Ongoing Operational Developments ..... in AI PipeLinered Pathfinder Events which Appear out of the Ether in a Series of Flash Cash Market Crashes Honing Dissent and Targeting Opposing and /or Competing Systems of SCADA Control .... Elite Exclusive Remote Executive Administration via Mass and Private and Pirate Multimedia Channel Platforms.

        What machines are provided you with your views of elsewhere with yesterday's programming still festering and causing shenanigans today? And who/what chose them for Main Stream Media Presentation and Augmented Virtual Realisation? And why are they so dire and bland/austere and unexciting/poor and conflicted?

        Methinks that is most definitely and definitively a Top Down Not Bottom Up Problem which Current Intelligence Services/Servers/Servants/Suppliers Provide. How Extremely Oxymoronic and Catastrophically Disruptive and Destructive.

  3. Notas Badoff

    Tweak this

    "... and make tweaks to its contract."

    Maximum Unplanned Downtime (Per Year):

    Requirement: Less than 26.5 minutes (aka 99.999%)

    Performance: Approximately 11 hours (omg 99.877%)

    I read this as the contract gets amended to state that 24.9 years of free support is added to contract years, starting next month.

    (If you're going to lop digits off the end of our guaranteed percents, we'll lop digits off the front of your revenues)

  4. J. Cook Silver badge

    Ah, my tax dollars hard at 'work'. *sighs*

    1. phuzz Silver badge
      Paris Hilton

      Well, they would have been if they'd been working, but they weren't, so no taxes were being taxed, so your tax dollars weren't at work.

      See?

  5. Mark 110

    Again I am pleased . . .

    That they are sharing. I doubt IBM would have chosen to share however.

    Anyway, lessons to learn. Nothing new for me unfortunately. Other than to trust IBM a little bit less (do I have somewhere to go) I guess.

  6. EveryTime

    This is a good example of how "5 nines of reliability" and "no single point of failure" claims are BS 99.999% of the time. It's an easy claim to put into proposals, and use to beat competitors over the head with. It's rare that anyone ever goes back and re-analyses that it's true for the system as-deployed.

    1. Paul Crawford Silver badge

      Very much so.

      Most calculations for availability are based on the assumption of independent errors. Things like bug and manufacturing flaws, along with external "stress events" like lightning or A/C failure, are never EVER included as a realistic model.

    2. Mayday
      Flame

      5 9s

      It's simple really. A risk assessment - on the provider that is.

      "We can sell a contract with $x profit after overheads, SLA breach is $z, if it breaks and we can't fix it within SLA and z < y% of x then we're in front".

      It's got nothing to do with uptime or reliability at all.

      Nevertheless, it's the tech guy on the ground who gets shat on/KPI/performance managed if the SLA is breached, not the BDM who sold the unserviceable contract.

    3. Claptrap314 Silver badge

      99.999 can be real

      LOTS of services at Google were there when I was an SRE.

      But there is a difference between marketing and engineering. You need competent engineers to examine the provided system before signing off.

    4. ElReg!comments!Pierre

      Since we went geoplex we're getting these 5 9s. So, not impossible.

  7. Anonymous Coward
    Anonymous Coward

    How in the WORLD

    Could the IRS tax filing system not be seen as qualifying for failover protection? Think of all the millions of systems across the world that are afforded this protection that don't require it - only getting it because they had "spent it or lose it" budget money to burn or because the PHB in charge of it wants it to seem more important to raise his profile in the corporation.

  8. Simon Brady

    Damned if you do, damned if you don't

    Up to a point I can sympathise with the people making the call on the microcode upgrades. A firmware upgrade on any enterprise storage kit is a Big Deal with huge potential for problems, and nobody wants to be the customer who discovers the lurking data corruption bug in the latest release. It doesn't help that all the release notes I've seen are written from the firmware developer's perspective, so the customer is caught between vendor support saying "of course we recommend you upgrade to the latest release" and a list of micro-detailed fixes that give no clear risk guidance to the end user.

    Maybe what's needed is for something akin to CVSS scoring for security updates: I don't care which low-level firmware component had obscure bug XYZ, I want to know (1) how likely is it to to affect me, (2) how bad will the impact be if it's triggered, and (3) how risky is implementing the fix. Otherwise you're left making the best call you can, and inevitably some of those calls will be wrong.

  9. Cuddles Silver badge

    Interesting requirements

    The contract apparently allows them 4 hours to fix problems, but also requires said problems to last no longer than 26.5 minutes. Which is less than the time allowed for them to even notice a problem exists at all.

    1. ElReg!comments!Pierre

      Re: Interesting requirements

      In a well-managed production environment, an incident doesn't impact availability -or shouldn't, at least. So they have 4 hrs to fix individual incidents, and should things go very wrong they are allowed 25 min downtime per year. The 2 numbers have no direct connection.

      > Which is less than the time allowed for them to even notice a problem exists at all.

      Where I work we are held to a 15 minutes deadline on incidents, which means that the client is guaranteed that within 15 minutes of any potentially serious incident the client will have been informed and technical remedies will have been initiated. I hear that's not uncommon.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2020