back to article Alaska Air phones a friend to find out what caused massive October outage

Alaska Airlines has called in consultants to advise it on what went wrong during a late October IT meltdown that grounded flights and wreaked havoc for two days. According to the group, the company has increased its investments in IT by almost 80 percent since 2019, "investing in redundant datacenters and moving many guest- …

  1. xyz Silver badge

    inadvertent tenant configuration change

    That's got to be up there with involuntary conversion in terms of spoken suit shite

    1. Yet Another Anonymous coward Silver badge

      Re: inadvertent tenant configuration change

      From the industry that brought you "controlled flight into terrain"

      1. Anonymous Coward
        Anonymous Coward

        Re: inadvertent tenant configuration change

        Well, the ageists of Accunture sound just the sort of people for the gig.

    2. Throatwarbler Mangrove Silver badge
      Mushroom

      Re: inadvertent tenant configuration change

      Possibly followed by a "mid-air passenger exchange" and "aluminum rain."

  2. bill 27

    There's a lesson to be learned here.

    Perhaps Alaska will learn the dangers of flying on a Cloud.

    1. Gene Cash Silver badge

      Re: There's a lesson to be learned here.

      It looks like they already know them:

      "Alaska Airlines, however, was able to stand up its backup infrastructure, allowing customers to book and check in for flights."

      That's a hell of a lot better than most people were able to do during the outage.

  3. Nate Amsden Silver badge

    Probably too much complexity

    The first SaaS company I worked for 2 decades ago had a lot of reliability issues with their app stack, so many outages... a couple years after I left they moved to a redundant data center setup, though due to latency requirements the facilities had to be pretty close to each other, I assume they were doing real time replication perhaps with Oracle, as they did run the largest single OLTP database in the world (by that point probably around 60T, that was due to bad app design). I remember messaging a former co-worker at that point randomly, just asking how were things going, and his response was something like "we're 9 hours into a hard downtime on our five 9s setup". Couldn't help but laugh ...I don't think I ever found out (or even asked) what the root cause of the issue was. But really goes to show at least in that case, having such tight linkage between active/backup systems can in itself cause problems.

    In a completely different situation at a big company(fortune 500 - that I had no direct relation to), they had deployed multiple storage systems in a high availability configuration, I think it was 4 (large) systems total. Though they ran into a nasty bug apparently at one point that took all 4 systems down at the same time(only knew because I had insider info from the vendor). Not sure how long they were down for or if any data was lost(or what the impact was). I think the bug was related to replication. Myself I have worked through 3 different primary storage system failures in my career, each taking a minimum of 3-5 days to fully recover from (with varying amounts of data loss), in all cases the org did not want to invest in a secondary storage system(nor did they want to invest in one after the incidents occurred at least not right away).

    My storage systems are oblivious to each other, as is my storage network. I try to keep a balance towards far more simplicity (trade off vs features and automation) rather than more fancy integrated things. A point for my network is my core network runs on a network protocol(ESRP) that has been on the market for probably 25 years providing subsecond layer 2 loop prevention and layer 3 high availability. No MLAG, no TRILL, no SDN, no dynamic routing protocols, don't need/want complexity. I'm sure more complexity is suitable for certain orgs, just not for ones I have worked for in the last 25 years. If I was building a new network today I'd use the same method as I have for the last 20 years, provided the requirements don't shift dramatically(if they did then I'd consider alternatives, assuming the requirements were realistic).

  4. Brave Coward Bronze badge

    Are you flying Air Alaska today?

    "A world-class guess experience".

  5. Claptrap314 Silver badge
    Boffin

    Assume what you will

    "Having redundant datacenters would, we assume, have made a failover a relatively straightforward thing rather than having systems down for hours while staff scrambled." Not if you know anything about resilience architecture--or if you have kept up with "Who? Me?".

    Resilience is HARD.

    First, off, if you system is designed PERFECTLY, but you've not tested failover & back, you don't have any idea what the implementation has actually achieved.

    Secondly, it if not at all trivial to properly design resilience. Different operations will have different requirements, especially when you start talking about the relative importance of competing priorities. And let's not even begin to speak about what happens when one component, perhaps done early, is built to satisfy one priority configuration, while another, perhaps built later, assumes a different one.

    "But! But--that should not result in a multi-hour outage." You are correct. But to quote Snoopy, "unfortunately, we're not playing 'Should've'".

    Of course, since I was trained in SRE at Google, the entire idea of claiming resilience with only two datacenters is...laughable.

    But old-school resilience was very happy with a primary & a disaster recovery facility. As in--if the primary goes down, we restore the latest backups to the DR facility & start it. Not the way I would do things for an airline, but after my time in health care, I cannot rule it out.

    1. Anonymous Coward
      Anonymous Coward

      Re: Assume what you will

      I'm a slacker nowadays. I only replicate my personal data across 3 different computers 5 times a day. When I worked for a living if the system went off-line, and we failed to bring up the alternate (mirrored) one within 3 minutes, all Hell broke loose. Yes the process was tested often often just to keep everybody familiar with it.

    2. Anonymous Coward
      Anonymous Coward

      Re: Assume what you will

      "First, off, if you system is designed PERFECTLY, but you've not tested failover & back, you don't have any idea what the implementation has actually achieved."

      You still have no idea even if you've done that. Everyone involved in the failover procedure needs to be familiar with it and fully understand their roles and responsibilities. Which means doing the failover/switchover as part of normal operations. And doing that often. And reviewing the outcome to take account of lessons learned: for instance what do you do when $supplier doesn't answer the phone at 3am or sends a clueless PFY (who isn't on the access list) to the datacentre.

  6. Anonymous Coward
    Anonymous Coward

    It would probably be cheaper to fix the root cause.

    Than to have Accenture explain it away.

    1. Anonymous Coward
      Anonymous Coward

      Re: It would probably be cheaper to fix the root cause.

      While that is certainly true, it misses the point. Accenture will have been hired to find scapegoats and deflect blame from those responsible for the fuckup.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon