back to article Airline 'in talks' with Kyndryl after failed network card grounds flights

Aer Lingus says it is in talks with its IT services supplier, former IBM arm Kyndryl, after the disastrous combo of a sliced fiber optic cable and a faulty network card on the backup line caused an IT systems outage that forced the airline to cancel more than 50 flights. The outage on September 10 disrupted plans for tens of …

  1. John Doe 12

    Cannot cheat fate

    Aer Lingus are incompetent idiots and if it wasn't this then something else would soon have come along. Yes I know people will say this was a technical issue beyond their control but it's up to the airline to have a diverse disaster recover plan that is checked regularly for function.

    Ryanair are actually running rings around the state carrier which either says Michael O'Leary is doing a great job OR that Aer Lingus are THAT bad!!

    During the aftermath of the pandemic Aer Lingus behaved like a bunch of thieves with monies for flights that didn't fly and spent a lot of energy fobbing people off with vouchers that half the time were used to purchase other flights that didn't fly either!!

    1. wolfetone Silver badge

      Re: Cannot cheat fate

      It's easy to hate O'Leary, but what he's done (and continues to do) with Ryanair does show that Aer Lingus really are that bad. They are managed in a way that they essentially think "We're the flag carrier, we don't have to change to suit you. You have to want to fly with us". A bit like British Airways can be sometimes.

      I just hope for Aer Lingus' IT contractor that their insurance policy is up to date and paid for.

      1. Korev Silver badge
        Big Brother

        Re: Cannot cheat fate

        It's easy to hate O'Leary, but what he's done (and continues to do) with Ryanair does show that Aer Lingus really are that bad. They are managed in a way that they essentially think "We're the flag carrier, we don't have to change to suit you. You have to want to fly with us". A bit like British Airways can be sometimes.

        Anyone would think they're the same company...

        1. wolfetone Silver badge

          Re: Cannot cheat fate

          You would think that but the flying experience I've had on BA is nothing like Aer Lingus. BA was so much better, AL felt a bit shonky.

      2. John Doe 12

        Re: Cannot cheat fate

        I would LOVE to see the I.T. contractor force Aer Lingus to take credit vouchers rather than hard cash as compensation. The irony would be delicious ha ha

        1. Anonymous Coward
          Anonymous Coward

          Re: Regional data centres

          Most contracts I've seen stipulate that any compensation will be in terms of service credits, so you're not far off the mark there.

    2. chivo243 Silver badge
      Coat

      Re: Cannot cheat fate

      I've had good and really bad experiences* with Aer Lingus, but I haven't flown with them in close to 6 years and may never again just because my travel patterns have changed...

      When it's bad, it's bad*... like public seating in Dublin airport

    3. Anonymous Coward
      Anonymous Coward

      Re: Cannot cheat fate

      Well they had issues on the Friday 9th too.

      My AL flight from Dublin to Newark was cancelled and we all got shoved on a United flight. Lots of issues with tickets and staff complaining about the systems not working.

      Rail/tunnel/digging/Birmingham - HS2 perhaps?

  2. Korev Silver badge
    Terminator

    The fault in the primary line was a severed cable and the second failure, in the backup line leading to the other DC, was due to a failed network card.

    and

    The Aer Lingus exec pointed out that their data had been mirrored to two separate sites, a datacenter in Manchester, and the second one in Birmingham by their IT services provider, and that the lines had been replicated into both

    Why didn't the primary DC use the backup line?

    1. Anonymous Coward
      Anonymous Coward

      Probably because some manager somewhere who wanted brownie points for saving money probably decided "a 6 month check is enough, right?".

      1. Dronius

        Our latest outages triggered no alarms on the fibre circuits from our providers.

        The reason was that both paths (main and backup on each discreet provider pathway (x4)) had bit rate present, so the automated monitoring never triggered an alarm, so the automatic fail-over never did 'fail-over'.

        It was only when the content analyser was pointed at a suspect stream that the fault was identified by us.

        Even then there was doubt due to a fight between the auto fail-over system at one hub and each subsequent one in the chain.

        Each one failed over to the alternative so the upstream fault was not revealed.

        So many compounding fibre providers, each with their own system cobbling together pathways, managed to mask the source fault from the automated fail-over systems. The thing is we commission from a provider but until the thing goes pear-shaped the details of sub-providers are unknown for 'commercial reasons'

    2. deadlockvictim

      Korev» Why didn't the primary DC use the backup line?

      Good question.

      I will say that disaster recovery is not easy, or rather, it is a non-trivial task enough to plan it out and testing it is hard and very expensive. These systems needs to be tested often & regularly.

      Why didn't they check that all of the moving parts were operational.

      everything that can be made redundant should be made redundant. Why wasn't the network card redundant?

      This failure for Aer Lingus demonstrates why redundancy is so important.

  3. Anonymous Coward
    Anonymous Coward

    Apologia for the world's construction workers

    On the other hand, a big problem that BT used to face is that doing the due diligence for cable location, and then digging carefully, is expensive. Many construction companies prefer to play the odds, and just dig. If they should have the misfortune to hit something they just call it in, leave it for the (already overworked) cable supplier to fix, and then pay the bill when they get it. It can work out cheaper than being careful, in the long run.

    1. Marty McFly Silver badge

      Re: Apologia for the world's construction workers

      My neighbor digs holes for telephone pole replacement. He no longer uses a shovel. He has a giant vacuum that creates the excavation, often finding undocumented infrastructure.

      1. Anonymous Coward
        Anonymous Coward

        Re: Apologia for the world's construction workers

        Hole vacuums are great invention.

        When they put in the fiber optic line to our house they used one to find the old but still in use copper cable. Then they knew how to have the directional drilling rig go high (or low) and miss the cable. Worked great.

        1. Anonymous Coward
          Anonymous Coward

          Re: Apologia for the world's construction workers

          Wish I'd known about those before I dug 50 fence post holes by hand because I didn't know what was underneath!

      2. Sgt_Oddball
        Holmes

        Re: Apologia for the world's construction workers

        I wondered what those things were supposed to be for. I thought it might be for reducing the amount of noise/dust produced but sucking around infrastructure makes sense too.

        They're busy using them to make a continued mess of Leeds City centre just down the road from my office (doesn't help they're making large chunks of the already complex one way city loop bus and taxi only during specific arcane periods, now different at each road).

    2. Steve Graham

      Re: Apologia for the world's construction workers

      When I joined BT in 1981 (actually, it was still GPO "trading as" British Telecommunications), young executives were sent for "Area Training" for two four-week periods, to climb poles, dig holes and crimp wires, to see how the Real Work was done.

      One of my happy memories is of a hot Summer's day on a former airfield in East Anglia, trying to find a buried copper cable with the assistance only of a 1920s Air Ministry map.

      1. R Soul Silver badge

        Re: Apologia for the world's construction workers

        There's probably someone from Openretch with that 1920s map still hoping to find that very same cable.

    3. heyrick Silver badge

      Re: Apologia for the world's construction workers

      Where I used to live, they wanted to turn some scrub land into council housing. Turns out, back in the sixties or so it was used for caravans or mobile homes, long since forgotten, and huge amounts of infrastructure to damn near the other end of town ran through there in line with a winding road (for the caravans), the road no longer really visible any more.

      Of course, absolutely none of this was on any map or plan.

      They performed a test dig and came across gas pipes with buried electrical wires running alongside. All with no markings whatsoever until you reach the things.

      The hole was filled in and the project abandoned as "too expensive to dig the entire place by hand to see what's down there".

      1. Korev Silver badge

        Re: Apologia for the world's construction workers

        I wonder if they were billed for their gas and/or electricity?

  4. Jou (Mxyzptlk) Silver badge

    Cold standby?

    Not Hot, not in "send data double over both", or in load-balancing configuration? The faulty card would have shown up in those configurations before the cable got ripped. If monitored, of course.

    I smell beancounters...

    1. Marty McFly Silver badge

      Re: Cold standby?

      Cold standby is just fine....as long as disaster recovery exercises are performed on a regular basis.

      1. James Anderson

        Re: Cold standby?

        Not really. It’s a physiological problem. As soon as a system is labelled backup, standby, “the B system” human nature will downgrade it and the system will be neglected. A load balanced cluster is best, failing that a monthly switch between sites will keep people alert, and also ease upgrades.

        I am presuming this is an IBM mainframe site. They mastered high availability 30 years ago —- so no excuses.

    2. Anonymous Coward
      Anonymous Coward

      Re: Cold standby?

      "Not Hot, not in "send data double over both", or in load-balancing configuration? The faulty card would have shown up in those configurations before the cable got ripped. If monitored, of course.

      I smell beancounters..."

      I was thinking exactly the same ! Every single time I've seen/heard such a set up fail, it was because 2 "redundant" links have in fact gone through a single tray, which was impacted.

      What is related in the article has the same probability of happening than 2 unrelated asteroids landing in your garden in the same square meter, the same day ...

      For sure, the network card was faulty since long ...

  5. john.w

    Virtue (or System) untested is not virtue (orsystem) at all.

    "So it should have been more resilient than it proved to be on the day."

  6. MiguelC Silver badge
    WTF?

    24/7, seven days a week

    so... 24/49?

  7. Anonymous Coward
    Anonymous Coward

    Not sure I buy Lynne Embleton's explanation, or more likely the line she's been fed by Kyndryl (what sort of dumb crazy name is that??). If this is really true:

    ...When asked how often the backup line was tested, Embleton responded: "What is common practice is to allow traffic over both, really to avoid a situation where one has been lying dormant and then, at the point it's needed, it fails, so we flip between the two to enable us to ensure both are working, which is why... to have a main failure and then a backup failure really shouldn't have happened...

    which I guess points to some sort of load-balanced configuration then standard network monitoring should have picked up problems with the network card on the secondary well before the primary failure. Or maybe no one was looking at the error logs... Surely not!!

    1. Anonymous Coward
      Anonymous Coward

      Kyndryl, https://www.theregister.com/2021/11/04/kyndryl_ibm_spinoff/, it is a crazy name.

      1. Ian Johnston Silver badge

        It sounds like a cough syrup.

        1. Ken Shabby
          Alert

          It is what a dentist calls his handpiece when he can not find it.

          where is my f..?

  8. Anonymous Coward
    Anonymous Coward

    This is why I like load balanced over standbys

    With a load balanced setup you have in principle an always-on functionality test. A hot or cold standby isn't until you have verified it actually works, frequently.

    1. Anonymous Coward Silver badge
      Boffin

      Re: This is why I like load balanced over standbys

      But you also need to be sure that:

      1) Each line can independently handle the full load.

      2) The systems either side correctly handle one route missing (with 2 DCs you need to ensure they can stay in sync/resync cleanly).

      1. Anonymous Coward
        Anonymous Coward

        Re: This is why I like load balanced over standbys

        with 2 DCs you need to ensure they can stay in sync/resync cleanly

        Yup. We followed IBM's recommendation and used DCs that were less than 25km apart and had dark fiber between them. That gives you latency low enough to enable RDMA to the point that a whole VM can be migrated from one DC to another while live, which we occasionally do just for fun (we call it 'training", of course).

  9. Flicker

    It was Moriarty!

    ... the airliner's [sic] corporate affairs officer Donal Moriarty ...

    That's the problem - they have a criminal mastermind running corporate affairs!! Maybe they should put Holmes on the case...

    1. Sgt_Oddball

      Re: It was Moriarty!

      Does he have a friend who goes by the name Grytpype-Thynne?

      It does rather sound like the sort of silliness the Goon show used to mock.

    2. Korev Silver badge
      Joke

      Re: It was Moriarty!

      That's the problem - they have a criminal mastermind running corporate affairs!! Maybe they should put Holmes on the case...

      As it's IBM Kyndryl wouldn't they be using Watson?

  10. Anonymous Coward
    Anonymous Coward

    Bee careful

    Local contractor was drilling holes for street light foundations. One of those augers that's a couple feet wide. Started a new hole next to an old tree stump. Auger gets embedded and hits a root, shifting and vibrating the tree stump. Killer bees did not like this happening to their hideaway. They swarm the auger operator who flees. Between the roots and inadvertent jarring of controls, auger continues drilling in a somewhat different direction.

    Comms go down for multiple neighborhoods for many hours. More hours, because the killer bees had to be rousted first, before the comms people could even begin.

    Nobody blamed the operator. He started in the right place. It was the NIMBYs that caused the outage.

    1. Korev Silver badge
      Coat

      Re: Bee careful

      Nobody blamed the operator. He started in the right place. It was the NIMBYs that caused the outage.

      You mean NIMBees?

  11. Anonymous Coward
    Anonymous Coward

    A normal weekend

    sorry it just has all the hallmarks of the usual 'unlikely' combination of factors.

    Weekend/weekend workers/weekend support/weekend manager/inability to correctly figure out whats going on. Managers like the power but not the responsibility so nothing can get done.

    To make things worse everyones got used to doing basically nothing from the comfort of home. (thats the new one I guess,we cant even launch a moon rocket anymore)

    At least they all had a terrible time lol.

    1. Anonymous Coward
      Anonymous Coward

      Re: A normal weekend

      An Aer Lingus moon rocket would be a dream in green that I would want to be on.

  12. DS999 Silver badge

    Cancel more than 50 flights?

    When the US airlines were struggling this summer they were cancelling thousands of flights a day for multiple days in a row. They probably wish they could have a problem small enough it only affected 50!

  13. Norman Nescio Silver badge

    Some thoughts...

    Hmmm. I wonder if the failed network card was just misconfigured. There's a really nasty gotcha with Fast Ethernet and auto-negotiation (appears on old Cisco Catalyst switches) which can end up with one end of a connection being configured for full-duplex operation and the other end for half-duplex. For sufficiently low traffic levels, it works, but if you look at the interface errors, you'll see excess collisions reported, rising sharply with traffic volumes, rendering the connection unusable at normal traffic levels. If your load balancing is (mis)configured so the second connection is carrying no actual traffic other than routing updates, everything can look fine (barring the interface collisions counter, if you monitor it) until the primary connection fails and the second connection fails under load.

    As for the length of time repairing the fibre in a railway tunnel: gaining access to railway infrastructure is difficult. Even if there is a possession in operation for engineering works, you don't let telecomms guys onto a potentially working line willy-nilly. If access to the fibres wasn't in the original plan, then the telco will have to go through normal procedures to gain possession/access, and for very understandable safety reasons there are strict protocols to follow, which take time, and probably more so if railway engineering works were going on at the same time.

    1. Peter2 Silver badge

      Re: Some thoughts...

      Not to mention that (at least in the UK) every person going near the track has to have passed a safety course, which is going to seriously cut down on the number of engineers capable of being deployed to site to do the job.

      Also apropos of nothing; When a prat with a digger went through the connection to my house I spliced it myself to avoid being offline while waiting for an Openreach engineer; it's easy to do with copper even if the correct tools for doing the job are at work and your improvising with a pair of pliers and a spare patch lead.

      Fibre is just a little more difficult to splice under similar circumstances.

    2. yoganmahew

      Re: Some thoughts...

      Yeah, "working on half load, but not as sole link" is my guess too. And the Kyndryl DC is probably locked so tight it takes 10 hours of approvals to get an engineer on site, having flown him from somewhere first. Mad if it was quicker to fix the fibre.

    3. Alistair
      Windows

      Re: Some thoughts...

      We had *all* of our EOR switches from cisco with that @#$% misconfiguration.

      Oddly its easy to test for in linux during card initialization, and have the right config in the nic setup. (was a default on all my deployments until cisco finally fixed the negotiation protocol). I've *no* idea how the 'Doze team managed things on their end, but we never had the issue after we documented it in house. (and bitched at all the vendors involved). HPUX 11.0 didn't deal well with it in superdomes as it would halt the boot process at nic init, we moved the devices to a separate, HP switch to mitigate initially. Eventually 11.2 solved it.

      All that said, that is *OLD* shit. If they're still hitting that *now* I'm gonna suggest that they need a MASSIVE overhaul of infra.

  14. Anonymous Coward
    Anonymous Coward

    I will lay odds it's really a failure of Aer Lingus to allow any sort of resilience testing. Many customers are very averse to any sort of testing because they either know their 'DR' is crap or they just don't want the pain a proper DR test will cause. Either that or Kyndryl got lumbered with crap infrastructure with no remit to remediate anything.

    Without knowing what monitoring is in place it's hard to say why a NIC failure wasn't picked up - it certainly should have been - but I bet there would have been push back about the downtime needed to replace it. Something about this just doesn't quite add up. A *single* NIC failure and an unrelated *single* fibre being cut brings down their system....?

    I've been in this situation so many times (companies claiming they have functioning DR, refusing to do proper DR tests, refusing downtime etc), and while some questions need to be answered, it's not always fair to just blame the IT provider. More than once lumbered with crap the customer refuses to pay to fix.

    On a lighter note, I am reminded of a time when Aer Lingus leased a jet from another operator some years ago. The jet was resprayed, and someone kindly left a note in the cockpit that simply said "Fly green side up". This is a true story as far as I am aware - there was even a witch hunt for the person who left the note!

  15. heyrick Silver badge

    "the provider had confirmed the issue had never before occurred on any of the 4,000 network cards they had in service"

    Uh-huh.

  16. jollyboyspecial

    So we have a failure of the primary circuit at a network interface on the backup circuit at the same time?

    People lose primary and backup connections simultaneously more often than you might think, for a variety of reasons.

    You can't blame either the circuit vendor or the network card vendor for this. The real blame for the loss of business operations is the lack of a business continuity plan. Risk assessment is important. Is there a risk of losing the primary and backup connections at the same time? Of course there is, no matter how small. The blame here could be laid at a couple of different doors here. Maybe the project manager or solution designer didn't tell the business there was a risk of losing both connections simultaneously and for an extended period of time, in which case that's the arse that needs a good solid kicking. Or maybe the business was told of the risk and chose to accept it, in which case the particular manager who made that decision needs the kicking.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like