Aer Lingus says it is in talks with its IT services supplier, former IBM arm Kyndryl, after the disastrous combo of a sliced fiber optic cable and a faulty network card on the backup line caused an IT systems outage that forced the airline to cancel more than 50 flights. The outage on September 10 disrupted plans for tens of …

COMMENTS

Post your comment

House rules Send corrections

Add to 'My topics'

Monday 10th October 2022 14:17 GMT John Doe 12

Cannot cheat fate

Aer Lingus are incompetent idiots and if it wasn't this then something else would soon have come along. Yes I know people will say this was a technical issue beyond their control but it's up to the airline to have a diverse disaster recover plan that is checked regularly for function.

Ryanair are actually running rings around the state carrier which either says Michael O'Leary is doing a great job OR that Aer Lingus are THAT bad!!

During the aftermath of the pandemic Aer Lingus behaved like a bunch of thieves with monies for flights that didn't fly and spent a lot of energy fobbing people off with vouchers that half the time were used to purchase other flights that didn't fly either!!

24 0 Reply
1. Monday 10th October 2022 14:28 GMT wolfetone
  
  Re: Cannot cheat fate
  
  It's easy to hate O'Leary, but what he's done (and continues to do) with Ryanair does show that Aer Lingus really are that bad. They are managed in a way that they essentially think "We're the flag carrier, we don't have to change to suit you. You have to want to fly with us". A bit like British Airways can be sometimes.
  
  I just hope for Aer Lingus' IT contractor that their insurance policy is up to date and paid for.
  
  19 0 Reply
  1. Monday 10th October 2022 14:38 GMT Korev
    
    Re: Cannot cheat fate
    
    It's easy to hate O'Leary, but what he's done (and continues to do) with Ryanair does show that Aer Lingus really are that bad. They are managed in a way that they essentially think "We're the flag carrier, we don't have to change to suit you. You have to want to fly with us". A bit like British Airways can be sometimes.
    
    Anyone would think they're the same company...
    
    16 0 Reply
    1. Monday 10th October 2022 14:55 GMT wolfetone
      
      Re: Cannot cheat fate
      
      You would think that but the flying experience I've had on BA is nothing like Aer Lingus. BA was so much better, AL felt a bit shonky.
      
      8 0 Reply
  2. Monday 10th October 2022 15:51 GMT John Doe 12
    
    Re: Cannot cheat fate
    
    I would LOVE to see the I.T. contractor force Aer Lingus to take credit vouchers rather than hard cash as compensation. The irony would be delicious ha ha
    
    39 0 Reply
    1. Tuesday 11th October 2022 09:00 GMT Anonymous Coward
      
      Re: Regional data centres
      
      Most contracts I've seen stipulate that any compensation will be in terms of service credits, so you're not far off the mark there.
      
      2 0 Reply
2. Monday 10th October 2022 15:14 GMT chivo243
  
  Re: Cannot cheat fate
  
  I've had good and really bad experiences* with Aer Lingus, but I haven't flown with them in close to 6 years and may never again just because my travel patterns have changed...
  
  When it's bad, it's bad*... like public seating in Dublin airport
  
  10 0 Reply
3. Tuesday 11th October 2022 07:49 GMT Anonymous Coward
  
  Re: Cannot cheat fate
  
  Well they had issues on the Friday 9th too.
  
  My AL flight from Dublin to Newark was cancelled and we all got shoved on a United flight. Lots of issues with tickets and staff complaining about the systems not working.
  
  Rail/tunnel/digging/Birmingham - HS2 perhaps?
  
  7 0 Reply
Monday 10th October 2022 14:45 GMT Korev

The fault in the primary line was a severed cable and the second failure, in the backup line leading to the other DC, was due to a failed network card.

and

The Aer Lingus exec pointed out that their data had been mirrored to two separate sites, a datacenter in Manchester, and the second one in Birmingham by their IT services provider, and that the lines had been replicated into both

Why didn't the primary DC use the backup line?

10 0 Reply
1. Monday 10th October 2022 16:18 GMT Anonymous Coward
  
  Probably because some manager somewhere who wanted brownie points for saving money probably decided "a 6 month check is enough, right?".
  
  7 1 Reply
  1. Tuesday 11th October 2022 07:45 GMT Dronius
    
    Our latest outages triggered no alarms on the fibre circuits from our providers.
    
    The reason was that both paths (main and backup on each discreet provider pathway (x4)) had bit rate present, so the automated monitoring never triggered an alarm, so the automatic fail-over never did 'fail-over'.
    
    It was only when the content analyser was pointed at a suspect stream that the fault was identified by us.
    
    Even then there was doubt due to a fight between the auto fail-over system at one hub and each subsequent one in the chain.
    
    Each one failed over to the alternative so the upstream fault was not revealed.
    
    So many compounding fibre providers, each with their own system cobbling together pathways, managed to mask the source fault from the automated fail-over systems. The thing is we commission from a provider but until the thing goes pear-shaped the details of sub-providers are unknown for 'commercial reasons'
    
    12 0 Reply
2. Wednesday 12th October 2022 07:00 GMT deadlockvictim
  
  Korev» Why didn't the primary DC use the backup line?
  
  Good question.
  
  I will say that disaster recovery is not easy, or rather, it is a non-trivial task enough to plan it out and testing it is hard and very expensive. These systems needs to be tested often & regularly.
  
  Why didn't they check that all of the moving parts were operational.
  
  everything that can be made redundant should be made redundant. Why wasn't the network card redundant?
  
  This failure for Aer Lingus demonstrates why redundancy is so important.
  
  2 0 Reply
Monday 10th October 2022 15:07 GMT Anonymous Coward

Apologia for the world's construction workers

On the other hand, a big problem that BT used to face is that doing the due diligence for cable location, and then digging carefully, is expensive. Many construction companies prefer to play the odds, and just dig. If they should have the misfortune to hit something they just call it in, leave it for the (already overworked) cable supplier to fix, and then pay the bill when they get it. It can work out cheaper than being careful, in the long run.

23 0 Reply
1. Monday 10th October 2022 16:43 GMT Marty McFly
  
  Re: Apologia for the world's construction workers
  
  My neighbor digs holes for telephone pole replacement. He no longer uses a shovel. He has a giant vacuum that creates the excavation, often finding undocumented infrastructure.
  
  18 0 Reply
  1. Tuesday 11th October 2022 01:30 GMT Anonymous Coward
    
    Re: Apologia for the world's construction workers
    
    Hole vacuums are great invention.
    
    When they put in the fiber optic line to our house they used one to find the old but still in use copper cable. Then they knew how to have the directional drilling rig go high (or low) and miss the cable. Worked great.
    
    11 0 Reply
    1. Tuesday 11th October 2022 12:22 GMT Anonymous Coward
      
      Re: Apologia for the world's construction workers
      
      Wish I'd known about those before I dug 50 fence post holes by hand because I didn't know what was underneath!
      
      2 0 Reply
  2. Tuesday 11th October 2022 13:10 GMT Sgt_Oddball
    
    Re: Apologia for the world's construction workers
    
    I wondered what those things were supposed to be for. I thought it might be for reducing the amount of noise/dust produced but sucking around infrastructure makes sense too.
    
    They're busy using them to make a continued mess of Leeds City centre just down the road from my office (doesn't help they're making large chunks of the already complex one way city loop bus and taxi only during specific arcane periods, now different at each road).
    
    1 0 Reply
2. Tuesday 11th October 2022 10:00 GMT Steve Graham
  
  Re: Apologia for the world's construction workers
  
  When I joined BT in 1981 (actually, it was still GPO "trading as" British Telecommunications), young executives were sent for "Area Training" for two four-week periods, to climb poles, dig holes and crimp wires, to see how the Real Work was done.
  
  One of my happy memories is of a hot Summer's day on a former airfield in East Anglia, trying to find a buried copper cable with the assistance only of a 1920s Air Ministry map.
  
  6 0 Reply
  1. Tuesday 11th October 2022 13:19 GMT R Soul
    
    Re: Apologia for the world's construction workers
    
    There's probably someone from Openretch with that 1920s map still hoping to find that very same cable.
    
    3 0 Reply
3. Tuesday 11th October 2022 12:27 GMT heyrick
  
  Re: Apologia for the world's construction workers
  
  Where I used to live, they wanted to turn some scrub land into council housing. Turns out, back in the sixties or so it was used for caravans or mobile homes, long since forgotten, and huge amounts of infrastructure to damn near the other end of town ran through there in line with a winding road (for the caravans), the road no longer really visible any more.
  
  Of course, absolutely none of this was on any map or plan.
  
  They performed a test dig and came across gas pipes with buried electrical wires running alongside. All with no markings whatsoever until you reach the things.
  
  The hole was filled in and the project abandoned as "too expensive to dig the entire place by hand to see what's down there".
  
  4 0 Reply
  1. Tuesday 11th October 2022 13:17 GMT Korev
    
    Re: Apologia for the world's construction workers
    
    I wonder if they were billed for their gas and/or electricity?
    
    1 0 Reply
Monday 10th October 2022 15:39 GMT Jou (Mxyzptlk)

Cold standby?

Not Hot, not in "send data double over both", or in load-balancing configuration? The faulty card would have shown up in those configurations before the cable got ripped. If monitored, of course.

I smell beancounters...

29 0 Reply
1. Monday 10th October 2022 16:43 GMT Marty McFly
  
  Re: Cold standby?
  
  Cold standby is just fine....as long as disaster recovery exercises are performed on a regular basis.
  
  8 1 Reply
  1. Tuesday 11th October 2022 09:46 GMT James Anderson
    
    Re: Cold standby?
    
    Not really. It’s a physiological problem. As soon as a system is labelled backup, standby, “the B system” human nature will downgrade it and the system will be neglected. A load balanced cluster is best, failing that a monthly switch between sites will keep people alert, and also ease upgrades.
    
    I am presuming this is an IBM mainframe site. They mastered high availability 30 years ago —- so no excuses.
    
    6 0 Reply
2. Tuesday 11th October 2022 08:55 GMT Anonymous Coward
  
  Re: Cold standby?
  
  "Not Hot, not in "send data double over both", or in load-balancing configuration? The faulty card would have shown up in those configurations before the cable got ripped. If monitored, of course.
  
  I smell beancounters..."
  
  I was thinking exactly the same ! Every single time I've seen/heard such a set up fail, it was because 2 "redundant" links have in fact gone through a single tray, which was impacted.
  
  What is related in the article has the same probability of happening than 2 unrelated asteroids landing in your garden in the same square meter, the same day ...
  
  For sure, the network card was faulty since long ...
  
  6 0 Reply
Monday 10th October 2022 15:44 GMT john.w

Virtue (or System) untested is not virtue (orsystem) at all.

"So it should have been more resilient than it proved to be on the day."

6 0 Reply
Monday 10th October 2022 15:55 GMT MiguelC

24/7, seven days a week

so... 24/49?

14 0 Reply
Monday 10th October 2022 16:00 GMT Anonymous Coward

Not sure I buy Lynne Embleton's explanation, or more likely the line she's been fed by Kyndryl (what sort of dumb crazy name is that??). If this is really true:

...When asked how often the backup line was tested, Embleton responded: "What is common practice is to allow traffic over both, really to avoid a situation where one has been lying dormant and then, at the point it's needed, it fails, so we flip between the two to enable us to ensure both are working, which is why... to have a main failure and then a backup failure really shouldn't have happened...

which I guess points to some sort of load-balanced configuration then standard network monitoring should have picked up problems with the network card on the secondary well before the primary failure. Or maybe no one was looking at the error logs... Surely not!!

13 0 Reply
1. Monday 10th October 2022 16:20 GMT Anonymous Coward
  
  Kyndryl, https://www.theregister.com/2021/11/04/kyndryl_ibm_spinoff/, it is a crazy name.
  
  4 0 Reply
  1. Tuesday 11th October 2022 09:21 GMT Ian Johnston
    
    It sounds like a cough syrup.
    
    4 0 Reply
    1. Tuesday 11th October 2022 23:05 GMT Ken Shabby
      
      It is what a dentist calls his handpiece when he can not find it.
      
      where is my f..?
      
      0 0 Reply
Monday 10th October 2022 16:13 GMT Anonymous Coward

This is why I like load balanced over standbys

With a load balanced setup you have in principle an always-on functionality test. A hot or cold standby isn't until you have verified it actually works, frequently.

7 0 Reply
1. Tuesday 11th October 2022 07:50 GMT Anonymous Coward
  
  Re: This is why I like load balanced over standbys
  
  But you also need to be sure that:
  
  1) Each line can independently handle the full load.
  
  2) The systems either side correctly handle one route missing (with 2 DCs you need to ensure they can stay in sync/resync cleanly).
  
  6 0 Reply
  1. Tuesday 11th October 2022 09:16 GMT Anonymous Coward
    
    Re: This is why I like load balanced over standbys
    
    with 2 DCs you need to ensure they can stay in sync/resync cleanly
    
    Yup. We followed IBM's recommendation and used DCs that were less than 25km apart and had dark fiber between them. That gives you latency low enough to enable RDMA to the point that a whole VM can be migrated from one DC to another while live, which we occasionally do just for fun (we call it 'training", of course).
    
    4 0 Reply
Monday 10th October 2022 16:20 GMT Flicker

It was Moriarty!

... the airliner's [sic] corporate affairs officer Donal Moriarty ...

That's the problem - they have a criminal mastermind running corporate affairs!! Maybe they should put Holmes on the case...

20 0 Reply
1. Tuesday 11th October 2022 13:16 GMT Sgt_Oddball
  
  Re: It was Moriarty!
  
  Does he have a friend who goes by the name Grytpype-Thynne?
  
  It does rather sound like the sort of silliness the Goon show used to mock.
  
  2 0 Reply
2. Tuesday 11th October 2022 13:19 GMT Korev
  
  Re: It was Moriarty!
  
  That's the problem - they have a criminal mastermind running corporate affairs!! Maybe they should put Holmes on the case...
  
  As it's ~~IBM~~ Kyndryl wouldn't they be using Watson?
  
  5 0 Reply
Monday 10th October 2022 17:04 GMT Anonymous Coward

Bee careful

Local contractor was drilling holes for street light foundations. One of those augers that's a couple feet wide. Started a new hole next to an old tree stump. Auger gets embedded and hits a root, shifting and vibrating the tree stump. Killer bees did not like this happening to their hideaway. They swarm the auger operator who flees. Between the roots and inadvertent jarring of controls, auger continues drilling in a somewhat different direction.

Comms go down for multiple neighborhoods for many hours. More hours, because the killer bees had to be rousted first, before the comms people could even begin.

Nobody blamed the operator. He started in the right place. It was the NIMBYs that caused the outage.

15 0 Reply
1. Tuesday 11th October 2022 13:20 GMT Korev
  
  Re: Bee careful
  
  Nobody blamed the operator. He started in the right place. It was the NIMBYs that caused the outage.
  
  You mean NIMBees?
  
  7 0 Reply
Monday 10th October 2022 17:38 GMT Anonymous Coward

A normal weekend

sorry it just has all the hallmarks of the usual 'unlikely' combination of factors.

Weekend/weekend workers/weekend support/weekend manager/inability to correctly figure out whats going on. Managers like the power but not the responsibility so nothing can get done.

To make things worse everyones got used to doing basically nothing from the comfort of home. (thats the new one I guess,we cant even launch a moon rocket anymore)

At least they all had a terrible time lol.

6 0 Reply
1. Monday 10th October 2022 22:12 GMT Anonymous Coward
  
  Re: A normal weekend
  
  An Aer Lingus moon rocket would be a dream in green that I would want to be on.
  
  4 0 Reply
Monday 10th October 2022 17:52 GMT DS999

Cancel more than 50 flights?

When the US airlines were struggling this summer they were cancelling thousands of flights a day for multiple days in a row. They probably wish they could have a problem small enough it only affected 50!

6 0 Reply
Monday 10th October 2022 21:12 GMT Norman Nescio

Some thoughts...

Hmmm. I wonder if the failed network card was just misconfigured. There's a really nasty gotcha with Fast Ethernet and auto-negotiation (appears on old Cisco Catalyst switches) which can end up with one end of a connection being configured for full-duplex operation and the other end for half-duplex. For sufficiently low traffic levels, it works, but if you look at the interface errors, you'll see excess collisions reported, rising sharply with traffic volumes, rendering the connection unusable at normal traffic levels. If your load balancing is (mis)configured so the second connection is carrying no actual traffic other than routing updates, everything can look fine (barring the interface collisions counter, if you monitor it) until the primary connection fails and the second connection fails under load.

As for the length of time repairing the fibre in a railway tunnel: gaining access to railway infrastructure is difficult. Even if there is a possession in operation for engineering works, you don't let telecomms guys onto a potentially working line willy-nilly. If access to the fibres wasn't in the original plan, then the telco will have to go through normal procedures to gain possession/access, and for very understandable safety reasons there are strict protocols to follow, which take time, and probably more so if railway engineering works were going on at the same time.

13 0 Reply
1. Tuesday 11th October 2022 14:41 GMT Peter2
  
  Re: Some thoughts...
  
  Not to mention that (at least in the UK) every person going near the track has to have passed a safety course, which is going to seriously cut down on the number of engineers capable of being deployed to site to do the job.
  
  Also apropos of nothing; When a prat with a digger went through the connection to my house I spliced it myself to avoid being offline while waiting for an Openreach engineer; it's easy to do with copper even if the correct tools for doing the job are at work and your improvising with a pair of pliers and a spare patch lead.
  
  Fibre is just a little more difficult to splice under similar circumstances.
  
  2 0 Reply
2. Wednesday 12th October 2022 08:38 GMT yoganmahew
  
  Re: Some thoughts...
  
  Yeah, "working on half load, but not as sole link" is my guess too. And the Kyndryl DC is probably locked so tight it takes 10 hours of approvals to get an engineer on site, having flown him from somewhere first. Mad if it was quicker to fix the fibre.
  
  0 0 Reply
3. Thursday 13th October 2022 01:24 GMT Alistair
  
  Re: Some thoughts...
  
  We had *all* of our EOR switches from cisco with that @#$% misconfiguration.
  
  Oddly its easy to test for in linux during card initialization, and have the right config in the nic setup. (was a default on all my deployments until cisco finally fixed the negotiation protocol). I've *no* idea how the 'Doze team managed things on their end, but we never had the issue after we documented it in house. (and bitched at all the vendors involved). HPUX 11.0 didn't deal well with it in superdomes as it would halt the boot process at nic init, we moved the devices to a separate, HP switch to mitigate initially. Eventually 11.2 solved it.
  
  All that said, that is *OLD* shit. If they're still hitting that *now* I'm gonna suggest that they need a MASSIVE overhaul of infra.
  
  1 0 Reply
Monday 10th October 2022 21:25 GMT Anonymous Coward

I will lay odds it's really a failure of Aer Lingus to allow any sort of resilience testing. Many customers are very averse to any sort of testing because they either know their 'DR' is crap or they just don't want the pain a proper DR test will cause. Either that or Kyndryl got lumbered with crap infrastructure with no remit to remediate anything.

Without knowing what monitoring is in place it's hard to say why a NIC failure wasn't picked up - it certainly should have been - but I bet there would have been push back about the downtime needed to replace it. Something about this just doesn't quite add up. A *single* NIC failure and an unrelated *single* fibre being cut brings down their system....?

I've been in this situation so many times (companies claiming they have functioning DR, refusing to do proper DR tests, refusing downtime etc), and while some questions need to be answered, it's not always fair to just blame the IT provider. More than once lumbered with crap the customer refuses to pay to fix.

On a lighter note, I am reminded of a time when Aer Lingus leased a jet from another operator some years ago. The jet was resprayed, and someone kindly left a note in the cockpit that simply said "Fly green side up". This is a true story as far as I am aware - there was even a witch hunt for the person who left the note!

11 0 Reply
Tuesday 11th October 2022 12:30 GMT heyrick

"the provider had confirmed the issue had never before occurred on any of the 4,000 network cards they had in service"

Uh-huh.

2 0 Reply
Tuesday 11th October 2022 14:46 GMT jollyboyspecial

So we have a failure of the primary circuit at a network interface on the backup circuit at the same time?

People lose primary and backup connections simultaneously more often than you might think, for a variety of reasons.

You can't blame either the circuit vendor or the network card vendor for this. The real blame for the loss of business operations is the lack of a business continuity plan. Risk assessment is important. Is there a risk of losing the primary and backup connections at the same time? Of course there is, no matter how small. The blame here could be laid at a couple of different doors here. Maybe the project manager or solution designer didn't tell the business there was a risk of losing both connections simultaneously and for an extended period of time, in which case that's the arse that needs a good solid kicking. Or maybe the business was told of the risk and chose to accept it, in which case the particular manager who made that decision needs the kicking.

3 0 Reply