
*Delta knocks on the door of another airline*
"Hi, I was just going to check if there was a power cut, but I saw one of your planes fly overhead with it's lights on"
A computer outage has caused worldwide delays for thousands of passengers using Delta Airlines. The US carrier tweeted about the issues on Monday morning, blaming delayed and cancelled flights on a “computer outage." Delta, based in Atlanta, Georgia, subsequently blamed the crash on a massive power cut at 2.38am ET (7.38am …
Next positive leap second on 31st December (see IERS bulletin)
Will people be ready for that one?
This is why there is so much pressure to kill off leap seconds. The ITU recently kicked that can down the road for another few years, but personally, I don't see why this is still such a problem. We've had leap seconds for decades and computer time protocols have been designed to signal future leap seconds for a *long* time. It does involve the strange concept of a specific minute at the end of June or December should have 61 secs. The specs also allow for 59 secs too but that is unlikely now ever to happen.
You could write the software to deal with a step, but Google decided to just slow down the computer notion of the time in a controlled fashion for several hours so that after the leap second actually occurs the clocks are exactly back in sync. That is probably much more friendly to existing applications.
And long before that we had ephemeris time (1952), and then TDT (1976), and then GPS from 1980 using continuous time with a leap-second offset rather like a time-zone.
As I keep saying IT IS A KNOWN FEATURE and if your code can't handle it gracefully you are incompetent due to either:
1) Not using tested system libraries to handle time, delays, etc.
2) Writing or modifying said libraries without knowing what you are doing.
And most of all NOT TESTING YOUR DAMN CODE! Really, just set up a fake NTP time server and have it generate leap seconds regularly backwards and forwards and see if your code works.
Uh, wouldn't that be because a warmer atmosphere expands, moving mass away from the centre of the Earth, so conservation of angular momentum demands that the Earth's rate of spin slows to compensate. Same thing with the seas, which although they wouldn't move as much, are considerably denser. I'm surprised if that's sufficient to affect the rotation by as much as several seconds in just a few years, but I can't be bothered to do the maths.
IIRC it's not the expansion of the atmosphere, but a reduction in viscosity that allows atmospheric tides to counterbalance lunar torque. However I'm being disingenuous: it was coming out an ice age and at a time when the moon was a lot closer. (600Myr ago, a 21 hour day.) And while the corner in the delta-T is striking, the earth's mass distribution is changing all the time and it's far more likely that's the cause.
The length of the mean solar day *was* 86400 SI seconds around the middle of the 19th century. That rotation rate was embodied in the astronomical observations that were used to define Universal Time in the late 19th century. When UT was replaced as the best measure of time by Ephemeris Time and then by International Atomic Time in the 20th century, both ET and TAI were defined to have the same length of second as UT. And that's why we have a problem with leap seconds: the SI second reflects the rotation speed of the Earth almost 200 years ago, not today.
"Will people be ready for that one?"
Well the one that followed the aircraft-bothering incident went with practically no issues at all. Simply because folk had woken up and tested things for the inevitable occurrence of another leap-second.
In fact the Linux bug mentioned had been created by somebody modifying already-working time related code and not testing the damn thing for this situation. As others have already said, leap seconds and means to deal with them have been with us for decades already so its not new stuff. But every new generation of code monkeys seems to be able to break things...
"In fact the Linux bug mentioned had been created by somebody modifying already-working time related code and not testing the damn thing for this situation."
Was it actually a Linux bug? I find that hard to believe since given the 10s of millions of installations of linux in backend server systems not to mention embedded systems around the world. I think an OS timing bug it would have caused more problems than just an airlines reservation system going down. Far more likely an application bug which was conveniently blamed on the OS. Also, what application crashes just because of a 1 second difference even if the OS was at fault??
While we're at it, it's Sabre, not "Sebre", and Virgin Australia are on the Sabre system, not the Altea system.
Oh and Delta bought the code rights to the mainframe they are on (Deltamatic). It is still managed by Travelport (for infrastructure). http://www.travelmarketreport.com/articles/Delta-Reacquires-Res-Operations-Systems-From-Travelport
It's not really clear what data centre was hit and whether it was the one that houses the mainframe. Delta reservations appeared to be okay.
Grumpy old mainframer...
Really? They have their main business-critical system at a single datacenter, without geographical redundancy, so a power cut at that datacenter can bring down the whole thing?
I would have hoped such an essential system would be spread over 2 or 3 sites, so that losing one site has no impact on operations.
Reasoning which holds up well until the catastrophe occurs and you see the bill for repairs. More often than not, you will then reevaluate your opinion of what "makes sense" as far as investments are concerned.
True story : at an important government-level organization I will not name further, there was a kerfluffle when a senior engineer warned, in writing, all the way up the hierarchy, that the currently-at-the-time PC upgrade process was an open invitation to virii and expensive downtime.
He was hauled into his managers' office for a right chewing out, which, being a senior engineer in a function from which nobody could oust him, he took with a verbal barrage of his own (likely containing many words such as "idiotic", "moronic", "abysmally stupid" etc - don't know, wasn't there, but I damn well hope so). Still, he was told that the investment "wasn't worth it" and that he should "stop making waves".
As fate would have it, the tsunami hit later that year. An outdated PC piloted by a nincompoop got infected, the infection spread to the servers, and everything was shut down for at least 3 days. That's over 500 people with no more PCs for 24 work hours. You do the math.
He did the math, and presented the cleanup bill with a scathing "I told you so" that, curiously, all the managers took quite meekly.
The PC upgrade schedule was changed after that. Unbelievable, ain't it ?
I've been involved in several incidents over the years where I've said to the boss: "We need to spend X to replace an aging/failing system." I'd be asked: "Is it currently broken or about to fail?", and when I replied "No", was told to forget about it.
Later on, the system in question would die and management would complain about people not being able to do their jobs. A blank cheque was usually swiftly provided to replace said faulty system.
Rumsfeld:
Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.
> UPS+backup generator=no problem?
Then your single-points-of-failure include the UPS, the generator, the power distribution units, and the emergency-power-cut-off buttons. UPSs, generators and PDUs all break. Idiots switch things off for maintenance without thinking of the consequences, or press emergency-power-cut-off buttons by mistake.
You can have multiple generators or UPSs, although you still have the risk of a design flaw taking them all out when they're needed (e.g. http://www.zdnet.com/article/365-main-details-sf-outage-problems/ ).
There have been plenty of stories about datacenter power outages on The Register, despite the standard UPS+generator.
I remember about 15 years back having my terminal screen wink out while working on a system a thousand or so miles distant at a US military data center. Not coincidentally, others nearby working on various other systems there had the same experience at the same time, and ensuing discussions with the SA revealed that all power to the main computer building had dropped because a contractor (WHO HAD BEEN TOLD) severed the cables from the oubuilding containing the substation, the redundant UPSs, and the backup generators. Power was restored around 6 hours later.
It's actually possible to RAID your UPS systems and run everything via the UPS rather than introduce a switchbreak (large systems use a flywheel and this has the effect of supplying conditioned power to the site).
It's also possible to RAID the generators that back the UPSes.
As someone else has mentioned, the problem is managers looking to get a bonus for cutting costs who end up ripping resiliance out of the systems. I wonder if they'd be so keen if they were made liable for the costs of system failures if it traces back to their cost-cutting.
Worked a privately owned company where the owner decided that the backup generator hadn't been needed in years, so he had it moved and connected to his house. I'll leave to the reader to visualize what happened 6 months later when the power failed at the plant/office. Two months later, we were getting new dual UPS's and two diesel generators.
The problem tends to be that nobody ever has the rights and the guts to actually force a real time test. I suppose it's as well that they don't do real time nuclear rockets tests, 8 inch floppy disks or not. What I have seen more than once is companies who think their backups are fine until they find out it's never tried and total rubbish for years (typically backing up stuff from the wrong place, that was the right place long ago).
@Lars
Go and look up the Netflix Chaos Monkey and its parent Simian Army which does just that, it generates different failures in various tiers of the of their applications within Amazon AWS to ensure that the applications and infrastructure can fail over in the correct manner when things fail.
See Simian Army
Worse. Airline. I. Have. Ever. Flown. With.
LHR to ATL about 20 years ago. Rude staff, shoddy aircraft, and the code-share they had with Virgin coming back was a joke, not knowing which check-in desk to use at JFK was bonkers. At least we flew back with Virgin, which was one of the best flights I've ever had (although that might be because we got bumped...)
I'll see your 'Delta' and raise you BWIA: Britain's Worst Investment Abroad. Also known as Better Wait In Airport. BOAC was Better On A Camel. I remember one flight from Kingston, Jamaica, to Miami on BWIA which took off five hours late and then got later. Something about having to take the long way around Cuba 'cause someone in Trinidad, where BWIA was based, had pissed off the Cubans. Another flight, from Kingston to London, on British Airways not long after it ceased being BOAC, had to divert to Bermuda 'cause someone didn't quite close the cap on one of the wing fuel tanks and we were leaking kerosene over the Atlantic. Yes. Really. That year a lot of British Airways staff went on strike, including engineering support crew, and several aircraft, including the one we flew in on and the one we were supposed to fly back on, had 'cracks' in the wings. Apparently big ones, which could be fixed, except that the guys who were supposed to fix 'em were on strike. So British Airways arranged that everyone on our flight was to fly back to relatively calm and peaceful Jamaica on Air Jamaica... except that Air Jamaica didn't have enough aircraft which could fly the Atlantic Kingston-London, and the one and only one they did have was busy on Air Jamaica's regular service, so the return flight was made with a stop in the truly wonderful vacation spot of Gander, Newfoundland. I haven't flown British Airways since... It'll be 40 years, come next year. I have flown on BWIA. They just were always late, they didn't actively risk your life flying them. They and Air Jamaica have combined into Caribbean Airlines. They're still always late.
>>>Well, time fudges the memory. Must have been out from Gatwick, and back to Heathrow with Virgin.
Hmmm...Back in the day Delta only flew from LHR and Virgin only from LGW.
Anyway agreed, Delta are not very good, but neither are any other trans-Atlantic airline I have tried any different.
This post has been deleted by its author
Aeroflot has been extremely safe for at least 15 years.
The problem is there are a pletheora of russian airlines, many having splintered off Aeroflot and still using aircraft painted up with Aeroflot's logo - and it's those ones which are still falling out of the sky with monotonous regularity.
In other words, be very careful who you fly with.
In other news, one of the single largest contributors to civil air transport safety was the decision most airlines made in the 1990s to stop recruiting ex-military pilots. Originally this was due to a shortage of them that forced more reliance on flight schools, but it soon became clear that civil-trained pilots were much safer because unlike military pilots they'd assess the situation and take into account the number of people behind them, rather than taking a gung-ho approach and try to land no matter what.
The power connections for the three phase power was "WYE", and this was "Delta".
@Herby - the article appears to be inaccurate (and US-centric). Quote, "Most homes are wired with single-phase that uses one ac voltage delivered over two hot wires and one neutral wire. The voltage across the two hot wires measures 240VAC (for your oven or dryer) and across any hot to neutral measures 120VAC (for everything else)."
That describes a home fed by a 2-phase supply, not a single phase supply. Single phase has one live and one neutral and only one voltage.
My company has its regional HQ in Atlanta. We are not allowed to fly Delta if there is another choice even if it is more expensive. before the ban was put in place I wanted to fly to Cozimel in Mexico. There was a route via ATL but Delta would let us book a 5 hour connection. WTF. If Lufthansa can do 35 mins inc hold baggage in Munich why?
Mind you American and United aren't much better. Southwest and Alaskan seem to be the top of a poor pile.
Not sure why you got downvoted (I upvoted you to compensate) - I agree with your assessment. I used to fly United almost exclusively, but their service and attitude to customers has really deteriorated in the last 5 years or so (probably coinciding with the Jeff Smisek helmsmanship, now that I think of it...). Now I'm just using up my accumulated United miles & not bothering to use them for new paid travel. I had to fly United out of Manchester a couple of weeks ago - there's an unholy combination for you.
That said, I think Southwest are pretty good - they're cheap, but at least they're also cheerful. But I know opinion is sharply divided on them.
Airlines with their own system (including only old-fashioned mainframe-type ones):
Turkish
ANZ
JAL
Air India (maybe?)
Iberia, Lufthansa, AF, KLM, Finnair all distribute through Amadeus, but had back-ends at one time. Not sure how many of them still do.
No, you're not thing only one Uppsy Daisy ;-)
This post has been deleted by its author
http://www.ajc.com/news/business/delta-outage-grounds-flights-causing-massive-cance/nsB5W/#signin
Georgia Power spokesman John Kraft said a failure overnight of switch gear equipment caused the outage. He said other Georgia Power customers were not affected because it was an issue with Delta equipment, and said Georgia Power crews were on site working with Delta to repair the equipment.
I wonder how many of these datacenter power disasters would have been avoided if everyone had just standardized on -48VDC for servers, like in the telco world?
It's orders of magnitude simpler to provide backup (batteries! No DC/AC converters needed) and failover (no phase sync needed, among other things) for.
Hell, apart from enormous DC UPS systems, the only time you see those insanely huge DC/AC converters is in HVDC systems... For a reason.
Because accountants get all antsy when you tell them that a DC powered server is three times the cost of an AC powered one. It was OK in the days of monopoly telcos because the server vendors could hide behind NEBS but it's a brave engineer who'll suggest a few NEBS boxes in a high-class data center vs a boatload of dells in a bit barn.
I'll raise a pint to all the worker bees who are spending their days fixing it while a management storm of fecal matter rages around them.
If you are a CIO, do you attempt to build your own data center operations, knowing that your limited experience is probably going to miss something, or do you outsource to Amazon, knowing that you are going to be held to ransom if you ever try to leave?
Sometimes I miss the old days. A VAX cluster would have survived Delta landing a plane on it.
We have some 360s, 370s, & 390s, available for demo in the cloud. (our backend warehouse storage room Just UPS (united parcel service) overnite your card deck and in 24 hours we will mail you back 2 day USPS your printout and card deck. /sarc off.
Does not that 390 emulate / run Linux? Is there a IBM 360 emulator for Linux available.
Interesting what the CIO will do to address this power problem in the next few days.
zSeries run Linux just fine (native), and while power is available will run with substantial component failures, although with some performance degradation, and many of the failures can be corrected without a reboot.
They also will run briskly at 100% CPU utilisation as long as the paging rate is kept reasonable.
The z/TPF mainframe stayed up - it was unaffected.
All the decentralised systems (Flight Info displays, kiosks, online check-in etc.) failed.
Source: TPFers Group on Facebook.
Note:
Not scoring points and would never gloat over the failings/disasters of others - all too aware that the next operational failure could be in my own back garden.