Million to one chances ....
.... happen 9 times out of 10! - PTerry
The small island of Jersey's natural gas supply is still switched off five days after a software problem caused its main facility to failover to a safety mode, leaving engineers struggling to reinstate supplies to homes and businesses. On Saturday, the island off the coast of northern France lost its gas supply. The following …
Love the Pterry reference......
.... and I wanted to note, saying something has a "million to one" chance of occurring is meaningless without any event-related or time-related reference frame. For example if a software controlling a physical process is pinging some sensors and making some adjustments once every minute, it's going through a million cycles in less than 2 years, so "million to one" chance of one cycle screwing up really means a certainty of failure (one would expect industrial equipment to be in operation for decades rather than years). Without a reference framework, "million-to-one" seems to imply "over the operational lifetime", which is wildly optimistic.
Directly related to the above, (and since I've been re-reading Feynman's "What do you care what other people think?") with reference to his investigation of the shuttle disaster, his observation was that because of certain psychological and organisational characteristics that are in-built in human psychology and business organisational structures*, with every step up in management level, there is an order of magnitude change in the belief of probability of bad things happening. So if the official spokesperson from the C-level says it's a million to one, the VPs think it's 100k-to-1, the middle managers think it's 10k-to-1, the team leads think it's 1000-to-1 and the devs actually working on it *know* it's probably closer to 100-to-1.
* which means it's highly likely that, unless the organisation recognises it and takes specific steps to counter it, that this issue will be present 'by default' in any company, government or organisation
Obviously incompetence has to be considered before (and despite) malice (if you left the door open and your company got burgled, then the blame lies jointly between you and the burglar)
But it is curious that this event happened so close to a second mysteriously exploding gas pipeline between Finland and Latvia this week.
I mean, who knows what "rogue code" is.
The only thing you can be sure is if the main system fails and the backup system fails in exactly the same way, it's not the because the chances are "like winning the EuroMillions" lottery, it's because the same software failing in exactly the same way on the exactly same input.
If someone told me their system had "rogue code" I'd expect it be an Easter Egg, logic bomb or supply side attack
This sounds like a bug and someone is worried about being sued
> “it's because the same software failing in exactly the same way on the exactly same input.”
Agree, it is usual for DR to be between systems from different suppliers, even in fail safe environments. On British Railways, for example, it was only considered for Solid State Interlocking.
It is this, same and immediate failure which lends weight to the “rogue” software being part of the original build and not third-party stealth ware.
Exactly, no it's not a 'million to one chance' it's actually an absolute certainly, because both systems are obviously vulnerable to the same issue.
Rogue code? Surely a misprint for 'rouge code' - see it's the Russians, Putin's fingerprints all over this!
But seriously, I've seen this sort off thing before, a virtualised server is replicated to an offline system which will fire up and cut in if there is a failure of the primary server. All good, unless a dodgy update or some malicious event happens to the primary server, and this gets replicated to the secondary. So when the primary system falls flat on its face because of said dodgy update, the secondary fires up and promptly falls over as well for the same reason.
Cue my being in a meeting with the C-Suite guys saying to them, 'you do remember that time when I told you that this wasn't necessarily a good idea, replication is not actually the same as backup........, oh you don't remember. Luckily for me I keep copies of all the emails I sent you warning of such a possibility, oh and the read receipts which you failed to disable and would indicate that you had received and read said emails.*.
* Always cover yourself, and yes a read receipt doesn't imply that the person has actually read and understood the implications of what is being said. However, legally; at least on the right side of the pond and I'll assume the same on the left as we share a common legal framework; it's going to be a hell of a lot harder to show they won't aware.
No. The code is not responsible here. It may be the direct cause, but the real responsible is the idiot who bungled the specifications and did not do sufficient testing to ensure that the code would work in extreme situations.
Blaming the code is easy, but code is just the materialization of a list of scenarios. If you have a scenario that was not foreseen, then the code is likely to not solve the problem.
Somebody didn't foresee this, or didn't put in the effort to check if the code was viable in that situation.
It's still not the code that is at fault.
> It's still not the code that is at fault.
I would agree, as it seems the failed over system also did a fail safe shutdown: “ the plant turned itself off to protect the network,”
If the code was “rogue” I would not expect both systems to fail safe.
However, I wonder whether the first system in turning the plant off also turned off the power supply to the failover/DR system…
Teams I’ve worked in now routinely test far more for failure than success. Program a system to throw garbage input, test race conditions, 24/7. Then see what broke and fix it. As above, a simultaneous failure suggests the backups didn’t use different clean room code. This is therefore a rogue management problem, not code. Computers are just rule following idiots.
Meaningless pseudo techno babble.
"All of them failed at exactly the same time because of the code"
I don't believe it!
“Rubis Énergie and Rubis Terminal continued to deploy their collaborative software for the preventive maintenance of facilities (computerized maintenance management system). Once the relevant information has been loaded into the database, these systems allow the planning of monitoring and preventive maintenance work.”
I doubt that would speed up recovery, it'd just make it funnier to watch
And give the techies a bit of leverage.. "You know, that pay rise of ours that you turned down while awarding yourself a 10% rise - you might want to re-think because our low wages are very definately causing low productivity. And good luck firing us and trying to find someone else to fix things.."
It takes a long time to repressurize all the tubes.
It's not a big truck.
All the gas appliances have a minimum safe operating pressure, they need to somehow get it back up to that pressure before each customer turns their gas supply back on.
They'll also need to get all the air out of the system that will have inevitably leaked in via all the gas appliances across the network, as otherwise... see icon.
> Re: Days or even weeks to turn it back on? .... It takes a long time to repressurize all the tubes. ....gas appliances have a minimum safe operating pressure, .... get all the air out of the system that will have inevitably leaked in via all the gas appliances across the network, as otherwise... see icon.
I dunno how they do it in Jersey, but in NEW Jersey most/all older gas burners had Pilot Lights, a small standing flame. When the gas goes to zero, the pilot goes out. When the gas comes back, all those pilot lights bleed UN-burned gas into their rooms: basement, kitchen, etc. We were always told if we smell gas to NEVER use a (wired!) telephone or an electric switch, or we go BOOM.
The expectation was that a gas system would "never" go dry. When gas came to town, all houses start with gas OFF. Then they go around knock the door, turn-on the gas, then run to all the gas appliances to light the pilots. The few minutes this takes is not enough gas to explode.
What could go wrong?
My water heater is on a pilot. My new furnace has a hot-coil, no pilot. My gas fireplace goes both ways: in summer I can spark it with a battery; in winter I leave a pilot going because the heat is a benefit.
If you even have a button - lots of modern boilers operate entirely on sensors and don't have any way for you to control the gas at all.
Most of these rely on a pressure switch though to determine if it's safe to attempt to light since they don't operate a pilot flame any more to save energy.
>” It takes a long time to repressurize all the tubes.
It's not a big truck.”
That is the approach however, when dealing with water mains, as the big trucks/tankers permit the network to be filled from multiple high points, before the pumps are turned on to pressurise it.
This approach also helps to minimise the residual trapped air.
How did the gas network suddenly find itself empty?
If there was no new gas going into it then as you say all the appliances would eventually turn off due to low pressure. The low pressure threshold would be some amount above atmospheric, so the pressure in the network would be low but not zero.
Eventually it might balance out at atmospheric pressure due to leakage, days or weeks later. Then fresh air might get in via percolation, but doesn't seem like this would be a massive effect.
This is a pretty catastrophic failure mode. And, if, as you say, the consequences or trying to black start it without all/majority of your customers complying with instructions is a big fireball then that's really hard to believe.
<Gallic shrug>
Useful reference: ” Gas supply emergencies - consumer self isolation and restoration (SI&R) of gas supply”:
” The gas network cannot simply be switched back on since the order in which premises are restored must be coordinated to ensure that pressure in the network is maintained.”
It would seem as the Jersey network will have been off for more than 24 hours and impacts circa 4,000 customers, safety concerns dictate a controlled restoration of supply.
At the domestic level, if people haven’t isolated their home pipes from the mains, so they retain pressure, there is an increased risk of there being a combustible gas-air mix in the pipes, when mains supply is restored. So yes, turning gas back on does have an increased risk of explosion. When my local gas main was repaired recently, all houses on that segment had a first engineer visit to turn off their gas supply, and when the supply had been restored, a second engineer visited to check pressure and that all appliances (boiler, cooker, fire) were operational. Naturally the windows were left wide open for a few hours to permit vented gas to escape…
Aside: This sort of answer is going to challenge AI, as it requires a deductive leap to link your query to the relevant HSE document.
Japan's banks' clearing system fell over the other day with reversion to back-up systems for two days. I'm not sure it even made it to El Reg's news pages. I may have missed it.
https://japantoday.com/category/business/Japan-bank-payments-clearing-network-disruption-continues-for-2nd-day
There is a real need for both a way of rebooting stuff easily and quickly when it falls over, and a recognition that everyone needs a viable Plan B. That may be an Alt-System that can supply barebones functionality after malware locks you out of your main kit, or paper, phones and human beings. Lazily relying on Plan A needs to cost serious money in compensation payments. Only the financial pain from that will nudge some companies into having a reliable alternative.
The belief that a system will not fall over, and the inclusion of poor fail safes in the belief that they will never kick in, are amateurish.
Instead of the endless whining about privacy, we need to focus on resilience.
Corps have a really bad habit of assuming that if it never happened, and a backup plan costs money, that it's not worth spending money for a maybe. Then, when it does happen, won't spend the money because what are the chances it'll happen again?
Now make the C-suite personally liable...
Just to point out on Airbus and the like, this is why some of the safety-critical redundant systems, if there's 2 or 3 of some sensor for instance they'll be from 2 or 3 different vendors. It avoids the situation where if failure is due to some design flaw, software flaw, or manufacturing flaw, it's possible for your redundant systems to all succumb simultaneously to the same flaw.
I don't realistically expect gas plants, power plants, etc. to get redundant systems from 2 different vendors, but anyway in some limited cases that's actually done.
Of course, maybe they really asked some programmer "How likely is this to happen?" "Million to once chance" (but the code runs like every 5 minutes so it'd run a million times in just over 9 years.) There's been a few kernel bugs (in the unstable versions) that would essentially have a million to once chance of triggering, but if it's in some driver that's pushing like 10,000 packets or screen draws or whatever a second that means kernel panic within a few minutes.
what "the power had seized in the plant" means.
The plant lost its connection to the grid?
The grid itself was down?
The plant generates its own electricity from gas and the generator's bearings failed?
Anyhoo, I have been running simulations on a global network of supercomputers and there is a possibility of maintaining gas supply and pressure even when the computers fail. This radical new concept takes the form of a large gas-tight storage cylinder which can telescope vertically. Even when the myriad of excel spreadsheets being used as databases, process control and mission-critical safety monitoring become unavailable, the weight of the gas storage unit's upper section bearing down on the volume of gas below will maintain gas pressure to the consumers until either the problem is resolved or an orderly shutdown can be initiated. Scale the volume of stored gas to the rate of consumption to give the amount of time needed to avoid loss of supply to consumers.
Although the luddites infesting this site will mock this system as unworkable as it fails to use either blockchain or AI, I think it is worth trying wild crazy ideas just in case they work.
You mean to rebuild all the gas cylinders everyone has been ripping out over the last 30 or so years.
They probably weren’t seen as modern enough as they worked on gravity, there used to be one outside where my grandparents lived and depending on the time of year was anything from 20 to 100 feet tall
They were used because the system of generating Town Gas from coal had no excess capacity: supply was built up during hours of low demand, then used during hours of high demand.
The natural gas wells have higher-than-demand capacity, and are buffered by long runs of transmission pipes. Gasometers are no longer required*.
*(explain OP joke here)