Dry Run
It's just a dry run for what's going to happen when BT turns off the copper phone network and everything goes VoIP.
Britain's communications watchdog is investigating former state telco BT over a "UK-wide disruption" that prevented some calls connecting with emergency services on 25 June. The technical glitches at the telecoms operator first showed up on Sunday morning at 0830 local time, and the dominant telco was forced to move to a …
I just renewed --- for two years I think -- my fibre to the cabinet [ BT having recently confirmed any better fibre etc. is not coming near here --- despite a flood of offers from various firms over the last two years for fibre to the house as soon as available ( please register your interest and we will call back asap ) to Vodafone; which has beenOK just about... at a rather expensive £24 a month. Possibly could have got it a pound or two cheaper from another ISP, but it wasn't worth the hassle since nothing would change.
No new router, but they posted me a gizmo to plug in the back of the old instead of plugging the landline into the wall, thus enabling VoIP instead of copper.
I don't think it actually works, but didn't care as I have no interest in landlines.
Rather similar to what happened round here. The Parish had been told to collect pledges of funding (~£1k for each household) to co-fund fibre rollout. Then one day I saw a guy climbing the poles at the foot of the garden and was told that we would all be FTTP in a few weeks. No explanation, no further information, just BT planning at its best. Still it works just fine so I'm grateful for that.
My Sky went "out of contract" so went up in price as the "in contract" discounts stopped.
Their web renewal was broken so spoke to a very helpful (!!!**) person on the phone who not only got me everything back for less than before, I will now have FTTP and openreach were here yesterday fitting the external box.
** and yes she really was very helpful....still in a state of shock!
"Our rules require BT and other providers to take all the necessary measures to ensure uninterrupted access to emergency organizations as part of anti call services offered. They also require providers to take all necessary measures to ensure the fullest possible availability of calls and internet in the event of catastrophic network breakdown or in cases of force majeure."
This is the typical kind of bullshit spouted by people who have zero understanding of how anything works on a technical level. People who think because something is "regulated" that means everything will be ok.
There seems to come a point very quickly where systems can (will, and do) fall over. Whether their application is something serious like emergency services or banking doesn't make them exempt from this premise.
Whether it comes down to incompetence, mistakes or mis-management - and let's be honest if it involves BT that's entirely possible - is a separate matter. But I cannot stand this naivety that critical infrastructure is somehow 100% fault tolerant. It isn't, and no amount of policies, regulations or highly paid consultants are going to change that. Ever.
It's quite scary how much masking tape and spreadsheets keep infrastructure running in the first place. Add in some human error and nothing is guaranteed.
There were a handful of major outages in the 18 months I worked on System X in the 1980s that resulted in loss off 999 calls. Some were hardware failures.
One was a configuration error related to overload management, where 999 calls were impacted by a general overload. A bug report was raised related to the "New Faces" television programme. The overload caused by telephone voting for the programme knocked out 999 calls.
Certainly, "forcing regulation" is a bunch of political BS. But this isn't Microsoft, so serious reliability engineering can, assuming this was a software problem, keep the chance of something like this vanishingly low. It is NOT easy, however, and without a serious look under the hood, I could not venture to guess as to what exactly they did wrong.
We had a global distributed key store at Google when I was there (2015-6) that only went down when the SREs took it down--one minute a quarter, so that people would not build apps that depended on it always being up.
I consider six 9's to be theoretical as an SLA unless you are in something like a single-site manufacturing facility, but if you engineer for it, five 9's is quite doable by a competent team.
Re: "This is the typical kind of bullshit spouted by people who have zero understanding of how anything works on a technical level. People who think because something is "regulated" that means everything will be ok."
Regulation enables them to blame someone. All that will happen is that at the end of the inevitable inquiry, they'll announce "Lessons must be learned" and do nothing else.
If it turns out someone has died, or been seriously injured as a result of this, the same will happen.
People make most of their calls on mobiles.
I can't imagine many 999 calls from VoiP at home. The use of AML in mobiles to provide location information in 999 calls will probably be a better option when landline numbers are not tied to a physical location.
And from my experience of switching to FTTP on JohnLewis/Plusnet it is very difficult to keep a landline number and transition to VoIP. Their support staff did not even know how it could be done.
On Zen everything went incredibly smoothly, email a couple of weeks before to plug the phone into the router instead of the wall at a certain time, and it would work, it did.
Keeping my number was simple, as was adding a DECT phone, which I had lying around in the box of tech crap that I might get round to using when I can be bothered to fiddle with it.
Mind you the modem they send is pretty decent, mesh networking available (I rolled my own which worked, with the router and I saved a few quid a month).
Unfortunately, it doesn't have an OpenWRT image available for it, which is the only downside. But it's fast, gets updates frequently enough, and the usb3 port works as my home NAS. So there is that.
I love the comment from the government reported by the BBC - "The government has said it took BT nearly three hours to alert ministers to the problems it was experiencing."
which seems like a typical non-technical failure to grasp the idea of priorities. If the issue started at 08:30 on Sunday I'd take a wild guess that the only relevant BT staff working (either actively or on call) would be the engineers, and they'd be busy trying to fix the problem. They're engineers, so they're sure as hell not gonna be calling Whitehall who likely can't contribute anything practical to fixing the issue, they'll just be updating their immediately line managers. I imagine you'll have to go up a few levels of management (none of whom would be actively working on a Sunday morning, so possibly not immediately contactable) before you reach someone with the authority to speak direct to the government on behalf of BT. They don't indicate whether any status pages or similar were updated with information, but that's presumably more likely to happen in the short term, than finding someone willing and senior enough to place the call, especially when initially they'd likely have zero information to pass on anyway.
Standard "I can speak to you about the issue, or I can fix the issue... I can't do both".
That's a reasonable attitude if your tech team consists of you and your manager. But this is BT, and we are talking about EMS.
For even slightly mature incident management, external communications is generally the first responsibility that gets spun off by the incident commander--usually to the nearest manager, it keeps them out of your hair & lets them be the reassuring face to the clients while the workers are trying to figure out exactly what phase the moon is right now.
For something like EMS, weekend staffing also shouldn't matter much. Given the scope of the problem, the pagers should have been hitting everyone within half an hour--including the first line manager, whom, as mentioned above, is going to be responsible for communication. That includes understanding what the regulatory requirements are for communication, and calling up whomever as necessary.
"The government has said it took BT nearly three hours to alert ministers to the problems it was experiencing."
I would imagine they all a little <cough> hungover partying at Glasto, aside from Sir Moggie who would have been marshalling his own flock in preparation for the first service of the day. Seriously, I would have expected OfCom and the Cabinet Office to be informed promptly, as per their checklist, and for OfCom/CO to alert ministers
"the only relevant BT staff working"... would be the network operation centre(s), who would be seeing any 999 alarms at the top of the priority list. I'm guessing there were no alarms until someone/something noticed that the call traffic to the emergency services had dropped to zero (assuming that they flag an abnormal levels of failed and prematurely terminated calls as an alarm)
"the only relevant BT staff working"... would be the network operation centre(s), who would be seeing any 999 alarms at the top of the priority list. I'm guessing there were no alarms until someone/something noticed that the call traffic to the emergency services had dropped to zero (assuming that they flag an abnormal levels of failed and prematurely terminated calls as an alarm)
How many Ministers do we have now, and how many would really need to be notified? Does the Min of Ag & Fish, or "Levelling Up" need to be notified? Or just the Home & Cabinet Office? And also why BT given they're a service provider, and the service users, ie Police, Fire, Ambulance and Coastguard would have hopefully been aware of the outage pretty quickly. And then what Ministers could do to help. Get the MoD to issue distress flares? Or worse, seize the opportunity to 'modernise' the service.
Having done some work on this in the past, I know BT and every decent telco takes 999 services seriously, as do the customers because it's a safety-of-life issue. But it's been complicated a few times due to politics, like the infamous FiReConTrol* plan to shrink the number of fire & rescue ECC's down to five, and outsource a lot of it. That was fun because I refused to design a cheap solution based on ADSL to the fire stations. And a memorable quote from a senior fire officer telling me that nobody wanted this other than Prescott and the PFI bidders.. And then there were the E.911 conversations where VoIP operators would chuck the problem of accurately locating callers over the fence to BT, and expected them to figure it out.
I suspect this will be down to an IP issue, and the important issue's really why it failed, and how long it was down for, not how long it took to notify ministers.
*I may have capitalised this incorrectly, but it was capitalised weirdly. But it was modern!
Failure to notify is pretty serious, because it proves there were no managers paying attention.
It also implies that BT themselves didn't know for rather a long time.
Disaster response for this kind of thing is supposed to be two-track - the management tell all the users and internal support staff that the system is down and they are using the backup which has known limitations x, y and z, while the technical folk get to work fixing it.
If you don't have that first track, two really nasty things happen:
1) The system users don't know it's failed, only "It's been pretty quiet for a while", and/or despatching a few ambulances to completely the wrong place.
2) When the users discover or start to suspect that it's failed, they start hammering in reports and complaints, distracting the technical folk and potentially even pulling them away from fixing it.
This is also why service status pages are necessary.
(This actually happened somewhere in the UK.)
A pensioner called the nearby police station to report a break-in. The police replied with, "We're busy at the moment. Can you please call back a few hours from now?".
A few minutes later, he rang the police again and said, "Don't worry (about the call). I have taken care of them." and hung up the phone.
Within minutes, with sirens blaring from several vehicles, in bullet-proof vests, full riot gear, the police came screaming around the pensioner's property and arrested the burglars.
"You said you took care of them," spluttered the police.
"And you said you were busy," replied the pensioner.
I know you've gone all American recently, but this was a pretty big story.
As someone in the know about these things I value the commentard's opinions very much.
This kind of thing is going to be increasingly common. Wouldn't have happened back in the heyday of Nortel.
Anonymous, for reasons above.
Nice to know. Too bad they didn't offer an option to immediately put you through to someone who could fix your complaint. However that would mean the designers of these call centres got off their arse and did some proper work instead of finding new ways to annoy and frustrate their victims. Which simply cannot be allowed to happen of course.
FFS! Sacking whoever is responsible for this epic fuckup is reasonable. Though nobody here has actually suggested that. What sort of punishment would you consider appropriate for a massive outage of a nationwide life-or-death emergency service?
It doesn't matter if there hadn't been a failure in 66 years either. The service did fail. Bigly. Heads must roll. But they won't - as usual.
Kingston upon Hull is the one area that isn't BT/Openreach in the UK yes, KCOM are the incumbent provider there, however KCOM use BT for 999 services.
KCOM connect to BT in Hull to enable BT to provide the 999 services for them, they got a large fine back in 2017 for a four hour outage in Hull back in 2015. It turned out back then all the routes that KCOM had into BT relied on a single BT exchange next to the river in York which got flooded. KCOM rather than BT got the fine for not having the appropriate resilience in place, be interesting to see what Ofcom does this time when it was a much larger outage.
Details of KCOM fine and issue: https://www.bbc.co.uk/news/uk-england-humber-40860169
As others have said I can't remember an outage like this before, South Yorkshire Police failing to answer the call when put through by the BT Operator yes but issues on the BT side no.