Bad news. The fog's getting thicker.
And Leon is getting laaaaarrrrrger.
It's a bit of a cliche that "everything's connected", but O2's stunning outage yesterday – chalked up by Swedish kitmaker Ericsson to an expired software certificate – is a reminder of how true that is. Payment terminals croaked, bus displays went blank. Strangers blinked at each other in the street, like Robinson Crusoe …
FFS (For F£$k Sake) expand your acronyms the first time you use them!
I've got better things to do on a Friday mid-morning than work out whether M2M means made to measure, machine-to-machine, or some defunct Norwegian pop duo!
Well, slightly better, I mean - reading the Reg ......
MVP, what counts for any normal techy solution in the current day. Deliver the absolute minimum, promise the earth & walk away, safe in the knowledge that unless the customer is really, really big there is sod all anyone can do about it.
And even if you are really big, this is still probably sod all you can do about it.
Came up on the first Google search so it must be right.
Acronym = Letters that from words
Abbreviation = Shortened word E.G. St, Dr etc
Initialism = First letter of each word and enunciated E.G. VIP
If I'm wrong blame Google, it's not that I'm lazy... honest!
Do not blame Ericsson here.
UK telco operations have a well established and entrenched fear of certificates for anything.
Once upon a time, before I went back to write software, I still did network architecture including security aspects. So while working in a major UK telco I proposed the idea of certificates everywhere for purposes of inventory, identification and security of provisioning. I was freshly out of a vendor where I did most of the design and implementation of a x509 retrofit into everything and they became the foundation of how the system fits together. So I was expecting some questions or a technical discussion.
I got none.
The faces around the table looked like they were a still frame from The Shining. They looked at the idea like I was serving a disemboweled body with maggots and suggesting they eat it. They were horrified at the idea despite having less than 60% accurate inventory and a long standing requirement to secure key aspects of the network management.
This fear has its roots in incidents like the one in O2. It is also the root cause of incidents like O2.
UK telcos (and most telcos in general) fail to understand the most basic principle of using X509 for infrastructure purposes.
It is: YOU RUN YOUR OWN CA. No vendor roots. The root is yours. And so are ALL certs.
Because they do not understand it and fear it, they either use vendor certs (which expire at the most unfortunate moment) or outsource it to an external CA which defeats the purpose of the exercise as you are no longer in control of your network. Either one of these results in an incident like O2 which in turn results in more fear, more vendor use and more outsourcing.
Ad naseum, rinse repeat.
Oh, and by the way, no lessons will be learned from this incident - O2 will NOT start running its own CA as it should.
In what electronic diary? Notifying whom?
Do you know how many certificates large enterprises have to manage now? It would be a full time job for someone - but if you made it that, you'd be screwed when they went on vacation or quit and the reminder from their electronic diary went to /dev/null.
The whole system around certificates is irretrievably broken if you require humans to be in the middle of it. It has to be automated - a subscription service that automatically updates. We will never see the end of such issues so long as humans have to be "reminded", because we are fallible. If the certificate for some weird page hardly anyone visits expires, it might be weeks before the company is notified. If the certificate required for mobile data to work at a large provider expires, it could do a lot of damage in the hours required for the problem to be diagnosed and corrected.
The whole system around certificates is irretrievably broken if you require humans to be in the middle of it. It has to be automated - a subscription service that automatically updates.
Suggest you dust down the risk assessments from the mid-1990's for Single-Sign-On solutions - these worked well whilst everything worked, break something and everything fell into a rather big heap, from which it was easier to reset and start again than trying to recover...
The obvious issue with subscription services is ensuring the bank account(s) from which monies are automatically taken always have sufficient funds (or haven't been closed) and if there is a hiccup in payment processing things get escalated so that action can be taken before certificates expire...
True, payment processing can be a problem, but no more of a problem than it is for manual payment. Ideally it would be done with a yearly subscription for all your certificates in a lump sum, or paid in monthly installments, rather than dribbling out a small payment each time a certificate is renewed. The accounting department would HATE YOU if you managed 3000 certificates and each was a separate charge for yearly renewal!
Automated renewal also makes it practical to have certificates that last only a month, making the cumbersome process of revoking them if compromised less of a factor.
If your organisation relied on certificates and you were using more than a handful, I suggest you would be well advised to set up your own PKI, it isn't all that difficult. That would reduce your 3000 certificate (subscriptions) to one root certificate.
It also makes it practical to have as you suggest short lived certificates as they would be wholly managed within your own infrastructure.
BTY, if your Accounts department can't handle 3000 certificate renewals a year then there is something wrong with it - its not that difficult in many accounts/financial systems to set up a bank account and ledger for reoccurring IT expenditure/subscriptions. But I expect the problem is that in many companies IT doesn't talk finance to Finance and so get things neatly structured.
And that still requires a manual process to insure EVERY certificate finds its way into that electronic monitoring system. This is better than a manual process around every renewal since you only need to do it once for a certificate and then you are good for as long as that particular certificate-requiring function remains exactly the same.
Better, but not good enough.
Cheap almost free open source monitoring software can keep an eye on certificates and give you prior warning that the date in one is approaching. You can choose how much warning you want and it will display it on a dashboard in red, ,send you an email or automatically open an ITIL compliant helpdesk ticket for you, with P1 urgency if you want.
Even the most shoddy IT shops I've dealt with have this sorted. It's really simple stuff.
werdsmith, your missing a vital point, your assuming O2 (the company) actually give a fook (shareholders will if share price slides longer than 24hours).
Give it a week and nobody will even remember they had an outage, once they can upload fish face pictures to instatwat or pictures of their lunch to twatbook
I can see what you're getting at. The certificate system has a different purpose for this situation. It isn't about somebody such as me, downloading software from a myriad of possible suppliers, possibly via intermediaries, where the certificate is about blocking access to possible malware, now with such things as HTTPS. Secure delivery still needs attention, but once a genuine copy of the software is delivered and authorised for use, the supplier's action (or inaction) shouldn't be able to stop it working.
Yeah, I suppose contracts can set up something like software rental, and that's nothing new. But if you shut down your customer I am sure the lawyers would be interested in the procedures you followed.
"But a look in their forums shows tons of people just screaming at them, who didn't even bother reading the news."
How were they supposed to read the news when their phone data connection was down? You don't honestly think they would have something old fashioned like a landline based connection or a radio or even a TV, do you? No, of course not. The world had just ended!
"How were they supposed to read the news when their phone data connection was down?"
How were they able to post in forums if they had no data connection...
I suggest that those able to access forums weren't those truely impacted by this outage, who's smartphone would have been reduced to a games console for Snake and Tetris (aside: showing my age here)
Giff, Gaff, you mean Telefonica aka O2?
Maybe its just me but their adverts really get on my goat, moreso than any other telcos ads (which are bad) every add they spout all i can hear in my head is Liar Liar Bums on fire, your telefonica in disguise you charlatan!
replace Giff Gaff with Tesco, Sky and Lyca......it fits!
If an IT organisation has to manage something that can expire and must be renewed then it follows that it shall, at some point expire without having been renewed at the worst possible moment.
Certificates always expire at the time when a) the responsible IT bod is on annual leave, or b) there has been a change in management/HR/re-organisation such that no-one is sure who is responsible for the certificates or who can approve paying for their renewal, or c) just after a major IT upgrade, so everyone thinks that the failure is due to the new equipment. Other options are also available...
The difference is that buses failed safe - the network connections failed, but the buses still ran.
As I heard it, the Ericsson software was just used for billing usage. But because O2 couldn't track customers usage, they denied them access completely.
I think O2 should have to credit every account with 20p, even if the customer didn't complain (or get through to complain). Costly enough to impact execs bonuses, but cheaper to implement than handling 32m complaints, so even then they get off lightly. And if I have to waste £10-worth of my time to get a 20p credit, that just adds insult to injury - I'd be asking for the £10 rather than 20p
The buses ran... but customers on O2 networks couldn't pay their bus fares (in London at least), nor did their shiny work on the Tube, and if you delay more than 100ms at the gateline in London, then you are to be crucified against the TfL roundel by your former fellow commuters that you have held up. But that's OK because all the little iPads they gave to Tube staff when they eliminated ticket offices a few years ago (the ticket offices where the machines had ludicrously outdated bits of electric string) run off O2 (except they fall back to the WiFi which was mostly Virgin, so the effect was minimal).
"The difference is that buses failed safe - the network connections failed, but the buses still ran."
There was a comms failure on our local metro system the other week. Complete shutdown of the system resulted despite the fact there far fewer vehicles involved, no other vehicles other than authorised ones with trained operators, very few junctions, but, no, to be safe, it all has to stop. Could you imagine the reaction of roads being closed because traffic lights failed?
Admittedly, there are stretched os single line operation and even sections where the light rail shares track with main line trains, so I suppose those sections might be more dangerous to operate without comms or signalling.
Admittedly, there are stretched os single line operation and even sections where the light rail shares track with main line trains, so I suppose those sections might be more dangerous to operate without comms or signalling.
Suggest you read up on the early railways and why signalling systems were developed...
Travel on the trains and when One carrier is having issues it's tickets are valid on the others for a short duration. Maybe if one carrier is having a 'Really bad day'[tm] the others could let their customers on theirs.
You never know it might be like good for everyone.. so it will never happen
I assume this is something to do with controlling autonomous vehicles.
If it is, then it's worrying. An autonomous vehicle must be able to work without a network connection! For emergencies and for areas without 5G. All it needs is to know what is around it - it doesn't need the latest news on traffic problems 300 miles away. It should be able to rely on its own sensors, and, possibly, short-range comms to chat to nearby vehicles. That's it. Updates can wait until it's next connected, like phones.
We've been told that the low-latency modes of 5G are required for V2X (vehicle-to-everything)
V2X is going to be necessary for smooth traffic flow -- negotiating permission with oncoming traffic to make a left turn (for those of us who drive on the right)/right turn (for those who drive on the wrong) for example. And it's probably how the folks that are repairing yonder bridge are going to tell your car that that area that looks like a hole in the pavement is in fact a hole in the pavement. It's not clear that it needs a lot of bandwidth or especially high speeds. But it probably does need latencies never more than a few hundred ms. And of course it needs standards that are unambiguous and are actually adhered to.
Vehicles will need (or at least want) to communicate with one another, yes. But there's absolutely no reason they need to communicate via a cell tower. They will be in close proximity to one another and can communicate directly, there's no need to go to/from a cell tower which will often be further away than the cars that need to talk to each other.
As long as autonomous cars have to share the road with human driven vehicles they will need to be able to operate without any V2V communication though. They can't trust humans to always signal a turn etc. so they will still need to drive defensively and not fully trust the info they get from other vehicles.
The exception to that trust would be for things like drafting bumper to bumper in the left lane, obviously you'd need to trust that the cars ahead will act appropriately and the lead car will alert the rest of a hazard that will require braking or steering. So sorry, no user modifiable software allowed!
Because certificates typically expire after 2-3 years - beancounters and bosses cannot see that far ahead (except when pulling "strategies" out of various orifices).
Even the IT monkeys doing the renewals have moved to new offices at least 3 times, so that two your old calendar with the post-it notes? Noone remembers what it was for, so it goes down the bin.
Only tangently related, but it reminded me of a policy at my last place of work that I managed to change.
If it was required to run a one off job on a machine overnight (yeah... no "at" batch command) then it was recommended that you put the job in cron, scheduled to run the next day, on that day-of-month, on that month.... so that your job wouldn't be run the next day too if you didn't remove the cron entry in time.
Yes, you've got it - there were a number of times where some system would "randomly" cock up, and be traced to some date specific cron job that no-one remembers anything about, and which is presumably at least a year old.
Errr.
Because the bean counters were the people responsible for outsourcing the IT department to a provider incapable of managing (or unaware of) things such as this.
I can almost guarantee you its the bean counter's prior actions in chasing the cheapest IT solution in order to line the pockets of those at the top that has led to this mess.
"If the beancounters can get something done by a certain date, why can't the IT monkeys?"
One of the things that the beancounters get done by a certain date is to outsource the IT monkeys who had their calendars sorted. And when the IT monkeys get outsourced are they really going to tell the beancounters "by the way, you need to keep an eye on this."? At some point beancounters get to discover that the IT people they outsourced weren't monkeys but there's a distinct possibility the outsourcers were - or maybe they were snake-oil salesman.
James Burke's Connections explained a failover system of the electrical grid in America. One relay tripped because a street was overloaded and it passed the current over to other circuits in a domino effect until the whole state was offline.
If the load is too much for one then it will be too much when added to the next one that's still working.
These sorts of decisions are often made by marketing type teams, where the brand identity is worth 10x more than the damage from down-time. The decision to allow customers onto a competitors network as theirs broken? No way! Get ours fixed!!
20 or so years ago as a tech-support rep at Orange, a frequent issue at the time was SMS jamming. An easy fix was browsing for another network in the phone settings, attempting to join it (which would fail), then just joining the Orange network again. Within a minute or so, the "stuck" SMS would start coming in. Marketing or some similar dept caught wind of the advice being given out - and said it was to stop.
No matter it worked 99% of the time, no matter there was no other fix available, no matter the customer was inconvenienced by it not working.. The sheer fright that another network's name would come up on the customers screen? Unthinkable!
The fix is not "failing to join the other network", it is more correctly "disconnecting and rejoining."
Similar connectivity faults exist even today and can often be cured by temporarily going into airplane/flight mode then back again to normal mode. Or even by switching off and on again but not recommended as boot times are getting ever longer because all the crap with which we fill up our phones.
...was a concept that went out of fashion over a decade ago.
The mere idea of software actually checking the status of its connection and then retrying, rechecking, disconnecting (cleanly!) and reconnecting before trying again has been deemed ancient cruft - programmers have become too used to reliable always-on connections and never experienced firewall timeouts or line noise causing a modem to hang up.
$Deity, I feel old.
Or even by switching off and on again but not recommended as boot times are getting ever longer because all the crap with which we fill up our phones.
One of the things I've noticed about my current smartphone is it boots quicker than the one it replaced (both were mid-high end compact models), and probably about as fast as the feature phone I had before that. Brands omitted in case anyone thinks the data point is just shilling...
...although the 3310 was obviously quicker than any of them ;)
"The fix is not "failing to join the other network", it is more correctly "disconnecting and rejoining.""
Might be being a little over fussy there, Joe. The post said accurately the steps given to customers.. I think it's realised by all what those steps achieved (the disconnect/reconnect)!
And the image it has painted in my head of Cameron flying into a red faced rage, because his magic smartphone kept failing in his artisan yoghurt eating Cotswold smugster's paradise has put a smile on my face for a few hours at least.
I've been saying this for over 30 years.
Even most user's computer infections are relying on the user's lack of computer expertise (not disabling Autorun, unwanted services, adding toolbars, not disabling remote content in email viewer, clicking on OK boxes without reading them, opening unexpected documents to see what they are, not hovering to check links etc etc).
Most really bad IT disaster I've seen have been human error. Even HW failures were everything was lost is human error in sense of not having a backup, RAID or Cluster depending on importance of system. Once there was a server moved while running. Two reasons everything lost. 1) The HDDs only had one or two screws. 2) You don't move stuff that's not portable while running. It's not even a good idea to move a laptop with a regular HDD while running, Dropping it is more likely to be fatal to HDD than when off or asleep.
I even wrote a book about an "apocalypse" caused by human error. Faulty patches to BGP on Routers and on HTPP and eMail on servers on same late Friday.
Also having RAID or a Cluster makes no difference to need for a backup. Most data lost is caused by user error, also RAID or a Cluster is no protection against malware.
A nasty malware may have a timed later activation so that your backups are infected. Thus you can't just keep rotating the backups or just using one USB HDD etc.
You need to keep archived backups off site.
You also don't know how long it might be before user error deletion or mess of data, or patch or new program shows a problem. You may need an earlier backup than you imagine.
Most individuals, small companies and many Corporates have no real "disaster recovery" plan. What if your single shop or office is burgled, burnt down, blown up, flooded. You can buy new stock, office furniture and PCs. What about your accounts, supplier data, customer data / CRM, payroll, etc? Also do not rely on 3rd party "Cloud" CRM, Payroll or accounts. What is their backup, security etc? What do you do if you lose your broadband? How do you migrate to a different supplier. Can you make your own backups in case of error of one of your users, not just the failure of provider?
Cloud services may be essential for a Commerce Web site. Or two co-located servers in two data centres is cheaper than electricity and Fast Broadband to a single office. Cloud services or outsourcing for your core business, your backend data etc is really stupid. Banks are particularly crazy to do this.
This. I am actually working on this kind of problem right now and nobody seems to understand that just because you have global SAN replication/synchronisation and additional backup copies (to the same SAN!) there is still value in having master snapshots and emergency backups per server kept completely off the grid for the sole purpose of that 'once in a career' real DR event such as a data centre fire or flood. My personal preference is the KISS approach and keep a rolling swap/set of air-gapped USB3 SATA disks on standby at your DR site, swapped out quarterly and immediately after the latest NFT patch/DR testing completes on your master data servers.
We have a very large SAN in our USA data centre, everything looks good until one day some tech was replacing the backup PSU for routine servicing and got a little confused which was the backup - once they put it all back and system rebooted it was then found the SAN had never saved configurations so it went back to day one.
>This is why you test your restore/recovery procedures.
I've tended to make restore/recovery part of normal day-to-day operations - probably because of my initial training on non-stop and fail-safe computing systems and focus on business continuity. However, I suspect unless you've had your fingers singed (SSO) you probably haven't considered certificate expiry to be an operational risk.
"that 'once in a career' real DR event such as a data centre fire or flood."
One of the things about having had your place of work burn down is that you realise such things can actually happen and potentially more than once in a career. Those who haven't experienced one tend to put them in the "won't ever happen" category.
having four MNOs, the UK is more fortunate from this perspective than most nations, which have three
It isn't really four. It depends on how you count them and that turns out to be a lot harder than you might think.
Nowadays there is a lot of sharing going on: sharing of towers, radio, core network, back office and other things. And the sharing is different for different technologies (2G, 3G, LTE). And then there are (secret) roaming agreements where effective national roaming happens in some places (often to provide rural coverage). And the operation is mostly outsourced so the same outsourcer may be operating multiple networks (or parts of them, normally split geographically).
I think the answer is that for this sort of thing there are about 2 1/2 networks in most places in the UK. If I remember correctly there are about three main core networks but they are split up geographically. So most places end up covered by 2 or 3 of them plus, sometimes, a much small piece of network (for example microcells in a city). So, call it 2 1/2!
Anyone got better insight into the effective average number of networks with SGSNs covering a single point in the UK? And how many different SGSN vendors involved? And how many different operations companies?
Mobile spectrum, actually ANY spectrum is a very limited resource. Splitting it to different physical operators reduces performance by x2 to x5. Also operators will not increase mast density to improve performance (the ENTIRE concept of Cellular frequency reuse) once they have sufficient coverage. The issue of ROI. Adding more masts / performance doesn't generate more income.
Just because Network rail is a disaster, doesn't mean the idea of managing and regulating fixed single resources shouldn't be done.
The old Post Office management of Telegraphs and Phones was done wrong. The solution isn't to go to the opposite extreme and have multiple operators and a Regulator that cares more about income from Operators than coverage, performance or the Consumer.
I keep banging on about this to customers and get ignored every time. For printed newspapers, there is a retainer contract with a backup printer in case the normal presses catch fire, break down, go on strike etc. For their app editions, there's bugger all: when the tech falls over, that's it. I think the problem is that having a Plan B is extremely unfashionable at the moment, in business as in politics.
Indeed, I would love to see what happens if the the non-UK owners of utilities / manufacturing were made to pack up and go home after Brexit. Mass unemployment in most manufacturing industries (Japanese, German, American mostly) and electricity blackouts, since every nuclear power plant in the UK is owned by EDF (French). And the Dartford crossing would be closed down out of spite by the (french) toll taking company. Lets brick up the channel tunnel while we're at it, eh?
>At least we still make our own bricks.
According to the British Geological Survey the UK isn't self-sufficient in bricks and imported bricks account for a significant percentage of the market...
Mind you perhaps this might be a benefit of Brexit - we won't be able to build all those rabbit hutches various parties say need to be built...
You're suggesting Airwave don't care how much money they make and aren't using kit that's out of support? Right.... Running emergency services over commercial networks could be more resiliant than the current setup if they had roaming (there aren't enough blue light users to cause a cascade failure). That doesn't even need network trickery - just a multi IMSI sim or a dual SIM handset.
What is a poor idea is roaming everyone off a failed network, and not having a two tier service, as fixed line installations do.
It's probably forgotten more often these days that in the case of widespread telephone line disruption the average punter will be disconnected, and essential users (doctors, for instance) remain contactable.
I'd be surprised if this isn't part of the mobile networks, and if not, it needs to be.
So, in the event of a major mobile network outage, mountain rescue retain their access (they generally use 2G/pagers for alerts, although they may have radios too), bus availability doesn't as there (should be) a timetable printed on the bus shelter.
You can't work this without a two tier service, because ultimately businesses will work round unreliable networks by implementing their own multi network/SIM solutions.
We just had our software developer outfit leisurely let the "Apple Developer Certificate" (whatever that is) for their mobile app expire.
Consequence: app won't mysteriously start on several hundreds of mobile devices (not even an error message, Apple QUALITY interface there). And these are used in a role which I would personally consider a "high assurance" because if it is not working then lots of dinarii go down the drain per minute.
Of course, no hotline, developers at home etc.
No-one was responsible because "you should have noticed that our developer certificate would expire by looking at your Mobile Device Management Platform".
Yeah, thanks? I guess.
No, I haven't seen an SLA either.
What if somebody decided that all energy supplies (consumer, industrial, etc) needed to be remotely managed, and the contractors forgot to (a) build robust connectivity into the scheme (b) forgot to test what happened when the notworking was inevitably unreachable in remote areas (c) forgot to care what happened when the notworking was unusuable across wide areas?
Who would/should pay the price for this level of incompetence?
As far as I can tell, the people at the top don't pay the price of failure, at least not in the UK, at least not in the same way as they reward themselves when things "go well".
Obviously there's no way that the successful rollout of a genuinely robust sensible-data-throughput national network with decent availability and uptime could be considered a prerequisite for a Smart Meter rollout. Oh no. That would never work. Not at board level anyway.
How many other countries were (not) affected by the Ericsson foulup? Why might that be?
The only things that melted down more quickly than the O2 network were the O2 customers! I saw no end of "my phone is critical to my business" whinging going on, demands for huge compensation and stories of life changing events.
To that I say, my £100 smartphone has two SIM cards in it, one O2, one EE.
I thank you, good night.
>To that I say, my £100 smartphone has two SIM cards in it, one O2, one EE.
So do you have both numbers on your business card, or do you use a virtual number and call redirection service?
Personally, given dual SIM phones aren't generally available in the high st. but unlocked phones are, I have two handsets (latest toy and previous toy), each on different networks (EE and Three) and my tertiary fallback is a quick trip to a local shop where I can pick up a Vodafone/O2 SIM or a suitable MVNO SIM.
V2X "vehicle to everything" - really? To pedestrians? cyclists? horse riders? flocks of sheep? cows going to milking? Circus parade elephants? Sleepy kangaroos? Spilled loads? Fallen trees?
Technology needs to address itself to the real world, not the "simplest case" that the spec had in mind.
Software and systems should be designed from failure backwards: every function should initially be designed to report and cope with failure, then the "non-failure" case should be added as an exception.
But this doesn't often happen becasue the developers are so focussed on what they want it to do.
But this doesn't often happen becasue the developers are so focussed on what they want it to do.
Devs these days largely work under the thumb of a fragile project manager, the incentives for the fragile project manager are to ensure that delivery deadlines are met.
Of course the delivery date is often a fantasy date that is rarely based on the work required for completion.
In short, shit buggy software on time = bonus.
High quality, robust software, 10-20% late = no bonus = no chance.
The quality of the software doesn't matter to delivery managers and so its difficult to prioritise improvements to robustness over delivery dates, that's why the software developed using the fragile process is *fragile*.
That's okay, keep blaming the dev's and not the line of oversight all the way up to board.
You titled your reply well: it was indeed bollocks.
Managers do indeed press developers to make things happens cheap and fast. But that doesn't stop developers having to say "no, it takes longer to do it properly"
The reality is that it doesn't take much longer. Start with the "cope with error" template and it becomes second nature.
The extra dev time is compensated for by easier integration testing.
Few developers even understand the concept of a "failure first" approach, so it looks hard to them and they react with moronic comments like "Bollocks"..
This post has been deleted by its author
Indeed, but it was less than a year after that date that BT-Securicor launched their service just a few days after Vodaphone. Prior to that there had been the GPO's mobile radio telephone service, aka the Carphone service. So I expect that a lot of the investment in infrastructure, masts, planning permissions, operator's licenses, mast site applications, power supplies etc etc had to go in well before the network was actually launched, or made use of the existing GPO/British Telecoms masts. Thus the funding for this was directly from the tax payer, rather than indirectly from the tax payer through their choice to subscribe of course. And then we see with Broadband the government subsidising the initial deployment of infrastructure. Did this not happen with mobile phones too?
One "subsidy" mobile and other network infrastructure companies got from HMG was an abbreviated planning process to dig holes, run cable and build towers everywhere. I can see the point of doing that but it's something that would otherwise cost them lots of cash and stretch the time to market hence their ability to bill customers for new services.
Cables and holes aren't generally the mobile company's job - the backhaul is usually provided by Openreach, Virgin, or one of the dedicated business network companies. Of course the cost of that would increase if the overheads did, but it would increase for everyone, not just mobile co's
Planning permission for towers is abbreviated only in certain cases. A remarkable percentage have to go to appeal to get built. (and the same folk who object sometimes complain about lack of a decent mobile signal!)
It is perfectly possible for one UK mobile network to back up another without creating a domino effect. There are two solutions. The first is an official "telephone preference scheme" that has been around for years for the fixed network for a time of national crisis. This is where critical uses can be identified, put on a register and MNO's obliged to have a dormant contract with a second mobile operator to take over the "priority user's traffic" should the mobile network they subscribe to go down.
The second solution is a commercial solution where the automatic back-up on a second network is a premium service anyone can buy into. The price of the premium service is set so that the number of such premium customers can be handled by the second network. This commercial approach is a dream solution - critical users have peace of mind and mobile operators have a new stream of revenue that is literally money for old rope. What is there not to like about it? Ofcom could do everybody a huge favor by mandating every mobile operator offers a premium back-up offer to their customers.
"We only realise how pervasive machine-to-machine (M2M) mobile data connections are in our lives until they stop working"
It's either "we DON'T realise" or "WHEN they stop working". As supplied the sentence only works because I assume I know what you mean, not because you've conveyed that.
Under the circumstances a communication failure due to a mismatch of standards is . . . ironic? Only missing the bullseye by still just about functioning...
"The fact is building and operating a nationwide network requires huge capital expenditure..."
In our limited understanding of the network side of things it does seem to us that enough has been ploughed in already. This being so, one might think that such protocols would have been in place as part of the standard package perhaps, or at the least been factored into the acceptance and testing regime prior to overall commissioning in the field.
At the consumer level we pay top dollar for MVP kit and services that are fun and shiny but which have apparently poor resilience when put under modest stress.
At both the network and consumer level, vendors have eschewed the need for decent local/offline fallbacks which would provide much needed continuity in such events; they have sacrificed this irritating niggle at the altar of The Cloud. A prime example of this was 'exposed' in this particular outage: According the Beeb some plumber was unable to use their satnav - presumably Google Maps or the like - to get to jobs.
Seems the smart device era aint so smart after all.