a failure to follow standard procedures
Isn't this current leading theory on why Crowdstrike sent out a defective update too?
Corner cutting due to pressure from above maybe?
An AT&T cellular outage lasting more than 12 hours that prevented subscribers from accessing services including 911 was caused by misconfigured hardware and a failure to follow standard procedures when deploying. Or so says the US Federal Communications Commission (FCC) of the incident on February 22, which affected AT&T …
I have first-hand experience at a Fortune 500 company where merely doing a thorough peer review could get you a reprimand from management for "not being a team player" or "being a roadblock". Finding actual problems and rejecting a change request could get you short-tracked for dismissal.
Rubber-stamp it or else... In some cases managers would simply approve highly technical and complex change requests to key infrastructure themselves, even though they had absolutely no technical expertise in that area.
Large corporations are full of standards that don't get followed.
Sometimes because they're stupid.
Sometimes because they're under-resourced.
Sometimes because employees just don't wanna.
But they love to create standards without follow-up because it lets management check the box of "We did a leadership!" without doing the hard work of ensuring that the standard is sensible, is doable, and is being followed.
And then it won't be anybody's fault, because "We did a consensus!"
See, from the director level on up, none of them actually do any work. They have endless meetings, about other meetings to pretend to make policy, and leave the execution to their reports without any real authority to enforce those policies nor oversight of their reports.
Combine it all, and it is Kafka writ large on an unprecedented scale. With high speed computing creating high speed mistakes.
Now add another factor. A quote I've recently seen. "When poor performance is not punished and good performance is not rewarded, all motivation is gone."
And with that thought, I'm off to the pub. ------------------------------->>>
Am I the only person bothered by the fact that AT&T's ENTIRE, NATIONWIDE NETWORK can be taken down by a single badly configured device? Like, at all, under any circumstances?
And the only protection against this is that someone else is supposed to also look at it?
Killing a single tower or a couple in a small region? Sure, understandable - but one idiot fucks up and it ALL goes to shit?
What kind of house of cards setup are they running? Where is the outrage over this from the FCC or Congress or anyone?
...more about the "protection mode" that was triggered. So a network device was installed that triggered some kind of watchdog system and, instead of just isolating the faulty new component, it somehow brought the whole network down.
I have no clue about how mobile networks are being run. I do understand that many layers of safeguards are necessary to protect the network from faulty/compromised/wrongly configured components. But surely the protective response can't be "let's shut the whole network down". So why did it happen? Was that protection system behaving as designed? Was it built to protect against a different scenario, and made the whole problem worse? Or was it designed to do exactly that to protect against some even more undesirable consequence by disconnecting all devices?
Was the time taken by getting everything to come back online, and then tens to hundreds of millions of devices all trying to reconnect within a very short period.
One would assume that if the outage occurred three minutes after this change was made, that it would be quickly apparent to everyone that a change made during the maintenance window was responsible and an order could be quickly issued to roll back those changes. So one would think that in less than hour that "rogue network element" would be removed, and the rest of the 12 hours were getting stuff back online. Sort of like what happens if you lose power in a datacenter, you can't just flip a switch and let everything come back online at its own pace. You have to power on servers in a certain order, verify they are up and working, then move on to the next group, until the environment is fully back online.
Not much they can do about that problem as that's inherent to any complex environment, but I assume the bum rush to reconnect could be mitigated via either the tower software or the SIM profile to stagger mass reconnections over a few minutes.