back to article Failure to follow proper procedures caused US-wide AT&T outage, FCC says

An AT&T cellular outage lasting more than 12 hours that prevented subscribers from accessing services including 911 was caused by misconfigured hardware and a failure to follow standard procedures when deploying. Or so says the US Federal Communications Commission (FCC) of the incident on February 22, which affected AT&T …

  1. John Brown (no body) Silver badge

    a failure to follow standard procedures

    Isn't this current leading theory on why Crowdstrike sent out a defective update too?

    Corner cutting due to pressure from above maybe?

    1. DS999 Silver badge

      Re: a failure to follow standard procedures

      In my experience it is more likely an employee who knew the rules but chose not to follow them "this is a small change and I know what I'm doing".

      1. ecofeco Silver badge
        Holmes

        Re: a failure to follow standard procedures

        My experience is both. And across a wide range of industries by direct, employed there, experience.

        The whole system is a catastrophe looking for a place to happen. We ain't seen nothin' yet.

        1. Drew Scriver

          Re: a failure to follow standard procedures

          I have first-hand experience at a Fortune 500 company where merely doing a thorough peer review could get you a reprimand from management for "not being a team player" or "being a roadblock". Finding actual problems and rejecting a change request could get you short-tracked for dismissal.

          Rubber-stamp it or else... In some cases managers would simply approve highly technical and complex change requests to key infrastructure themselves, even though they had absolutely no technical expertise in that area.

  2. ecarlseen

    "But we made a standard!"

    Large corporations are full of standards that don't get followed.

    Sometimes because they're stupid.

    Sometimes because they're under-resourced.

    Sometimes because employees just don't wanna.

    But they love to create standards without follow-up because it lets management check the box of "We did a leadership!" without doing the hard work of ensuring that the standard is sensible, is doable, and is being followed.

    And then it won't be anybody's fault, because "We did a consensus!"

    1. ecofeco Silver badge
      Pint

      Re: "But we made a standard!"

      See, from the director level on up, none of them actually do any work. They have endless meetings, about other meetings to pretend to make policy, and leave the execution to their reports without any real authority to enforce those policies nor oversight of their reports.

      Combine it all, and it is Kafka writ large on an unprecedented scale. With high speed computing creating high speed mistakes.

      Now add another factor. A quote I've recently seen. "When poor performance is not punished and good performance is not rewarded, all motivation is gone."

      And with that thought, I'm off to the pub. ------------------------------->>>

  3. Anonymous Coward
    Anonymous Coward

    No fine?

    So they're not going to even bother with a token fine?

    Regulatory capture at its finest.

    1. IGotOut Silver badge
      FAIL

      Re: No fine?

      Go back. Read Article.

  4. User McUser
    Unhappy

    Am I the only person bothered by the fact that AT&T's ENTIRE, NATIONWIDE NETWORK can be taken down by a single badly configured device? Like, at all, under any circumstances?

    And the only protection against this is that someone else is supposed to also look at it?

    Killing a single tower or a couple in a small region? Sure, understandable - but one idiot fucks up and it ALL goes to shit?

    What kind of house of cards setup are they running? Where is the outrage over this from the FCC or Congress or anyone?

    1. ecofeco Silver badge
      Pint

      Having once worked for at&t, lo many moons ago, if you only knew, you would do what I do -------------------------->>>>

      1. Mark Exclamation

        Do enlighten us.....

  5. Frank Bitterlich

    I'd like to understand...

    ...more about the "protection mode" that was triggered. So a network device was installed that triggered some kind of watchdog system and, instead of just isolating the faulty new component, it somehow brought the whole network down.

    I have no clue about how mobile networks are being run. I do understand that many layers of safeguards are necessary to protect the network from faulty/compromised/wrongly configured components. But surely the protective response can't be "let's shut the whole network down". So why did it happen? Was that protection system behaving as designed? Was it built to protect against a different scenario, and made the whole problem worse? Or was it designed to do exactly that to protect against some even more undesirable consequence by disconnecting all devices?

  6. DS999 Silver badge

    Sounds like most of the issue

    Was the time taken by getting everything to come back online, and then tens to hundreds of millions of devices all trying to reconnect within a very short period.

    One would assume that if the outage occurred three minutes after this change was made, that it would be quickly apparent to everyone that a change made during the maintenance window was responsible and an order could be quickly issued to roll back those changes. So one would think that in less than hour that "rogue network element" would be removed, and the rest of the 12 hours were getting stuff back online. Sort of like what happens if you lose power in a datacenter, you can't just flip a switch and let everything come back online at its own pace. You have to power on servers in a certain order, verify they are up and working, then move on to the next group, until the environment is fully back online.

    Not much they can do about that problem as that's inherent to any complex environment, but I assume the bum rush to reconnect could be mitigated via either the tower software or the SIM profile to stagger mass reconnections over a few minutes.

  7. Anonymous Coward
    Anonymous Coward

    It shows how vulnerable we are. It a nation state wanted to disrupt the US at a time of war... This is the way. Just seed bas actors within. No need for a virus or a hack, humans are a far more dangerous attack vector.

    1. ecofeco Silver badge

      Or better yet, sell them their own destruction!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like