impressive demonstration
It used to be difficult to take out a whole country's telephone network.Now you seem to be able to do it by accident.
Backbone provider Level3 says an outage that knocked out VoIP service for much of the US Tuesday morning was the result of improperly configured equipment. It seems the outage, which smashed call services offline for much of the country, was not the result of any fiber cuts or facility damage, but rather some classic bad …
Reason for Outage (RFO) Summary: On October 4, 2016 at 14:06 GMT, calls were not completing throughout multiple markets in the United States. Level 3 Communications¿ call center phone number, 1-877-4LEVEL3, was also impacted during this timeframe, preventing customers from contacting the Technical Service Center via that phone number. The issue was reported to the Voice Network Operations Center (NOC) for investigation. Tier III Support was engaged for assistance isolating the root cause. It was determined that calls were not completing due to a configuration limiting call flows across multiple Level 3 voice switches. At 15:31 GMT, a configuration adjustment was made to correct the issue, and Inbound and outbound call flows immediately restored for all customers. Investigations revealed that an improper entry was made to a call routing table during provisioning work being performed on the Level 3 network. This was the configuration change that led to the outage. The entry did not specify a telephone number to limit the configuration change to, resulting in non-subscriber country code "1" calls to be released while the entry remained present. The configuration adjustments deleted this entry to resolve the outage.
Some lower being in their IT rolled out a duff config to mission-critical routers affecting some - what, millions? - of customers, because they didn't bother to pre-test, check, verify or anything else on their config change and manage to take down - what? a million? - phone lines.
Of course, none of this was caught by testing or configuration or change management, and it was only when it got to the top bod who actually knew what he was doing, who started shouting, that someone owned up to putting a stupid config on their main devices without testing.
This obviously all took hours to happen and fix rather than someone pushing a change to a set of switches they manage, testing them immediately afterwards, and then immediately rolling back when they realised they weren't working as before. Because, nah, forget all that, our customers will tell us if something doesn't work.
It doesn't matter WHAT scale of business, the same stupid junk happens all over.
Im yet to work in an enviroment that actually has a test enviroment for the comms guys.
app developers, servers sure. but all comms equipement I've ever touched has been "live".
Makes "testing" difficult. and of course with a config change like that you need realistic load to test with.
This also sounds like one of those standard changes that gets done multiple times a day and more or less waved through change management. and the "lower being" probably verified by making sure the line looks like his doco then went to the next job, and would have been the last to find out it had gone TITSUP.
It sounds like a simple configuration change that was supposed to restrict the call volume to a single phone number (e.g. someone published the wrong phone number for a business, or there was a contest call-in that went bad). Instead they entered it so that it limited the call volume of everything in +1.
To put it in Unix admin terms, they did
FILE=tmp/foo
rm -rf / ${FILE}
".....Level 3 Communications¿ call center phone number, 1-877-4LEVEL3, was also impacted during this timeframe....." Many years ago, I knew the managers for the NOC at an UK telecom, and they admitted to me that they had half their on-call phones on a competitor's network just in case theirs went down, and their competitor vice versa. I assume no-one at Level3 thought about that option.