phew
I had just installed a new router this weekend and I was beginning to fear I configured something wrong when connectivity was showing issues.
Adn now I also know our ISP uses Telia....
Swedish infrastructure company Telia is to blame for a massive internet outage today after an engineer apparently misconfigured a key router and sent all of Europe's traffic to Hong Kong. The Tier 1 network provider is one of fewer than 20 companies that provides a basic foundation for much of the internet. It sent a note to …
While a fairly crappy thing to have happen from Telia, it shows various connected telco's don't have rules in place to filter out bogus routes (which they should). That's absolutely on them, not Telia.
People, please fix your route filtering so bogus external routes (including TLA-sponsored crap) don't automatically hook you.
Cloudflare... hmmm... playing up to the media maybe? :(
Not so easy, that is the problem with BGP, was never built for such scale and world, BGP leaks happen every day and they are difficult to stop, specially wit big telcos that have peering, transit and transport from each other. The protocol is based on trust, any other thing as filters are inefficient and only really manageable in small BGP implementations that aren't dynamic. There is an attempt to review the protocol exactly because of this, the reg has something about it.
Never the less as the article suggests, if Telia is transporting the traffic for other ISPs if the route it somewhere wrong, there aren't filters to resolve it, they are simply transporting the traffic to a black hole, because the next gateway has no clue what to do with the prefixes and just drop them.
Remember all that explanation about how cleverly self-organizing and self-repairing the internet is? How it can route around a damaged link? Remember all that?
Bollocks. Now it's apparently manually controlled by fat fingered blokes. Might as well be levers and handles controlling pipes. Fry was right.
Is this described under an RFC document? RFC 9995 'Fry's List of Levers and Handles to Control the Internet Pipes'
Stupid.
Yes.
This is way outside my area of competence.
So I'd always assumed that at this top level of function there was a top level of error checking. Both human and digital.
Systems for checking that what is meant to go to x actually goes to X and systems that make sure humans don't press the wrong button. Or at least if they do that it's spotted PDQ.
Engineers in other system critical areas seem to have this, usually. e.g. Aircraft engineers.
Majority of these services are provisioned separate and along years, so in reality there is little holistic view over everything and dependencies and inter dependencies.... Not to mention that half of the telcos bought another number of telcos and there are a lot of skeletons hidden in the closets.
In general these big networks are everything less well controlled, no matter what the telco is, prove is the successive amount of big outages, no matter how big the name is, they all had their own episodes in the last 2 years.
Reality is plain simple and hard, all protocols Internet is based in are old, very old and were never meant to scale out, along the years people have been implementing a number of best practices and workaround to try to make inefficient protocols work to the new demands. IPv4, ven TCP, DNS, SMTP.... all completely inadequate for today needs, but all the baseline of everything we call internet.
Even the IPv6, that is almost not implemented at all, is dated somewhere in 2003, that is close to 15 years ago. BGP that is the big engine of all of this, is dated somewhere in first half of the 90s.
To make it more interesting very few big telcos really have any automation in place in the cores, because the networks are so big and have been built over the years with all kind of hidden stuff, no one dares to apply the blind law of automation and risk an outage. Automation is normaly implemented in the edge in customer services, even there due to the complex nature of many, there is no automation possible. Normally things like DSL, cable and this kind of services is where you find the automation piece.
@jp
If you absolutely need to route that way you will. If your sharing tables with customers hopefully their kit will show a shorter path via their alternate service provider and avoid the long erroneous route, else they are wholly reliant on their provider doing the right thing & there lies risk and danger to service availability.
But having an alternate service provider would not have helped. Telia didn't go down. The routes they injected told the whole world that these particular subnets were available (or withdrawn from) HERE. Any alternate providers would redirect that traffic to Telia. Any transit peers or tier 2's connected would still suffer. Alternate service provider would typically not help a business in this case.
It's common knowledge that BGP-4 misconfiguration errors can do things like this, and not all such errors can necessarily be filtered out automatically. I'm guessing that this was a route that was supposed to be in iBGP but got announced to BGP-4 peers as well.
(Not defending Telia in the least, but there are daily BGP-4 misconfigurations, it's just that most of them have more localised impact.)
I do not understand how you're able to write this article, since the last update we had from Telia was:
> We would like to let you know, issue experienced earlier it is now solved. Our engineers, have confirm the network performances it back to normal. You should not experience any more disturbances due to this outage.
> We will provide you with an RFO once the information its available.
Anyway, maybe the RFO will confirm your article.
And a footnote: CloudFlare were affected - no other CDN mentioned this publicly. I'm just gonna point this out: http://28.media.tumblr.com/tumblr_labmz1v2QU1qd0a5no1_500.jpg
I've certainly screwed things up over the years. But when you make a major change and part of the Internet goes dark, wouldn't you put 2+2 together and check your work? Or did he just hit enter and go on vacation?
I remember once when I inadvertently tripped a facility's fire alarm. I threw a switch and ominously the alarm went off a second later. That couldn't be me... Could it? COULD it? Aarrgh.
Where is the "oops" icon?
Uhh… I think I found him in my building. He's the guy with his shirt off standing on a desk in the middle of the cube farm while a burning network switch illuminates him from below, drinking from a gallon jug of 200 proof we keep for cleaning purposes. Hang on, he is yelling something at the CIO.
"I DID SOMETHING. I AFFECTED THE WORLD. I AM A PERSON WHO MATTERED TODAY."
"Too easy to make a typo on the commandline and sink a bunch of ships."
It's just as easy to get distracted and click the wrong button on a GUI too. Most GUIs are not well designed and those that are are usually quicker and easier to use from the keyboard anyway.
Unless you are drawing, most GUIs will usually let you do pretty much everything from the keyboard but there does seem to be a trend to "mouse only" GUIs which can be a pain when you have to type some text in then get the mouse and click "submit" when the previous version let you use ALT-S without having to move your hand so far to find the bloody mouse!!!
GUIs are great at keeping the learning curve shallow but can be draining on productivity once the user is familiar with the software. And beginners who need a hand-holding GUI probably should not be messing with tier 1 routing tables.
I'm guessing the "plank" in question, accidentally loaded a config into a live box instead of a POC/dev/test box?
I can't begin to image the "white fear" ( the blood draining from your face ) as it suddenly dawned on them what they had done! "Hello love? Yes it's me, I'm going to be working late...assuming I still have a job at the end of the day!"
As someone who's spent time working for one of those 20 or so companies (The lot I'd worked with were into the top 10 of that list). This kind of configuration/testing is nigh-on-impossible to do in Labs. You can, and do have multi-million pound labs for testing, but unfortunately, you can't quite make a test lab anywhere near as big, or as complex, as a large chunk of the public internet.
In theory, most things at this level are ok because normally they follow fairly rigorous change control procedures where changes are vetted by multiple R&S CCIEs and people who not only have the practical knowledge, but also the theory (very important, if you can't figure out what'll happen in your own mind, you'll make lots of mistakes).
While at an Enterprise Level, complex networking is albeit a simpler problem, at a Global Service Provider Level, it's extremely complicated, and no vendor has yet got anywhere near the level of automation needed to prevent fat-finger mistakes.
This one will continue to happen (very occasionally) for a long while to come I think. I would agree with others on here that have mentioned customers/CDNs should perhaps be filtering routes and doing a few other widely available tricks to be more in charge of their own destiny, however, it doesn't seem to be the case, and most get slammed when Tier 1 issues propagate downwards.
Such changes are supposed to only be made under change control. Write down the commands that are to be implemented and why. Include tests to prove the change worked as expected. Include backout plan in case needed. Send to change panel for approval. Peers review, suggest updates, eventually approve. Follow implementation process, sorted.
Even if wrongly typed, the post-change checks and tests would pick up on it (look at advertised routes in changed range before, and after. Are they as expected?)
So yes transit peers should filter, but Telia is a tier 1 and therefore ANY route from them (except those owned by the transit peer) can be trusted! That's the nature at tier 1, alternate routes to the same subnets are frequent and common!
There were no obvious direct effects for me, but \i suspect a couple of services I use were hit by CDN issues over the weekend. CloudFlare is one of I don't know how many CDN networks.
It ought to be clearing by now, but I am not sure it is.
It would be nice to know just which CDN the service I am paying is using, but i can't really see this problem affecting just CloudFlare.
What would the BOFH do?
Ah, opening time, never mind.
...to use the expression Deprioritizing them until we are confident they've fixed their systemic issues. (with one or two very minor changes of word) in a domestic setting when DDO * does something to upset me. Could turn out to be a high risk strategy, of course.
* Director, Domestic Operations, just in case you were wondering.
*meaning that any route cloudflare receives with and Telia AS numbers in the ASPATH will have a few more of Cloudflares' own AS numbers appended. Longer AS path reduces it's liklihood of acceptance to the routing table, before other routes. But Telia will still be used as last resort..
Published about an hour ago:
RFC 7908
Title: Problem Definition and Classification of BGP Route Leaks
Author: K. Sriram, D. Montgomery, D. McPherson, E. Osterweil,B. Dickson
Status: Informational
Stream: IETF
Date: June 2016
URL: https://www.rfc-editor.org/info/rfc7908
DOI: http://dx.doi.org/10.17487/RFC7908
Do Telia not operate some kind of change management or is their change management system inadequate (e.g. checking the results of the change)?
Do they really let individuals make it up as they go along, rather than stick to changes that have already been discussed?