BGP
Nowhere in the article is BGP defined. Footnote might be handy for non-techies,
https://en.wikipedia.org/wiki/Border_Gateway_Protocol
Monday's prolonged Google cloud and websites outage was triggered by a botched network update by a West Africa telco, it is claimed. Main One, a biz ISP based in Lagos, Nigeria, that operates a submarine cable between Portugal and South Africa, said a misconfiguration at its end caused Google-bound traffic to be redirected to …
Referring to the previous article - “While Google was hesitant to draw any conclusions, cloud security experts have little doubt that the BGP hijacking was intentional, rather than a brief typo in a config file or a fat finger in a terminal, and that the people behind it were almost certainly up to no good by intercepting Google Cloud connections.”.
Except it wasn't... or maybe it was... or....
A haze is settling over my view of the cloud and I can no longer see the woods.
Quietly slipping on tinfoil body suit.
The following is a joke... of course... to the tune of, "Blame Canada"
The irony is the last part, because as Joyce said in the article, "... the wakeup call for all of us to get serious about addressing the massive and unacceptable vulnerability..."
----
Time's have changed
The Internet's getting worse
They won't obey the IETF
...and go to V6 instead
Should we blame the government?
Or blame society?
Or blame the traffic of Internet TV?
No, blame Africa, blame Africa
Their update was a surprise.
They re-routed all our files.
Blame Africa, blame Africa
We need to form a full assault
It's Africa's fault
----
Don't blame my poor old router
It saw the wrongest route
And now it's off to China and Japan.
And Russia's on the path
My files have gone "Прощай"
And buggered off to the East instead of West
Well, blame Afria, blame Africa
Something technical went wrong
When Africa came along
Blame Africa, blame Africa
They're not even between me and L.A.
My data could have been a movie, or a best selling book.
Now it's down a black hole, come and look.
Should we blame the fibre?
Should we blame the light?
Or the technicans who buggered it up last night?
Heck no
---
Blame Africa, blame Africa
'Cause of MainOne's hullabaloo
They lost all your traffic too
Blame Africa, shame on Africa
All the smut got lost, the traffic all got crossed
All this routing mess must be undone
We must blame them and cause a fuss
Before someone thinks of blaming us
After all you have lots of entities peering with each other. Each of those peerings requires an agreement. It would be sensible to use this to also sign keys, after all you typically know who you are peering with.
However I don't think attribution is much of the problem here as it is usually rather easy to find the culprit. What's really needed is route filtering.
However I don't think attribution is much of the problem here as it is usually rather easy to find the culprit. What's really needed is route filtering.
It's not that simple. As I discovered long ago when AS7007 had an 'oops' moment-
https://en.wikipedia.org/wiki/AS_7007_incident
And the 'major changes' I think were mostly trying to get accurate contact details in a form that would be accessable in the event of a repeat. But this particular event shows that fat fingering can still happen, and it can still take time to resolve.
In theory, it could be possible to make sure every specific, routable network has a fully defined route object and up to date origin AS.. Which is something the routing registries have been trying to get done since RIPE-181 defined RPSL. Which is possibly why ARIN's swamp remains a mystery. RPSL has RFC status now, but there are still a lot of problems getting route policys accurate, which then enable more reliable route filtering.
The thing is, there is no easy solution to this.
In the generalised case, only one peer knows what route their customer should be emitting - and that's the one directly providing the service. But that only applies for the "leaf nodes" - so if I get a line in form a couple of ISP to my little hosting biz, both of those ISPs can (and should) filter my BGP announcements to only allow the small set of IPs I have. That bit is relatively simple - and as long as every end-point provider does this basic filtering at source then one avenue of cock-up is blocked. But if they don't then ...
Both of those ISPs will be taking my traffic to one or more exchanges and publishing my routes alongside many others. So my route advertisements now appear coming from two different ISPs - the problem is that all those other peers connected at the exchange(s) will not know (or have any way of knowing) whether the routes the ISPs are sending on my behalf are genuine.
And it gets worse. Those peers will pick up my routes and propagate them across their network, and at some other point they will get broadcast to other peers. These other peers (now twice removed from any relationship with me) will not have any way to know whether or not they are genuine.
And so it goes on, with peers around the world getting further and further away from knowing who I am and who should be carrying traffic towards my IPs.
But that is only the simple case where the error is in a leaf node where it's relatively easy to know what routes should be advertised from there - the ISP asks me when providing connectivity what AS numbers I own and put those into their filter for the connection itself.
In the case here, the error happened at a transit peer that by definition must be handling lots of routes for people it knows nothing about.
In this case, what I think has happened is that internally they've setup routes to send Google traffic direct to Google via their peering arrangement. Basically that's a matter of "send this list of IP blocks via this gateway". At the same time, they should be filtering those same IP blocks from BGP announcements they make via other connections - specifically the sub-sea cable they operate. They made a mistake here, so the peering specific routes leaked out.
But as above, the other peers involved have no way of knowing that this was a mistake - it could be that the announcements they saw were the result of some new connection going in that made this a good route for the packets, something that's not easy to determine. The key thing is, these other peers really have no way of knowing whether that link genuinely is a route to those destinations. Just signing the route advertisements won't help - because all those routers will have to propagate the routes anyway, so seeing a route that's signed does not tell you anything about whether the router you received it from should actually be routing that traffic.
Bear in mind that the global routing table is heading on for 3/4 of a million entries, propagated across many thousands of routers operated by thousands of operators. It's hard to see how any web of trust could be setup that would handle that scale
But.. RPSL!
So as an example-
route: 86.128.0.0/10
descr: BT Public Internet Service
origin: AS2856
Part of a route object for a chunk of BT's address space. That says that 86.128/10 should only originate from AS2856, which is BT's AS. Where a network and route is multi-homed, it can have multiple origin: ASNs defined.
So far, so simple. If AS7007 starts to advertise that network, the advertisement is inconsistent with teh origin and the advertisment ignored. The route shouldn't be accepted, and traffic should flow as BT hopes it will.
Snag is that would require route filters to be automagically configured based on RPSL and the relevant RIR entries. Although there are toolsets to support this, it would take a brave neteng to rely on it given the potential for it to automagically break everything. And then there's the challenge that not every network has a route object defined, accurate maintainers, or someone available at 0300 who knows the password to update an entry.
That all becomes FUN and often becomes apparent when you lose remote login to your core router due to collosal volumes of traffic suddenly thinking you're their new, bestest transit network. Especially if you don't have outband access to your routers so you can kill BGP/shut down interfaces. Otherwise congestion can mean you can't access your router, or peering sessions collapse due to lost BGP keepalives and your router(s) suddenly get very busy trying to recalculate routes as peering sessions flap merrily in the breeze. Eventually flap dampening might buy some breathing space.
And given the Mk4 Internet is running at around 750K routes, that can be a lot of work for a poor, struggling router.. Or just a bigger mess to try to manage automagically.
Just another pause for thought moment.
All these people and companies who, like me understand nothing about the global infrastructure, but who, unlike me are prepared to stick their life and work on a remote server somewhere.
Well, how confident are they that they'll be able to access their stuff when they need to?
And if they are why the f*** are they?
They don't look at it like that, what does downtime matter when they are saving the money?
This is prevalent all through society. That downtime will eventually lose them customers which will lose them more money than they saved from putting it in the cloud. Same goes for zero hour contracts, great idea let's treat our workers like slaves to save money. Do they really think that's going to sustainable with a demotivated workforce that really doesn't care? Say goodbye to your customers. Outsourcing is another prime example, let's get rid of the people that know how to do the job and have done it for years to save some money only to find at a later date that the people you outsourced it can't do the job. Bye bye customers.
People will never learn.
All these people and companies who, like me understand nothing about the global infrastructure, but who, unlike me are prepared to stick their life and work on a remote server somewhere.
Yes! And the fools rely on shipping companies to move goods! And telcos for communication! And utilities to provide electricity! And banks to provide a financial system! And governments to maintain civil order!
When will they learn? You can't rely on anything. Just give up now.
False analogy.
1.) If you put stuff on a ship (or plane etc) it's because that's core to what your business does. Your analogy works for an ISP. Not a data owner.
2.) Only a portion of the stock is on any one ship usually. Historically, of course an entire company might depend on a single shipment of silk, or spices. So yes, they might literally go under.Not usually these days.
3.) You can insure against a boat sinking. But intangible data data loss, reputational damage and law suits might prove a bit too expensive, even if it's possible to get such an indemnity policy.
At the same time, the realization that something as simple as a regional ISP misconfiguring a server could trigger a global outage does not sit particularly well either.
Worth noting that this isn't a global outage of all internet traffic, or all traffic to google even.
As the article says, it was "primarily propagated by business-grade transit providers and did not impact consumer ISP networks as much"
so it was mostly "just" business grade providers connecting through one region that lost access to one company's services.
Disruptive? yes, especially when that one company is as big as Google.
Not quite the end of the internet though.
And Telstra in Australia decided to route a good chunk of the domestic Internet to Melbourne and two very confused routers that sat there bouncing packets back and forward until their ttl ran out.
Halfed our servers traffic for an hour and Telstra doesn't handle any transit or peering for us at all!