Reply to post: ROOT CAUSE

Google told BGP to forget its Euro-cloud – after first writing bad access control lists

Mike 125

ROOT CAUSE

"Google’s underlying networking control plane consists of multiple distributed components that make up the Software Defined Networking (SDN) stack. These components run on multiple machines so that failure of a machine or even multiple machines does not impact network capacity. To achieve this, the control plane elects a leader from a pool of machines to provide configuration to the various infrastructure components. The leader election process depends on a local instance of Google’s internal lock service to read various configurations and files for determining the leader. The control plane is responsible for Border Gateway Protocol (BGP) peering sessions between physical routers connecting a cloud zone to the Google backbone."

See, we have this 'system stuff' which is incredibly reliable. But it's terribly complex. It turns out we don't really understand its full dynamic failure modes ourselves, but we don't admit that ;-)

"Google’s internal lock service provides Access Control List (ACLs) mechanisms to control reading and writing of various files stored in the service. A change to the ACLs used by the network control plane caused the tasks responsible for leader election to no longer have access to the files required for the process."

Someone changed some 'system stuff' and for some reason, it all fucked up :-O

"The production environment contained ACLs not present in the staging or canary environments due to those environments being rebuilt using updated processes during previous maintenance events. This meant that some of the ACLs removed in the change were in use in europe-west2-a, and the validation of the configuration change in testing and canary environments did not surface the issue."

Our 'system stuff' is so reliable that we don't really need to validate changes properly before rollout. So we didn't. We just validated any old configuration :-~

"Google's resilience strategy relies on the principle of defense in depth. Specifically, despite the network control infrastructure being designed to be highly resilient, the network is designed to 'fail static' and run for a period of time without the control plane being present as an additional line of defense against failure."

Our system stuff is incredibly reliable, so reliable that it'll kind of 'appear' to run normally, even when completely knackered! Isn't that just great? :-}

"The network ran normally for a short period - several minutes - after the control plane had been unable to elect a leader task. After this period, BGP routing between europe-west2-a and the rest of the Google backbone network was withdrawn, resulting in isolation of the zone and inaccessibility of resources in the zone."

Our completely knackered system stuff ran for several minutes. I know! Amazing! Unfortunately, during that time nobody actually managed to spot its complete knackerement because, well, why would they? They weren't even looking- our system stuff is incredibly reliable :-)

Very soon our system stuff fell over completely causing visible errors, which we weren't expecting AT ALL.

So why did our system, taken as a whole, fail to be resilient? Well, it's 'system stuff' and it's terribly complex. So. Hmmm... we don't... really... know... :-(

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon