I learned SRE at G in 2015-6 supporting (primarily) hangouts. This overlapped the rollout of GCP, I was not involved with that effort at all.

We had a number of rules that were involatile. Rule #1 on that list: the minimum number is three. Three DCs (with non-overlapping maintenance schedules), and on at least three servers in each, we would not talk to you (you being an internal G team).

If you want resilience, you MUST be able to handle simultaneous scheduled & unscheduled outages, both at the DC level & at the level of the individual servers.

This is NOT cheap. SRE can tell you how to do it without exploding your cost, however.

Set up this way, and you can simulate scenarios like this one as training. (We preferred Tuesdays.)

