Reply to post: Have all of you misunderstood/read the Google report?

Google reveals version control plus not expecting zero as a value caused Gmail to take an inconvenient early holiday

SecretSonOfHG

Have all of you misunderstood/read the Google report?

This is an update to an authentication service, not an end user facing application. So there is little use for any "staging" or "validation" environment for anyone to test. The engineers are more than able to validate the change by themselves during service testing and load/fail over testing.

And in case you've not read the article, the failure was due to leaving the old quota system in place instead of switching it off, and that happened back in october. Yes, that was a time bomb, but when doing these kind of infrastructure updgrades leaving the legacy system switched on instead of turning it off is best practice, as it speeds up the rollback in case you need it. The only fault on Google's side is that they forgot to switch the old system off after more than a month of running the new one, and that caused the outage.

It is just impossible to detect that kind of ommission in any "staging" or "testing" or "pre-anything" environment. The only purpose of having the legacy system running in these environments is to test that you can rollback your changes quickly without breaking it. Unless you leave such environment running for more than month injecting a realistic worload, you'll never, ever catch the omission. Not that it does not make much sense to spend two months running a simulated production environment, as you only have to make a note of switching off the legacy system once everything is working smoothly with the new one. And remember to do it. Which wasn't the case.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon