
Rolling update causes outage
"apiserver timeouts after rolling-update of etcd cluster"
Back in the day of steam driven computers, we were taught to never ever update a live system. First test on a test rig before rolling out to the live system or at least have a working roll-back procedure in place. One that won't fail because it can't find its configuration data, because the rolling update borked communications to the server.
"Beattie posted an analysis of the incident and lay the blame on Kubernetes"
NO NO NO, the blame lies with whoever at Monzo rolled-out the update without first verifying it, or at least have a working roll-back procedure in place ..
"To restore service, they turned to an updated version of linkerd being tested in the company's staging environment."
Is this the same 'staging environment' you didn't bother to test the rolling update on in the first place?