Take your bets!
Which was it this time? Was it DNS? Wrong window syndrome? Regex gone wild? Far fingered an rm -rf? Or did someone forget to pay the onion bhaji tax to the local BOFH???
If you can't or couldn't access GitHub today, it's because the site broke itself. The Microsoft-owned code-hosting outfit says it made a change involving its database infrastructure, which sparked a global outage of its various services. The biz is now in the process of rolling back that update to recover. "We are …
Classic. When I started a chain of actions that triggered a Hangouts outage (mostly internal to Google, thankfully, but still really bad), it took me less than 10 minutes to decide that. Seriously, if you last change corresponds to the start of an outage, now is NOT the time to say "correlation does not imply causation". You aren't Aristotle. Be Beyes instead. Undoing that change is far more likely to fix things than to harm them.
I am thinking it might have taken 30' for the changes to propagate to the full extent of github's infrastructure.
From bitter experience hastily reversing an undesirable change midway through that change taking full effect is often just adding another locomotive to the train wreck.
Taking time to collect the evidence and think through everything before potentially compounding the felony is one of the hallmarks of experience.
BTW Bayes for the typo Beyes.
I’ve worked on some service windows where the plan has a point at which ‘going back now will exceed the service window length’. At that point it is VERY tempting to press on and fix any issues rather than roll back - and many times there’s a good chance that you win. I’ve worked on some where there’s cabling changes, equipment changes and configuration changes carefully coordinated between people where even the thought of considering roll back is anxiety inducing.
But I do not subscribe to the ‘don’t pull the knife out from the wound’ approach. A sensible plan has criteria at each stage that, if missed, dictate the start of the (documented) rollback from that point and a documented rollback procedure (that you have tested yourself!) from that very point.
All very standard - or is in many industries. Seems to be based on consequence to the business as to how much effort is put into a proper plan and review process.
So m$ wants to convince customers that their cloud is scalable and demand flexible yet they are too cheap to provision a preproduction instance for final change deployment and validation before going to production.
Leadership is do as I do, not just do as I say..
If you don't live best practice, don't expect to be taken seriously in enterprise IT...