The unforgivable reasons for the outage:
1. Cloud strike used software to read configuration files that would crash for configuration files that it didn’t like. That is absolutely unacceptable, and probably was the case long in the past and still today. Performing this operation at boot time made it obviously worse, but even without that it would have been an unforgivable problem.
2. A crash at boot time is a permanent problem, not fixed by rebooting once or twice, but requiring costly manual intervention by a well-trained and trusted employee, which is why fixing it took so long and was so costly.
1. Would have been prevented by better development practices.
2. Would have been prevented by making ever update save the previous state, and allowing the blue screen of death to reboot with a known working configuration.
So the downtime should have been: 1. Computers being rebooted end up on a blue screen of death. 2. IT gets a few, then many calls. 3. IT figures out that some button needs to be pressed to reboot successfully. 4. IT tells the first user with a working brain and waits for a successful reboot. 5. From then on IT tells callers “press button X on the BSOD and tell all your colleagues”. 6. IT walks around and does that for all users that were too afraid to press a button they didn’t know. So maybe an hour until many computers are back up, and three hours for everyone. Except people on holiday at the time, or people panicking and cutting their power cables, or department heads sending everyone home.