Tim Harford's book "Adapt: Why Success Always Starts with Failure" has some more great examples - a really interesting read.
Tolerating failure: From happy accidents to serious screwups … Time to look at getting it wrong, er, correctly
This correspondent has a confession to make: I’m not perfect and sometimes things don’t go as I hoped. I have made quite a few mistakes during the many years I’ve spent working with technology. What’s more, I see this is a good thing, and I am reassured by the fact that the famous late businessman, author and company …
Monday 21st June 2021 11:28 GMT Pascal Monett
"we should not be ashamed of things not going right"
Being a father, I have had the opportunity to watch my child learn to walk (many years ago now).
These days, I often reflect on that. An infant does not start walking from one day to the next. There's a lot of leg exercise to start with. That is followed by standing up, generally next to a low table or chair. It takes a few weeks before the infant is capable taking its first few steps.
That is a lesson we all need to re-learn : it's okay to fail, because you never succeed 100% on the first try. Charlie Brown was right, failure is the best teacher.
Babies know that. Adults need to re-aquaint themselves with the notion.
Monday 21st June 2021 11:52 GMT stiine
Re: "we should not be ashamed of things not going right"
You get to learn more when you've made an error.
The second thing you should learn is how to take responsibility for what you've done. Don't dither or delay. Notify your support team before your customers start calling.
The third thing you should learn is how to fix it.
The first thing you should have learned is the new way to break it.
Monday 21st June 2021 15:03 GMT Tom 38
Re: "we should not be ashamed of things not going right"
Although, I am slightly peeved at the developmental capabilities of the human infant. Ever seen a new born foal, lamb or calf? Trotting around quite happily within minutes of entering this world. A 12 week old puppy can be trained to not shit in the house. Babies spend far too much time being irascible poop machines that are require constant support and supervision.
Monday 21st June 2021 15:01 GMT Doctor Syntax
Monday 21st June 2021 23:50 GMT swm
When I was writing the Dartmouth time sharing exec there were errors (of course). Sometimes an I/O operation would complete successfully but, due to an error, the error path was taken. Rather than fix the code immediately we would check that the "error" was handled correctly. It is difficult to test very infrequent occurrences so anything that took the program down a little-used path was quite helpful.
I also learned not to trust the status of "good" after an I/O operation. Once, due to a hardware fault, the peripheral would return a good status even though all of the words had not been transferred. So I checked everything I could - last data control word, word count etc. and declared an error if anything I could lay my hands on did not match my expectations.
I also learned that in the exec there were no "errors" - everything had to be handled. In case of a disk error I retried 3 times (logging the first attempt on the console typewriter) even though the error was recovered) and, if the error was not recoverable, give an error status back to the user (with another log entry on the console typewriter).
Even so, mistakes were made.
Another policy was to tell the operators that, if they made a mistake, to talk about it and they would not get in to trouble.
Tuesday 22nd June 2021 13:21 GMT Keith Langmead
Tuesday 22nd June 2021 18:11 GMT Boris the Cockroach
Re: Fixing errors
"Another policy was to tell the operators that, if they made a mistake, to talk about it and they would not get in to trouble."
This is the line I use all the time on our operators, own up, confess to have borked things, then we can either give you a bit more/different training or re-write the programs to prevent you(or anyone else) from making the same borkage.
Had one machine smash a 6 inch tool straight into the job when the operator hit start... needless to say this was very loud and everyone went to brown alert.
On finally persuding said operator to confess , it turned out that he'd slammed the safety door shut faster than he should have done, the PLC picked up the the door as being shut and allowed start to be pressed, which then commanded the door lock to come in and lock the door.... however the lock bolt could not engage, and so the machine threw an error "door not locked".
Operator then shut the door properly and hit start.
Everything locked, but the machine had a read ahead buffer of 5 lines, so when it hit the first error , it dumped the buffer, on the 2nd start it started from 5 lines in.... and missed out a vital setup command... hence the crash. (this fault diagnoses took me the better part of 1/2 a day)
So procedure and training were changed than on a door not locked alarm, the operator(s) were to find a setter and get them to reset the machine. and no more crashes.
plus the operators were far more open about borkages as they saw that they would not be fired for causing a major borkage.(unless they did the same thing 3 times in a row and got told to leave the building via the wood chipper....)
Tuesday 22nd June 2021 13:28 GMT Keith Langmead
“Principle of Least Access”
Aside from the obvious security benefits, my favourite side effect of a properly laid out “Principle of Least Access” is it can sometimes make tracking down the source of an issue much faster. Had a customer suffer from a randomware attack in the past, and being able to quickly say :
"OK, content in folders A, E and F have been encrypted, but not the other folders. Which user or users only has access to that specific set of folders? Focus our investigation on their machines so we can find the culprit, get it disconnected from the network, and get the borked data recovered from backup".
Not the only way to track things down, but sometimes you get lucky and can either immediately identify the infected machine, or at least massively narrow down the scope of the search.
Friday 30th July 2021 18:39 GMT J. Cook
I will draw from the sage advice from the 70 Maxims of Maximally Effective Mercenaries, specifically the last one:
"Failure isn't an option. It's mandatory. The option is whether or not to let failure be the last thing you do."
I've noted in my career that failure is an excellent teacher, as long as something is learned from it.
Saturday 31st July 2021 09:26 GMT Anonymous Coward
Some of the time its a case of even identifying a mistake has been made or a fault has occurred.
Had one recently - random systems had the same type of configuration issue crop up. The issue was similar in nature. It was fixed on each occurrence and the world kept of turning. I just happened to notice the pattern no-one else did (and that was only because I was cc'd on email - I wasn't working the problem), tracked it back to someone going around a list of systems and performing some work with an unexpected side effect (rather ironically the work being performed was tested, but the issue was not identified as it matched the test system's configuration).
Mistakes will be made. Faults will occur. The important thing is to learn from them (at least in my opinion).