back to article Tolerating failure: From happy accidents to serious screwups … Time to look at getting it wrong, er, correctly

This correspondent has a confession to make: I’m not perfect and sometimes things don’t go as I hoped. I have made quite a few mistakes during the many years I’ve spent working with technology. What’s more, I see this is a good thing, and I am reassured by the fact that the famous late businessman, author and company …

  1. Tim 11

    Tim Harford's book "Adapt: Why Success Always Starts with Failure" has some more great examples - a really interesting read.

  2. Anonymous Coward
    Anonymous Coward

    sorry about windows


  3. Pascal Monett Silver badge

    "we should not be ashamed of things not going right"

    Being a father, I have had the opportunity to watch my child learn to walk (many years ago now).

    These days, I often reflect on that. An infant does not start walking from one day to the next. There's a lot of leg exercise to start with. That is followed by standing up, generally next to a low table or chair. It takes a few weeks before the infant is capable taking its first few steps.

    That is a lesson we all need to re-learn : it's okay to fail, because you never succeed 100% on the first try. Charlie Brown was right, failure is the best teacher.

    Babies know that. Adults need to re-aquaint themselves with the notion.

    1. stiine Silver badge

      Re: "we should not be ashamed of things not going right"

      You get to learn more when you've made an error.

      The second thing you should learn is how to take responsibility for what you've done. Don't dither or delay. Notify your support team before your customers start calling.

      The third thing you should learn is how to fix it.

      The first thing you should have learned is the new way to break it.

      1. 89724102172714182892114I7551670349743096734346773478647892349863592355648544996312855148587659264921

        Re: "we should not be ashamed of things not going right"

        Learn lessons from a failure, but never accept culpability: Fight like hell.

    2. Antron Argaiv Silver badge
      Thumb Up

      Re: "we should not be ashamed of things not going right"

      As someone who is watching (via Facetime and videos...thanks, Mom!) his new grandson go through this exact process at the moment, I'm getting a kick out of this.

    3. Tom 38

      Re: "we should not be ashamed of things not going right"

      Although, I am slightly peeved at the developmental capabilities of the human infant. Ever seen a new born foal, lamb or calf? Trotting around quite happily within minutes of entering this world. A 12 week old puppy can be trained to not shit in the house. Babies spend far too much time being irascible poop machines that are require constant support and supervision.

      1. doublelayer Silver badge

        Re: "we should not be ashamed of things not going right"

        That is the downside of our large brains. Unless we can better evolve the birth process to deal with the growth in head size, that's more convenient than the alternative most less intelligent mammals have taken.

  4. Doctor Syntax Silver badge

    Learning from somebody else's mistakes is far better than learning from your own.

    1. Robert Moore

      > Learning from somebody else's mistakes is far better than learning from your own.

      Unfortunately, it is also impossible for many people.

  5. swm

    Fixing errors

    When I was writing the Dartmouth time sharing exec there were errors (of course). Sometimes an I/O operation would complete successfully but, due to an error, the error path was taken. Rather than fix the code immediately we would check that the "error" was handled correctly. It is difficult to test very infrequent occurrences so anything that took the program down a little-used path was quite helpful.

    I also learned not to trust the status of "good" after an I/O operation. Once, due to a hardware fault, the peripheral would return a good status even though all of the words had not been transferred. So I checked everything I could - last data control word, word count etc. and declared an error if anything I could lay my hands on did not match my expectations.

    I also learned that in the exec there were no "errors" - everything had to be handled. In case of a disk error I retried 3 times (logging the first attempt on the console typewriter) even though the error was recovered) and, if the error was not recoverable, give an error status back to the user (with another log entry on the console typewriter).

    Even so, mistakes were made.

    Another policy was to tell the operators that, if they made a mistake, to talk about it and they would not get in to trouble.

    1. Keith Langmead

      Re: Fixing errors

      "Another policy was to tell the operators that, if they made a mistake, to talk about it and they would not get in to trouble."

      Absolutely, making an honest mistake shouldn't get you in trouble, but trying to cover one up should.

    2. Boris the Cockroach Silver badge
      Thumb Up

      Re: Fixing errors


      "Another policy was to tell the operators that, if they made a mistake, to talk about it and they would not get in to trouble."

      This is the line I use all the time on our operators, own up, confess to have borked things, then we can either give you a bit more/different training or re-write the programs to prevent you(or anyone else) from making the same borkage.

      Had one machine smash a 6 inch tool straight into the job when the operator hit start... needless to say this was very loud and everyone went to brown alert.

      On finally persuding said operator to confess , it turned out that he'd slammed the safety door shut faster than he should have done, the PLC picked up the the door as being shut and allowed start to be pressed, which then commanded the door lock to come in and lock the door.... however the lock bolt could not engage, and so the machine threw an error "door not locked".

      Operator then shut the door properly and hit start.

      Everything locked, but the machine had a read ahead buffer of 5 lines, so when it hit the first error , it dumped the buffer, on the 2nd start it started from 5 lines in.... and missed out a vital setup command... hence the crash. (this fault diagnoses took me the better part of 1/2 a day)

      So procedure and training were changed than on a door not locked alarm, the operator(s) were to find a setter and get them to reset the machine. and no more crashes.

      plus the operators were far more open about borkages as they saw that they would not be fired for causing a major borkage.(unless they did the same thing 3 times in a row and got told to leave the building via the wood chipper....)

  6. Keith Langmead

    “Principle of Least Access”

    Aside from the obvious security benefits, my favourite side effect of a properly laid out “Principle of Least Access” is it can sometimes make tracking down the source of an issue much faster. Had a customer suffer from a randomware attack in the past, and being able to quickly say :

    "OK, content in folders A, E and F have been encrypted, but not the other folders. Which user or users only has access to that specific set of folders? Focus our investigation on their machines so we can find the culprit, get it disconnected from the network, and get the borked data recovered from backup".

    Not the only way to track things down, but sometimes you get lucky and can either immediately identify the infected machine, or at least massively narrow down the scope of the search.

  7. J. Cook Silver badge

    I will draw from the sage advice from the 70 Maxims of Maximally Effective Mercenaries, specifically the last one:

    "Failure isn't an option. It's mandatory. The option is whether or not to let failure be the last thing you do."

    I've noted in my career that failure is an excellent teacher, as long as something is learned from it.

  8. Anonymous Coward
    Anonymous Coward

    Some of the time its a case of even identifying a mistake has been made or a fault has occurred.

    Had one recently - random systems had the same type of configuration issue crop up. The issue was similar in nature. It was fixed on each occurrence and the world kept of turning. I just happened to notice the pattern no-one else did (and that was only because I was cc'd on email - I wasn't working the problem), tracked it back to someone going around a list of systems and performing some work with an unexpected side effect (rather ironically the work being performed was tested, but the issue was not identified as it matched the test system's configuration).

    Mistakes will be made. Faults will occur. The important thing is to learn from them (at least in my opinion).

  9. Bruce Ordway


    >>> unplugging everything in the old data centre

    >>> reconnecting it wrongly in the new one.

    Ha, I've been called in to several sites to resolve cases exactly like this.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like