back to article Whatever you do, don't show initiative if you value your job

We've covered backups before in the annals of this column, but a bit of helpfulness that turned into a bonfire of the binaries? Start your Monday with a lesson in not taking the initiative. Our story comes from "Harry" (not his name) who was working for a major medical products company back in the days when Windows XP was the …

  1. Anonymous Coward
    Anonymous Coward

    "So was James truly the guilty party?"

    It seems that James was cocky and arrogant enough to wreck the system because 'he new best', so Yes, he was guilty and being shown the door wasn't unjust.

    However, the idiot who left him on his own overnight with administrative access to critical systems, when he was still wet behind the ears, should bear more guilt. If that was Harry, then he should have been shuffled off too. Assuming he was expendable, of course.

    1. b0llchit Silver badge
      FAIL

      Re: "So was James truly the guilty party?"

      Ehm, and the company with none or very bad change-management procedures for critical production systems. The company and its management is clearly at fault here.

      1. My-Handle

        Re: "So was James truly the guilty party?"

        Sounds like there's plenty of fault to hand around here. This wouldn't have occurred if any of the following had happened:

        - James had not had his brilliant idea / decided to check with someone else before doing it on a critical system

        - Harry had realised that letting a newbie loose on a critical system was a disaster in the making

        - The company had a decent change-management / permissions system on their critical machines

        All the holes lined up on the swiss cheese model here.

        Not sure if this was a fire-worthy event though. Yes James made a very stupid mistake, but he made it with good intentions and the experience he got from it was hard-bought. If he was in any way competent, he is unlikely to ever make a similar mistake for the rest of his career. The next guy they get on board to replace him might. If this was the biggest in a line of similar mistakes for James though, yes he should be fired. And an even heavier look should be taken at whoever let him near those machines.

        1. Doctor Syntax Silver badge

          Re: "So was James truly the guilty party?"

          From the story the company actually had a training system on which James was trained. Although he was a newbie he was, reasonably, believed to be trained and to know what he was doing and, after all and like everyone else, he had to go solo some time. It's how you get to go from being a newbie to no longer a newbie. At some point a new employee has to be trusted to do the job for which they've been trained.

          If he was installing the new system from CD-ROM he needed to have admin rights. This being XP it's entirely possible that the the system wouldn't even run without the user having admin rights.

          Change management procedures aren't going to be much use if someone goes ahead and ignores them and does something that isn't in them. The procedures would be unlikely to forbid removing the duplicate files, They probably didn't say not to format C: either. After all, nobody had done either of those on the production machine before.

          1. TonyJ

            Re: "So was James truly the guilty party?"

            This line "...This being XP it's entirely possible that the the system wouldn't even run without the user having admin rights." never fails to amuse me.

            With simple tools such as Regmon and Filemon from Sysinternals it was always possible to get software running as a non-administrative user.

            Document the changes in case an update decides to re-apply perms (rarely happened) and it was trivial.

            This used to be the refrain on poorly built Citrix servers all the time - oh it needs admin rights so we made everyone a domain / local admin. Lazy and needless.

            1. Captain Scarlet
              Unhappy

              Re: "So was James truly the guilty party?"

              The amount of software back then which required writing to places like the root of C:, the Windows Directory or Program Directory just showed how poorly thought out/updated many programs were.

              1. Prst. V.Jeltz Silver badge

                Re: "So was James truly the guilty party?"

                yes , it was like the wild west at the time .

                I worked in a school at the time

                the era of "European Computer Driving License"

                and the amount of software that insisted on this was staggering .

                People are writing this software , and selling it to customers where its an absolute cert the people using wont have admin rights

                1. Anonymous Coward
                  Anonymous Coward

                  Re: "So was James truly the guilty party?"

                  By the time WinXP came around, M$ had design guides and lists of 'best practises' for programmers...

                  Which they couldn't care less for...

                  We bought half a dozen Epson A3 Flatbed scanners, and found that the SW insisted on storing temp files in the 'Program Files' structure...

                  We just packed them down, shipped them back to the seller and demanded our money back because they weren't fit for purpose.

                  Another program we had, that was somewhat older(Originally written for Windows 3.1 or about then) insisted on having the .ini file in the 'C:\Windows' folder.

                  We learned THAT when we upgraded the WinNT4.0

                  1. KA1AXY
                    Facepalm

                    Re: "So was James truly the guilty party?"

                    M$ had design guides and lists of 'best practises' for programmers...

                    It was a different time...

              2. Peter2 Silver badge
                Meh

                Re: "So was James truly the guilty party?"

                Most didn't need write access to %program files%.

                What they needed was for the user to be given write access to %program files%/ApplicationName/tempfile.tmp or C:/Temp/tempfile.tmp.

                And yes, they should have been writing to %temp%, however even today half the people out there write to C:/Windows/Temp rather than just using the %temp% variable.

                1. david 12 Silver badge

                  Re: "So was James truly the guilty party?"

                  If you use the %temp% variable, you leave yourself open yourself to the kinds of malware problems unix/linux has with environment variables (for example, shellshock).

                  Native Windows programs use the Windows API to identify the temp folder. %temp% is for .BAT and .CMD scripts.

              3. Anonymous Coward
                Anonymous Coward

                Re: "So was James truly the guilty party?"

                yeah I worked in a research lab and the amount of shonky software writtern mainly by boffins that "required" admin creds to run was shocking.

                1. Alan Brown Silver badge

                  Re: "So was James truly the guilty party?"

                  this starts rubbing at the core of the issue

                  Most software ISN'T written by professionals, but by bumbling amateurs whose code happens to do what's needed at the time

                  Couple that with a "If it works, ship it!" mindset from the moneybags and you have a disaster waiting to happen at some point down the line

              4. david 12 Silver badge

                Re: "So was James truly the guilty party?"

                There was, and still is, an enormous amount of cross-platform software that installed in c:\, and expected to be there, or couldn't handle spaces in the path, or (on servers) ran very poorly when in a deep directory tree.

                For native Windows programs, the problem was programs that stored configuration data in the 'machine' branch of the registry, requiring either 'machine' privilege to do so (rather than 'user' privilege'), or someone sophisticated enough to identify individual leaves of the 'machine' registry which could be given write permissions for 'authenticated users' as well as 'admins'.

              5. Jou (Mxyzptlk) Silver badge

                Re: "So was James truly the guilty party?"

                This is where Vista was the first Windows OS that silently redirected such writes to %userprofile%\AppData\Local\VirtualStore . My "Windows" directory there is full of .ini files, and many of them with fresh last change date. Like cool.ini, since I still use Cooledit 2000 on Win11 x64. I also have a "ProgramData" and "Program Files (x86)" there, which is not surprising and contains the expected suspects.

                In my opinion: A GENIUS move to to do that since Vista, solves a lot compatibility issues.

                BTW: The same goes for the registry, check your HKEY_CURRENT_USER\Software\Classes\VirtualStore.

            2. localzuk Silver badge

              Re: "So was James truly the guilty party?"

              There was plenty of poorly written software around at the time that did indeed require local admin rights to run. No amount of changing permissions would remedy the issue - anyone working in Education at the time can give you plenty of examples.

              1. Rob Daglish

                Re: "So was James truly the guilty party?"

                Ohhhhh yes. I worked for a County Council supporting school IT, and even now I still want to punch the developers of "Talking Write Away" and "Number Box". TWA was a word processor which had different coloured levels which exposed varying amounts of tools to the kids. Except some levels just wouldn't print... and you could forget all about being able to save to network drives, c:\ it is!

                1. Terry 6 Silver badge

                  Re: "So was James truly the guilty party?"

                  In my case, some stuff wouldn't even save to a local partition. C:\programmename\data was hardwired! In the programme I posted about below, though, you could save anywhere you wanted to. If you knew that you needed to. knew you could and how to get to the controls that let you change the settings. But this was well hidden.

                  1. Jou (Mxyzptlk) Silver badge

                    Re: "So was James truly the guilty party?"

                    Would a junction work? It works in quite a lot of situations, and Windows uses it a lot for compatibility. c:\inetpub, for example, can be moved to a different drive and a junction from c:\inetpub to <wherever> works fine!

            3. Anonymous Coward
              Anonymous Coward

              Re: "So was James truly the guilty party?"

              "With simple tools such as Regmon and Filemon from Sysinternals it was always possible to get software running as a non-administrative user."

              I think the point Doctor Syntax was making was in relation to installing the software (which generally always needs admin permissions), rather than running it afterwards (which I agree never should).

              Since this was XP there's no UAC available, so you'd need to be either logged in as an admin user, or have been granted admin access to your own login. Though that said, since Sage Payroll* STILL doesn't properly support UAC so quite possible that wouldn't have helped anyway.

              * neither do some others, but come on Sage, you're a massive company and should have that sorted by now!

              1. Mark 85

                Re: "So was James truly the guilty party?"

                * neither do some others, but come on Sage, you're a massive company and should have that sorted by now!

                Possibly they are too big and those who created this are long gone and this bit has been forgotten.

                1. Rob Daglish

                  Re: "So was James truly the guilty party?"

                  Given Sage's implementation of "Cloud Drive" I wouldn't bet on it being fixed anytime soon...

            4. CommanderGalaxian
              Holmes

              Re: "So was James truly the guilty party?"

              "With simple tools such as Regmon and Filemon from Sysinternals ..."

              Because obviously a newbie will completely familiar with that approach...

          2. Bent Metal
            Facepalm

            Re: "So was James truly the guilty party?"

            I'm still stunned that - even after training on a separate, isolated system - someone could decide to try out a new idea for the first time ever on all of the production systems simultaneously

            1. david 12 Silver badge

              Re: "So was James truly the guilty party?"

              I'm not stunned at all. I'm old enough to remember that breath-taking arrogance was a mark of computer programmers in the main-frame era, and young enough to know that it's still a mark of fresh comp-sci graduates today.

        2. breakfast Silver badge

          Re: "So was James truly the guilty party?"

          I feel like describing this in Swiss cheese terms is perhaps a little generous, seems like the configuration was basically all holes, with maybe a couple of very thin cheese strings stretched across here and there.

        3. Anonymous Coward
          Anonymous Coward

          Re: "So was James truly the guilty party?"

          Not IT, but the comment about learning the lesson reminds me of an incident many years ago on an offshore drilling rig in the Gulf (the US one, that is). For background (greatly simplified), many rigs have a device called "a rigsaver" that stops the travelling block being raised so far that it collides with the static block (at the top of the derrick) - if that happens it's very expensive (not only to repair but also the downtime involved).

          On this particular rig, the rig saver had been disconnected (for what was probably a good reason at the time) but the driller coming on tour wasn't told. Shortly into the job, he was raising the block fast (as required to keep the job moving) expecting the rigsaver to kick in if needed; eventually, the inevitable happened and he missed the point he needed to stop. An expensive ouch! for the drilling contractor. The driller was promptly fired for incompetence, the drawworks repaired, and a new driller arrives to carry on with the job. Being new, he didn't know the full history of what happened, or why, and shortly into the restart of drilling, the inevitable happened again.

          Rig management were the ones who should have been fired - firstly for firing the one driller who knew what error to avoid, and then for not learning and ensuring his replacement was fully briefed. But hiring and firing minions was the culture...

          1. The Oncoming Scorn Silver badge
            Angel

            Re: "So was James truly the guilty party?"

            The OIM is God - Icon.

          2. Mark 85

            Re: "So was James truly the guilty party?"

            But hiring and firing minions was the culture...

            Was???? I think it still is as one of manglement's jobs is to protect themselves along with their paycheck and bonuses.

    2. Plest Silver badge

      Re: "So was James truly the guilty party?"

      Exactly!

      Just 'cos you can see it's complete shite, that doesn't give you the right to rip it up and just change it without proper consus and paperwork.

      I hate CR paperwork but it covers arses all round, targets what was changed ( assuming it was written properly! ) and then it's harder to simply point the finger at one person when everyone had lots of chances to stop a catestrophe occurring due to bad planning.

      1. Anonymous Coward
        Anonymous Coward

        Re: "So was James truly the guilty party?"

        T|he most important sections of a CR are the impact assessment and the back out plan. The impact assessment should take some time because this is where you are considering the benefits of the change vs the potential risk. It also informs the back out plan and the implementation approach. I've come across too many CR's in my time which specify that recovery is 'restore from backup' without understanding when backups are taken or the actual downtime this would cause. I ended up having a stand up fight with a very experienced tech support manager who wanted to implement 'restore from shadow copy' approach for a operationally critical case management system written in SQL Server when Microsoft did not support this approach and this would prevent the rolling forward of the database to the failure point. This could have resulted in a loss of several hours of updates directly impacting on the delivery of service to citizens in crisis.

  2. Pascal Monett Silver badge

    Where were the procedures ?

    I agree that James launched himself into his own troubles, but where were the documented procedures on how to do what ?

    Of course, this was back in the day and this is exactly the kind of thing that made people understand how important written procedures are.

    1. Doctor Syntax Silver badge

      Re: Where were the procedures ?

      In that sort of environment it would be surprising if there weren't documented procedures. They're not much use if they're ignored and it sounds as if James was just the type to do that. I suppose he never found himself anywhere near the emergency stop button.

      1. Version 1.0 Silver badge
        Unhappy

        Re: Where were the procedures ?

        I saw a database designed back then for XP, the company sold the medical-grade software, demonstrating the database functions with a sample of 5 clinical subjects. The buyers all thought it was great until they started using it and the number of database entries went from 5 to 100 or more ... accessing the data dropped from 5 seconds to 50 seconds or more.

        So often software is written and shown to "work" but never tested to the limits and is not fully documented, resulting in the problems described in this article and the company saying that reporting the problem was a stupid user complaint.

        1. Plest Silver badge
          Facepalm

          Re: Where were the procedures ?

          Otherwise known as scalability and stress testing. Yes, the "t-word" that bad devs hate so much!

          Having spent 25 years as a DBA scalability is very dear to my heart having heard whining dev's say things like, "Well it ran perfect with 5 records but when we put it in production with the expected in 5 million records it started running like a dog! You need to fix it ASAP, aren't there any magic config options in the database config?". Even if there were, I wouldn't share them with you, piss off and write proper scalable code, stress test it and when you have some proper test reports we'll talk!

          1. ColinPa

            Re: Where were the procedures ?

            I visited one bank whose test system size was bigger than production regarding size of databases, and the transaction volumes it ran. I think test ran at maximum production rate +50%

            1. DS999 Silver badge

              Re: Where were the procedures ?

              That's good practice, but it is quite expensive to generate artificial traffic volumes that exceed your production data volumes for large real world systems so it is easy to see why it is so rarely seen.

              1. Anonymous Coward
                Anonymous Coward

                Re: Where were the procedures ?

                With the right tooling in place its actually very simple to generate the required test volumes and not use real data at all. I worked on one project where they would from an empty database to full volume in a couple of hours generating the original customer records and a decent history in a morning. This meant every new test cycle the entire application and database was built from scratch. Unfortunately the real issue is organisations lack of willingness to invest in automated test tools, especially performance testing

          2. Dagg Silver badge

            Re: Where were the procedures ?

            Just once I would actually like to see the bloody BA's specify actual NFR's (non functional requirements) that would allow the dev's to design and develop against.

            I would also like the bloody project managers and the rest of the pointy haired ones provide the infrastructure that would meet the NFR's.

            I remember one project where they would not provide a UPS to develop and test with so that when it went into production it was on a wing and a prayer and first power fail it didn't detect that it was running on battery and crashed damaging the database. Doh!

          3. PM from Hell

            Re: Where were the procedures ?

            And for gods sake optimize your queries! as an ex tech support manager I was often pressured to make more machine resources available when devs put non-optimized code live. As the dev teams would not work with us pre-production I'm afraid my default response ended up being to get a DBA / Sys Admin to identify the query hogging 99% of a CPU and then just telling the Devs the system was coming down until they had optimized the code at that point they would finally talk to a DBA who would assist them in optimizing the query and adding any additional indexes required or even move a disk partition onto a less busy drive. Now so many dev projects are in the cloud I'm seeing exactly the same issues which can cause huge cost issues for the operations teams as they have to throw more and more CPU's At projects to deliver acceptable performance rather than tuning code.

      2. The Oncoming Scorn Silver badge
        Flame

        Re: Where were the procedures ?

        I applied for a job supporting medical (Anesthesia IIRC) equipment (While still in the field of Electronics IIRC, prior to IT).

        One of the things pressed to me very hard at the interview was each product & dependent on the overseas customer I was to support, that each country had their own standardised colour for the oxygen, nitrogen etc hoses being delivered to the patient.

        I couldn't simply say, unplug the green hose or replace the valve on it, when diagnosing over the phone as the green hose may not be part of the delivery system being investigated in that country I had to know each variant system & colours fully.

  3. Anonymous Coward
    Anonymous Coward

    jumping out of plane with no parachute

    "Oh my, there were lots. James had full system access and so after each update removed the duplicate files. Much better."

    Ouch, mate, there's a *heavy* assumption behind this happening with no catastrophic impact !!!

    Like jumping out of the plane with no parachute, hoping you'll eventually find one holding up, here in the air, while descending !

    How could he be this careless ?

    1. jake Silver badge

      Re: jumping out of plane with no parachute

      "How could he be this careless ?"

      Human nature. Are you new to the working life?

      1. TDog

        Re: jumping out of plane with no parachute

        Well I've jumped out of a fucking plane with two parachutes,

        And neither worked as well as I could have wanted.

        The first didn't tell me I was in a shitty place until the wind blew me across the combine harvester storage area [No, I am serious] and the second couldn't get me out of the mess when I found the first wasn't working. Still here with some small bits of titanium generously donated by the taxpayer via the NHS, and I have now a life (not so very long left) concern about however many parachutes you can have, if they

        * Come too far down the decision tree

        * Don't understand the interdependanilty risks

        * Don't take the external environment [just where are those fucking combine harvester storeyards]

        * Don't understand the risks [So what is so bad about being 30 feet above a group of parked harvesters when you are dropping at 15 - 30 feet per second (big fat bugger on a T10)]

        * And finally for me, but there are lots more: Not all risks are combine harvesters, the NHS can't fix everything.

    2. Version 1.0 Silver badge

      Re: jumping out of plane with no parachute

      It's like just jumping out of the plane with a parachute and say that the problem is solved before you try to land.

      1. KarMann Silver badge
        Facepalm

        Re: jumping out of plane with no parachute

        Trevor Jacob would like a word with you.

    3. Prst. V.Jeltz Silver badge
      Alien

      Re: jumping out of plane with no parachute

      Like jumping out of the plane with no parachute, hoping you'll eventually find one holding up, here in the air, while descending !

      I seem to remember Ford Prefect jumping out of his office window with a similar plan at one point.

      It worked for him.

      1. The Oncoming Scorn Silver badge
        Thumb Up

        Re: jumping out of plane with no parachute

        The trick is to aim for the ground & miss.

    4. Plest Silver badge
      Happy

      Re: jumping out of plane with no parachute

      Even now, 40 years down the IT career track, when I'm about to do anything monumentally dangerous on a prod system I instinctively pull hands off keyboard and check what I think should happen, what happened during testing and what the script/commands are about to do. I compare and check about 7 times before going for it.

      I've had my share of "white panic/blood draining from face" moments over the last 40 years, my heart can't take much more stress as I see retirement on the horizon!

    5. Terry 6 Silver badge

      Re: jumping out of plane with no parachute

      Oh my, there were lots.

      To me that would be a sign that considerable caution is needed. It takes considerable arrogance to think that something that significant wasn't noticed by the experienced staff. And to think they were all incapable of thinking of the possibility of a trim- and hadn't done so without a good reason not to..

  4. Howard Sway Silver badge

    Backup and restore capability for mission critical files?

    Another case of "we've not heard of it".

    If this really was such a "major" manufacturer, then the senior staff should have been the ones for the chop for skimping on the importants. Alternatively they could have adopted the true engineering approach of using the incident as a learning experience and putting procedures and preventions in place to stop it from happening again. If anything, they should have been grateful to the fresh faced youngster for helping to reveal and eliminate a major risk to their business.

    1. Doctor Syntax Silver badge

      Re: Backup and restore capability for mission critical files?

      Backups? Those were what James deleted.

      This was a newly installed version - installed by James - and, reading between the lines, it appears that the install created backups - AKA duplicate files - and at startup checked tor their existence. It seems that the system was designed to do exactly what you suggest and did it so as to be idiot proof until nature produced a bigger idiot.

      1. Adrian 4

        Re: Backup and restore capability for mission critical files?

        Good reply, although providing information that the duplicates were required would have saved some downtime.

        1. Doctor Syntax Silver badge

          Re: Backup and restore capability for mission critical files?

          It was a black box. Until opened nobody realised there was a cat inside it - and it had claws.

        2. heyrick Silver badge

          Re: Backup and restore capability for mission critical files?

          Or simply doing what you're instructed and not touch anything else, no matter how well meaning.

          At home you own the machines, you can make decisions. At work, you follow orders unless you can demonstrate that said orders are gibberish and have something solid to back that up.

          Where I work, one of the process machines was freaking out and not working anything like correctly. The girl in charge of the machine tried the usual settings that she has access to, to no avail. So, being helpful, she did what she did with her home computer. She turned it off and on. Only, it didn't come back on. The techies, when they finally arrived, wanted her hung. And the "minor" problem that they didn't pay much attention to became a huge problem that caused a three day backlog. But since she took the initiative to power cycle the machine, all the blame fell upon her head.

          So, just don't. It ain't worth it.

          1. CommanderGalaxian
            Boffin

            Re: Backup and restore capability for mission critical files?

            Surely during testing of the machine they did a "black restart" - i.e. simply pulled the plug suddenly in the middle of it doing its thing and then re-applying power to make sure it recovered ok?

            1. heyrick Silver badge

              Re: Backup and restore capability for mission critical files?

              Failing to restart was not part of the design spec. It was age and a reluctance to take it out of production in order to fix it. Everything worked "so long as"...

      2. doublelayer Silver badge

        Re: Backup and restore capability for mission critical files?

        "Backups? Those were what James deleted."

        No, they weren't. It's clear that the duplicate files were on one disk, as the duplicate finder program was being used to free up disk space. Backups doesn't mean copy the file to the same disk with a different name. Even when you're making a temporary backup copy in case you damage the primary, you back up to a file the program isn't going to use, and since this one caused the program to report an error, it clearly was attempting to read it. The program needed those duplicates. Backups should have been elsewhere. It doesn't stop that being a really stupid thing to do, but backups could have at least prevented having to get a reinstallation disk sent out.

  5. OhForF' Silver badge
    FAIL

    Well meant vs. well done

    Taking the initiative to do something helpful without knowing all details doesn't often resultin a job well done.

    This "who, me" reminds me of the scientists in university that helped adminstrating the departments machines by cleaning up the clutter of files with zero length in /dev ..

    Result was half a day of work for the admins to restore the OS (Apollo Domain OS if memory serves) from tape.

  6. MatthewSt

    Once bitten...

    Back in my early days I was working on an application hosted on Windows on a Virtuozzo cluster (which was sort of containers before containers were a thing). We hit a bug that was resolved in the latest Service Pack of Windows so out of hours I started the update process on all 6 servers that made up the environment. When the first three restarted, they didn't come back.

    Turns out Virtuozzo used some crazy single instance features (or something like that) where each server "ran on top of" and used the same code as the underlying host. By running the updates I completely hosed the environment. I was given a very stern talking to by my manager (but I gather not as stern as the talking to he got from his manager) but thankfully that was the worst of it.

    What people seem to overlook in these kinds of scenarios is that if the person means well, they'll treat it as a lesson. They're not going to want to make the same mistake twice! You just have to work out what the intent was, and whether that person is capable of learning from it

    1. My-Handle

      Re: Once bitten...

      Exactly, but it does also take quite a mature management structure to interpret it that way. I've known a few friends who have costed their respective companies tens or hundreds of thousands of pounds just because of one mistake. They survived with their jobs intact. I've also known friends who were fired for trivial mistakes with no material consequence.

      I myself have managed to trash a core system database with the old UPDATE statement with no WHERE clause. The IT director almost shrugged. He just told me to go fix it, fix it quickly, and don't do it again.

      I'm a damn sight more careful now.

      1. dak

        Re: Once bitten...

        I brought down an entire bank by doing that once.

        1. Eclectic Man Silver badge
          Happy

          Re: Once bitten...

          Send it to the 'Who Me' team, they might make you (in)famous.

        2. John Brown (no body) Silver badge

          Re: Once bitten...

          "I brought down an entire bank by doing that once"

          I was about to ask if it might be RBS. Then I stopped and thought about all the other possibilities from just the last few years!! There's been so many, you could probably go into some level of detail and still not identify which bank it was and remain "regonimized" :-)

      2. Doctor Syntax Silver badge

        Re: Once bitten...

        "He just told me to go fix it, fix it quickly, and don't do it again."

        If it was something that could be fixed quickly he could afford to be fairly laid back.

        1. My-Handle

          Re: Once bitten...

          I managed to fix it within an afternoon.

          I was furnished with some database backups (A three-day-old full backup and a set of update snapshots, if I remember correctly). That got me back to the start of the morning. I managed to rebuild all but three order records from the wreckage of that day's data (cross-referencing with other systems). Those last three records were rebuilt by me begging Customer Services to call up the customers and confirm what it was they ordered. Fortunately, I have a good working relationship with our customer services team. The three affected customers were pretty good about it too, they appreciated being kept in the loop rather than us trying to hide the mistake.

          All it ended up costing us was a few hours of lost work and a couple of grey hairs on my part.

      3. tezboyes

        Re: Once bitten...

        Similarly, testing a script to update a lookup table. But I ran it in the wrong window - Prod rather than Dev. The script didn't work.

        But I noticed it immediately, told my manager who notified users immediately that there was an issue. It only took 30 mins to fix.

        And whilst it was a clear demerit I didn't get hauled over the coals as I didn't faff about or make up stories.

        Plus as a lesson learned, we set the default telnet window borders to be red on prod systems going forward.

    2. John Brown (no body) Silver badge

      Re: Once bitten...

      "I started the update process on all 6 servers that made up the environment. When the first three restarted, they didn't come back."

      I also learned a long time ago, update one first and see if it "does what it says on the tin". Then, if feeling brave, do the rest all at once, otherwise, do them one at a time.

  7. Tubz Silver badge

    First rule of change management, follow the change to the letter, if it fooks up, then pass the blame to who raised the change, who wrote the specific the instructions, the technical reviewer and then the approving manager.

    1. OhForF' Silver badge
      Devil

      Sounds like an exercise in blame management.

      How long does it take to raise/specify/proove/install an emergency hot fix for an issue actively blocking production following all procedurs?

      1. localzuk Silver badge

        When the outcome is "you're fired" you can be sure there is a blame management system in place.

        1. Terry 6 Silver badge

          Absolutely. And much later too. In the 2000s there was a suite of educational software for primary schools (might still be) that every school had and no one used- because lack of training of course. One seemed particularly useful, so I tried it out with a small group of kids ( they knew it was a test and were up for possible calamity, luckily). The programmes ran locally. And worked well. All seemed fine. We saved the results using the inbuilt save. Which did not give a choice of location.

          Then we tried opening one of the files up, because one of the kids wanted to change something. And it wasn't there.

          Some investigation and it turned out that the default save location was on the C;\ drive. Which, being a school machine, wasn't accessible. but error messages had been suppressed . So it just hadn't saved anything .

          1. localzuk Silver badge

            Sounds about right. There was lots of software like that. Still is some of it floating around in schools that are unwilling to modernise.

          2. WhereAmI?

            Then you get the Head of IT who knows f.all about IT. I was contracted into a local college where one of the briefs was to stop students from installing games etc.. I locked down the C: drive using Cerberus (that long ago!) and notified all IT teachers that the students either had to save to floppy or to the D: drive (I didn't like that idea - too many options for plagiarism/deleting other student work etc. but - overruled). Then came the inevitable time that I had to show the Head of IT how to unlock the C: drive to install new software. It was a simple two-stage password and response job IIRC. No; it was too difficult. He didn't understand it. I was told go back round and unlock every C: drive and remove Cereberus.

            Cue (or queue) the inevitable virus infections. At any one time around 30% of the college PCs were down with viruses. It didn't matter how many times I tried to explain the head of the college in extremely simple terms how we could reduce these problems, the glaze quickly came over the facial features and I was overruled.

            They then hired someone full-time to IT Support who was completely incompetent and I got out.

            1. Terry 6 Silver badge
              Facepalm

              This is one of my "I don't understand why this happens" issues. And for once seems to apply only to IT (Go on, find some other places). The organisation pays someone to do an IT job. They don't comprehend what they're being told, which is fair enough in at least some cases, but they neither try to understand, nor accept what they're told. Doing either would be rational - though understanding is surely better than tame acceptance.

              Doing neither is not rational. It's saying "There's a problem, I don't understand the problem, I don't understand the person who can resolve the problem so I'm going to ignore the problem".

      2. Anonymous Coward
        Anonymous Coward

        Change Management and Critical Incidents

        The answer to how long it takes to authorize a hotfix to a down production system should be less than 30 minutes. If a hotfix is already available then you still have the time taken by the vendor to identify it to convene an emergency CAB virtually. If a hotfix needs to be developed then you will have at least several hours. During that time get CAB Members updated on the issue and potential fix, perform the impact assessment and have a coffee.

        hen the fix arrives you may need top press the vendor for more details about what it will do but then take that analysis to the virtual CAB.

        Even if the issue cannot be recreated in test then you should apply the hotfix to test to make sure `1/ it actually does apply in your environment, 2/ it doesn't break some core functionality which would make the situation even worse.

        And as always have the backout / recovery plan detailed and rehearsed.

        On modern systems it's often possible to mirror the production DB to tt quickly and then apply the hotfix. IN any isutation like this the first question you need to be asking yourself is 'do I need a special backup of the database' If the answer isn't 'definitely not' then start the back up anyway. you can always cancel it if its not required but it could save your job in a few hours time. Big learning point though is this backup must not be part of the normal cycle or you may inadvertently restore a broken DB if there is a hardware break in the next couple of days.

  8. chivo243 Silver badge
    Facepalm

    I learned early on

    If it ain't broke, don't fix it... We had an intern who thought everything needed to be fixed. Like why can't our NT4 workstations have a USB mouse... Look out below!

  9. Adrian 4

    |Touchstone

    Who failed to communicate vital information that he was aware of ?

    - not Harry. he didn't know., Upstream developers or company admin

    Who did something unasked and without checking ?

    - James. A learnihng experience, or should have been

    Who failed to warn of the consequences of doing something other that what was documented ?

    - Harry, probably

    So all to blame, but mostly manglement. Because that's where the buck stops.

  10. Prst. V.Jeltz Silver badge
    Facepalm

    So was James truly the guilty party?

    Hell to the yeah!

    The guys a timebomb!

    I didn't see at any point in the story the promised "using your initiative"

    I just saw a guy pulling a proprietary system he knew nothing about to bits.

    I bet the next weekend he'd have tried to improve the machines on the line by pulling bits off he deemed unnecessary

  11. Anonymous Coward
    Anonymous Coward

    Either James was ready to fly solo on the update or he wasn't

    You have to take the training wheels off at some point but part of that is a judgement as to whether your apprentice is ready (and to be frank sometimes that judgement has to be absolutely never/over my dead body/etc.).

    1. Andrew Rowland

      Re: Either James was ready to fly solo on the update or he wasn't

      Yes, that's true. If it was his first time doing the upgrade, though, apart from change control, a senior administrator should have, at the very least, reviewed the procedure and expected results. James had a good idea that turned bad, simply because he didn't understand what was expected and how things worked together. A review, such as stated, would have, at the very least, made him think about what he was doing.

      1. Doctor Syntax Silver badge

        Re: Either James was ready to fly solo on the update or he wasn't

        You can have all the reviews you want. They'll make no difference when someone decides to try out a bright idea of their own.

        1. Mark 85

          Re: Either James was ready to fly solo on the update or he wasn't

          You can have all the reviews you want. They'll make no difference when someone decides to try out a bright idea of their own.

          Add to that: ".... or when management is clueless".

      2. John Brown (no body) Silver badge

        Re: Either James was ready to fly solo on the update or he wasn't

        "If it was his first time doing the upgrade, though, apart from change control, a senior administrator should have, at the very least, reviewed the procedure and expected results."

        Based on the article, that all worked and went to plan. It was James' additional work and ideas that screwed up. He did the documented job properly then went on to do some undocumented jobs of his own devising.

  12. Kubla Cant

    Backout procedure?

    It's not rocket surgery. A simple plan on one sheet of paper would suffice, even when working with Windows XP.

    1. This is how I propose to apply the change.

    2. This is how I will verify if the result is as expected.

    3. This is how I will back out the change if it didn't work as expected.

    Note that the correct answer to 3 is not "Phone the software suppliers and wait for them to send us a reinstallation disk".

    1. John Brown (no body) Silver badge

      Re: Backout procedure?

      And, unlike James in the article, Step includes testing the change on the test line where he was trained and if that works, testing it out on ONE production line, not all of them at the same time.

  13. Plest Silver badge
    Happy

    When I read it I recall working on a lot of systems that would use cumulative updates to keep going, you simply laid the latest patches over the previous 76 layers of code and prayed it all worked! Even more fun if the underlying patches go back years and are all in different folder paths linked by some bizarrre config file that holds it together!

  14. ColinPa

    You are a very lucky company

    I did a review of a system, and one of the charts had the line

    "You are a very lucky company".

    Initially the management very chuffed till I explained what it meant.

    You have been skating on thin ice, and not fallen through so far. You had made changes without understanding the impact. If there had been a problem, you could not have backed out the change. As I said - you are very lucky you have had no problems.

  15. Shez

    why a new install?

    If all the guy did was delete duplicate files then why did a new install need to be mailed out, surely it's just a case of copying the remaining files and renaming the copy back to what the now deleted duplicate file was called?

    1. doublelayer Silver badge

      Re: why a new install?

      From the sound of it, they didn't know what those names were and didn't have backups from which they could pull that data. I don't know what else a reinstallation required, but I would hope someone could find a log somewhere with the information.

    2. Jou (Mxyzptlk) Silver badge

      Re: why a new install?

      IF you have a log what was deleted and where... And invest the time to actually know that was what happened, and invest the time to actually do it. Your idea sounds nice, but the reality is: You need to make 100% sure it works on every machine, and that can only be done by doing it fresh.

  16. Ken Moorhouse Silver badge

    Reminds me when...

    I once worked for a small company that used WordStar.

    The secretary - charming Liverpudlian lady - one day was reading an article from a woman's magazine "What star sign is your boss?" She was reading his traits which matched him to a "T". He was a gemini and apparently you could go days, weeks, months even when you could do whatever you wanted in the office with no fear of reprisal. Growing mustard and cress in the in-tray was the cited example which still sticks out in my mind. Then, one day he will come in and reorganise everything and be totally disruptive for a few days before reverting to form.

    One day during one of these slack periods she was quiet, uncharacteristically quiet... I popped into her office to see what she was up to...

    "Just deleting some of the crap off [the boss's] pc."

    "How do you know it's crap?" I enquired.

    "Well there's all these files that, when you open them up using the word processor, are full of smiley faces. See, look at this one, it's crap!"

    I don't think I need to spell out what happened next...

    Being a gemini, my boss thought it quite amusing.

  17. billdehaan

    Newbies can screw up but to do real damage you need management buy-in

    Several years ago, I worked in a lab that was pretty much a free for all. The software was tested in the local lab and then deployed in machines with dedicated hardware all over the world. When there was a field issue, support people would often bring in the machine (if they thought it was a hardware issue) or just the hard drive (if they thought it was only a software problem) from the field back to the lab to debug it.

    Of course, many of these machines would be absolutely riddled with viruses and malware, which would then run rampant over the lab network. Management's response was not to install antivirus on the lab machines, however. That was too expensive, and frankly, the McAfee software that they had standardized on made computers so disgustingly slow that they were unusable. That was perfectly fine for employee machines, of course. No one cares about them. But customers visited the lab, so the machines there had to be presentable.

    Management ordered IT to install a process on every lab machine that would check every USB media connection attempt and check it had a specific file in the root directory. That file would have an MD5 checksum proving it had been checked for viruses by the lab anti-virus machine. The idea was that you brought your USB thumb drive to the lab and plugged it into the virus scanner machine, which would write this time-stamped USB credential file. Then you connected the USB to a lab PC, and the antivirus process would check that the credential file was current and correct. If it wasn't, or if it was missing, the process would eject the USB.

    Naturally, neither management nor IT told the engineering staff about this. Projects ground to a halt as engineers took test builds down to the lab and spent hours struggling unsuccessfully to install them on the lab PCs, only to have their USB media ejected as soon as it was connected. Dozens of problem tickets were raised against both lab support and IT, but since only the upper castes in those groups were aware of the cause, several IT and lab support people were trying to debug the issue.

    For those engineers whose lab machines had a CD reader, they burned their builds to CDs and were able to install them that way. Of course, they couldn't get logs back, but they could at least be partially productive. Others tried dozens of different USB keys, without success. A third group discovered that if you started a batch job on the PC that continually copied a big file to the USB drive letter in a loop before the USB was connected, once the USB was connected, it would establish a file handle and the disk wouldn't be ejected. That solution was mailed around by engineering leaders to their teams, and it became the de facto resolution.

    Delivery dates were missed, customers screamed, and finally, someone in management realized that maybe they should have, oh, announced this change or something. Then they came up with the brilliant idea that instead of just silently ejecting the USB, the software on the PC could put up a message on the screen telling the user why the USB had been rejected, or something.

    Ingenius! The software was updated, and an IT person was tasked with upgrading it on all the lab machines. All the machines. Several hundred of them, in multiple labs, on several floors.

    The IT person realized that there was a better way. Rather than manually doing 1500+ installs, why not have the users do it? Since they were plugging in USB disks to the lab machines themselves, have the virus scanner install the newer version on their USB, enable the USB's autorun, and then when anyone scanned their USB (which at least some engineers were doing now, since a month after this started, there was a company-wide notice about the change) and plugged it into a new machine, it would install the upgrade! Any machine that was used would be upgraded, and any that weren't upgraded weren't being used in the first place, right? This was a great time saver.

    So, the approved solution was to install executable software on the USB and put it in the USB's autorun. And it worked. The first time such a USB was connected to a lab machine, it would update the software, and all was good.

    However, on the second connection, the updated software would look at the USB, notice that it had autorun enabled, and was trying to install software on the PC. Oh my god! That's a virus! Quick! Force a disconnect and lock the machine!

    People who didn't use the virus scanner and were running the batch file hack to lock the USB didn't have any problem. People who did use the mandated (and mandatory) virus scanner found that after doing so, as soon as they connected to a lab machine, it not only ejected the USB, it immediately locked up. And since only lab personnel had the machine passwords, work stopped until a lab person could be found.

    In other words, you were fine as long as you didn't follow company policy. There's a great incentive structure for you.

    Even better, when people then plugged their USBs back on their desktop machines, it then spread the "virus" into the corporate network at large.

    That time-saving optimization disabled the entire lab for weeks, took IS three days to clean up the corporate network, and IT spent a month scouring all of the remnants of it off of the 1500+ lab machines.

    By the way, there never was a case of a virus being transmitted via USB in the lab. Ever. All of the viruses had been transmitted by field personnel bringing hard disks in from the field and installing them in the lab, which completely bypassed the USB checking nonsense. In other words, in addition to all the chaos that it caused, the entire exercise was completely useless at its' stated goal.

  18. Eclectic Man Silver badge

    Sytem Administration Logs

    When I took over sysadmin from an experienced and intelligent person who was leaving for job overseas, he handed me his SysAdmin Logbook. He explained that everything he did on the system which was for administration was recorded in that log book. Every command, keystroke, file uploaded, deleted, and printouts of kernel upgrades, edits etc. (before and after).

    I continued this tradition, and this had the effects that I at least knew exactly what I had done, and never did anything I didn't have to do (as writing up stuff was a pain).

    Now, if Harry had done that and insisted that James recorded everything he did as system administrator, I bet James would have thought 'I'm not writing down deleting every duplicate file, I'll ask Harry in the morning it is a good idea."

  19. Stuart Castle Silver badge

    NT 4.

    When I was a green young techie, I was (as part of my job) on the Microsoft beta programme. We were primarily a Windows NT based company, and had a large fleet of Windows NT 3.5 machines that we'd largely upgraded to 4 by the time SP2 came out.

    I'd been testing SP2 for weeks, and had no problems, so when asked by my boss whether we should deploy it, I enthusiastically answered "Yes". I now know that one computer is nowhere near a large enough test pool, as when we rolled it out to the users, it left roughly half the machine in a state where they would not boot.

    I wasn't a member of a huge team. There was me, and my boss. We both did a lot of apologising to users, and I had work late every night. First diagnosing and resolving the problem, then applying that solution to the users computers.

    IIRC, a couple of users took advantage of the situation to get new computers, which I had to build, but we did get the users back to a situation where they could work.

    In my defence, I'd only started the job a couple of months beforehand, and while I had good technical knowledge (far superior to my boss's, as he would admit), I had no experience. As such, I think my boss shouldn't have asked for my advice, and accepted it without question.

    Now, I test any new updates first on a couple of machines, then on a larger group (say 10). I also get other technicians and certain users to test the updates. It takes a little longer, but I've never taken out half the estate again.

  20. Already?

    Taking it on yourself to delete a load of files without first having the knowledge of what they’re there for, is bonkers. James decided that he knows better than the teams of developers who created the system, and worse that deleting files was bound to be ok. Wrong wrong wrong. What a chuffing numpty.

  21. earl grey
    Pint

    i deny everything

    see, my fingers never left my hand...

  22. damiandixon

    Sometimes it is better to leave it to the experts.

    I recently watched a you-tube video of someone trying to fix a laptop that someone else had tried to clean. The laptop prior to cleaning was working perfectly fine.

    It looked like the person who had done the cleaning had tried to lift a number of chips without understanding how heavily integrated electronic components are now.

    They had also pulled ribbon cables so hard that pins were broken.

    I did learn some interesting fault finding techniques watching the video.

  23. Anonymous Coward
    Anonymous Coward

    He could have asked

    Hey, Harry,

    I've got a bit of spare time, is it ok if I....

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like