back to article Under-qualified sysadmin crashed Amazon.com for 3 hours with a typo

Welcome again to "Who, Me?" – The Register's Monday column in which readers admit to making mistakes and explain how they managed to keep their careers going afterwards. This week, meet a reader we'll Regomize as "Ken" who told us that over 20 years ago he scored a job at Amazon.com as a Linux sysadmin, a role for which he …

  1. Dr Watson
    Alert

    Logs

    This is why you ALWAYS put /var/log in its own partition. That way when the logs get full it doesn't stop the rest of the system!

    1. Anonymous Coward
      Anonymous Coward

      Re: Logs

      This is why you ALWAYS put /var/log in its own partition. That way when the logs get full it doesn't stop the rest of the system!

      Looks like they were database logs, the most recent of which might be required to complete a transaction. If the rdbms cannot write an intent entry it doesn't start the transaction ? Which wouldn't necessarily get you further ahead with a full dedicated file system.

      These days I would put transient/volatile files on their own logical volume / file system rather than a physical partition.

      1. JulieM Silver badge
        Boffin

        Re: Logs

        Anything the database server might ever need to refer back to in normal operation is, by definition, not a log file, but a journal.

        1. AnonymousCward

          Re: Logs

          A journal is a transaction log, one which normally runs in a circular fashion, just like in a database which utilises simple recovery.

    2. Doctor Syntax Silver badge

      Re: Logs

      I'm guessing these are the logs that the database engeine would use in the event of a database restoration. They would be used to restore any transactions after the last database backup. If the engine ignored tet fact that it couldn;t write new ones it wouldn't be able to restore the new transactions if the need arose. Better to stop than compromise integrity.

      1. MrBanana Silver badge

        Re: Logs

        Or use a database that can add logs in an emergency, and automatically extend the log partitions partitions accordingly.

        1. mirachu Bronze badge

          Re: Logs

          Congratulations, that's one of the stupidest ideas I've heard in a while.

        2. JulieM Silver badge

          Re: Logs

          For your next trick, you could introduce a form of copy-prevention that works by embedding something into the original medium that an authorised device can read, but the pirates can't.

    3. Pete Sdev

      Re: Logs

      Generally good advice.

      However some applications will, when upon unable to write to a log, will pass the error up the stack and refuse to do anything. In some cases this is justified, e.g. database logs in a cluster.

      What's worrying about this story is:

      i) No monit sending a warning "hey this partition is 95% full you should take a look"

      ii) Inadequate testing of the script before use in production.

    4. ComicalEngineer Silver badge

      Re: Logs

      Reminds me of a former colleague, who was recruited after I had joined the company. We were using some sophisticated (for the time) software which IIRC used to take a full 8 minutes to compile the code into an .exe file. The software wrote several temporary files to a specific directory on the HDD. My esteemed colleague decided to "clean up" the HDD one afternoon when I was out of the office.

      I came in the following morning to the news that the program "kept crashing".

      Now "H" came from a culture where loss of *face* is a major issue, and this coupled with his arrogance made him somewhat unpopular with several people (including me).

      What have you done?

      Nothing!

      What. Have. You. Done?

      Nothing!!

      Etc etc.

      Eventually I extracted the story about *cleaning up* the HDD.

      After half an hour with the paper manual (remember them), checking the autoexec and config files I realised that the oddly named temporary directory was missing. Recreating the directory everything worked as it should. The program deleted everything in the temporary directory when the .exe file had compiled.

      This was not the first time I had to fix his computer cock-ups and I gave him the benefit of my opinion of people who mucked about with computers without knowing what they were doing.

      He lasted 11 months in the job.

      1. J.G.Harston Silver badge

        Re: Logs

        Gawd.... brings flashbacks of TMP=C:\DOS

    5. Brad Ackerman
      Alien

      Re: Logs

      /var/log should always be a separate partition, at least if you're not using ZFS (in which case give it a capacity reservation); but whether the system continues or not when that partition is full is a different issue. Secret squirrel agencies want AU-5(4) ("shutdown on failure", which does exactly that) implemented. That choice makes other controls (system monitoring) more important, but you should be monitoring free disk space without needing an SSP to tell you that.

      I shouldn't need to explain the icon; it obviously had to be Sectoids.

    6. DS999 Silver badge

      Re: Logs

      These were database online redo logs. Those most definitely DO NOT go anywhere in /var!

      1. Androgynous Cow Herd Silver badge

        Re: Logs

        Both statements are true.

        /var should be a separate partition,

        and

        database transaction logs do not go there.

        back in olden times, the database logs would be on one LUN (hopefully RAID 1, 10, or 0+1) because they are very write intensive

        Database itself on a separate LUN, Could be RAID 5 for the capacity benefit because that bit is much more read intensive

        And OS and all it''s bits on a different LUN altogether - With /var on a separate partition from /

        You can not mix the LUNS, or put anything else on those LUNs, no matter how much capacity is wasted.

        There used to be this thing called "Spindle Contention" you see...

        Substitute RAID groups for LUN if you could fit the minimum 8 disks necessary into your physical server...

        1. Alistair
          Windows

          Re: Logs

          Not so long ago in our perspective, but back in ancient history to some, there was a third party vendor agency for a certain yacht laden database and application platform vendor that was assisting in the build out of our implementation of an application platform. Sadly the architect advisor they had demonstrated at their first presentation the level at which this vendor was operating, by insisting that each table in the underlying DB had to have a dedicated spindle in the storage array. When asked if they knew how many tables were in the planned DB they replied "I believe perhaps 50 or 60".

          The DB involved had just over 52,000 tablespaces, and was projected to add approximately 1200 a year. (Hey, I was platform NOT DBA). The consult went downhill from there. And it was a LOOONNNG way down. Anyone know about HPUX filesystem i-node issues these days?

          In any case, no, in EVERY case LVM is your friend

          1. DS999 Silver badge

            Re: Logs

            I used to do storage consulting and I had to listen to these idiot DBAs who wanted to separate all this stuff "by spindle" all the time. No matter how much explaining I'd do about how if you stripe everything across ALL the spindles it'll be faster they had been trained in the ancient days and that was all they knew. Even young ones wanted that because I guess they'd be taught the ancient knowledge that cannot be questioned by the Old Ones that came before them.

            In one engagement I had an opportunity to prove it since we had a new array sitting around for months while a datacenter expansion was taking place to make room for all the new servers that would be connected to it. I got the DBA to set up a clone of one of the databases on a server that I connected to that array and developed a script so I could rapidly reconfigure the Symmetrix from the LUNs up as well as on the HP-UX host. I gave him his desired config with dedicated spindles for the stuff he wanted it to have and told him to figure out a benchmark to run. Had him run it a few times to insure it provided consistent results. Then I tore down that config and rebuilt the array with striped metaluns, striped those across the entire array (well the portion that would be dedicated to this project at least) using LVM and told him to run his benchmark again. He was shocked by the difference, to the point where he didn't even try to defend his previous claims. So when we did the "real" install of the production system we did it my way.

    7. Lee D Silver badge

      Re: Logs

      Same for RADIUS logs on Windows.

      Put them on a separate drive, they manage themselves and delete themselves when full.

      Put them anywhere on C: and you will regularly find yourself with no space on the boot drive and failing services etc. because of it until NPS decides to clean them up.

      1. Jou (Mxyzptlk) Silver badge

        Re: Logs

        Add windows deduplication on top (default fileserver settings are fine, no need to tune ANYTHING) and you will be surprised how much suddenly can fit there. Dedup can, sadly, not be activated for the boot drive. While technically possible and would work fine, no one as Microsoft dared to open that can of worms, including supporting it. So they better write "unsupported" into their documents if someone manages to circumvent the internal "not for C:" blockers.

  2. Pascal Monett Silver badge

    "which did not seem like a good omen"

    I have to admit, if I were responsible for something like that in such a company, my butthole would be so clenched I wouldn't even be able to fart until I was told that no, I wasn't fired.

    I'm glad that he escaped that episode unscathed (well, almost).

    1. JustAnotherITPerson

      Re: "which did not seem like a good omen"

      Well, on the plus side for him, the Amazon of 20 years ago is a far cry from what it is today. You take down Amazon.com today and it is possible that you will never work in IT again.

      1. Sudosu Silver badge

        Re: "which did not seem like a good omen"

        I bet you would do quite well doing a GoFundMe as the person who broke Amazon.com however.

      2. Zoopy

        Re: "which did not seem like a good omen"

        > You take down Amazon.com today and it is possible that you will never work in IT again.

        Okay but what's the downside?

      3. Anonymous Coward
        Anonymous Coward

        Re: "which did not seem like a good omen"

        Don't threaten me with a good time.

  3. Anonymous Anti-ANC South African Coward Silver badge

    At least he did not use the venerable and powerful rm -rf * (the stuff legends are made of)

    1. Anonymous Coward
      Anonymous Coward

      did not use the venerable and powerful rm -rf *

      Ironically in this case that would have worked in the log directory.

      1. Doctor Syntax Silver badge

        Re: did not use the venerable and powerful rm -rf *

        And left the database in need of an immediate backup.

        1. Jou (Mxyzptlk) Silver badge

          Re: did not use the venerable and powerful rm -rf *

          You meant to type restore, didn't you?

    2. m4r35n357 Silver badge

      The One True Command is: rm -rf /

      Get it right!

      1. tfewster Silver badge
        Facepalm

        If you type "rm -rf /", you know exactly what you're doing. Unless, of course, it's "rm -rf / garbagedir" (Note the space after the"/")

        "rm -rf *" is far more insidious, as it catches out people who haven't double-checked their current working directory.

        In either case, it's a rite of passage; After that, you develop paranoia and situational awareness - "I'm about to do something risky - Am I in the right place?"

        Disclaimer: "last reboot" or "last | grep reboot" are fine. "last | reboot" is not. To be honest, I fully expect to make typos like that occasionally, so the second rite of passage, "backups" is learned.

        1. OhForF' Silver badge
          Linux

          A lot of people although mess up their first attempt to delete all those "hidden" files starting with a dot '.' in the current directory..

    3. Antron Argaiv Silver badge
      Thumb Up

      One does not type that, even in jest, lest the curcor be in the wrong window.

      One refers, instead, to "the command that shall not be typed"

      1. Yet Another Anonymous coward Silver badge

        I thought that was 'dd if=/dev/sdb of=/dev/sda' - wait a minute, which one is the mirror ?

        1. MrBanana Silver badge

          Objects in the mirror disk are more corrupt than they appear.

      2. Just Enough

        Do not rely on the safety net

        I think someone on this very website once boasted that they always modified their systems so that "the command that shall not be typed" always got intercepted by their own script, preventing its unthinking and accidental use. So it couldn't possibly happen to them.

        Until they get so use to this safety net, they forget it's not standard on every remote console they could ever be connected to.

    4. wolfetone Silver badge

      Are you even a sysadmin if you've never done that on the root partition of a production server?

  4. Korev Silver badge
    Coat

    > We walked inside, where everyone razzed me for a long time."

    They probably thought "Not fu Ken him again"

  5. KittenHuffer Silver badge
    Coat

    Big logs

    Big logs that have not been removed from their storage area would bring me to a halt as well.

    In fact I can recall sitting for long periods of time contemplating just this issue.

    ---------> Mine's the one hung on the back of the toilet door!

    1. Anonymous Coward
      Anonymous Coward

      Re: Big logs

      Ah yes, one of those unfortunate occasions when backed up logs can cause a production stoppage...

  6. Pope Popely

    Job half done

    Unfortunately, no permanent fix for Amazon.

  7. A Non e-mouse Silver badge
    Thumb Up

    Kudos to the manager for not firing "Ken" or throwing him under the bus.

    1. tip pc Silver badge

      firing the guy who figured out the issue and resolved it is never a good idea.

      1. KittenHuffer Silver badge

        I've never known that to stop Manglement though!

        1. anothercynic Silver badge

          100% this. If some power that be wants you gone, you're gone (primarily in the US with its antiquated working practices, but also known to happen in the UK) one way or the other.

    2. Sam not the Viking Silver badge

      As his boss, I took the blame for a number of cock-ups that our new graduate trainee introduced: Poor supervision on my part although I was worried about competence.....

      I finally lost sympathetic-mode when instead of admitting defeat, he falsified a set of test results. The series of figures recorded didn't compute to the results displayed. It wouldn't have been so bad except that the data should have been recorded automatically and the answer calculated within the spreadsheet. Instead, he hand-typed numbers into the cells.

      He expected things to be done for him, so he could pass them on and take the credit. I now realise he was middle-management material. But I/we sacked him. We later found out that the recruitment agency had failed to verify his qualifications but omitted to pass this information on..... We sacked them too.

      1. Anonymous Coward
        Anonymous Coward

        I had similar issues with a support person. He was likely to fail his probation due to poor timekeeping anyway but he then made an unauthorised configuration change which took down the ERP system database. Had he admitted it he might have had a second chance but he denied it even when presented with the evidence that he was the only admin logged on at the time so he had to go.

        It later turned out that the recruitment agency had been economical with the info they'd provided on him but still insisted that we had to pay the full fee as we'd passed the cancellation date specified in the contract. Shortly after we stopped using employment agencies*!

        *When I recruited his replacement three out of four candidates put forward turned out to be unsuitable for the role, and I suspect had been on the agency books for a long time. Fortunately the fourth is still with us.

        1. FirstTangoInParis Silver badge

          Once upon a time a recruiter wanted to put me forward for a job working on power showers. I didn’t even know the first thing about them, other than mains electricity and water don’t get on well together. I didn’t bother with that recruiter again.

          1. Yet Another Anonymous coward Silver badge

            Shocking behavior

          2. JulieM Silver badge
            Boffin

            For anyone outside the UK and curious what a "power shower" is: For historical reasons, including legislation in response to catastrophic failures, until the 1980s in the UK, mains-fed storage water heaters were only authorised for industrial use. Any water heater installed in a residential property had to be either gravity-fed from a cistern and permanently vented to the atmosphere, or contain only a minimal quantity of water being heated on its way through. This meant that you were limited either by the hydrostatic head of water (100 Pa per cm., minus any pressure loss due to friction along the length of the pipes) or the heater power (it takes 70W of heat to make water flowing at a rate of 1 litre per minute 1 degree hotter).

            A power shower used a double-impeller pump controlled by a flow switch to draw hot water from the hot water cylinder and cold water from the cistern, which were then blended through a thermostatic mixer valve so as to maintain the temperature as the hot cylinder (often quite rapidly) became depleted of hot water.

            1. Yet Another Anonymous coward Silver badge

              And this is a country that doesn't allow mixer taps in case of Cholera ?

              1. Richard 12 Silver badge

                Nope, mixer taps are very common

                There's simply regulations about their design and installation, that are quite easy to comply with and do indeed prevent disease.

                1. JulieM Silver badge
                  Boffin

                  Re: Nope, mixer taps are very common

                  Yes, regulations such as requiring divided-flow mixer taps between gravity-fed hot water and potable cold water from the rising main. (Single-flow mixers are permitted where A: both sides are gravity-fed from the same cistern, B: both sides are fed from the mains or C: an approved non-return valve is fitted in each leg.) See also Quincy, M.E., S5E21 Deadly Arena.

                  If a single-flow mixer tap is installed incorrectly between supplies at different pressures (e.g. a gravity-fed water heater and the main), there is a slight risk of contaminating the public water main with germs from a customer's stored water if a loss of mains pressure occurs while the outlet from a single-flow mixer is blocked. The much greater risk, though, if the outlet from a single-flow mixer is blocked, is that of backflushing the hot water cylinder and overflowing its supply cistern.

                  1. collinsl Silver badge

                    Re: Nope, mixer taps are very common

                    And in the UK in systems with hot water tanks in the loft it was reasonably common for them to have missing or damaged lids to the tanks, such that the tank water could be contaminated with loft insulation, dead bugs, dust, dirt, spiders, etc. Occasionally you'd end up with drowned rats/mice or birds in them.

                    This is why the parents of lots of us were always paranoid about teaching us not to drink out of the hot tap, or even get bath water or shower water in our mouths etc for fear of whatever contamination was in the tanks getting into our bodies.

                2. Sam not the Viking Silver badge

                  Re: Nope, mixer taps are very common

                  Legionnaires disease lurks in stagnant (non-flushed or stationary), warm water. It's not as if you can avoid it, but under conditions where it can grow, it is very serious, especially in an aerosol, e.g. A shower.

                  Ignore the rules at your peril, although it is older or health-compromised people who will suffer most. I declare a personal interest.

            2. MrReynolds2U

              Power shower scares

              Depletion, followed by the sound of an aircraft taking off when you take too long showers. Always used to scare the shit out of me.

              1. Martin an gof Silver badge

                Re: Power shower scares

                Mum and dad had a very early power shower, two industrial-type pumps in the airing cupboard next to the cylinder, lots of flexible piping, a remote pneumatic switch and a Hans Grohe handset with interchangeable heads. Six or seven of them if I remember correctly. Very futuristic.

                Depletion

                When, many years later, I decided to fit one in my own house, the plumbers decided we needed a huge cold cistern in the attic, replacing the smallish one which previously kept the hot water cylinder topped up (cold came direct from the mains). What they didn't do though, was reinforce the attic joists, meaning that over a few weeks we began to notice the ceiling directly above our bed bowing alarmingly. Got that sorted, sharpish.

                Noisy, both systems, but they worked.

                Current house is mains pressure for both hot and cold, but not via a mains pressure cylinder (or indeed a "combi" boiler); the cylinder is vented in the conventional manner and heat is taken out via a heat exchanger. This meant I could DIY the whole thing instead of having to get a qualified and certified plumber in. Said heat exchanger is theoretically capable of transferring 70kW, which is about twice as powerful as a "big" combi boiler and means that it's possible to have two showers running at the same time, and run the kitchen tap without shower users yelling at you (a problem I have often encountered with combi boilers). And as the heat exchanger is controlled (pump speed varies to maintain a set output temperature) it means I have simple mixer taps on the showers rather than thermostatic units, which I've often had problems with in the past.

                M.

      2. Anonymous Coward
        Anonymous Coward

        I remember someone telling me that whilst he wasn't formally their boss, he would occasionally mentor new recruits. One time, 2 started at the same time on a trial, and he was asked by his boss near the end of the trial who he recommended.

        He chose the one whose knowledge wasn't as good. When he wasn't sure what he was doing, he'd ask. The other guy, when he didn't know what he was doing, he'd wing it, and invariably screw up.

    3. DS999 Silver badge

      What I find interesting about his tale

      Is that he mentioned several times not having Linux experience, but the mistake he made that caused amazon.com to stop had nothing to do with Linux. It was a simple typo he could have made even if Amazon was using the Solaris he was familiar with.

      Makes me idly wonder whether the reason he didn't catch his mistake before it went live is that he spend so much time checking and re-checking every step he did where the Solaris/Linux differences would come into play, that he ignored double checking the "simple stuff" like that config file with the typo.

  8. PCScreenOnly Silver badge

    Be truthful

    I have found that is the best way.

    If you admit it it can be quicker to find a resolution

    People appreciate the truth

    If something else hits the fan in the future and you say "It wasn't me" or "I did this within the last x", people are more likely to believe you and can check what you did instead of random guessing of the cause

    1. Peter Gathercole Silver badge

      Re: Be truthful

      I did this once working on an important live system. For an OS upgrade, I failed over an HA cluster service from one server to another without checking that the storage moved across properly (well, to be accurate, I took the application storage offline as I thought it was local storage to each node as in some of the other clusters I was looking after at the time.)

      Got a quick query from our production support people, who knew I was working on the system, had spotted it, and they pulled me up a bit sharpish. Had the storage online and the applications started again before the client even noticed that there was a problem (they did notice, but by the time they did, it was fixed.)

      Immediately the work was complete, I reported what had happened to the project and service managers (but not the end client, I left that to the CRM) to let them know there could possibly be some fallout from the client.

      Everybody looked at me as if I was deranged! They thought I would have tried to cover it up, blaming something in the system. The fact that I was calm and proactive about reporting it while admitting that I'd f'd up, and was prepared to take the flack made them think that I was actually panicking inside and needed to be protected from the fallout! From my perspective, there was no point in trying to disguise the problem or make excuses. I made a mistake in planning the work and leaving a gap in the procedure to be worked out on the fly, and I would take the consequences as a result.

      I survived, and what it taught me was even if you think you know the environment inside out, plan every step ad nauseam and document it completely for review before starting the work.

  9. jake Silver badge

    Strangely enough,

    These days it is unqualified amazon drivers crashing ...

    1. A.P. Veening Silver badge

      Re: Strangely enough,

      These days it is unqualified amazon drivers crashing ...

      As long as they are driving Teslas, they can place the blame outside themselves.

      1. StudeJeff

        Re: Strangely enough,

        Actually not, they would still be responsible. This called Supervised Full Self Driving for a reason.

        1. The Indomitable Gall

          Re: Strangely enough,

          Ah right "total complete autonomous self-driving that you need to watch over carefully, and which could drive you to New York in your sleep as long as you stay awake to watch it at all times". That's not at all a negligently worded thing that will lead drivers into adopting unsafe behaviours.

          Besides, Tesla malfunctions don't only happen in the supervised autonomous mode -- some Tesla's have had themselves read-ended due to the crash avoidance system braking suddenly and without any discernible reason.

    2. Richard Pennington 1

      Re: Strangely enough,

      Just yesterday (in the UK) I noticed something odd about an Amazon delivery van parked outside my house (but delivering to another address in the street). When the driver returned to the van, I told him that his rear number plate ["license plate" for left-pondians] was missing. When he queried that statement, I invited him to walk round the van and have a look.

      I then said that if I tell him, it's a friendly warning. If the police tell him, it's a bit different.

  10. tatatata

    Well, it is clear that is was quite a few years back. To put things in perspective, it is around the time that only half of the population had Internet, not every schoolboy with a phone. Amazon was then still a technical company, as I understood it.

    If you take down Amazon now, you won't get a second chance anymore.

    1. anothercynic Silver badge

      You won't even get close to the databases like this guy did... you'd be 'managing' it through AWS... And God help you with that.

    2. Martin an gof Silver badge

      Well, it is clear that is was quite a few years back

      Yeah, 20 years ago is two years (give or take) before Jobs dropped the iPhone on an unsuspecting world...

      It seems like such a ubiquitous device that when people complain about the touch interfaces where I work I have to remind them gently that this place was built before the iPhone existed to popularise multi-touch gestures.

      M.

  11. An_Old_Dog Silver badge

    Semi-Subtle Error

    In forgetting to create a fake database log file as part of the testing set-up -- it's not stated, but implied in TFA that's what happened -- nobody could tell from the all-good test results, that the procedure under test had failed to delete the database log file.

  12. Red Ted
    Alert

    Where's the original Reg story?

    I can fine these:

    Outage hits Amazon sites from Nov '99.

    Amazon unavailable for holiday shopping madness from Dec '04 which seemed to drag on for sometime afterwards.

    Would Ken care to comment?!

    1. anothercynic Silver badge

      Re: Where's the original Reg story?

      I'd venture to guess the one from '99... that didn't drag on that long. :-)

  13. TeeCee Gold badge
    Facepalm

    Bright eyed, freshly trained, no experience.

    I recall a Storage Admin like that. Asked to add some disk to a server he found there were no spares in that loop. Fortunately for him, one of the servers in the same loop had a load of disk allocated that wasn't in use. He knew it wasn't in use, the course he'd just taken from the array provider themselves had drilled into the students to check for filesystems on disk to prove it was spare before doing anything to it and there were none.

    He learned the hard way that a) some DBMS's prefer their disk served raw and b) all DBMS's go down harder and faster than a Portsmouth tart on Navy payday when their storage is rudely removed without warning.

    1. Jamie Jones Silver badge

      Re: Bright eyed, freshly trained, no experience.

      Ooh, that reminds me of something similar I did, which I'd long forgotten about:

      One client we were supporting (remote support. For the youngsters, think "cloud" but on the companies site -- we connected to their servers to do work rather than the other way around), had one machine which was constantly filling up (email, documents, etc - office type stuff) and they didn't have the budget for any new disks.

      Calls would be logged, we'd again show them the user usage stats (which they actually already had access to), and their local (not technical) administrators would hound users to delete stuff (these systems had no disk/account quota facility)

      Anyway, one day i was logged into this machine, and noticed that one disk out of the 10 or so wasn't mounted. I knew this machine didn't run anything exotic that would use raw disks - it was a vanilla ICL DRS6000 thar ran "officepower" that the users accessed from dumb terminals, and later, terminal emulators.

      I did talk to the local admin first, mentioned that I'd found this unused disk, and that it appeared to hold old officepower data and user files (from a quick scan of it) and it was probably neglected some time in the past during one of the numerous refactoring jobs in the past where new or bigger disks had been added, and users files moved from one partition to the other.

      She checked locally, and no-one knew of this disk and it's purpose, so she agreed i could add it. I added it to the system, and moved some user accounts from other disks to spread the usage. All disks were now healthy, and she was happy...

      A week later, there was a call logged that many users couldn't log in. I picked up the call, and recognised they were users I'd moved to this new disk.

      It turned out that someone in the distant past had set up the equivalent of a dd if=/dev/disk1 of=/dev/disk2 bs=1M to run weekly via cron.

      I assume at the time, the machine was quite empty, so someone had thought it would be a useful way to keep a quick backup that wouldn't rely on tapes.

      This was undocumented (or lost in the midst of time when support of said machine moved from company to company)

      Anyway, when I realised what was going on, I spoke to the admin again. She had been there many years, and never knew this was happening - indeed, over the years, if any restores had been needed, they were always done from the tape backups, so it was effectively unused.

      I removed the cronjob, restored the drive from the previous night backup, and all was fine. (Fortunately, the cron job had run overnight, but after the backup had completed, so nothing was lost (the office wasn't open over night, and mail logs showed no new mails received in that window)

    2. Peter Gathercole Silver badge

      Re: Bright eyed, freshly trained, no experience.

      Reminds me of a problem. On AT&T UNIX SVR2 running on an AT&T 3B15 (think a 3B2 [if you know what that is] with faster disks and in a minicomputer cabinet) that was supplied to a certain influential UK software house writing some software for the company I was working for, the swap partition was NOT included in the disk partition table, but was left as an unused (by filesystems) area of disk immediately after the / filesystem. (For these systems, the disk partitions were defined as part of the sysgen process while the swap area was defined somewhere else in the sysgen file.)

      Said expert software house, needing some more space for /, spotted this 'unused' part of the disk, and re-sysgenned the system to extend the / filesystem (they were quite clever about it, this was waaaay before anything like gparted or Partition Magic), but neglected to move the swap space.

      Things were fine until they put the system under load (SVR2 swapped, not paged, but only when short of memory), and it corrupted the / filesystem. They rebooted, fsck'd the disk, and then had exactly the same thing happen again shortly afterwards. And they then reached out to us to fix 'our' system (it was on loan to them).

      So I was dispatched to central London, worked out what they had done, restored the system from the last good backup, and left, following up with a written report of what I found, and what I had done to fix it.

      I thought that was that, until I arrived at work a few weeks later to be asked whether I could go up to London again, because they had done exactly the same thing again....

  14. anderlan

    15 second hard look.

    His boss was from the management school of Denholm Reynholm.

    https://y.yarn.co/e2cd8349-2af8-462a-bbed-07caf169fac6_text.gif

  15. CtrlAltDeleteCloud

    a single typo. full blackout. infra needs a last line that doesn’t blink.

  16. Manolo
    Facepalm

    The tide

    Wasn't this around the same time I saw a joke on (I think it was) userfriendly.org comparing Amazon to the tide?

    Sometimes it's up, sometimes it's down.

  17. Slow Joe Crow

    Fortunately my screwups were minor

    Not so much a typo as an oversight, in my first job doing real dba I left out the Where clause in a SQL Update command and gave everyone in the database the same first name. Fortunately I could restore the table from an hourly backup the users never noticed.

  18. back_to_abacus

    Before reading...

    1. Open the console tab in your browser's developer tools

    2. Paste the line of code `body.innerHTML = body.innerHTML.replaceAll("Who, Me?", "me.")` and press enter to execute it

  19. GeekyOldFart

    That manager was me

    That's how it's done. Nobody gets fired for an honest mistake, until they make the same one the third time, demonstrating that they are incapable of learning from it.

    1. Anonymous Coward
      Anonymous Coward

      Re: That manager was me

      The Codeless Code: Ten Thousand Mistakes

      https://www.thecodelesscode.com/case/100

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like