Logs
This is why you ALWAYS put /var/log in its own partition. That way when the logs get full it doesn't stop the rest of the system!
Welcome again to "Who, Me?" – The Register's Monday column in which readers admit to making mistakes and explain how they managed to keep their careers going afterwards. This week, meet a reader we'll Regomize as "Ken" who told us that over 20 years ago he scored a job at Amazon.com as a Linux sysadmin, a role for which he …
This is why you ALWAYS put /var/log in its own partition. That way when the logs get full it doesn't stop the rest of the system!
Looks like they were database logs, the most recent of which might be required to complete a transaction. If the rdbms cannot write an intent entry it doesn't start the transaction ? Which wouldn't necessarily get you further ahead with a full dedicated file system.
These days I would put transient/volatile files on their own logical volume / file system rather than a physical partition.
I'm guessing these are the logs that the database engeine would use in the event of a database restoration. They would be used to restore any transactions after the last database backup. If the engine ignored tet fact that it couldn;t write new ones it wouldn't be able to restore the new transactions if the need arose. Better to stop than compromise integrity.
Generally good advice.
However some applications will, when upon unable to write to a log, will pass the error up the stack and refuse to do anything. In some cases this is justified, e.g. database logs in a cluster.
What's worrying about this story is:
i) No monit sending a warning "hey this partition is 95% full you should take a look"
ii) Inadequate testing of the script before use in production.
Reminds me of a former colleague, who was recruited after I had joined the company. We were using some sophisticated (for the time) software which IIRC used to take a full 8 minutes to compile the code into an .exe file. The software wrote several temporary files to a specific directory on the HDD. My esteemed colleague decided to "clean up" the HDD one afternoon when I was out of the office.
I came in the following morning to the news that the program "kept crashing".
Now "H" came from a culture where loss of *face* is a major issue, and this coupled with his arrogance made him somewhat unpopular with several people (including me).
What have you done?
Nothing!
What. Have. You. Done?
Nothing!!
Etc etc.
Eventually I extracted the story about *cleaning up* the HDD.
After half an hour with the paper manual (remember them), checking the autoexec and config files I realised that the oddly named temporary directory was missing. Recreating the directory everything worked as it should. The program deleted everything in the temporary directory when the .exe file had compiled.
This was not the first time I had to fix his computer cock-ups and I gave him the benefit of my opinion of people who mucked about with computers without knowing what they were doing.
He lasted 11 months in the job.
/var/log should always be a separate partition, at least if you're not using ZFS (in which case give it a capacity reservation); but whether the system continues or not when that partition is full is a different issue. Secret squirrel agencies want AU-5(4) ("shutdown on failure", which does exactly that) implemented. That choice makes other controls (system monitoring) more important, but you should be monitoring free disk space without needing an SSP to tell you that.
I shouldn't need to explain the icon; it obviously had to be Sectoids.
Both statements are true.
/var should be a separate partition,
and
database transaction logs do not go there.
back in olden times, the database logs would be on one LUN (hopefully RAID 1, 10, or 0+1) because they are very write intensive
Database itself on a separate LUN, Could be RAID 5 for the capacity benefit because that bit is much more read intensive
And OS and all it''s bits on a different LUN altogether - With /var on a separate partition from /
You can not mix the LUNS, or put anything else on those LUNs, no matter how much capacity is wasted.
There used to be this thing called "Spindle Contention" you see...
Substitute RAID groups for LUN if you could fit the minimum 8 disks necessary into your physical server...
Not so long ago in our perspective, but back in ancient history to some, there was a third party vendor agency for a certain yacht laden database and application platform vendor that was assisting in the build out of our implementation of an application platform. Sadly the architect advisor they had demonstrated at their first presentation the level at which this vendor was operating, by insisting that each table in the underlying DB had to have a dedicated spindle in the storage array. When asked if they knew how many tables were in the planned DB they replied "I believe perhaps 50 or 60".
The DB involved had just over 52,000 tablespaces, and was projected to add approximately 1200 a year. (Hey, I was platform NOT DBA). The consult went downhill from there. And it was a LOOONNNG way down. Anyone know about HPUX filesystem i-node issues these days?
In any case, no, in EVERY case LVM is your friend
I used to do storage consulting and I had to listen to these idiot DBAs who wanted to separate all this stuff "by spindle" all the time. No matter how much explaining I'd do about how if you stripe everything across ALL the spindles it'll be faster they had been trained in the ancient days and that was all they knew. Even young ones wanted that because I guess they'd be taught the ancient knowledge that cannot be questioned by the Old Ones that came before them.
In one engagement I had an opportunity to prove it since we had a new array sitting around for months while a datacenter expansion was taking place to make room for all the new servers that would be connected to it. I got the DBA to set up a clone of one of the databases on a server that I connected to that array and developed a script so I could rapidly reconfigure the Symmetrix from the LUNs up as well as on the HP-UX host. I gave him his desired config with dedicated spindles for the stuff he wanted it to have and told him to figure out a benchmark to run. Had him run it a few times to insure it provided consistent results. Then I tore down that config and rebuilt the array with striped metaluns, striped those across the entire array (well the portion that would be dedicated to this project at least) using LVM and told him to run his benchmark again. He was shocked by the difference, to the point where he didn't even try to defend his previous claims. So when we did the "real" install of the production system we did it my way.
Same for RADIUS logs on Windows.
Put them on a separate drive, they manage themselves and delete themselves when full.
Put them anywhere on C: and you will regularly find yourself with no space on the boot drive and failing services etc. because of it until NPS decides to clean them up.
Add windows deduplication on top (default fileserver settings are fine, no need to tune ANYTHING) and you will be surprised how much suddenly can fit there. Dedup can, sadly, not be activated for the boot drive. While technically possible and would work fine, no one as Microsoft dared to open that can of worms, including supporting it. So they better write "unsupported" into their documents if someone manages to circumvent the internal "not for C:" blockers.
I have to admit, if I were responsible for something like that in such a company, my butthole would be so clenched I wouldn't even be able to fart until I was told that no, I wasn't fired.
I'm glad that he escaped that episode unscathed (well, almost).
If you type "rm -rf /", you know exactly what you're doing. Unless, of course, it's "rm -rf / garbagedir" (Note the space after the"/")
"rm -rf *" is far more insidious, as it catches out people who haven't double-checked their current working directory.
In either case, it's a rite of passage; After that, you develop paranoia and situational awareness - "I'm about to do something risky - Am I in the right place?"
Disclaimer: "last reboot" or "last | grep reboot" are fine. "last | reboot" is not. To be honest, I fully expect to make typos like that occasionally, so the second rite of passage, "backups" is learned.
I think someone on this very website once boasted that they always modified their systems so that "the command that shall not be typed" always got intercepted by their own script, preventing its unthinking and accidental use. So it couldn't possibly happen to them.
Until they get so use to this safety net, they forget it's not standard on every remote console they could ever be connected to.
As his boss, I took the blame for a number of cock-ups that our new graduate trainee introduced: Poor supervision on my part although I was worried about competence.....
I finally lost sympathetic-mode when instead of admitting defeat, he falsified a set of test results. The series of figures recorded didn't compute to the results displayed. It wouldn't have been so bad except that the data should have been recorded automatically and the answer calculated within the spreadsheet. Instead, he hand-typed numbers into the cells.
He expected things to be done for him, so he could pass them on and take the credit. I now realise he was middle-management material. But I/we sacked him. We later found out that the recruitment agency had failed to verify his qualifications but omitted to pass this information on..... We sacked them too.
I had similar issues with a support person. He was likely to fail his probation due to poor timekeeping anyway but he then made an unauthorised configuration change which took down the ERP system database. Had he admitted it he might have had a second chance but he denied it even when presented with the evidence that he was the only admin logged on at the time so he had to go.
It later turned out that the recruitment agency had been economical with the info they'd provided on him but still insisted that we had to pay the full fee as we'd passed the cancellation date specified in the contract. Shortly after we stopped using employment agencies*!
*When I recruited his replacement three out of four candidates put forward turned out to be unsuitable for the role, and I suspect had been on the agency books for a long time. Fortunately the fourth is still with us.
For anyone outside the UK and curious what a "power shower" is: For historical reasons, including legislation in response to catastrophic failures, until the 1980s in the UK, mains-fed storage water heaters were only authorised for industrial use. Any water heater installed in a residential property had to be either gravity-fed from a cistern and permanently vented to the atmosphere, or contain only a minimal quantity of water being heated on its way through. This meant that you were limited either by the hydrostatic head of water (100 Pa per cm., minus any pressure loss due to friction along the length of the pipes) or the heater power (it takes 70W of heat to make water flowing at a rate of 1 litre per minute 1 degree hotter).
A power shower used a double-impeller pump controlled by a flow switch to draw hot water from the hot water cylinder and cold water from the cistern, which were then blended through a thermostatic mixer valve so as to maintain the temperature as the hot cylinder (often quite rapidly) became depleted of hot water.
Yes, regulations such as requiring divided-flow mixer taps between gravity-fed hot water and potable cold water from the rising main. (Single-flow mixers are permitted where A: both sides are gravity-fed from the same cistern, B: both sides are fed from the mains or C: an approved non-return valve is fitted in each leg.) See also Quincy, M.E., S5E21 Deadly Arena.
If a single-flow mixer tap is installed incorrectly between supplies at different pressures (e.g. a gravity-fed water heater and the main), there is a slight risk of contaminating the public water main with germs from a customer's stored water if a loss of mains pressure occurs while the outlet from a single-flow mixer is blocked. The much greater risk, though, if the outlet from a single-flow mixer is blocked, is that of backflushing the hot water cylinder and overflowing its supply cistern.
And in the UK in systems with hot water tanks in the loft it was reasonably common for them to have missing or damaged lids to the tanks, such that the tank water could be contaminated with loft insulation, dead bugs, dust, dirt, spiders, etc. Occasionally you'd end up with drowned rats/mice or birds in them.
This is why the parents of lots of us were always paranoid about teaching us not to drink out of the hot tap, or even get bath water or shower water in our mouths etc for fear of whatever contamination was in the tanks getting into our bodies.
Legionnaires disease lurks in stagnant (non-flushed or stationary), warm water. It's not as if you can avoid it, but under conditions where it can grow, it is very serious, especially in an aerosol, e.g. A shower.
Ignore the rules at your peril, although it is older or health-compromised people who will suffer most. I declare a personal interest.
Mum and dad had a very early power shower, two industrial-type pumps in the airing cupboard next to the cylinder, lots of flexible piping, a remote pneumatic switch and a Hans Grohe handset with interchangeable heads. Six or seven of them if I remember correctly. Very futuristic.
Depletion
When, many years later, I decided to fit one in my own house, the plumbers decided we needed a huge cold cistern in the attic, replacing the smallish one which previously kept the hot water cylinder topped up (cold came direct from the mains). What they didn't do though, was reinforce the attic joists, meaning that over a few weeks we began to notice the ceiling directly above our bed bowing alarmingly. Got that sorted, sharpish.
Noisy, both systems, but they worked.
Current house is mains pressure for both hot and cold, but not via a mains pressure cylinder (or indeed a "combi" boiler); the cylinder is vented in the conventional manner and heat is taken out via a heat exchanger. This meant I could DIY the whole thing instead of having to get a qualified and certified plumber in. Said heat exchanger is theoretically capable of transferring 70kW, which is about twice as powerful as a "big" combi boiler and means that it's possible to have two showers running at the same time, and run the kitchen tap without shower users yelling at you (a problem I have often encountered with combi boilers). And as the heat exchanger is controlled (pump speed varies to maintain a set output temperature) it means I have simple mixer taps on the showers rather than thermostatic units, which I've often had problems with in the past.
M.
I remember someone telling me that whilst he wasn't formally their boss, he would occasionally mentor new recruits. One time, 2 started at the same time on a trial, and he was asked by his boss near the end of the trial who he recommended.
He chose the one whose knowledge wasn't as good. When he wasn't sure what he was doing, he'd ask. The other guy, when he didn't know what he was doing, he'd wing it, and invariably screw up.
Is that he mentioned several times not having Linux experience, but the mistake he made that caused amazon.com to stop had nothing to do with Linux. It was a simple typo he could have made even if Amazon was using the Solaris he was familiar with.
Makes me idly wonder whether the reason he didn't catch his mistake before it went live is that he spend so much time checking and re-checking every step he did where the Solaris/Linux differences would come into play, that he ignored double checking the "simple stuff" like that config file with the typo.
I have found that is the best way.
If you admit it it can be quicker to find a resolution
People appreciate the truth
If something else hits the fan in the future and you say "It wasn't me" or "I did this within the last x", people are more likely to believe you and can check what you did instead of random guessing of the cause
I did this once working on an important live system. For an OS upgrade, I failed over an HA cluster service from one server to another without checking that the storage moved across properly (well, to be accurate, I took the application storage offline as I thought it was local storage to each node as in some of the other clusters I was looking after at the time.)
Got a quick query from our production support people, who knew I was working on the system, had spotted it, and they pulled me up a bit sharpish. Had the storage online and the applications started again before the client even noticed that there was a problem (they did notice, but by the time they did, it was fixed.)
Immediately the work was complete, I reported what had happened to the project and service managers (but not the end client, I left that to the CRM) to let them know there could possibly be some fallout from the client.
Everybody looked at me as if I was deranged! They thought I would have tried to cover it up, blaming something in the system. The fact that I was calm and proactive about reporting it while admitting that I'd f'd up, and was prepared to take the flack made them think that I was actually panicking inside and needed to be protected from the fallout! From my perspective, there was no point in trying to disguise the problem or make excuses. I made a mistake in planning the work and leaving a gap in the procedure to be worked out on the fly, and I would take the consequences as a result.
I survived, and what it taught me was even if you think you know the environment inside out, plan every step ad nauseam and document it completely for review before starting the work.
Ah right "total complete autonomous self-driving that you need to watch over carefully, and which could drive you to New York in your sleep as long as you stay awake to watch it at all times". That's not at all a negligently worded thing that will lead drivers into adopting unsafe behaviours.
Besides, Tesla malfunctions don't only happen in the supervised autonomous mode -- some Tesla's have had themselves read-ended due to the crash avoidance system braking suddenly and without any discernible reason.
Just yesterday (in the UK) I noticed something odd about an Amazon delivery van parked outside my house (but delivering to another address in the street). When the driver returned to the van, I told him that his rear number plate ["license plate" for left-pondians] was missing. When he queried that statement, I invited him to walk round the van and have a look.
I then said that if I tell him, it's a friendly warning. If the police tell him, it's a bit different.
Well, it is clear that is was quite a few years back. To put things in perspective, it is around the time that only half of the population had Internet, not every schoolboy with a phone. Amazon was then still a technical company, as I understood it.
If you take down Amazon now, you won't get a second chance anymore.
Well, it is clear that is was quite a few years back
Yeah, 20 years ago is two years (give or take) before Jobs dropped the iPhone on an unsuspecting world...
It seems like such a ubiquitous device that when people complain about the touch interfaces where I work I have to remind them gently that this place was built before the iPhone existed to popularise multi-touch gestures.
M.
I can fine these:
Outage hits Amazon sites from Nov '99.
Amazon unavailable for holiday shopping madness from Dec '04 which seemed to drag on for sometime afterwards.
Would Ken care to comment?!
I recall a Storage Admin like that. Asked to add some disk to a server he found there were no spares in that loop. Fortunately for him, one of the servers in the same loop had a load of disk allocated that wasn't in use. He knew it wasn't in use, the course he'd just taken from the array provider themselves had drilled into the students to check for filesystems on disk to prove it was spare before doing anything to it and there were none.
He learned the hard way that a) some DBMS's prefer their disk served raw and b) all DBMS's go down harder and faster than a Portsmouth tart on Navy payday when their storage is rudely removed without warning.
Ooh, that reminds me of something similar I did, which I'd long forgotten about:
One client we were supporting (remote support. For the youngsters, think "cloud" but on the companies site -- we connected to their servers to do work rather than the other way around), had one machine which was constantly filling up (email, documents, etc - office type stuff) and they didn't have the budget for any new disks.
Calls would be logged, we'd again show them the user usage stats (which they actually already had access to), and their local (not technical) administrators would hound users to delete stuff (these systems had no disk/account quota facility)
Anyway, one day i was logged into this machine, and noticed that one disk out of the 10 or so wasn't mounted. I knew this machine didn't run anything exotic that would use raw disks - it was a vanilla ICL DRS6000 thar ran "officepower" that the users accessed from dumb terminals, and later, terminal emulators.
I did talk to the local admin first, mentioned that I'd found this unused disk, and that it appeared to hold old officepower data and user files (from a quick scan of it) and it was probably neglected some time in the past during one of the numerous refactoring jobs in the past where new or bigger disks had been added, and users files moved from one partition to the other.
She checked locally, and no-one knew of this disk and it's purpose, so she agreed i could add it. I added it to the system, and moved some user accounts from other disks to spread the usage. All disks were now healthy, and she was happy...
A week later, there was a call logged that many users couldn't log in. I picked up the call, and recognised they were users I'd moved to this new disk.
It turned out that someone in the distant past had set up the equivalent of a dd if=/dev/disk1 of=/dev/disk2 bs=1M to run weekly via cron.
I assume at the time, the machine was quite empty, so someone had thought it would be a useful way to keep a quick backup that wouldn't rely on tapes.
This was undocumented (or lost in the midst of time when support of said machine moved from company to company)
Anyway, when I realised what was going on, I spoke to the admin again. She had been there many years, and never knew this was happening - indeed, over the years, if any restores had been needed, they were always done from the tape backups, so it was effectively unused.
I removed the cronjob, restored the drive from the previous night backup, and all was fine. (Fortunately, the cron job had run overnight, but after the backup had completed, so nothing was lost (the office wasn't open over night, and mail logs showed no new mails received in that window)
Reminds me of a problem. On AT&T UNIX SVR2 running on an AT&T 3B15 (think a 3B2 [if you know what that is] with faster disks and in a minicomputer cabinet) that was supplied to a certain influential UK software house writing some software for the company I was working for, the swap partition was NOT included in the disk partition table, but was left as an unused (by filesystems) area of disk immediately after the / filesystem. (For these systems, the disk partitions were defined as part of the sysgen process while the swap area was defined somewhere else in the sysgen file.)
Said expert software house, needing some more space for /, spotted this 'unused' part of the disk, and re-sysgenned the system to extend the / filesystem (they were quite clever about it, this was waaaay before anything like gparted or Partition Magic), but neglected to move the swap space.
Things were fine until they put the system under load (SVR2 swapped, not paged, but only when short of memory), and it corrupted the / filesystem. They rebooted, fsck'd the disk, and then had exactly the same thing happen again shortly afterwards. And they then reached out to us to fix 'our' system (it was on loan to them).
So I was dispatched to central London, worked out what they had done, restored the system from the last good backup, and left, following up with a written report of what I found, and what I had done to fix it.
I thought that was that, until I arrived at work a few weeks later to be asked whether I could go up to London again, because they had done exactly the same thing again....
Not so much a typo as an oversight, in my first job doing real dba I left out the Where clause in a SQL Update command and gave everyone in the database the same first name. Fortunately I could restore the table from an hourly backup the users never noticed.