Out of the Frying pan, into the Fire?
Went out and bought Backup Exec? Especially from that time... We used it, and moved on.
The perils of dusty old kit, a cashing-in of brownie points and if in doubt, blame Microsoft! Start the week with another Register reader Who, Me? confession. "George" (for that is not his name) was that most common of breeds in the tech world – a self-taught IT guy responsible for a small network of users at a local firm. …
6 hours of hammering away, the birds were starting to twitter and I took a loo break. Came back to 3 horrified friends sitting round a blank sceen, one had accidentally killed the power. After they left I reloaded Elite from cassette and saved at every bloody space station. Some lessons stay with you.
This post has been deleted by its author
This post has been deleted by its author
"...File systems are software too but nevermind..."
Lars - not sure who keep aiming this at but I was specifically talking about hardware in this case - things like dried out electrolytic capacitors in power supplies springing to mind that are fine when running but can't cope with a sudden inrush when the power is removed and turned back on, or dry (cold for left pondians) solder joints that again were ok when warmed up but fail after a power cycle...or any of a number of other things.
Happened to me with a home built PC with mirrored HDDs. For me it was the motherboard that joined the choir eternal.
No problem? Hah!
The company no longer made that motherboard and, indeed, no longer made any SCSI motherboards. Plus the disks weren't translatable by a new MB, probably due to an odd file system the original MB used (which is, admittedly, a software issue).
Hardware breaks and hardware problems cascade.
Electrolytics capacitors do very strange things electically when they 'pop'. The circuits they're in will be exposed to some very odd current fluctuations and voltage levels, so it's not a surprise when an old cap blows that other damage can occur.
One of my A-level projects (how old am I now... er blimey) was to hook up a very sensitive current measuring device (basically a very fast data recording multi-meter) to a computer, and gleefully overcharge every electrolytic cap I could get my hands on while recording the current and voltage through the test circuitry. :D.
IIRC we even blew a 1 farad (nearly 2 inches diameter), which had been in the stores for years waiting it's moment of glory, everyone was cleared out of the lab and we were still deafened.
I had a client running on very ancient H/W (e.g. the processors were Pentiums and the biggest disks were 4Gb but a lot of each). The disks were mirrored on a belt and braces principle. They were mirrored at the controllers and each mirrored pair was mirrored in the database. There were, of course, database and OS backups (tested by being restored as part of the the BC plan). There were rumours of disks failing to restart so it was still a stressful time when the installation had to be powered down for a rewire. There was even some debate as to whether it could be kept running on an emergency supply. In the event it all went smoothly and the system came back up.
When the entire business resides on your storage system, whatever that might be, an element of paranoia about every single aspect of it isn't a failing; nor is it a luxury. It's an absolute essential.
Yep, we had an IBM AS/400 in the mid 90s. UPS failure (battery check said 100%. power failure said 0%) and the AS/400 went down. It took a couple of hours to get the power back, by which time the components had cooled and the DASD, that had been holding on on a wing and a prayer, seized. The bearings had dried out.
"... and the DASD, that had ..."
Ah the old IBM "sow FUD with obscure terminology and names".
OEM salesman: If you buy your disks from us you can save oodles of cash.
IBM user: What are these "disks" of which you speak? Anyway we don't have them!"
OEM salesman: Our disks are much cheaper and they are plug compatible; only your wallet will notice the difference!
IBM user: Begone! We don't need your "disks"!
OEM salesman: All modern computers have disks! Where do you save all your files, otherwise?
IBM user: Our computers use the very latest and greatest Dynamic Access Storage Devices (DASDs), we don't need these "disks"!
OK, probably apocryphal but those were the days!
"Ageing hardware that hasn't been power-cycled for yonks tends to give up and die when it finally gets turned off and back on."
A few years ago I was doing an SBS 2003 to 2011 migration for a client and well aware of the age of the hardware. The old server needed one reboot to apply the SBS migration settings before I could begin, and of course the RAID controller couldn't take it and the whole thing died.
Ended up with a MacGyvered old desktop with a load of extra drives in it to restore the old server to, purely so I could start the migration again.
As does aging hardware that hasn't been powered on at all for yonks.
I've been involved in a retrocomputing project to resurrect some PDP-11/23 systems I worked on back in the early 80s. Many, many tantalum capacitors needed replacing on the proprietary audio cards (my partner in crime, a hardware guy, did all that soldering, fortunately). Some boards were dead, but fortunately we've so far had enough spares (fingers crossed! -- because some of the ICs are no longer available). Other problems arose and were mostly solved, though a couple still bedevil us.
One challenge was to get the data off the old MFM hard drives -- Seagate ST-506'es and ST-412's, plus compatible drives from other vendors. These things are well past their use-by date -- the Seagates have a specced design lifetime of five years.
We had a mix of results. Some drives had so many uncorrectable sectors as to be useless. At least one couldn't be read at all, not even with errors. But several (enough! *whew*) imaged perfectly -- which astonished me to be honest -- or with only an error or two.
The drives I'm proudest of are the two whose spindle bearings had seized. They refused to spin up, and gave off a bad smell when we kept trying. (Not stiction. If only.) Naturally, the drive recovery place we'd been hoping to use had recently got rid of all the gear they'd need to work on drives of such ancient vintage. That meant we were on our own. Also naturally, these were the two most important drives of the eight in our collection -- the ones containing the only copies of some really critical data. So extreme measures were called for.
We first tried pushing the platter assembly by hand, in the hope of freeing it up. It turned, but not freely; there was too much resistance for the spindle motor to take over. Then we baked a drive (at a LOW temperature -- maybe 100ºF/40ºC, I don't exactly recall -- in hope that that would make the lubricant more flowy. That didn't work either. (We were lucky, though; at least we didn't give the oven a permanent burnt-electronics smell.)
All this time, of course, we were all too aware that anything we did to one of these drives might kill it completely. Never mind the torture we were subjecting them to; every power-on might be a drive's last.
In the end, in sheer desperation -- acutely aware of how crazy this was, but with little left to lose -- we drove the platter assembly externally, using a die grinder borrowed from an auto shop (basically a super-beefy Dremel), and using a laser tachometer to help us keep things within the necessary few percent of 3600 RPM, for the couple of minutes necessary to capture each drive image.
But capture them we did!
Colour me impressed. That is some dedication.
Back in previous life we had some customers bringing back dead drrives (why, yes of course Seagate (mostly ST-125/138 but some 5.25" versions also)) for warranty replacement.
In some cases disk were readable by borrowing logic board from a working drive. No, we were not recovery company nor was data recovery part of any official support, but if customer was pleasant and asked nicely I sometimes did give it a try.
Reminds me of a couple of hard discs that came my way in the early 80's for a BBC micro none the less.
They were full height Maxtor 20MB MFM's in 5.25 form factor, they had two thumping great 5W resistors on top of them that had a habit of cooking their solder joints to a point where they went high resistant and access problems started happening.
The repair was fairly simple, re-flow with lots of flux and ordinary solder to get things melted again. Clean up the oxidisation, then re-solder with high melting point solder and things would return to an operational state - until next time it happened.
I guess this is why the were removed from the original system and became available for salvage in the first place. But on the flip side, it did me for several years until better drives became available.
No access to the bearing without a lot more disassembly, which would have been especially problematic given that we had no clean room to do it in. And even then, there'd have been no guarantee of being able to get at it.
Would we have gone there, had the approach I've described failed? I'll never know...
Or drives that refuse to spin back up after cooling down due to bearing lubrication. Without some external encouragement that is. Had that happen with Convergent (MiniFram/MegaFrame). Memory is bit hazy but may have been Micropolis disk. Once encouraged to spin up, it ran happily for long time again.
I remember having to pull an all night repair session on a server that I was replacing the tape drive on, it turned out it had never been turned off and did not want to start after I shut it down. As a bit of a 'kill or cure' I put one of the disks from the raid array in a desktop and it magically booted fine giving us back the data.
As a consultant, I'm guessing it would be easy, but in the long run I believe it is better to be quite honest and own up to one's mistakes. Once you have a reputation for being a liar, it doesn't go away.
That said, I will freely admit that I have sometimes responded quite positively to a user's question about "did you take care of that ?", only to go and feverishly code the solution once the user had gone away.
But I never lie. It's bad for business relations.
Indeed, it is essential to state that "a small number" of users are affected. Even if that small number is actually "all users", all finite numbers are 'small' in some sense of the word.*
*(PhD in Mathematical Logic, specifically transfinite countable ordinal representations using the Ackerman hierarchy of arithmetic functions.)
He didn't even have to lie. He could just say: "After the power outage when I turned them on only one came up, aging desktop systems crammed under a desk were never an acceptable solution, which is why I campaigned so hard to get proper servers to replace them. In hindsight, I should have foreseen the possibility of this and planned the changeover to happen before the power outage"
No one would blame him, and he'd probably get praise for his foresight in obtaining the proper servers as otherwise they'd have been SOL when they came in Monday morning.
....is kinda in line with the article title.
I was IT tech. in a school looking after one division that had it's own server, an old RM CC3 server!
It regularly crashed and burned and had to be rebooted pretty much every morning, whichh could take anything from 5 to 20 minutes.
Rebooted one day and got "no operating system". I had a feeling it would happen one day.....
I left it like that waiting for someone paid more than me to deal with - 2 days it stayed like that, with no teaching network available in 20 odd clasrooms and 3 IT rooms.
Day three I decided to try a reboot again and noticed a USB stick I'd left in.
You can guess what happened.
Never did own up to that, managed to 'deflect' the truth by saying the server just decided to start working again and it was a real wake-up call that it should be replaced ASAP. Was another year before it got replaced!
One place I worked was much like that described in the article: the "servers" were four desktop-class 386 machines. Not even towers; these were the old landscape-mode cases. At least they were on shelves. The O/S was Xenix.
Things were fine (well, fine'ish; Xenix wasn't all that stable, so users were used to a few crashes per week) until I installed a kernel update. Most of the machines were still fine, but one refused to boot. Same software; same update; different results.
Thus did I learn about the BIOS 1024-cylinder limitation.
I hadn't installed these machines, but my predecessor who did, clearly didn't know about that limitation either. On the original install onto empty disks, everything had fit; but now, with the disks fragmented, the new kernel was scattered across parts of the drive where the BIOS couldn't see it.
The machine in question booted from a 5 1/4" floppy for months, until we got a chance to do a reinstall -- with a /boot partition this time.
Sounds like my experience with Synology NAS Raids...
If you have a Synology NAS Box, and run "Antivirus essentials" package on it, be sure to turn on smart-scan. Failure to do, and it will scan every file, every time, not just new or changed files. It will thrash the drives till they fail. That really should be turned on by default.
Synology boxes typically have the minimum amount of RAM, that package loads the whole file into RAM and will also thrash the swap partition as it scans large files.
So even a box that's never used will spend its time thrashing the drives 24/7.
Antivirus Essential > Settings > Enable Smart Scan.
And upgrade the RAM if that doesn't fix it.
To be honest with users unless they are "Difficult" or tell me "Blah blah, I've stopped listening to you now" (Yes, some fuck once said that. I ignored her after that)
The "Difficult" user thought I was great and would ask her team "Just ask Steve", they informed her tickets still needed to be raised. Anyway. When she finally left to their delight (they hated her as she was, quite frankly, a cunt) I told them "Well I know you say she thought I was great but she didn't know I lied to her most of the time and made technical bullshit up. Especially if I'd forgotten to do stuff I told her I'd do" they found that very funny :)