You should have been sacked
Not him, but the person who left a newly employed trainee unsupervised without sufficient contact details.
Hello readers! You've found your way to the sickest of El Reg's columns, Who, Me?, where readers share their most embarrassing moments for the pleasure of everyone else. This week, we meet "Zac", who once managed to take out major stores' point-of-sale systems across the land while working at a finance company back in the late …
Also, isn't AS/400 able to provide some pretty fine grained permissions? Fine grained enough that a trainee left alone on the weekend couldn't enter such a "global" command that reset everything rather than just the one connection he intended?
Maybe the AS/400 admin was the problem here.
I remember the first ever night I was on call, also on a system that "never" went down... Guess what - it did. The problem with systems that never go down is that there is very little experience to draw on on WTF to do next without making things worse. In this case the system, also on an AS/400 curiously enough was managing financials for multiple large pension funds, so no pressure.
Called my backup - no response. Waited 10 mins, called again, still no response so gave up on him. After reverse engineering a fair bit of the solution I managed to work out how to resolve the issue. It took me some time after that to learn how to sleep with a pager....
"Also, isn't AS/400 able to provide some pretty fine grained permissions? Fine grained enough that a trainee left alone on the weekend couldn't enter such a "global" command that reset everything rather than just the one connection he intended?"
At that time, the authorizations structure of the AS/400 wasn't that fine grained yet. Besides that, it was most likely the same command and he just forgot to replace the default "*ALL" in one parameter, something no authorization structure can fix. Been there and done that as an AS/400 programmer since 1993.
I think people (including me) sometimes forget how long ago the 90s were, and how many lessons have been learned the hard way in IT since (And how much easier it's become to share those lessons, such as by reading El Reg).
I mean, *I* tell my boss I read these forums as Professional Development & Skills Maintenance, and I'm not entirely bullshitting...
In some cases, maybe. But the reason my 1990s washing machine didn't spontaneously combust and shake itself to bits wasn't because of heroic efforts of software engineers, it was because it was a mostly electromechanical system that was too dumb to know or care what the date was. The only thing it knew was how long in minutes since the dial flipped round to the particular section of the washing machine program it was currently on.
Also, computer systems fail quite frequently. It is a nuisance when they do, but we mostly manage to survive it.
averted by dint of a great deal of very diligent and unacknowledged effort
And the bit of fiddling and faddling I did at the time - mostly Unix boxen and various firewall systems. I did lower myself to do the odd bit of poking at Windows Server though (and yes, I remember the patch that broke all the previous 'fixes' Microsoft applied and led to need complete retesting..)
OK while it lasted, but when all the projects started finishing and all the 'expert' Y2k people started being released, contract rates went through the floor and I ended up going permie again.
 And other OSen - had one fun contract where we were replacing OS/2 desktops with a Y2K-compliant version - we had a target number of machines to do every night and got paid for 10 hours regardless of how long it took us since none of the management wanted to have to work nights. I seem to recall eating a lot of curries at midnight since we were usually finished by that point.
"it was most likely the same command and he just forgot to replace the default "*ALL" in one parameter, something no authorization structure can fix"
I agree about that, but for such an important command, you could at least warn. A simple "This will [insert verb, in this case reconnect] connections globally (for all users). Usually, this is not necessary. Are you sure you want to proceed?" would probably have alerted the user that they probably want to cancel the command and try again. However, as good as that solution would have been, a better one would be giving a very new employee more training than that before leaving them alone with full permissions and no backup or emergency assistance.
but for such an important command, you could at least warn
I've never herded AS400 boxes but, seeing where they come from, I would suspect that they, like unix/linux/mainframe boxes, tend to assume that operators and root logins know what they are doing and don't throw up too much in the way of warnings.
I would agree that the people mainly at fault are the schedulers that didn't ensure that someone experienced was present as well as the n00b.
And bear in mind, we've only heard one side of the story. As told, it sounds like an accident in a system that wasn't entirely fit for purpose. But if it had been told by the person responsible for that setup, the spin would surely have been different: zac was messing about, overrode safeguards, perhaps gained unauthorised access to a privileged account, or somesuch. And maybe a different angle on what zac calls a 15-minute bathroom break.
Bah. Such blind speculation needs one of these -->
TBH That reminds me of most mainframe techs of that era... I started to learn that they were mostly like that. Their mistakes could be hidden behind what they believed you would never discover, file extension limits reached, lack of MIPS etc.
It was around this time I also discovered that mainframe software was NEVER optimised. The answer every year was, just to buy more MIPS. Lots of life lessons learned about how the mainframe budget being so much bigger also meant far fewer questions were asked about business cases or value for money that I was routinely punished with for asking for anything.
I found XEDIT on IBM to be a pretty optimized editor. Successive ‘show/hide’ that built on each other were like grep on steroids. Then you could finish it all with a c/foo/bar/all that would only change the foos still being displayed. Adventurous? It had a built macro mode. Only similar editor is Kedit but that’s Windows only, ASCII only and EOL.
That’s about it for my mainframe fondness though. Command mode was a bitch to get used to and my work programming languages at the time were COBOL and JCL :-(
"It was around this time I also discovered that mainframe software was NEVER optimised."
Really? Mainframes (and lets not forget minis) with 256K (yes, K) or so were capable of running complex systems decades ago, mainly because they simply HAD to be optimised, before PCs came along and added tons of bloat to everything.
Oh, I don't know. The DEC compiters were pretty good.
We had a rep show up at our data center. We had a bunch of VAX and were considering replacing them with a mainframe. They installed a test rig and the rep gave us a tape, saying we should install the FORTRAN program on the VAX and their mainframe and let them run.
We should call him in a week or so, when the mainframe was finished, the VAX would still be churning away.
When he got back to his office a couple of hours later, there was a message for him to call us back. The VAX was finished and the mainframe was still chugging away.
It turns out the mainframe had compiled the software and ran it. The VAX had optimized it and ran it. The difference? The VAX looked at the inputs (none), the processing (large, multi-million point multi-dimensional array, fill it with random numbers) and the output (none). The VAX compiler then made the "sensible" decision that no input and no output = nothing to do and optimized the whole array BS out of the equation and the program ran in under 1 second. Queue one very red faced sales rep.
FWIW grammarist.com is the poster child for NoScript. So much crap ads including all the zergnet, taboola, one weird tricks, you won't believes, etc...
2 paragraphs of useful information. 5 pages of clickbait.
For laughs I pointed Vivaldi at it, which doesn't have NoScript (I should probably install the Chrome equivalent) and my CPU usage ramped right up.
It's pretty much a site I will avoid on general principles, even though I have Noscript.
My apologies, that site works OK for me, but I guess I assumed that everyone on here uses NoScript, u-block origin and ghostery in their browser, uses a pi-hole for DNS requests and runs pf-blockerNG on their pfSense firewall. How the hell would the WWW be bearable otherwise? ;-)
No worries. I wasn’t criticizing you, just the site. I really dislike their clickbait, that’s all and it was funny to see a site so over the top bad. I’d love to feed zergnet taboola etc folks to red Amazonian fireants. Of course, I’d have to apologize to the ants later.
The same thing happened when they benchmarked a P & E machine. There was a loop that iterated some calculation 1,000,000 times but since the results of the calculation were never used the compiler deleted the loop body, then noticed a loop with an empty body and deleted the loop.
I used on optimising compiler that ran 1 billion iterations of an empty loop at three eights of a cycle per loop. I checked the assembler code. Compiler had decided to unroll the loop eight times. Eight times nothing is still nothing, but now it had an empty loop iterating 125 million times instead of a billion. Should have unrolled again :-)
Dec compilers were good. Cross-compilers could be...less good.
A program compiled this way for a military computer was taking far longer than the designers had claimed. So I analysed the output. Every subroutine went like this (approximately)
MOV R3, R5
MOV R4, R6
....code never using R3 to R6 inclusive
MOV R6, R4
MOV R5, R3
Once we took the assembler code and removed every single one of those redundant instructions, everything worked.
If I was a competent programmer back in my first job after graduating, it was purely by coincidence. Certainly never learned, other than on-the-job.
And a few years later when I returned to academia, I was surrounded by colleagues who had learned the same way, yet we were the ones teaching the next generation of CompSci students. I like to hope they came out competent to program (among other things) when they graduated, despite their lecturers' lack of education in the subject.
The *very first* FORTRAN compiler did strength reduction as a loop optimization - they had to do pretty well because they had to convince dedicated assembler programmers it was worth sacrificing a little performance to get easier to write code.
LOL - when I was helping teach a computer class at college back in the 80's they needed a 8080 cross-assembler for the students to use so I wrote them one in FORTRAN to run on an old Perkins Elmer that sat in the corner of the classroom ... I laughed so hard when I finished.
I had to write a DOS TSR to handle a Pharmaceutical Wholesalers front end to one of MD's mainframes - it ran on a '486 with 32 modems. It had to be a TSR "as the mainframe never went down" so normally my software was just transcoding the orders entered on rival wholesalers systems (a feature creep, big task in itself as all the wholesalers had differing systems) and the PC's main job was someone's word processor. The software's original task was that in the case that the mainframe ever went down my front-end had to seamlessly take over and then pass on the orders when the mainframe came back up. It turns out that the mainframe was always going absent, but no-one really noticed as the Pharmacists would just blame it on the phone line and start again a few minutes later. The only time anyone from MD would entertain a call was Monday morning New Zealand time (late Sunday night for me) and it was weeks before they fixed the issue :-(
IME, internally developed mainframe code is just code done with internal budgets and internal deadlines. Same as .net or C# or VBA code used within an organisation.
Optimisation is almost always cut to hit the deadline or the budget or both, and spending developer time optimising old code is very hard to justify. Some devs know enough to optimise the code as they write it, but then someone changes the specification and invalidates the gains...
"Optimisation is almost always cut to hit the deadline or the budget or both"
If it's developed for use in house and the house is paying mainframe prices for MIPS any sensible beancounter should insist on optimisation. But then there's a shortage of sensible beancounters.
And even when not, I've had to optimize systems that others had written.
A financial reporting system that took over 4 hours to pack up the data for transmission to HQ. 2 days work on that system got it down to under 20 minutes. Just a few simple changes. That saved 4 hours times 5 reporting programs, times 12 months, times over 260 reporting sites. Those 2 days re-programming and testing saved the company over 62,000 hours lost productivity a year.
Another time, the customer had already invested in 4 servers and a load-balancer for their commercial webiste, but it still keeld over and died, when more than 25 users per server tried to load a page. The programmers were good at web coding, but hadn't a clue about optimization. They had just thrown new indexes at the database to speed up loading the menu structure, but it hadn't helped. It still took over a minute to load the menu under load.
A quick run through the code, re-oder the WHERE clause so the database could process it optimally - as opposed to the programmers being able to read it in "from front-to-back" order (I changed it from human understanding drop-through to database starting with the shortest exclusion and working outwards). A few changes to the deeply nested surrounding IF statements in the PHP code and hey presto, the query sank from over 1 minute to execute to under 500 milliseconds. The page load time dropped to under 3 seconds.
Sometimes a company is more than willing to pay for optimization. Spending a few days of developer time to save 7 man years of accountants time per year, for example, or customer satisfaction at not having to wait for pages to load, is peanuts in comparisson to what the company will have to give out otherwise or could lose due to poor customer experience.
Best bit of optimization I did was to speed up a shell script which produced loads of configuration files from a large flat file of data which took over half an hour to run.
I converted it to perl. I got it working with some small amount of test data, then I ran it with the full data set.
It took three seconds. It took me a few attempts to convince myself that I had improved the speed by a factor of over 600...
I did enjoy demonstrating that. I told them it was a bit quicker, and that I thought they'd be pleased, and then ran it. The expression on their faces was a picture to behold.
We had a system that For Reasons (TM) worked on networked Acorn Archimedeses and was written in Basic. The key mathematical bit was rewritten in C++ to run on a PC (not by me, by a very clever if paranoid bloke). It was surprisingly hard to get people to understand that the rewrite was 300000 times faster - this is probably about 15 years ago - my recollection is that I got them to put the main calculation loop in a loop that did every step 100000 times - they were then prepared to believe me.
Not quite the same thing, but Vax 11/780 at college with 40+ students all trying to login with custom LOGIN.COM scripts at the same time = several minutes of wait time for the command prompt. Me, I spent an afternoon reading the system call documentation and rewrote my login script in assembler, including a 1 of 100 screen name picker. (Paranoia name format for max geekiness.)
Up and running in under a second.
quite a bit of years ago I was the universal techie at a small university research institute.
We had a famous (within his field...) professor that was very proud over his "find all instances of something in a database", that took about 48 hours to run. It really trashed the disk all that time. At this time I was working at optimizing disk cache systems, and thought this was a problem that might be
solved with some intelligent selection of program memory versus disk cache.
The professor where asked about how much memory was used, and he said about 250KB.
I put all of the memory in the machine up as a disk cache, and the program finished in about ten minutes.
The professor nearly died, and jumped on the phone to call his department at MIT with orders to inform his techs how close they where to be fired.
One bottle of reasonable champagne was handed over to me, and the professor moaned over how much of his life had been wasted waiting for his program to finish....
Grrr. Your example is something I see *every day*, because so-called 'programmers' stopped bothering to learn how the platform they were writing on actually works and instead rely on their favorite library-of-the-day to just make it all work.
Guess I shouldn't complain too much -- if they'd do it right I'd not have the job -- but there are days I could do with less mess to clean up.
But what do I know. I've only been doing this stuff for going on 30 years. THEY have DEGREES.
The mainframe tech sounded like, sorry cover yours eyes and ears, a cunt. Being in IT support, I really can't stand techs like that. Its not like he did it on purpose. Everyone makes mistakes. All I'd have said is "At least it will make you double check the command next time :)". There is no need to be nasty.
Its why I love where I am now. You are forgiven if you make a genuine fuck up. There is no benefit in berating people or making them feel small. You just advise them to be more careful and check their work.
Assume the person who decided to leave someone new to the job unsupervised had words said to them?
I recently moved a handful of users to a new test OU to test out my new group policy. Forgot to apply direct access to this OU. So those few couldn't connect in while at home. Oops. Nothing I could do, I made my apologies and they were fixed the next day when they got the updated policy. Thats it. No bollocking required. I learned from my fuck up and haven't done it since.
Everyone makes mistakes, and *when* they do it'll usually be pretty big!
No so long ago we had a network contractor who (with a straight face) said he never made mistakes. That didn't last long as he somewhow scheduled some work using CET and not UK time. Cue some systems doing down unexpectedly at the end of the trading day...
somehow scheduled some work using CET and not UK time
Perhaps one of those people who thinks that GMT just means "UK time", and is unaware that Daylight Saving time means that the UK is on GMT+1 in the summer? I get USAnians like that who schedule meetings for "3pm GMT", then show up at 3pm BST and complain that no-one else is there.
Cue the semi-regular retelling of my old IT manager "Dick" (surprising how many people called him that behind his back with just enough emphasis to make it an insult).
Can't remember the system, but it was running a piece of warehousing and billing software written in MUMPS. IT manager was apparently so intelligent he wrote part of the billing routine for the software. It only ran for one customer, and guess which customers bills were always wrong?
Basically the routine was supposed to re-allocate payments to the oldest outstanding bill, then recalculate what was remaining and allocate to the next outstanding bill, and so on. Instead it generated random credits and debits against the account, sometimes to the tune of thousands of pounds. It did this for years, before I started looking into the issue and narrowed down where the errors were coming from. Instead of being praised for finding out what the issue was and getting the vendor to fix it, the IT manager decided to blame me for the now correct bills and tried to get me fired because the client now had several tens of thousands of pounds worth of bills outstanding that they hadn't known about previously. He'd signed off on the fix, and the report to the client, then refused to admit that he'd done anything. To the point that he even tried to claim that the fix had actually broken the billing. Needless to say in the HR meeting I didn't hold back with my opinion of him.
I used to keep pet rats. It was fascinating watching them learn from each others mistakes.
Smart little rodents. Smarter than many people I've worked with.
/me looks in mailbox and learns that giving the new web hosts line-by-line CLI instructions to install the certificate has, in fact, worked. Unlike expecting them to do their damn jobs.
"The great engineer learns from other peoples mistakes"
I second that. Here's an example. It takes about 10 minutes watching YouTube to learn how to safely cut down a tree. But you should NOT stop learning there. Take another hour watching videos to learn how to unsafely cut down a tree. There are numerous examples. Often involving ladders. Never use a chainsaw up a ladder unless you're thinking of a career playing the bad guy in "The Fugitive".
So true. My brother is a professional arborist, so I know that they do not use ladders. They just climb the fscking tree, attach them selves to a bit that they're not cutting off, and then proceed to cut off the bit that they are cutting off!
And the answer to the title is that the one thing they have in common is that they both fsck up trees!
Gotta be careful though - if you were to put a rope around yourself and the tree's trunk, and the tree decides to split down the trunk while you're tied to it; you might find yourself suddenly quite narrow at the waist as your body attempts to arrest the split. Results may vary.
They just climb the fscking tree
Or use a Hiab arm with a stable platform on the end.. (oldest brother is a tree surgeon, owns his own company and his daughter also works for him. She's going to take over when he's had enough of swinging around in trees..)
Which is why there are separate chainsaw certifications for working at height. As you can imagine, H&S at a tree-surgery company is somewhat strict. In the 10 years of having his own company, there's never been a serious injury.
 One of his colleagues in a previous company managed to run a stump-grinder over his own foot. Needless to say, even though he was wearing steel-toecapped boots, there wasn't much left of his foot.
Well, there was that time when I got so pissed off with my creations that I drowned them, babies and all. I thought I was doing the right thing by letting one good, righteous man and his family survive, along with a bunch of animals, but no sooner had the waters gone down than he got drunk and flashed his willy at his kids. Useless tosspot.
But if the story of Noah was true, the inbreeding would mean a population where half were so gullible that they would believe everything on FB and twitter.
Oh wait... maybe that explains Trump and Brexit.
Maybe. But one side still believes that:
§ 18B/ year or 350M/ week or 50M/ day is sent to the EU
§ 17.6M turks want to come and live here
§ Von Rompuy made a speech about backs to the wall and knives to our throats.
§ There is only one deal on offer, when the EU said at least 2 alternative deals are still on offer but the one already signed can not be altered.
The list goes on and on.
"§ There is only one deal on offer, when the EU said at least 2 alternative deals are still on offer but the one already signed can not be altered."
In the beginning, there was remain and leave. And the people voted for leave.
This was passed to the politicians and it became remain which was definitely not being done, a good deal with the EU or a no deal/WTO exit
Over time, this then became remain which is still not being done, a bad deal or leaving/WTO exit whcih is obviously a bad option so won't be done either.
Now we appear to be at maybe remaining "we're not remaining, we're just pausing everything for a few years", a bad deal that has been voted down 3 times, an alternative deal that is still likely to be a worse deal than remaining (i.e. a customs union with no exit clause - rule takers, not rule makers, no option to establish new trade deals and other countries have the ability to make trade deals with the EU that do not extend to the UK). And almost certainly not a WTO exit.
And a Labour party that wants an election, a SNP that wants independence, a large chunk of Ireland that wants reunification and a PM that believes being stubborn is more important than reading the writing that's been on the wall for so long I can't remember.
Oh, and an electorate that weren't very fond of the EU last time around, were gullible enough to accept the "facts" above and now view UK political parties with the same contempt as they viewed the EU.
And while there are pockets of resistance to leaving (NI, Scotland, London, metropolitan areas), the electorate is largely outside of these areas.
And while polling numbers support remain, they support it at a similar level tothe lead in to the referendum.
Where will the dice land?
"And while there are pockets of resistance to leaving (NI, Scotland, London, metropolitan areas), the electorate is largely outside of these areas."
You do realise that most people live in metropolitan areas?
I completely understand and agree that many people in more rural counties and smaller/medium towns feel, quite rightly, that they have been neglected and hard done by, and they are right to be upset, but it is the UK/English government that they should be upset at. The EU and devolved governments are the ones who have actually been trying to support and invest in neglected areas.
"You do realise that most people live in metropolitan areas?"
Okay, it was meant as a summary - decide for yourself how the metropolitan areas voted (zoom in for more details):
We had an AS/400 at a client, it had been running for years. It was on a UPS and it was tested every week, you know, press the Test button on the UPS, it beeps and says everything is fine...
Queue a new IT manager and a "real" test - pull the plug and see if the warning messages get sent to the machines. Then re-attach power and carry on.
1. Remove mains power from UPS
2. The sound of "jet engines winding down"
4. Re-attach power
5. AS/400 reports broken DASDI
The UPS hadn't noticed in all those years, that the batteries were at 0% charge! So, they didn't hold for an hour with their charge.
And the drive in the AS/400 wasn't used to being turned off and the bearings ceased when it spun down and cooled.
I've been to vendor demos where they say that their system is fully fault tolerant and it can survive bits being switched off.
The look of horror on their faces as I reach for some random cables to unplug is always quite amusing.
Note to vendors: If you claim your system is fault tolerant, I *will* test your claims - and not in a nice controlled shutdown manner either.
One of the few times I've been mildly impressed by a salesdroid. They were demo-ing an Equalogic SAN, and showed that when it was running you could just grab one of the controllers and pull it out and it would seamlessly fail to the other one (ditto power etc.).
I was less impressed with an HP blade enclosure that I was expecting to run on three PSUs while I re-routed a power cable. Yep, that thing died hard. It turned out one of the PSUs was bad and could only handle about 1/3rd of it's rated load.
I have a story from a usually reliable source about an AS/400 falling from a window and not only surviving, but also keep running without any problems.
It was a small model AS/400, parked in the window sill of an open window for a while because the cooling of the server room was down. One clumsy technician bumped the AS/400 from the window sill and it dropped outside. That technician was rather fortunate as it was a ground floor window and the AS/400 landed on top of some shrubbery. With the soft landing and the cables not being yanked out, it could keep running. The manager witnessing it didn't though, he had a heart attack (proving he had a heart after all ;) ).
There was the demo by Novell, where one of a pair of Physical Servers was squashed, literally, by a An Anvil* every hour, and no data was lost- I believe it was big screen video around the exhibition site (wizzed about by ATM, IIRC).
* yes a big Acme style cartoon Anvil.
The bits of server were swept up and a new box installed...
Anyone remember where and when? It was a bit since, obvs...
the bearings ceased when it spun down and cooled
I had a similar problem one time with a server: The main data disc won't spin up after a controlled power off. The head of IT was getting stressed by the downtime, knowing a full restore onto a new HDD (once it arrived) would take a long time.
The head of IT started to panic when I got a hammer out!
A light touch to the side of the HDD was enough to release the bearings and the disc spun up OK. More backups were rapidly taken whilst we waited for an engineer to arrive with a new HDD.
Had a SUN E10K to run hot, read instructions, find filters need to be cleaned once a month or 3.. ask admin and get a "what filters?" response, obviously at that moment i had just volunteered to do the cleaning.
So we shutdown the thing as it had been running for over a year and we were scared of moving the filters and have everything sucked up, clean filters, turn back on.
Resulted in 3 HDDs, a switch and two of the fans (if i remember right) not getting back online, but redundancies worked pretty well at the time.
Many years ago, I worked for an IT managed services company. We were on site at a small medical clinic, late at night to do some upgrades on their server. The server we were working on had a RAID5 array with 9 drives in it. It hadn't been down in years we were told (the fear that always gives you). In those days backups would take all night, barely finishing before the offices opened in the morning, so we weren't able to wait for a backup before starting work.
So, we shut own the server and the array, add the RAM, etc. and power up the server and the external array. Sure enough, four of the drives decide not to spin up!! Once they cooled, that was it for the bearings!
I had a very young junior tech with me. I told him "quick look around under everyone's desk for a space heater". You know, in every office there is someone that is about to freeze to death at 78 degrees, and needs to have a heater under their desk (usually plugged into the same power strip as their PC). Sure enough, we found one.
I took it into the server room, laid it on it's back, like barbecue grill, and placed the four drives on "the grill". The junior tech looked horrified and said something like "what the hell are you doing?". We let them roast for a while, then put them back into the array. This time all but two drives spun up. So, those two went back on the grill. A few more minutes of roasting, and back into the array. This time only one wouldn't spin up, but with 8 drives working, the array came back up. This was about 3:00am.
We promptly ordered 9 new drives shipped overnight to replace the drives one at a time.
"A light touch to the side of the HDD was enough to release the bearings and the disc spun up OK. More backups were rapidly taken whilst we waited for an engineer to arrive with a new HDD."
Wish I knew about the hammer trick back in my AS/400 days. Dead hard drive was something like a 12 hour marathon for restore, depending on how in practice I was (IPL from 1/2" tape, 4 reels worth, then restore data from 8mm).
I did get to the point I could nap during the 1/2" tape restore steps. Not a full sleep, but a light doze until I heard the tape rewinding.
This is why tests should be as real as possible. If the full test had beeen planned earlier it could have all been much more controlled.
Having said that, I used to work for a well known broadcaster. Admittedly not a battery test since in the day, we didn't have 100 MWh batteries.
We had a regular 3monthly power fail test, fully planned, monitored and logged. Everything went perfectly for several years, until the diesel failed to start. Luckily, it was a test so we just switched back to incoming power and went home. Subsequent investigation revealed that all those 15 minute tests had emptied the fuel tank. Something that was not on the schedule for checking as it was a mechanical check, not a system check, and no point paying overtime for a night test of something which could be checked in daylight.
...and the genset that was supposed to start up while the UPS was holding the system up
Or, like one incident, started up happily but died after 10 minutes when the small tank of fuel was exhaused. I was supposed to be kept topped up from the much, much bigger tank under the carpark but that had developed a leak and all the fuel had drained away and no-one had noticed because the fuel level in that tank wasn't monitored..
It was afterwards.
In the late nineties working on a biggish site, one of my team was working somewhere amongst the dozen buildings, and I needed to speak to him urgently. One of my crew (who was much less of a n00b than I was) said "I know, we'll send a message with NET SEND". Tap tap tap:
net send dar1a2 /ASGARD "Please call extension 34567 immediately"
"Won't putting the domain name in send that to everyone?" I ask. "No", sez he, "it just limits the username to the ASGARD domain." <CR><LF>
So help me, the phone rang inside two seconds - just the most caffeinated and twitchy of our 4k or so logged in users, every one of whom had seen the same message pop up on their screen.
while phone_rings: answer; apologise; hang up; end_while
Actually, we just unplugged the phone and slunk away for an hour. God knows how much aggregate time we wasted.
"The UPS hadn't noticed in all those years, that the batteries were at 0% charge!"
We've just replaced the batteries in 4 of our offices, batteries were around the 4.5 year old mark. At one of the offices it took serious effort to remove, the battery pack had "ballooned" and we were damned lucky that they didn't explode...We should have replaced them last year but finance said they could wait... We were lucky that we didnt have to put them into use....
Thanks to having recently replaced all our old servers with VMS, our main server room has gone gone from 6 hrs to 24hrs of autonomy.. ( Two very large cabinets full of batteries which are maintained twice a year in order to avoid the dead battery scenario). We had the "opportunity" to test it after a short circuit in a PDU knocked out one of the breakers except for one NAS that only had a single power supply ( obviously it was on the same phase as the short-circuited PDU, Murphy smiled again...)....
Suprising how much we rely on UPS working 24/4 365/365 and it's not an easy decision to decide to test them...
Caution. Beancounters might just have something to do with your payroll.
They do, but I never encountered one smart enough to read an article on this site, leave alone the comments. And even if one did, it wouldn't understand the term "bean counter" or that it applied to it.
but I never encountered one smart enough to read an article on this site
No, they are probably all on an equivalent beancountery-type site, complaining about those frivolous and wasteful IT people who insist on relacing stuff *before* it's broken..
 And what a wild an exciting site it must be, full of articles on crazy new developments in amortisation theory and ways of restricting spending on non-essentials like IT kit and techies..
Suprising how much we rely on UPS working 24/4 365/365 and it's not an easy decision to decide to test them...
That's also exactly why it isn't easy to test them, shirley? Because if something goes wrong you're in for at least a day of misery, and if Murphy is really having a go possibly a good sacking too. *
On second thought the UPS is working fine. See all the little green lights? Step away from the transfer breakers...
* In reality we are one of the few organizations that prefers to do a real test on schedule, when the right people are actually around to fix anything that goes wrong and when the mains are hopefully available for a quick automated switch back if the genny or UPS AC modules fail. Because the alternative, having it happen over the holidays on night shift when all those critical people are too pissed to see the road, let alone a terminal, is (shockingly) a rather bad idea!
My lot did it on a weekend because while there was a UPS, it was for the servers. everything else in the building got a few seconds of dead power while the generator spun up.
On the plus side, they weren't doing anything important most weekends so if it didn't come up, we could fix it on Monday no issue, and it never failed. So much no issue that Facilities stopped telling IT when the tests were happening.
And then, one day, the UPS exploded.
I've seem the small charred fragments (some still reading "23000v") from an exploded UPS cabinet, complete with the expensive smells associated.
I also recall being told by our site engineer at another job, that the switches used to swap between external power, generator, and batteries also wear out fairly quickly and are not intended to cope with repeated use. Test the, or use them, either way seems perilous. I quickly arranged for the site engineers to be part of the CAB process where all sorts of entertaining discussions came to the table.
Change xxx on day yyy, that's a Sunday so that'll be quiet... Site engineer says "Water will be off that day, so don't use the toilets and bring your own drinking water, that's our scheduled disinfection of tanks etc." so it worked both ways... I learned a lot more than I ever thought I needed to know about the engineering behind my facility at that role.
I also recall being told by our site engineer at another job, that the switches used to swap between external power, generator, and batteries also wear out fairly quickly and are not intended to cope with repeated use.
That's interesting. Normally switchgear is rated in the thousands of cycles at minimum, so even a weekly test should yield lifetime in decades (probably other parts of the apparatus start failing before the contacts do). Of course, this is nonlinear so overloading the switchgear and trying to switch under maximum load would shorten lifetime considerably, but still...of all the moving parts in a redundant power setup, the switchgear is one of the lesser likely bits of kit to fail.
When it does fail, though, the results can be ... spectacular.
Be thankful your average facility UPS doesn't handle that kind of power!
If your budgeting system requires "a project" for everything, yes. Tracking all your costs and not just giving Engineering a bucket of money to use as they want inevitably leads to finance saying "not this fiscal quarter".
On the other hand if you're working in the finance industry, you way well be in a position where there's so much data processing going on that finance say "no" because they can't permit any mainframe downtime, lest the delays down-stream break something expensive and regulatory.
It's that tension between "working collaboratively with the teams to schedule downtime when it's least disruptive" and "sod it, this has to go in now before the whole thing falls over on it's own".
It's always the same thinking by beancounters. It's working, right, so why do regular maintenance on the batteries.
I have 6 UPS across my site, carrying 46 batteries, rated 120Ah. I requested new batteries $Diety knows how many times, until one UPS started going down within a few minutes of operation, before the Diesel gen would kick in, and they got their wake-up call. Minimum age of batteries was 5 years by that time.
We do regular tests though. But not by choice. Power fails at least once a month because of lightning strikes at the local substation, or just because of shoddy infrastructure problems.
Fun extra, the diesel gen doesn't supply power to drive the building airconditioning, so humidity shoots up pretty fast. Don't turn machines back on (after controlled shutdown) right after a power restore. Condensation is a killer...
We do regular tests though. But not by choice. Power fails at least once a month because of lightning strikes at the local substation, or just because of shoddy infrastructure problems.
That isn't testing, that is real life helping a hand. Just keep the figures at hand that bean counters cost more money than they save (unexpected down time vs timely replaced batteries in your case). Hit them with the figures after the fact. And suggest the next cost cutting should be an overly redundant bean counter.
The only problem is that upper manglement trusts the beancounters for their bonuses. The rest of us are just cannon fodder for them to use to get said bonuses. In a perfect world, the brass wouild bow to the computer services that generate the profit and not to the clown car that counts it.
The only problem is that upper manglement trusts the beancounters for their bonuses. The rest of us are just cannon fodder for them to use to get said bonuses. In a perfect world, the brass would bow to the computer services that generate the profit and not to the clown car that counts it.
No problem whatsoever, just have a nice, friendly chat with the auditors and let them point out in their report that the bean counters are a large risk to the continued existence of the company, figures attached as evidence. The real problem here is that most people are afraid of auditors or (mostly in IT) actively dislike them. Once you overcome that problem, auditors are just another tool for anybody with the proper xxFH1) certification.
1) xx From Hell, most famous for the Bastard Operator, but I like my Senior Programmer From Hell certificate just fine.
"He was made to promise that he wouldn’t do it again, and “that was that.” At least as far as the bosses were concerned."
This isn't so much a part of the anecdote, more a succinct analysis of who really was to blame and why. This is beginning to become a predictable common pattern iin most of these stories. The fact that the AS/400 admin blamed the inexperienced guy, rather than the management and / or himself is pretty telling too, although it's what I'd expect from an AS/400 admin who called himself "the mainframe guy".
This story reminds me of my very first "IT job". Written in quotes as I was actually a trainee engineer at the time. What was interesting was that after a year of writing reports on the AS400 I made the discovery that I actually knew more about the data structure of the system than the IT guys did. Probably not that surprising given my day to day job was pretty much data analysis by that point. Did discover that myself and the guy who I'd previously been asking for help on the reports had very similar tastes in TV shows (the Mark Dacascos series of The Crow was out at the time)
Don't you love that feeling when you realize you are just entering the command line of doom, but you cannot stop your fingers in time and have to experience that dreadful finger-touches-enter-presses-enter-fully-down-FUUUUUUUUUUUUUUUUUUCK!
Followed by a light drizzle of adrenaline.
Followed by a light drizzle of adrenaline.
And a quick mental search of "when was this backed up, where are the tapes, when were they last tested." The brown underwear moment follows the realization the answers are "15 years ago", "in the cellar of the previous location", and "never", and immediately precedes deciding whether you will drive the CEO or CTO's car at full speed into the datacenter in true BOFH style...
Inspired by the discussion here about " UPS batteries dont last forever". UPSes never seem to kick in when they really need to: we've all heard stories of the UPS silently failing, or ruthlessly shutting down the server and causing its own damage, or the battery pack expanding until one day it looks like Jabba and cant do anyone any good, etc etc etc.
I posit that UPSes are actively harmful these days and what you should do is design applications and server subsystems with redundancy. e.g. independent power feeds into your data center, app pools/clusters than can survive nodes going AWOL, etc.
I could be convinced, but whenever I bring this up to colleagues they look at me with horror. Am I insane? (Don't answer that last bit.)
All of those things cost money. Some of them (facility changes) cost a LOT of money and still won't help if the local substation gets hit by lightning / a bus / whatever. Some add operational cost (hot redundant kit) and may even serve to create security holes. And as the cherry on top, the world of proprietary software loves to charge an incredible fee for said features as only banks and the like currently require them.
If your organization is using open source software, then a middle solution may be possible -- two UPSes feeding two rows / blocks / whatever of racks in failover / mesh topology. It'll still cost double to run (hot spares and all) but if you're concerned about the UPS exploding something tells me you're chasing 5 nines or similar. Microsoft can't even seem to do one nine, what makes you think the organization's paying clients expect (or more importantly will pay more for) anything better than that?
While I can share your deep mistrust of UPSes, particularly with all the stories on here, they certainly have a great deal of value. Independent power feeds are only truly independent if they're coming from wildly separate sources - different ends of the building, connected to different power companies, etc. Horrifically expensive, and still prone to the occasional blip. Whereas a routinely maintained and tested UPS - tested as in true, full-load "will it take over and run for a few minutes" test - is far cheaper and, arguably, more reliable.
I personally have a tiny consumer-grade UPS (APC, but a little one) at home, running my home server and router. All I'm looking for is about 30 seconds of extra power, as most of our outages are only for a fraction of a second. Being a mini desktop with low power requirements, though, I really get about an hour. It's prevented a LOT of reboots.
> I posit that UPSes are actively harmful these days and what you should do is design applications and server subsystems with redundancy. e.g. independent power feeds into your data center, app pools/clusters than can survive nodes going AWOL, etc.
Dual power feeds, ideally from different substations, are a good idea however the switch that switches from one to the other is not instantaneous but needs to see a loss of 4 - 6 complete cycles before triggering. So, unless you have a UPS, your kit has to be able to survive say 8 cycles (to be safe) with no power. When you unbox a new server from your preferred vendor how do you test that it meets their claims regarding surviving brownouts without specialised test kit? And then, after six months, 12 months, 18 months etc of operation how do you test to confirm that each server continues to meet that specification?
Secondly, redundant servers etc cost money both in terms of kit and licenses (especially if active/active). So at some point it becomes more cost-effective to have a UPS than design all systems to be redundant.
That said, I've specced UPSes for control systems, but rarely for non-control - mostly because the clients have DCs with UPS+generators so individual UPSes aren't required.
"not quite like the mighty S360 VM/CMS universe"
According to Wikipedia:
"The 1967 IBM System/360 Model 91 could do up to 16.6 million instructions per second. The larger 360 models could have up to 8 MB of main memory,"
Mostly worked with tape drives I think, though drum memories up to 4 MB were available.
So a fraction of the computer power of most el-cheapo smartphones. But with software good enough to run banks, etc. Feeling nostalgic now.
My over zealous (and unthinking) issue also involved an IBM AS400 point of sale system. I was writing a complete and critical upgrade of the security and one Monday morning a stressed looking admin asked me, as I arrived at my desk, if I had backed up my work off system. I had and he looked very relieved. It seems the admin staff had scheduled a full UNIX update that weekend. They had not mentioned this to anyone outside of their little clique. Nothing in user's space was backed up and the update involved a complete drive wipe. "Well, that takes time we don't have and we are not responsible for the backups of others." I can imagine then self-righteously declaring to themselves. There were more than a few screams in IT that day.
one more time, the IBM i (formerly known as AS400) is NOT and has never been a MainFrame. Mainframe is 'z'.
OS400 is a wonderful operating system (i5 OS now?)
I don't know what the mainframe guys do except charge more, but I had customers with an AS400 in a closet, they come in all sizes. And still do. I'm working at a midsized bank that runs it's entire Banking, ATM, POS, Internet Banking, Teller and more on one.
Biting the hand that feeds IT © 1998–2020