Back when OpenStack was launched with NASA, you literally ...
... had to be a rocket scientist to run it
He wasn't trolling you, was he?
Friday has arrived once again with a tale from the smouldering world of On Call. Today's remembrance comes from "Phil", who a few short years ago found himself supporting an unnamed public cloud vendor that decided to base its product on OpenStack Grizzly. It's safe to say that it wasn't a pleasant experience. "OpenStack," …
It seems that some get offended by silly single-entendres and are unable to recognise that there are some that occassionally appreciate a wallow in gutter-level humour.
They may well want to shut down most comedy and stand-up shows too.
This post has been deleted by its author
That might explain why the office automation job I applied for at the end of my degree course started with "are you prepared to sign the official secrets act" and it quickly became obvious the interviewer was from the military end of the company.
Finding out about my copious university bar time makes more sense than a degree almost designed to turn out weapon guidance system engineers!
> My uncle was permanently drunk from his early teens to his sad early end and still managed to fit in an unrealistically successful career in rocket science - software programming for ballistic missiles.
In a galaxy long ago, I was housesharing with someone who'd been writing software for the eurofighter - something to do with the avionics iirc, twenty-something years on. But he'd quit and was slowly working his way down the alcohol food chain - I don't think he was quite at the two-litre cider bottles yet but he wasn't far off. Equally, I'm not sure if he quit his job because he was an alcoholic, or if the job had pushed him in that direction.
Either way, an odd chap, especially when you factor in that he had the "charm" gene cranked to 11; he went out pretty much every night and came back with a different lady each time.
As a young and naive kid straight out of university, I missed most of the overtones. As a older and allegedly wiser person, I'm wishing there had been a way to help steer him away from this spiral...
HP kit is pretty resilient.
I worked at one company, who thought the ideal computer room was the top floor, south facing room with floor-to-ceiling windows and no AC! In summer, the first person in the building went into the computer room and opened the windows...
When I started, the first thing I told the CEO was that we needed AC in the room, or to move the computers into the basement. Both were vetoed, the AC budget was exhausted, because the CEO needed AC in his office and the mirrored SQL Server was already in the basement, eggs in one basket and all that... Plus, the servers had never had problems in the past.
Yeah, because they were newer and not full of dust!
I quickly put a thermometer in the middle of the rack. In winter, it was reading over 40°C, with an open window.
Summer came and lo-and-behold, the temperature in the space between the servers exceeded 60°C, but there was still stony silence on the need for AC... Until one of the financial servers went tits-up - I came in one morning to screaming fans, well, screaming a bit more than usual, I have the admin-gene and I could detect it! :-D
A quick status check and it was confirmed, the server was not responding. I forced the power off, pulled it out and waited... It eventually cooled down to under 40°C and I took the lid off, thick dust everywhere. With just a can of air, I sprayed the worst out and managed to get our external support company to come in on the weekend with an air-compressor and we went through the whole rack and cleaned all the servers.
But even so, only one server crashed, even though the room temperature was over 40°C and the rack temperature was over 60°C. I quickly left the company and found another job. But that HP kit is tough!
Similar story in my past.
Moved site to a "new" building (an old factory, gutted and with some newish desk). The secure server room was a windowless, box made of bricks in the centre of the building.
When we were planning the move I added up the power inputs of the servers, networking kit etc and asked for aircon to match (on the grounds of 1kW power in needs 1kW cooling).
A somewhat patronising refusal was given by the finance director [obviously a well respected architect/electrician in his spare time !!]. In his expert opinion, a big aircon was a luxury and a cost that couldn't be justified. Fortunately this was in a meeting and was duly recorded as part of the official minutes.
Fast forward a couple of months when we've moved in and the inevitable happened. The server room was over heating and the aircon was struggling away, noisily dripping algae laden condensate (known locally by the charming nickname of 'elephant snot'). As others have reported in their experiences, the main engineering design servers, database server and e-mail systems were too hot to touch (and the threshold of pain is generally reckoned to be 60°C).
Still no movement by the FD until... somehow in a server room rearrangement the finance server was moved under the aircon (wonder how that happened). After a few drips landed on it (and evaporated away quickly), it was shut down "as a safety preventative measure" (and logged in the safety incidents and risk registers) -- just a few days before the corporate quarterly return, a VAT return and a customer status report were due; strange how coincidences happen.
The FD was a little upset but on entering "the oven" and seeing the server draped in green gunk did have the grace to admit that "perhaps we did need more air-con" and ask brazenly "how did you let it get into this state?" The old meeting minutes and printouts of his e-mails were presented to him in a folder, which just happened to be at hand. The presence of several witnesses was a great help.
I believe that the purchase processing for the new aircon broke all company records.
But yes, HP and Sun Microsystems [remember them?] did make some good kit that survived abuse.
"The old meeting minutes and printouts of his e-mails were presented to him in a folder, which just happened to be at hand. The presence of several witnesses was a great help."
Always know where the bodies are buried and have documentation. Hallmark of true BOFH.
"The old meeting minutes and printouts of his e-mails were presented to him in a folder, which just happened to be at hand".
Ah yes, the "I TOLD YOU SO" folder. That has saved my skin many times over the years.
Remember: If you make a stupid decision, I WILL keep records, and I WILL make very sure that your words are carefully arcnived. I do this because I expect you will do the same against me anyway.
Ah, but management sometimes do worse ( especially to other, more junior managers). They'll archive comments out of context and bring it out with a nasty twist to the meaning, when required.
I have had several examples pulled on me over the years.
I once said to my line manager, ( respectively head and deputy head) while chatting in the staffroom, that I didn't think we should be micro-managing our highly skilled and professional specialist teaching teams, but that they should be self-managing and we should be setting and monitoring targets for performance etc.
This was, several years later, pulled out in a performance review as my having said to her that I didn't believe in managing our staff. This despite ( or I'd hazard because of) the fact that during that same year I'd been able to get rid of one very poor and incompetent teacher in a matter of weeks that she'd failed to be able to remove for many years; By setting and monitoring targets for performance! Because he couldn't meet them and I could prove that these were a minimum professional standard. Whereas for years they'd micro-managed this specimen, checked every dot and comma of his work for a few weeks, which he duly and temporarily complied with. Pretty easy for him to do since they'd told him exactly what to do almost hour by hour but never laid down any standards that he should achieve.
A leisure swimming pool was building their new site and in planning we said "You need to put the server room on the roof space. You're next to the sea and all the water plant equipment is in the basement".
As always, was ignored. They put said room in the basement. A month later the kit that was in there was now starting to rust because of the chlorine and sea air seeping into the room. It has been several years, they are having to replace the kit regularly.
If only they had listened they wouldn't be consistently pissing money away.
Corrosive air can elude the thought processes of the unwary ... In the mid-80s, I was working for a company that built gear to dynamically allocate bandwidth between voice and data.
Incredibly Big Monster of a company started getting weird bit errors on their global T1 (E1, T3 etc ... ) network. I was assigned to track down the problem after lower level techs couldn't figure it out.
Going thru' the data, I discovered that once the problem started occurring at any one site, it gradually became worse ... It was never bad enough to actually take down a connection, but network errors ramped up over time.
Further review showed that the same team of installers had installed the gear at all the sites with the problem.
I flew out to Boca and discovered that they had installed punch-down blocks in a janitor's closet ... directly over a mop bucket full of ammonia water. Seems it was the only wall space that was unused almost universally in such spaces.
Blocks relocated and corroded wire replaced, no more bit-errors ...
I have a (somewhat) similar story as well, though from the cellular industry.
Working in equipment installation and maintenance, my team went on site to facilities all over the region that we took care of. One of the sites had a single HVAC unit, with the whole site located on the south facing end of the maintenance floor, next to the elevator room on top of an office tower in the middle of downtown in that city.
Of course the single point of failure HVAC chose to do just that one summer.
When we got into the room, ambient inside temperatures were in the 120°F range, and all the coax insulation was melting and dripping off of the cables.
The single HVAC was promptly replaced with a pair, and operated with a fail-over load balancing controller, which was also able to report failures...
The disaster recovery of the broadcast equipment was an interesting mess in it's own right.
"you will get a contact nicotine buzz"
No, I will not. I refuse to work on that kind of hazmat, and have done since I first started working on computers. The interior of a smoker's computer is the epitome of narsty ... Several people I know quit smoking when I pointed out that their lungs undoubtedly looked and smelled worse than the mess inside their computers.
I've previously documented one machine that was sat on a carpet of cigarette ash. I had to work on that machine wearing a mask and thick latex gloves, and it was totally yellowed with all the nicotine.
By "fix", I meant removing the hard disk and cleaning it outside, placing the rest of the machine in a thick bin bag, and sealing it up tight. "The fans are gunked, the processor is totally fried, and this kind of damage voids the warranty. You'll have to buy a new one and it's not coming out of our budget".
Hard drives are (usually) at the front of a rack server, with fresh/cool air pulled over them by the fans at the back. The other gubbins is all at the back of the server, and heat mostly rises.
That's assuming it was all in one server rather than the disks being in an external enclosure (I'm guessing it was, otherwise our hero would have just plugged the enclosure in to the new server rather than moving disks.)
Go back a few years and you meet 3.5 inch drive half height (i.e. double current standard height) with a normal operation temperature of 60°C when cooled, and 70°C when not. I remember those monster 4 GB SCSI/SCA drives. And those were the "modern silent" ones.
Newer hard disks have ton of sensors and adjust them self to changing conditions since a few degrees difference means the tracks moved by the width of several tracks away from where they were just a few minutes ago.
OpenStack with just four people
So when one of them's on vacation and trekking the mountains, and another is having a baby, and another is having an epic session on the booze, and the fourth is stuck on jury service. OK, unlikely four-way coincidence, but ...
I hope they're not all in one workplace, where any lurgy is likely to spread and knock all of them out. The human equivalent of an overstuffed rack where the hardware fries.
Based upon the rest of the working environment, I think we all know what the answer is.
Personally, this is why I'm trying to get out of an on call role - On a quiet day, it's a nice windfall for being on call, but when you have the bad days and just want to go home and cry yourself to sleep, you don't get to.
I recognise that's the reason for the big bucks but it just makes the bad days all the worse.
I used to work for a software company selling systems to the meat processing industry.
A stoppage of more than 15 minutes was a six-figure loss at even medium sized companies. That meant 24/7 support (slaughter lines generally started work at midnight and finished around 8-10 in the morning).
Getting a call for a stopped line a 3 in the morning and less than 15 minutes to analyse and getting it working again was pretty stressful. I'm glad I'm out of there and have no real on call any more.
I admit I have no knowledge about Openstack, but I have used RAID.
RAID is a lot less fussy these days, but even in the 90s/early 2000s it was still possible to transplant a RAID array to another system and bring it up without much hassle, although it might have been necessary to ensure firmware levels were matched. If this RAID controller is decent enough to handle dozens of drives, it probably had better firmware than some of the lower level stuff that could still be coaxed to work.
If it's software RAID rather than hardware, whilst I'm not the biggest fan of Linux RAID, all the RAID devices are GUID based and bay position does not matter.
I have performed some terrifyingly rude operations on Linux softRAID, switching drives around, failed SATA controller taking out half the disks and just forcing the damn thing to rebuild, changing the Superblock version of an unmounted RAID set, you name it. It just keeps taking the punches!
A mate of mine used a lot of eBay hardware in an academic environment, and his warranty was basically a pile of servers in a cupboard. He used MD so that he could simply haul drives from one, shove them into another, and they'd always, always boot.
It's just sign of experienced ops guy - the only thing that surprises us is "everything just worked".
Yeah, raid controller suppose to handle foreign config, but what if battery backup on fried one didn't work? What if the heat demaged drives? What if drive fw had a bug that only showed up on new controller?
All of that happened to me, resulting in complete reinstall (I f@#$g hate exchange!).
I've run Netapp kit for 20+ years now, first using them when they still used the DEC StorageWorks containters for 3.5" drives on the old FAS720 systems. Good days! Anyway, we had a ClearCase (version control software) VOBs database on the Netapps and the system crashed with two disk failures in the same RAID group. Back then, backups were to DLT7k tape drives, and would have taken days to restore, and the company was desperate to get things working again without losing data. This was using the Netapp RAID4ish WAFL layout before they went dual parity. We had had two disk failures close enough in a row to lead to data loss.
Turns out that one drive had crashed the heads on the disk, you could hear it screaming and grinding. The other disk has just lost the on-disk controller board, fried somehow. So with Netapp support on the line, we ended up doing a disk-ectomy, moving from the bad platter disk taking it out of the StorageWorks container, pulling off the disk controller, and putting it onto the second disk.
Plugged the now hopefully good disk back into the array, fired it up and damn if it didn't start serving data again and rebuilding onto a spare disk as fast as it could. I was a very happy guy to see that happen.
I've moved a zfs pool to another computer a few times, and that always works without any problems, except for one case where it didn't pick up one of the drives due to a faulty cable. Being able to get from pile of computer bits to fully working system in about 10-15 minutes is nice.
You haven't met my user base, but most recently.......
Told for months that they have to use the two factor authentication for the VPN, this becomes a issue*:
Last thing Friday afternoon (2 hour reset process).
Sometime Friday evening
Last thing Sunday (When something needs to be for first thing Monday).
The conversation will always include:
What RSA token\Instructions & similar e-mails?
Oh, was that was that was about I never set it up, I didn't think I was required to anything with it!
I forgot my easily remembered pin number.
I can't connect, what am I doing wrong?.
*Especially if they have a recently replaced Windows 10 "On-Call laptop" & didn't bother logging in before leaving branch.
Our place has set this up using a Microsoft product.
It does not challenge on activating outlook.
It does not challenge on remotely accessing the system.
It does not challenge when connecting a phone to the mail system.
It DOES challenge randomly about once every three weeks while I am in the middle of my turn at covering the servicenow tickets. If I don't respond on my cell phone (in my pocket and tangled with my keys etc) in thirty seconds it shuts down my email server connection and getting it to come back up is a journey of discovery involving clicking on links, randomly shutting down and restarting outlook and on one particularly desperate occasion clearing off a *very* busy desktop and rebooting the workstation.
The wonderful chaos that ensues when my password needs changing is a thing of beauty too, as the email is configured to demand a change (by silently disconnecting from the server and hanging) in the middle of the day, whereas the network wants it doing at my convenience but nags me for two weeks. Password aging is the cowpat in the field of computer security. A bread and circuses approach that just makes for people gaming the password vetting algorithm and database.
Where's the Tylenol?
Yeah, if you whip a drive out of an md RAID1 array and plug it into another motherboard, it just looks like a normal drive with filesystems on it, even if the partition types appear a bit suspect. Not so if you do the same with a drive out of a hardware RAID1 array.
Been bitten that way exactly once, when a RAID controller packed up, and not touched another hardware RAID controller since. (It didn't help that it had previously tried to rebuild its array by copying a pre-emptively swapped in, fresh blank drive over the top of the good one ..... Yeah, that thing you joke about. Brown trousers time when it happens for real. I ended up dd'ing the contents of the removed drive onto another new one, like you're supposed not to have to do, so it wouldn't matter if it tried the same stunt.)
You can also live-swap and grow software RAID1 systems, replacing both disks with bigger ones, with just one reboot (none if the boot drive is not part of the RAID).
Wait, what? You must've dealt with some really crappy hardware raid - my sympathies.
I've pulled a raid 1 drive out of controller raid and booted another server with it many times... In fact - that one of the methods to install two node cluster I used to use.
Nowadays it's easier to v2p from vm image or use proper deployment tools (bad for one's geek street cred, I think).
"Wait, what? You must've dealt with some really crappy hardware raid - my sympathies."
Well, yes, there were a lot of crappy RAID controllers out there back in the day. I was once sent to a Compaq server to replace a Compaq RAID controller. Sounds simple enough, but the caveat was firmware revisions. If the server BIOS was at the "wrong" revision level, the RAID controller firmware had to be upgraded from tock factory level before the server would even boot. Unfortunately, the server was a dev machine at the devs house where he worked most days and of course the server BIOS revision level was at the "wrong" level. He had a couple of desktops, but they didn't have the right type of expansion slots (E-ISA? Microchannel?). Eventually I bit the bullet and did an enforced downgrade of the server BIOS, upgraded the RAID formware, then put the server BIOS back to the original (and latest) BIOS. BIOS upgrades where sphincter tightening in those days and downgrades were always a last resort as it might not boot up afterwards.
With the hardware RAID you have to trust the handfull (if lucky, probably just one guy)) who really understands the "firmware" if you are in real trouble, late at night, that is not a good place to be!. I have forced individual sectors to "act" as good with the help of mdadm and assorted disk tools, i have rescued a raid5 array by hand picking bad sectors and forced rebuilding!!...... When that shit mounted at boot and was R/W was the best moment of my professional life!!! Pheeeewi!! I could NEVER have done that shit with crippled "firmware UI from DELL inc!""
I ran a small ISP in the early naughts that had a similar hardware replacement policy. Once, one of our switches broke down - one of these expensive 24 port 19" Cisco thingies. I realized that a) we didn't use all 24 ports, b) we didn't yet use any of its management facilities beyond basic port monitoring, so c) I yanked the cables from the 12 port no-name switch on my home office desk, hopped in the car, swapped the switches (the no-name one wasn't rack-mountable but luckily had magnetic feet so I just attached it to the side of the rack enclosure) and went to bed. Next day, dropped off the Cisco at the office asking my admin to send it in for a warranty repair and picked up a fresh no-name switch for at home at the local PC store.
A year or so later, I noticed a box with a Cisco brand on it in our office. The admin had forgot to tell me that the repaired switch arrived (a mere week or so after the incident) and I forgot that our little ISP was still running an important chunk of traffic on a cheap no-name switch...
If I ever get dumb enough to start another company that actually has to spend $$$,$$$ on hardware, I'll make sure I'll have something better than an "yeah, whenever" replacement policy in place :) (I wont. Ever. The smell of data center in your clothes after another 12 hour shift standing behind a tray-mounted keyboard still makes me sick)
I've gotten pretty lucky here at [RedactedCo]- The company has zero issues with throwing Serious Money at hardware and support contracts, and we also have things engineered for redundancy and HA where possible.
Even so, I've had to pull out the Techno-necromancer's kit a few times before we go to where we currently are. Swapping CPUS around between slightly similar models of Poweredge 2950's to resurrect a failed server that blew it's mainboard during a physical hardware move was interesting, but not as fun and exciting as virtualizing one of the last SQL servers that was in a cluster before the shared disk packed it in was slightly hairy.
Some computer systems remind me of a toddler: immature, stubborn, repetitive, illogical, hard to calm down...
Me: What do you mean you want "the pink one", this IS pink!
Toddler: I want the other pink one!
Me: If you wore it yesterday, it's in the wash
Toddler: But I want the pink one!
Me: It's in the wash, how about this pink shirt
Toddler: I want the other pink one!
Me: I give up, I'll get the one from yesterday with the ice cream stain on it.
Partners laptop on Thursday, just before a trip (Short version - I tried a few combinations & other tricks during the tantrum).
OK lets stick a nice 120Gb SSD in you & a fresh Windows 10 install & bump up the memory to 8Gb from 6Gb.
Don't wanna boot - BSOD!
Put in original spinning rust - Don't wanna boot - BSOD!
Refit SSD & original 6Gb - OK I'll boot & start installing..... OK I'm installed.
Bump up memory - Don't wanna boot - BSOD!
Refit original memory - Don't wanna boot - I WANNA REPAIR NOW!
Repair (That took longer than a fresh install) - OK then apply usual tweaks, enhancements, software & Office 2019 Pro, all working.
Time to make backup recovery image on second partition - I don't wanna let you boot from USB or get back into the BIOS to boot off USB (Despite setting the boot menu up earlier).
Oh fuck you then - I'm going for a beer!
I'll make a image on its return........
Or indeed whether it's public or private.
There seems to have grown up a general managerial attitude that anyone who knows what they're talking about is an adversary - or more to the point is being adversarial- and has to be resisted.
Any budget proposed is obviously, therefore, assumed to be inflated.
Any time scale estimated must be too long.
Any risk noted too fussy.
Any scheme of work too elaborate.
So the cost estimates that go forward are made optimistic to the point of fantasy.
The time scales are inadequate to even get the preparatory work done.
Several key components will prove untenable due to unplanned for problems.
And the approach (read shortcuts) taken to perform the task will end with some major functions having to be omitted - or postponed to some future horizon.
And this holds whether it's a computer upgrade programme or building a new school.
Biting the hand that feeds IT © 1998–2020