It Works both Ways
He had a standard response of 'that must be a software issue' when his attempted fixes failed
How many software engineers does it take to change a lightbulb?
None - it's a hardware problem!
As Friday rolls around it's natural to feel a little low on energy. But this week's On-Call – The Register's weekly tale of tech support trauma – is positively crackling with electricity to pep you up before the weekend! This story came from a reader we'll Regomize as "Charlie" because it comes from his time working for a …
He had a standard response of 'that must be a software issue' when his attempted fixes failed
I'm surprised he attempted fixes before his declaration.
In the early 90s I was working with a national hardware maintenance company. I was writing the in-house database systems and supporting the network etc. We had a large number of dental practices on our books, supporting their hardware. This was thanks to an up stream supplier, that brought the dentists onboard, providing hardware and software, which was supported by a third company. We had a very good relationship with the supplier. The software providers were always saying "it's a hardware thing". In the early days we'd have an engineer on site, unable to fix - coz not hardware, right? The engineers would call me to check any software likelihoods: bingo! So, reading the signs on future calls, the software mob would get really narked as the supplier allowed me to speak to them directly and explain how to fix their software issue, with a barely hidden implication of "stop trying to pull this shit off!". They did rein in their shoddiness quite a bit, but we eventually handled the software support too.
Reminded of our networks team who, on being advised of a network failure, would inevitably respond with "Replace your NIC". I think that worked once, although it might have been zero times. We wound up simply going into the DC and switching cables around; once it was confirmed the fault followed the cable, we gave them the incident back. Still took them months to acknowledge that one of the RIB connections to our HP server wasn't working after we'd had to swap cables around a few times to get console connections working.
Storage team would typically also have a similar response to slow responses on the EMCs. "health checks are fine on our end", until a few days of badgering later alongside IO stats from the OS, they'd respond "we found a hot spot on some disks and we've rebalanced things" and disk responses would go back to better levels.
Those were the days when hardware inventory scans where common. Kaspersky used to have that feature (before it became unstable, unreliable bloatware).
Back when lots of people built their own PCs for home from components and work 'puters were all to easy "unwilling" donors.
We once had issues with a particular drive timing out on file copy jobs and such, and after much pestering, the storage team sent us two near-identical looking latency graphs (one for the bad server, and one for a good one in the same area) showing that there was 'no hardware problem'.
Indeed, both graphs looked very similar, with very stable latency values, indicating consistent, reliable performance! But what they'd missed was that the bad server's graph had a Y axis orders of magnitude higher than the other. This drive had, for the time, latency values you wouldn't be happy with on internet connections to the other side of the globe. Once we pointed that out, they sorted out a replacement pretty quickly, but man...
After a customer moved from main frame to "commodity hardware" (lots of x86 boxes) I got dragged into discussions about a performance problem.
I could see from an I/O trace that some I/Os were taking 100 ms... the rest of the time it was below 1ms. I reported this to the virtualisation team.
The virtualisation team said "on average the response time is below 1 ms across all the machines. So not a problem. It must be a storage problem."
The storage team said "Across all our storage the average response time is below a millisecond! - not us".
This had been going on for about 3 months. Eventually the customer took drastic action and turned off auditing to disk, and other things that did I/O.
After a lot of analysis it looked like there were some hot files on one disk. If many virtual machines all did something at the same time - you got I/O delay.
All the vendors were to keen on not looking bad - rather than resolving the problem.
Occasionally, but rarely, you come across a vendor who is more worried about the gear running smoothly than in looking bad. Those vendors will assume their gear is broken until proven otherwise, and as such are a pleasure to work with. And the benefit is, when this sort of vendor says it's not them, one is more inclined to believe them. A vendor that takes responsibility is also one I will use over and over, but one that always tries to dodge until forced to acknowledge is one I will avoid for future business.
... average response time is ...
To find the problem, we don't want the average values of the data. We want the extremes, and the missing data points -- in other words, knowledge that at time T there should have been a data point, but the monitors captured nothing at that time.
Mentioned this nefore - back when I used to get my hands dirty with HP LaserJet 4000 printers... I had just one spare printer network card, so when I had network problems on a printer that I couldn't solve, I would use the spare and leave the faulty card on the shelf. After a few weeks when I had another fault, I would use the one off the shelf - which will then work fine! So then the faulty one will go on the shelf and then will be the new spare.
Just occured to me - is this just switching it off and on again with style????
During my 20's I was a software diagnostician for a Mainframe manufacturer. I'd often get roped in when a new installation had a 'software problem' I had numerous fall outs with engineers who would swear on their mothers lives that the hardware was good, unfortunately I'd seen most of the common problems, If I could talk to all the discs in a cabinet up to to the controller you just installed, its not going to be a dead controller it's going to be the fibre optic cable you strained as you joined the 2 cabinets together. Passing on this news to an engineer with 30 years experience who really, really didn't want to take the cabinet apart to plug in a fibre connector was always going to be painful. Likewise trying to convince an engineer that the poor system performance was caused by a too cold CPU (really) because I could get the real CPU temperature from the Node Support Computer he didn't have access to, or even that I could go 3 levels deeper in the diagnostic process and prove that a front end processor failure in the engineering logs was cause by read errors on a disk volume timing out causing the OS to time out which resulted in the front end app thinking there was a comms fault. On the bright side IO could give him a disk volume name so he could perform hardware diagnostic tests on the disk
because I could get the real CPU temperature from the Node Support Computer he didn't have access to
That's kinda bogus though, and really can't be blamed on the engineer.
There's been times when I've done my diagnostic best and someone's told me "oh it's not that, it's this"
"Why?"
"Because such-and-such says so-and-so"
"OK, how do I get to such-and-such to see what it's saying?"
"You don't"
"OK. Fuck you."
I used to work for a comany that was a reseller of video compression hardware. The hardware was a few grand in cost and part of the software was a h/w diagnostic program. If I had a faulty board, I would run this and send the results to the company to get it fixed. Only to be told that is not a valid fault description...
On my first tech job, doing support, we used often used serial cables that had been put together by guys working on another floor. That surprises me now, but I had no standard of comparison then. Once when I complained to one of them about a cable that didn't work, telling him just how the pin-out was wrong, he told me that it was only one pin off. The benefit of all the odd cables was that I got a fair bit of practice with breakout boxes.
Very first batch of a new board, we're all gathered round as board 000001A gets powered up, six pairs of hopeful eyes on the LED that the firmware preloaded into the flash chip was supposed to blink. Plug in the 12v supply and the sacrificial diode pops and smokes: on the schemo the wrong way round and the fabricators had faithfully reproduced that for us. Through the whisps of diode smoke comes the voice of our esteemed Engineering Director, who was not at all a cock-bag: "Software problem: when will you have a fix?"
At a previous company 00001A always gathers a crowd at first power on.
Cue - "that's a bright orange light at the rear of the board, it's outshining the blue eye-burners!"
H/W Eng - "I don't have any orange LEDs - Argh!!!" and turning off the PSU with haste.
The orange glow was a shorted out PSU pin on the backplane gently burning. We never used a 21 slot rack for testing 00001A boards ever again!
Obvious icon ============>
this all sounds like the litany delivered by our former production engineer before he left last year
Any failure was due to my crappy code and certainly not due to him buying the wrong size grippers for the robot so it drops the part every time its supposed to take it in/out of the machine.
One of the reasons I hated him so much
There's an essential conflict between hardware and software people, which I've experienced at first hand. In the very early 1980s, my colleague and I developed this: https://www.cambridge.org/core/journals/annals-of-glaciology/article/digital-radio-echosounding-and-navigation-recording-system/48C0D56C413BA23F23F92996868E1E96
I developed the software; my colleague designed and built the hardware. If I had a fiver for every time we had the routine "It's hardware!" "No, it's software!" exchange, I'd own my own super-yacht! Fortunately, we were able to remain friends - just as well, as we were also the crew operating the system in the Arctic!
Of course, I only recall the occasions when it WAS the hardware! But we depended on the Z80 interrupt system, which was very sensitive to any noise on the edge that triggered the interrupt. Debouncing circuits abounded in the final system!
From the time when they used cable bridges and jumpers to program.
Punched-card accounting machinery a/k/a "PCAM," was programmed by jumpers on plugboards. Our local college had one programmed to simply copy an input card deck to a line printer (an "80/80 listing"?). It was great for getting an initial printout of your card deck, so you could fix the obvious typos without wasting CPU time on the "real" computer.
I worked in a tech support team with a bloke whose only response to a problem was to re-install the software. No matter what the problem was, even when I knew it was something trivial from just hearing his half of the conversation, his answer was always re-install. When the customer complained that they had already performed two re-installs, as per his previous instructions, his response was to arrange a new copy of the install disks (yeah, long time ago) to be sent out - case closed. I swear he never actually solved a single problem, just created work for the shipping department.
When I used to work in a Uni computer lab, one task the students had to complete annually was an end of year show to show of their project. They were doing a "multimedia" degree, so the projects could be games, interactive encyclopedias (this was the 90s when these were a thing), sound or video. A couple of the lecturers organised it, and invited people in industry to the opening night, with the aim that students could get jobs if they impressed. A few students did, so it did work.
The students were given space in one of our labs , and a PC or Mac to show their work on. They were given administrator access to this machine (the machines were re-imaged immediately the show finished). They could do what they wanted with the space, and could borrow extra equipment if needed. The students were responsible for setting everything up, running it (we still provided tech support) and packing the show up. In all the years we ran the show, I can count the number of students that actually turned up to pack up the show on the fingers of one hand, we usually ended up (with the lecturers) packing up the show, disposing of anything that wasn't ours.
One student had a need for a load of cardboard boxes in his show, but didn't turn up apart from to drop them off.
One day, one of our more strait laced technicians went out into the labs, piled the boxes up into a column of boxes, and taped them together. He then climbed into the boxes, and when a given student wasn't watching, he'd life the pile up, walked toward them, then put them down.
He did this for nearly an hour, and freaked out a *lot* of students.
Incident #1: AST 486 ISA PC with SCO Xenix randomly reboots, and also randomly changes speeds between 6/8/10MHz. PC shop: "It passes all diagnostics under DOS. It must be your operating system."
PC owner gives me the okay to work on it, so I get under the hood and make the following changes: (1) change secondary LPT to interrupt #5, from #7 [#7 was being used by primary LPT, and the PC HW at that time did not support shared interrupts]; (2) replace clone Western Digital Paradise video card with the real thing; and (3) replace Diamond Flower Industries extended memory card with one from AST. All was then good. It was the hardware, not the software.
Incident #2: Multiuser PC (brand forgotten) with SCO Xenix; I had added a couple of serial terminals within the office, with full hardware handshaking. Symptom was user would be working, then their screen would fill with garbage characters. My breakout boxes showed all the hardware handshaking signals in effect. Incident call with SCO yielded no useful help. SCO wanted to charge us for the incident, my company didn't want to pay for a no-help help call. This escalated to my company president and their vice-president ("It must be the software." "No, it must be the hardware." Yadda-yadda.) In the background I hear an SCO tech ask his VP, "Did they set RTS/CTS in their serial line config?" Me: "There's a setting in there for that? Lemme try it." (stty has a million options). SCO VP: "We don't guarantee that will work! It's not official advice!" I tried it by hand, it worked, I altered /etc/gettydefs, tested, all was good, and our company paid SCO for the support incident.
This one was the software, not the hardware.
(Icon for, "When I was your age, we had twenty or more people timesharing off a single 486 server!")
"(Icon for, "When I was your age, we had twenty or more people timesharing off a single 486 server!")"
And it wasn't slow, either. Seems like the faster the hardware gets, the worse the software gets. Can't imagine how fast Win95 would boot up on a modern computer, if it was able to use it all. IANASWG, so I don't know if it can, or if it would be an issue.
I configured a system with a 486SX and 48 serial ports connected to modems under MSDOS.
It was a Pharmaceutical Wholesaler order entry front end taking calls from multiple end user systems with many different parameters - baud 300/1200, 7 bits odd/7 bits even/8 bits no parity, 7 different communications protocols. All running as a TSR as they ran reports from the same PC (despite my protestations and there being only 1 PC - not load sharing or all the normal redundancy that we would have these days).
The PC front end was needed as the mainframe (a) could not handle that many calls (b) could not handle the different protocols (c) seemed to be offline for about 30% of the time.
Having HNDs in Analogue Electronics, Digital Electronics, Communications and Software Engineering came in useful, and I could only blame myself!
And it wasn't slow...
If only our chief engineer (and business partner) had held that attitude!
So, I'm sitting coding, at the hardware maintenance business I mentioned in an earlier post. In comes said engineer, with a customer's RTB 486DX2/66 - it's the fastest thing we've seen to this point, in PC terms. He puts the repaired box on the desk next to me and says "Lil, I'm going to clock this to DX2/80. Let's see how fast that runs!". So he's leaning over the open chassis, having jiggled the jumpers to overclock it, and gives it the juice. *POP* - really rather loud! I'm looking at him, with a wisp of smoke coming from his fringe! Looking into the box, I see the CPU has a square hole in the middle of it, it blew the silicon straight out, mills in front of his face! It's not often you see someone with a genuine "WTF!?" stunned look like that :) That cost quite a bit to replace :D
It can be very frustrating from the user's pov when support personnel insist on everything being a "software problem" without examining the evidence.
In my pre-retirement gig, I was back in a finance/reporting/training role (being one of the few who understood the organisation's data). One morning, one of the accountants approached me - she had been trying for weeks to get her computer fixed, without success.
On examination, it turned out that, when she turned on the computer on Monday morning, it would start up and then sit there with a "no operating system" message. However, if you left it on for 30 minutes, then ctrl-alt-delete, it would boot up just fine and continue working as long as the computer was left on (no prizes for Reg readers diagnosing the issue).
An email to the "service" desk describing these symptoms produced the answer "I'll reimage the computer and that will fix it". When this inevitably didn't fix it, I emailed back that it appeared to be a hardware issue and needed a warranty call - only to have the same response.
This continued for a couple of weeks, until I finally gave up and told the accountant to leave her computer on over the weekend, and just wait until it finally gave up the ghost completely.
I once had just this symptom described to me while on a site sorting something unrelated.
Lid off the suspect PC. No obvious loose connections until I pressed on the processor.
Click.
The pins on the chip would expand when warm and make contact. When cold in the morning they were just a tad too short.
I have no idea how the chip had rattled loose, but it was a permanent fix.
The pins on the chip would expand when warm and make contact. When cold in the morning they were just a tad too short.
I have no idea how the chip had rattled loose, but it was a permanent fix.
Your first line is the answer to your second line :-)
It's usually called thermal creep. That expansion and contraction of the metal legs on the chips and the sockets themselves can cause the chips to creep up out of the sockets. It's also why PAT requires fitted, ie not moulded, mains plugs be opened and the screws re-tightened.
It was a regular occurrence when dealing with a decent sized fleet of early PCs where a whole 1MB of RAM was often 32 or more socketed chips along with the CPU and/or other support chips. Pressing all the chips in was often the fix. Even if the fault was obviously something else, pressing them all in anyway while the cover was off as preventative maintenance was good practice :-)
When your colleague decides to avoid testing the polarity of a cable repair and claims it must be a software failure.
The tell-tale burning components were a dead giveaway to those of us that knew what we were smelling.
He was of course promoted out of the way and still contiuned to wreck havoc.
I left and have been stress free since !
Even for simple issues that are easily detected in hardware the modern way is to depend on software to show there is an error at all - blinkenlights are too expensive.
If the software is able to detect the hardware error condition at all most of the time the hardware designers don't provide documentation what to check for.
Add in time pressure and the usual programming practice that handles about 5% of the exceptional cases in any useful way and you end up with the current state of hard/firm/software.
No chance to figure out what is wrong without extra diagnostic tools (unless you have a degree in vodoo diagnostics).
5% of the exceptional cases? That's pretty damn high for much of the software vomited out by "modern" developers. Developers who believe that error trapping is something that oldsters use and that it's instead perfectly fine to just crash up an entire call stack with exceptions and as a result to report incredibly vague and useless error messages such as "an error has happened". Exception handling is there for the exceptional cases, the weird crap that is utterly unexpected - expected failures should be handled neatly and clearly and reported in a useful and clear manner.
For example, my car has a low tyre pressure warning. Each tyre has a sensor. Does the incompetently written software report which tyre is reporting a low pressure. Nope, it just reports that "one of the tyres has a low pressure"... even though the software knows which one because it has an independent damn sensor in each wheel.
blinkenlights are too expensive
I believe blinkenlights, and the appropriate front-panel switches, are no longer are too-expensive (LEDs these days are so cheap), as such, but also that modern PC CPU designs make blinkenlights and switches far-less potentially-useful than was possible in days gone by. I've seen systems with stupidly-limited front panels, from which one could only set memory, display memory, change the program counter, and start or halt the CPU. The drop in potential light/switch usefulness is because so much of modern CPUs' operations and circuits stay on-chip.
Older-era systems let you single-step, by instruction and even by clock-cycle, through a program, showing what was moving through memory, with full CPU and I/O status. Some systems had the ability for a user to change the clock frequency via a potentiometer. I don't think you can do anything like that on X86, ARM, or M1 CPUs. Maybe someone could rig up some sort of JTAG thing, but I don't know the capabilities and limitations of JTAG.
Not hardware vs. software, but network vs, hosts. In a previous job I was on the desktop team dispatched by the helldesk. There was also a network team. When our diagnostics exhausted any problems with the user's computer, we called the network team. Some of their member's standard test was ping. Their answer was it answers ping, not a network problem, click. Fortunately for the user's some of us on the desktop team knew a bit about networking and called their bluff.
From when, in my semi-professional educational IT support role, whenever I contacted a hardware supplier about an issue they'd blame software. And whenever I contacted a software supplier they'd blame hardware. Most of the time I had a pretty good idea what the problem would be before I called and most of the time I was right. But that still didn't save me from having to jump through hoops to get a fix. And when we got proper IT support, a few yeas later, they got put through the same run around bollocks if ever they had to contact a supplier for (warranty etc) support.
Ny 1999, for some reason I had become the "go-to" guy for many of the Veterinarians on The Peninsula[0]. They were all running late 1980s, early 1990s IBM PCs (486 "ValuePoint" machines, for the most part, if I recall correctly), with SCO Xenix 5.x ... The Vet Practice Management software was provided by IDEXX, but was built by PSI ... Needless to say, IBM, SCO, PSI and IDEXX all claimed the other three were responsible for any Y2K issues that may (or may not) crop up.
Most of the Vets, assured by these four companies wonderful bedside manner, switched to Cornerstone or Avimark software running on Win98. I cheerfully set 'em up and then dropped out of that world (except for a few cases). Xenix worked great, was never an issue in all the years I took care of them. Windows ... well, you know.
[0] The Peninsula is the local name for the bit of San Andreas Fault fractured rock roughly between the Golden Gate and Palo Alto.