Why does Cisco remind me of HAL 9000...
"I've just picked up a fault in the AE-35 unit. It is going to go 100 percent failure within 72 hours."...
Cisco has issued a warning that an electronic component used in versions of its routing, optical networking, security and switch products prior to November 16, 2016 is unreliable – and may fail in the next year and a half, rendering affected hardware permanently inoperable. "Although the Cisco products with this component are …
"would you feel comfortable buying older cisco gear after reading this"
Why not, it's only their C2000-based stuff that's affected (so far), and in due course they'll all be repalced (?).
Other vendors who have used the Intel chips in question will be affected too in due course, unless there's something vendor-specific going on.
For most all electronics manufacturing they run tests to see how the product will fair over the long term, and simulate some aspects of that since you just can't turn the clocks forward and see what happens. What that means is, say for a disk drive maker, they will build a lab and set many of the same disk products on a mission to exercise and die with variation on how and what constitutes death for that product. A good lab will have ovens and perhaps some freezer units to test the hardware at extremes. I got to visit a lab such as this back in the early 1990s at Apple. It basically had many versions of the computers being designed and readied for production, unless these, or other, pre-build tests fail. That's why you can look on the box it shipped with and it will tell you useful operating temperature data. Along with other "windows of variance" like power supply inputs, etc.
With enough data you can piece together a profile of how a certain product will behave over the years, and provide your end consumers with a nice MTBF number indicating the Mean Time Between Failures. However, like with this clock signal component, you might not be able to tease out the failures of this device before 18 months have passed. It's a tricky situation and it makes sense for Cisco to replace h/w, as this does not look like something fixable in software.
That initialism surely means Maximum Time Before Failure...
The classic example where I work was the original-fit projectors. These were recommended by the company contracted for the fit-out and according to the "manufacturer" data (actually a badge-manufacturer) had MTBFs of 28,000 hours. In our use this would equate to about 10 years. The company (I didn't work for them back then) had nobody who really knew about these things, and I have seen one single email from one person who essentially said "there isn't a projector in the world that would last that long". As you might expect, he was ignored.
When I started here, a large proportion of the projectors (there are over 30) were showing obvious signs of LCD and colour filter failure. This after around 5,000 hours run time. When I finally tracked down the original manufacturer (the badge manufacturers were still claiming 28,000 hours), they admitted that the LCD module had an expected lifetime of just 4,500 hours.
Of course nobody had even begun to think about a capital budget to replace the projectors, and at £5,000 for a complete "optical block", repair was out of the question.
It subsequently transpired that the power supplies started failing at between 7,000 and 8,000 hours, though this was due to dodgy capacitors and relatively easily fixed. The BM quoted me €1,000 for a new PSU.
Long story short, we replaced the LCD projectors with DLP units from a different manufacturer which had (and have achieved) expected lifespans of 20,000 hours, cost about as much to buy new as the cost of an optical block for the originals and used lamps that cost less and lasted twice or three times as long. DLP has its downsides, but in our case it has worked extremely well.
That said, I still have a couple of computers "out there" with original-fit Maxtor SCSI and SATA discs, now about 12 years old :-)
(it's ok, I'm doing it as a sort of experiment and there are spares ready-and-waiting)
M.
"For customers with affected products under warranty or covered by service contracts through November 16, 2016, Cisco intends to provide replacement products."
Under Oz laws at least you are entitled to a replacement from the vendor (and the vendor cannot simply point at Cisco) irrespective of the stated warranty so long as the item cost under $40k (which, being Cisco, probably counts out most of the items unfortunately!)
IANAL, but typically businesses buy equipment "as is". The warranty (or service contract) is there just to assure the buyer that they will get some use out of the equipment for at least the warranty/service period.
If Cisco had sold these products knowing they were destined to fail soon the customer could take them to court for fraud.
"Prize Plum"
This issue was picked up by /r/networking on Reddit, and one Redditor suggests which component is the culprit - it would be useful for the electronics industry at large if the part was officially identified to help other manufacturers and users plan scheduled maintenance for this issue before it gets worse.
https://www.reddit.com/r/networking/comments/5rmsw0/major_cisco_hardware_clock_issue_affecting/
From what I read, it could centre around the Intel Atom C2000 series. My understanding is that this includes the C2718 / Avotons which I understand from other research into a low power hypervisor host, would burn out/slow down/fail after around 12-18 months use.
If it is around Intel kit, that would explain the obvious NDA that Cisco is under. Who else has the clout to impose such a thing on NetZilla?
Hadn't heard about this but Intel did mention it on their earnings call and set aside a reserve for it, so it is obviously something they believe will be a real problem (setting aside reserves is typically only done for major issues that will cost a lot of money, like Samsung's exploding phones or the Xbox red rings of death)
If so, this is really good news for most of us, as only enterprise equipment will be affected. No one has an Atom C2000 in their home wireless router or smartphone.
This wouldn't at all be a sneaky way of sabotaging second-hand kit, now, would it?
Highly unlikely. The healthy market in second hand gear is of great benefit to Cisco. There would be a lot fewer people with Cisco skills if there wasn't an abundance of old used Cisco kit on fleabay.
I also wouldn't be surprised if some older, but not yet quite EOL, kit didn't end up used in smaller, less affluent companies possibly even with SmartNet on top.
Last time it was a memory issue:
http://www.cisco.com/c/en/us/about/supplier-sustainability/memory.html
I vaguely recall it being a supplier issue (supplier initially provided components to Cisco's spec but at some point the spec changed and wasn't picked up during QA).
Or there's the long running capacitor plague - https://en.wikipedia.org/wiki/Capacitor_plague
I know it sucks having to replace newish equipment, but there's not much more that a vendor can do (from memory Cisco replaced equipment with faulty memory as advance spares and return replaced components within two weeks for equipment covered by Smartnet and a tighter return window for non-Smartnet equipment although that might have just been because we were a large customer...)
A long time ago, the company that I worked for, started have Z80 CPUs go tits-up after a while in the field - this was traced to an improper clock driver. I don't know the details but until they redesigned the boards (a commercial hi-speed EKG analysis system) we were running around with a tube of mil-spec Z80's which lasted a lot longer than the commercial version.
"Intel produces this SoC on a custom-tuned variant of its 22-nm fabrication process, which has some of the finest geometries in the industry and is the first process to adopt a "3D" or FinFET-style transistor structure.
We've already seen quite a few bigger cores manufactured at 22-nm, but the benefits of this process are arguably most notable for low-power chips like Avoton. Intel is taking full advantage of its celebrated manufacturing advantage here."
Those words from
http://techreport.com/review/25311/inside-intel-atom-c2000-series-avoton-processors
(and similar reports elsewhere).
Elsewhere, some chap called Prickett-Morgan says these are also aimed at microservers (HP were mentioned, both blades and standalone).
"Intel's celebrated manufacturing advantage", eh? 'Course, it generally works out better if the manufacturing people and the design people are in close contact early in the design. Does that still happen early enough when you buy in the FinFET technology from outside (GlobalFoundries? Samsung?), at quite an advanced stage, and maybe roll it out without a full understanding of the implications?
Anybody out there familiar with the potential medium term reliability issues of bleeding-edge ultrafine geometry semiconductors, topics like migration effects etc (especially the interaction with thermal effects in real hardware, maybe even with realistic workloads)? There's possibly a consultancy opportunity for you at Intel. Might be good money if you've got the right answers.
Assuming that there's some relevance in e.g.
https://www.semiwiki.com/forum/content/5031-electromigration-analysis-finfet-self-heating.html
It could obviously also be something entirely different. We'll find out one day.
For the rest of the world: if you have a product based on one of the affected chips, your vendor is waiting to hear from you.
I rewrote the Cisco press release..
______________________________________
Cisco strives to deliver technologies and services that work. However recently, Cisco became aware that Intel planted a timed obsolescence feature into its Atom 2000 that affects a large number of expensive Cisco products. In all units, we have seen the Intel Atom CPU degrade over time. Although the Cisco products with this Intel CPU are currently performing normally, we expect product failures to increase over the years as Intel built this in to sell more chips, beginning after the unit has been in operation for approximately 18 months. Once the Intel Atom has timed out, the system will become a brick, will not boot, and is not recoverable. This requires the end user to buy a new product. The Intel Atom is also used by a huge number of other vendors on a large number of products.
We have identified all Cisco products that have Intel Inside and tried to work with Intel to quickly get a chip that works however they keep asking us to provide examples of the failure. All products shipping currently do not have this issue as far as we know. To support our customers and partners, Cisco will flog Intel and recall all units under warranty or covered by any valid services contract dated as of November 16, 2016, which have Intel Inside and shove them up Intel's ass. Due to the age-based nature of the failure and the crap ton of replacements, not to mention the cost, we will be prioritizing orders based on the products’ time in operation.
Q: When did you become aware of this issue?
Cisco learned about the timed failure and potential customer impacts due to this feature in late November 2016. Cisco and Intel have been working as quickly as possible to hide the impact and scope of the issue, create and test PR releases, and put in place a plan to hide Intel from lawyers without causing undue panic or effect Intel's stock price.
See also:
https://www.theregister.co.uk/2017/02/06/cisco_intel_decline_to_link_product_warning_to_faulty_chip/
"Intel's Atom C2000 chips are bricking products – and it's not just Cisco hit
Chipzilla and Switchzilla won't confirm connection but the writing is on the wall"
This story could run for a while.
Unlike the affected hardware.