
Learnt something new today
you can alter cpu firmware from inside an operating system
what could possibly go wrong
AMD will release on Monday new processor microcode to crush an esoteric bug that can be potentially exploited by virtual machine guests to hijack host servers. Machines using AMD Piledriver CPUs, such as the Opteron 6300 family of server chips, and specifically CPU microcode versions 0x6000832 and 0x6000836 – the latest …
It gets loaded at boot time. CPUs have been like this for a veeeery long time.
The advantage is that bugs in the microcode (such as this) can be fixed. If the CPU didn't use microcode and suffered from this bug, the only fix would be a new CPU.
Disadvantages - microcode means having instruction translation units in one's CPU, which require a bunch of transistors, which take power to run them. ARMs don't use microcode, which is one of the many architectural features that help them beat x86 CPUs on power consumption.
"And not only the CPU At Linux boot time, there may be several "firmware updates" of various stuff of the machine innards."
It's certainly quite astonishing to see just how many files are lurking in /lib/firmware. Makes one wonder if there's any such thing as a real piece of hardware these days. Seems like "hardware" is now firmware running on some sort of micro / cpu / etc. I guess the best I can hope for is that some of it is FPGA or CPLD images, but even that doesn't feel like a good old fashioned collection of dedicated transistors.
"AFAIK no one has ever successfully tinkered with microcode. It's a security through obscurity thing on a very large scale."
My first reaction reading this was that someone who was able to get the old firmware loaded could then trigger the exploit. But I suppose anyone with that level of access wouldn't need to worry about finding exploits to use.
when u say "AFAIK no one has ever successfully tinkered with microcode." you may be interested to know that the NSA have been <redacted> ever since as long ago as <redacted> and can demonstrate the ability to completely <redacted> your <redacted> using nothing more than <redacted>, <redacted> and a cucumber.
The problem is that a lot of so called "sysadmin" don't bother to upgrade systems' firmwares - sometimes they learn it the hard way like Github people, sometimes they never learn. So the OS tries to put a patch.
Once we were handed three almost new big servers for our lab because "they don't work with the new OS". They just needed a firmware upgrade, as I quickly discovered. When the sysadmin who gave them us saw them happily running with the new OS, asked them back because it had to replace them with smaller ones because of budget reasons... the answer was a big "f**********". When he discovered he would have had to admit he wasn't able to properly manage his systems and had spent more money because of his ineptitude, he preferred to keep it quiet and leave us the servers...
Lovely analysis and, unlike so many tech blog posts including ones right here on El Reg, a useable explanation including the register dumps with an explanation of the suspected logic flow error thereby giving readers more than just "It went boom, now it's fixed" insight.
Well done!
There's something Not Quite Right with the state of IT when an article here thinks it has to explain what a stack is, though.
I don't expect everyone to have done assembly / Forth / PostScript / etc where using a stack is an essential part of the job description, but they're a basic data structure that anyone writing software should know about, aren't they?
Next week: what an 'array' is...
And you are assuming all of the readers write code?
For me at least the bulk of the article was WAY over my head. I do a bit of scripting here and there but mostly server network and storage mananagement for the past 20 years(there is a big bright line between scripting and coding I refuse to cross, knowing ins ands outs of what i know is a hell of a lot already). Sort of reminds me of what I used to browse over in BYTE magazine many years ago.
I am aware of the stack terminology though my understang stops right about there. It's just not knowledge that is helpful for what I do.
"...extra pins..."
The ASCC Harvard Mark 1 was about 50 feet long, plenty of room for pins... ...as well as relays, switches, clutches, drive shafts...
But yes, point taken. Of course, as history proves.
Still, it is annoying how today's software generally scours any data file looking for something to execute (a huge oversimplification, but might as well be true). "Ooh look, code! RUN!"
Why on Earth do we need to scan DATA files for malicious executable code?
"Why on Earth do we need to scan DATA files for malicious executable code?"
Partly history, partly convenience (I know you know the answer, I'm just amplifying a bit.)
In the ancient days of slow cpus, little memory and two-digit dates, some CPUs actually required self-modifying code. On the PDP-8, for instance, you put the return address (if I remember correctly) immediately before the start of a subroutine, so if the program is in ROM, no subroutines. Of course in those days we were taught that self modifying code was very bad but sometimes it was the only way to get the job done. (On the PIC, which is a degenerate Harvard architecture, program flexibility can be obtained by putting a new PC value into a register and branching to it, and there is an evil patch of program-writable ROM so you could in unlikely circumstances get a virus of sorts into an embedded microcontroller.)
But the real rot set in with Microsoft's original idea of run everything everywhere. Microsoft programs had links all over the place - some undocumented of course - to enable unlikely things to be done. It provided lots of flexibility but had all the security of a safe made of gallium, resulting in much pain as Windows acquired security.
As an offshoot of this, what happened if you wanted a document with a new document model and didn't want the inconvenience of having to install new code to handle it? In the days before widespread Internet, the answer was obvious: include the necessary extensions to use the new model in the document. And bingo, chcocolate bank vault time again.
Now we see the effect of trying to keep Moore's Law going as if it was a law of nature and not an enthusiastic guideline. CPU progress is so rapid that stuff enters the wild with microcode bugs. OK, borrow a trick from mainframes and make the microcode modifiable. If we have to release completely tested CPUs we'll never get anything out of the door ahead of the competition. Heck, cars get released nowadays and the bugs get fixed during servicing.
One good thing about mobile phones is that their CPUs aren't a near monoculture with just a minor digression. We have Apple designs, Qualcomm, Kirin, Mediatek and Exynos, and probably others. If someone comes up with a truly terrible exploitable whoopsie, there's an alternative. So there's hope for the future, of a sort.
Actually, the protected mode of the x86 architecture allows for a clear distinction of code (executable) segments and data segments. You can also have read only data segments. IIRC, yo can have segments that are are executable but not readable as "data", less so writeable.
But every OS preferred to get rid of the more complex (and somewhat slower) segmented model - and not only Windows, but Linux and others as well. All of them preferred simpler compiler, more portable kernels and applications, thus preferred to load segment registers at the beginning - addressing the whole space - and then forget about them.
Sure, there is a speed penalty when a segment register is loaded, exactly because of the added security. You see here that the later added NX bit - required exactly because lack of "proper" segment usage - spotted the jump into data code and stopped it.
Sure, there is some software, i.e. scripting languages, and even p-code ones, that could need to execute "data" - but not all software needs it, and any software requiring it should be handled in a properly sandboxed environment.
I think it's obviously a fairly trivial and straightforward exercise for the Test Dept boffins at AMD to semi-automatically create (even hundreds of MB of) self-test machine code. The sort of code that would exercise their CPU designs or prototypes and would report back such errors. Given how trivial and semi-automatic this exercise should be by now, the 'coverage' of their test code suite should be very nearly 100% by now. And to preempt the too-predictable rebuttal about 'obscure timing of interrupts' etc., the Test Code (and associated hardware) can be left running for weeks x GHz clock speed. Test coverage should be a long string of '9's.
A blatant error such as popping the incorrect item off of a stack is the sort of thing that should be caught quite early and reliably in the process. I can't imagine why it wouldn't be caught, assuming that their approach to testing is as it really should be.
My conclusion is that this is a double failure: 1) a bug, and 2) the failure of the AMD Test Dept to catch it.
I'd almost be more worried about the latter.
I'm unhappy about the latter, but then I consider the recent history of Intel CPU bugs that have been discovered, admittedly more in computational accuracy than basic stack operation, and I wonder about the test process in both cases.
I do remember one of the P4 architects describing how they could no longer mentally anticipate how the CPU was going to behave in some circumstances though, so maybe this kind of thing is now just really really hard, and I don't have a good enough understanding of how one might go about designing a test suite.
xenny - how one might go about designing a test suite.
For Functional (i.e. card edge, as opposed to In-Circuit with probes) testing of logic circuits, you (your tool) builds a set of input 'vectors' that can be propagated through the hardware to ensure that every point is exercised (0 & 1) *and* is propagated to the outputs. For a qualification of a CPU design, that would be the first infinitesimal % of the test.
The book 'Fatal Defect' mentions that 'randomized' testing might be better than 'designed' testing, because it avoids making the same erroneous assumptions at Test Design as were made at UUT Design. so the randomized test might stumble across something unforeseen. Obviously it compares the UUT outputs against the 'Requirements', in an automated fashion. I'd say do both.
Testing of hardware triggered interrupts would require specialized stimulation hardware to endlessly walk the timing of the interrupt back and forth relative to the clock phase, in tiny fractions of a picosecond (fun design that; perhaps mechanical!). This should be SOP due to the fundamentals of Setup and Hold timings for such asynchronous inputs; it must be checked.
Since the interrupts are using the stack, then that area must be fully explored. As it's automated and GHz clock, that might need maybe an hour of run time.
As an example, when one is testing memory, one doesn't test every possible combination of bits due to 'age of Universe' etc. (something I had to work out once, when asked how long it would take). One tests with Walking Ones, Walking Zeroes, etc. etc. Minutes, not 'age of Universe'.
As mentioned above, the test software executing on the UUT CPU should be massive by now (the year of our Lord, 2016). There's no reason it shouldn't be hundreds of MB of encoded tests, covering all the boundary conditions and lessons learned. They've been doing this for a while.
In the 1980s I was taught how to develop microcode for a processor built by Norsk Data. It was hard. There were different objects within the CPU addressed by different fields within a very long instruction word. These objects had to be kept working together consistently, and with regard to their timing needs.
It makes me wonder if things have evolved since then; whether perhaps one can do a software emulation of microcode; and whether such an emulation could be more rigorously tested.
I think it's obviously a fairly trivial and straightforward exercise for the Test Dept boffins at AMD to semi-automatically create
This isn't a bug that's caused by "Execute instruction X followed by instruction Y and get the wrong results".
It's a very precise timing bug between an NMI and the processor being in a certain state. These corner cases are *very* hard to find and reproduce.
As several other commentards have noted, CPUs nowadays are *very* complex and testing every state is almost impossible.
A Non e-mouse... ...timing bug between an NMI and the processor being in a certain state...
Again!, "...And to preempt the too-predictable rebuttal about 'obscure timing of interrupts' etc., the Test Code (and associated hardware) can be left running for weeks x GHz clock speed. Test coverage should be a long string of '9's."
I shouldn't have to explain this. The interrupts' timing can be (should be) walked back and forth. We did exactly this only weeks ago, admittedly at a coarser time scale appropriate to our purposes. This is Test 101, really basic.
Re: "...And to preempt the too-predictable rebuttal about 'obscure timing of interrupts' etc., the Test Code (and associated hardware) can be left running for weeks x GHz clock speed. Test coverage should be a long string of '9's. [...] I shouldn't have to explain this. The interrupts' timing can be (should be) walked back and forth."
You're assuming that AMD (and the other CPU vendors) don't already do this - they do. Specifically random code running on a huge number of systems for weeks as well as all through the design process. And also directed tests where "the interrupt timing is walked backward and forward".
But even with this it's not possible to cover every possible bug. While I've not seen the full details of this specific bug the article contains this hint for the VMware/ESXi bug report: "Under a highly specific and detailed set of internal timing conditions, the AMD Opteron Series 63xx processor may read an internal branch status register while the register is being updated, resulting in an incorrect RIP".
So this is more complicated than just walking the NMI timing -- it only fails if the timing also hits "while the BSR is being updated", so you need other specific unlucky event(s) as well, and possibly requires other specifics such as a particular set of cache hits/misses or VM state to trigger the failing case. Put another way, the fact that Piledriver has been shipping for years with this bug only found now means that "running prototypes for weeks" does not cover everything, because there has been an enormous amount of random customer code running for a lot more than "weeks" on a lot more than prototypes before this bug was found.
Well, it works like this: You have good QA, the process works, you have good people and very few recalls. The the bean-counters arrive. They quickly conclude that since there are very few recalls and other disasters, there is not an urgent need for QA so the budget is cut ... and cut ... and cut. Then there is a major recall and scores of consultants come in to set up a QA system ....
"They quickly conclude that since there are very few recalls and other disasters, there is not an urgent need for QA so the budget is cut ."
Good job this kind of management behaviour is completely unacceptable in highly regulated safety critical industries such as avionics and automotive (can't speak for medical and other potential candidates).
Now, where's my flying pig.
:(
Recently I posted that a "firewall" running on the same host as other VMs wasn't really a firewall. A number of people were kind enough to upvote me: I got at least one response saying that what I described couldn't happen. One was from someone who claimed to have been involved in writing hypervisors.
I hope that guy is reading this article.
I am reading it :)
You got correctly downvoted then and there. Let me explain why (as someone who has _WRITTEN_ hypervisor software in use for virtual routing and firewalls).
This is no different from any firmware or CPU bug. You can break out of protected mode, exploit buggy network card firmware, etc. If anything, virtualization, when used correctly provides an _ADDITIONAL_ layer of protection.
By the way, from that perspective, in the specific cases of virtual routing and firewalls you are better of to consider forms of virtualization which use as little as possible in terms of hardware accel features. Sure, you pay in terms of absolute performance. You get it back in terms of maintainability and security. If you do it _THAT_ way, your virtual firewall is actually more secure than one running on bare metal as you have one more layer of "defense in depth". It is more maintainable too. That is is also exactly the use case I would advocate for (and what I used to do for a living). I would also not go schadenfreude-ing on every single firmware bug as the reason to invalidate the whole concept.
This is no different from the argument which Cisco tried to mandate to all of its indoctrinates ~ 10 years ago that they answer that PIX is more secure than firewalls which use combined kernelspace + userspace mode because it runs everything privileged in a monolithic system. That as we all know is bollocks. Sure you get a bug from splitting things once in a while - that is still better than doing everything in one blob.
By the way, looking at the bug, it offers a specific exploitation route in kvm. That does not mean that it does not have an exploitation route outside virtualization domain. There is a gazillion ways to trigger an NMI on a NUMA system. In fact, I have some userspace, unprivileged code lying around somewhere which will kill any older (and probably newer) 2+ socket Xeon running Linux within 15 seconds by hard fault through NMI storm. It is not that difficult.
>consider forms of virtualization which use as little as possible in terms of hardware accel features
Not really related but I know the people working on fq_codel didn't have a lot of nice things to say about NIC offloads and what they did to latency in the name of throughput.
The other ingredient in this saga is virtualization: the OpenSUSE build server was compiling GDB and testing it in a QEMU-KVM virtual machine. That means an unprivileged user in a guest virtual machine merely building software was able to trigger an "oops" in the host server's kernel. That's not good.
Is there an rapid escalation and elevation and expansion of the facility/utility/vulnerability/call it whatever you will, whenever an underprivileged user, not merely just building virtual kernel software with SsecuredDwares, realises the triggering potential and APT ACT portal is always present and a critical element/component in Intel designs?
Is the logical solution, to mitigate and cover the risk whenever the problem is inherent and core value, to raise said underprivileged user privileges to build ….. well, to be more than just sure of future security in all manner of practically real and virtual machine operating systems, both guests and safer hardened kernels?
Or would a real world fear of remote transfer of virtual command and control to unknown forces and anonymous sources cause an almighty all systems crash?
Wow ….. that be so much more than just a right bugger of a bug in systems to debug, methinks. Wherever would one start? And that make it a very valuable, fortunate weapon too, methinks. Do you also think it so?
Here's somebody else who realises the problems ahead ... http://www.zerohedge.com/news/2016-03-06/were-eye-storm-rothschild-fears-daunting-litany-problems-ahead
Hey man, can I have some of your microcode? I promise not to inject it.
Agree, it reminds me of when I first encountered the Intel 86/88/186 bug where contrary to the documentation, interrupts were enabled after the stack segment register is loaded. Get an interrupt at this point and the SS:SP aren't pointing to a valid stack...
I seem to remember spending the best part of three weeks getting to the bottom of that one, particularly as it only seemed to happen when the CPU was running at full speed and not when running in ICE mode and hence had to rely on first generation CPU monitors/analyzers to get any real idea of what was happening inside the CPU.
Ooops. Off by one there. Oh well.
Mind you I should probably deduct at least five marks for the video not working. All I get when clicking Play is an instant Error #2035.
I'm guessing it's this hour-long video, or one of its close relatives:
https://www.youtube.com/watch?v=eDmv0sDB1Ak
I'm also guessing based on https://support.mozilla.org/en-US/questions/987691 that some debugging is required for my combination of Firefox and AdBlock. So maybe the five point deduction isn't needed. But even with Adblock globally disabled it still doesn't play, and the Register experience becomes really quite unpleasant.
Still, finding out why it doesn't work can't be as hard as debugging a CPU hardware design hiccup can it.