Seems you can have fast or (maybe) secure, not both.
Oh well, I guess it is Intel's turn to laugh.
AMD has started issuing some patches for its processors affected by a serious silicon-level bug dubbed Zenbleed that can be exploited by rogue users and malware to steal passwords, cryptographic keys, and other secrets from software running on a vulnerable system. Zenbleed affects Ryzen and Epyc Zen 2 chips, and can be abused …
It is very useful to have your bank site open and at the same time be looking at your orders on the retailer's website (ugh, I did pay that much), plus the tax advice website (blast, I thought the Customs Exempt threshold was higher) whilst IMing (we said we'd go halfsies but I don't see the transfer from your account) and so on and so forth.
Logging out of the bank every time you want to double check a line in the statement - not going to be a popular move.
It's a feature I have used before for that purpose, but it does absolutely nothing for this vulnerability. If you want to have both windows open, the level of isolation required is a window in another browser running on a different computer. Anything where the same processor is in use is potentially vulnerable to the attack.
... the level of isolation required is a window in another browser running on a different computer.
Which is why it's specifically a cloud vulnerability. The "other computer" could easily be the same one.
You're all looking at this the wrong way. This is a corporate espionage / government hacker level exploit for secure cloud systems, not a tab spying one for s'kiddies.
It could be both. The first documented use of EternalBlue was North Korea failing to do ransomware*. Not too long after that came a Russian government attack on Ukraine and sort of anyone who looked a bit like Ukraine. Others figured out pretty quickly that this vulnerability was nice and started using it as well. If it becomes known, someone will try to use it, and unless it's fiendishly difficult, that includes smaller criminal organizations and individuals.
* Well, they succeeded at infecting a bunch of places, but failed to keep their malware functional by leaving in a kill switch and failed to collect very much money.
Something that can let a script running in one browser tab see the password entered into a bank's website in another tab would definitely not be of medium but of screaming-high severity.
From the article:
And one the code is written, it can be dropped into as many (hijacked) web pages as are willing to pay the few to get one copy.
There's a lively discussion on Ars about the practicality of exploiting it from JS.
https://arstechnica.com/civis/threads/encryption-breaking-password-leaking-bug-in-many-amd-cpus-could-take-months-to-fix.1494795/post-42056956 in particular seems logical. A sane JS interpreter/JIT is unlikely to emit the machine code required but instead optimize it away.
Hopefully, but as with every weird attack vector things get, well, weird.
The key lines in the post you reference appears to be
> which the JIT engine should known has just been zeroed and have optimized away.
Hmm, does the JIT engine really track every value to see if it is zero? It makes sense to have an iszero bit in hardware as it works in parallel and can be applied to every value passing through a register. You can track some values during compilation but add enough indirection and you can get a zero that isn't evident at compile time. Ahhh, something else to add to the "have a look at this in one's copious spare time" :-)
However, those bent on attacking will be looking at the results of JIT as well and can apply weird methods: they know the sequence of operations they want to come out of the compiler and can fling stupid-looking JS code at it until they get a workable sequence out.
The bottom line: fingers crossed nobody gets it to work in JS but plan for the worst case.
The finder of the vuln and publisher of exploit code never said that web browser was a vector. Someone with insufficient standing made that claim and everyone else gobbled it up.
If it were web browser exploitable, it would probably score a solid 10, and it would be a "patch now or disconnect from the internet" kind of vulnerability.
As El Reg said, they couldn't trigger this via Qemu, consider how much more abstracted a web browser is vs Qemu.
They couldn't run this through Qemu because Qemu specifically implements an abstraction that prevents it. Other software isn't designed for simulating CPUs and wouldn't be doing that. It's rather obvious when you know what Qemu is used for and how it's implemented. It's true that, as far as I know, no code has been written publicly to exploit this from a browser, and you'd probably have to look hard at the output of WASM engines and JS compilers to make it work well, but that is not the same as it not being possible. Web browsers have no hardware abstraction compared to what Qemu does. They don't allow access to some classes of hardware (more if it's Firefox, less if it's Chrome), but the CPU is not one of them.
My question about these exploits is - how do you make sense of whatever happens to be in the registers when you get the chance to read them? How can the exploit code know what the register values that have been put there by some other thread actually mean in the absence of context? Or is this just a case of dumping everything that's recorded as ascii and a person looking for signs of P455w0rD in there?
For that matter, how does the exploit manage to arrange a read of contiguous register values?
I'd assume the miscreants simply harvest as much data as they can and then sift through looking for obvious security items, maybe anything with "BEGIN OPENSSH PRIVATE KEY" or "password=Passw0rd" in an HTTP POST. Gather enough data over enough time, apply some sensible search terms and you've got a chance of getting something useful. It doesn't sound like you can direct it to gather specific data, but hoover up enough and you'll find something. I guess the difficulty in getting "good" data is why it's only rated as a "medium" risk.
You don't have to get lucky every time.
 not *you* you, of course, I'm not trying to suggest that you are interested in any of this for any reason than to protect yourself.
 on the other hand, your eyes are a bit close together and have a shifty look
I'd guess that the problem is getting the computer to run the exploit code. On shared servers I would definitely not say that this is "medium risk" though. In general: running native code should not allow you to grab information from other processes (obviously). Personally I'd go for "high risk" even if it is hard to exploit e.g. using a web page.
The description of this got me thinking. The write up seems to suggest that this vulnerability affects just the vector registers, and I wondered how the vector registers were actually used.
The vector units in Zen processors are designed to do single operations over multiple data values (Single Instruction Multiple Data or SIMD).
So how many string operations are likely to be processed as vector operations? I'm no expert, but the way that I thought these units were used was mainly for mathematical operations, like array processing. I'm sure that there are some applications where you may apply integer arithmetic operations to strings of characters (things like encryption and compression spring to mind but I'm sure there are inventive people out there who can think of novel uses), but nowadays many processors feature specific instructions not involving the vector units to do these operations, although I suppose that these instructions may use the vector registers as they are convenient large registers (256 bit in Zen2, I believe). This means that they can hold a maximum of 32 8-bit bytes, long enough for a password, but probably too short to hold the entirety of a decent length key or certificate.
So the question in my mind is how often the vector units are actually used to process data that could be easily identified as useful? Although it may be possible to get quite large amounts of data, I'm not sure that it would be that easy to consolidate data from each vector register capture into larger units of data from consecutive areas of memory (I don't think that the vector contains anything indicating the address of the data in memory), and I don't think that the software implementing the capture has huge control over what the vector registers were used for after being free'd up.
But user IDs and passwords or pass phrases may be damaging enough.
> The write up seems to suggest that this vulnerability affects just the vector registers, and I wondered how the vector registers were actually used.
> I'm no expert, but the way that I thought these units were used was mainly for mathematical operations, like array processing.
Me too. But, from the github page for the exploit - "The AVX registers are often used for high performance string processing by system libraries. This means that very high volumes of sensitive data pass through them."
Yes. Accelerating string copy or string search is the one scenario where it is really easy to use these weird instructions to accelerate code that was never designed for them. The OS provides an updated version of a language's standard library and applications feel the benefit with no changes to their own code at all.
Like you know, the cryptographic and hash functions you would use on bits and bytes of a password? That at least at one point start with the plaintext as an input? ;-)
This looks like with a little massaging and good timing that it could leak all sorts of things you'd hate to see get off box. What a PITA. Interesting to see if the bleeding stops here or if subsequent attacks develop along the lines of the litany of Specter variants that leaked out over the years following the initial sets.
Guess reading a raft of patch notes is on my todo list for the next couple weeks.
Cryptography computations typically run with special hardware flags to avoid the known pitfalls, including issues with speculative execution (see e.g. ssbd, psfd). These computations are a lot less likely to be affected by this new bug.
I'm not sure how far this extends to login fields, but they could technically be protected too. Only thing to worry about applying protection are performance hits, which I doubt would be that big, if applied with precision.
SIMD, Single Instruction Multiple Data, is encryption / decryption.
I have no idea how the vector registers work, or how the exploit works, but if I was looking for data that was being encrypted / decrypted, or signature generation or checking, the vector registers would be the first place I looked.
Well, on a system with larger than 8-bit GP registers, almost all load and save instructions work a word at a time!
In a word-aligned system, it's actually much more difficult to address individual bytes in a word. Some systems have hardware support for this (but it's still slower), but in others, it involves a load, an AND and (three out of 4 times on 32 bit or 7 out of 8 times on a 64 bit architectures), a shift. If there are fused instructions for AND and shift, this can be made slightly more efficient.
So for these types of architecture (almost everything after the 8-bit microprocessors), it is vital for efficiency to access strings as words.
This is hidden on most architectures by the compilers. It's only really people who write assembler or even more hair shirt, machine code who see this level of detail (the exception is when using pointers, where word aligned data structures can be critical on some architectures).
But you're using the general instruction set to do this, not any SIMD extensions.
The example of strlen() was actually in the article. If we were living in the 8-bit ASCII world still, I would accept that this is a scenario. But we're not. UTF-8 or sometimes ISO8859 are the order of the day, and just counting along an array of bytes until you get to a zero byte is no longer enough to work out the character length of a string. It's not that simple any more (although I'm sure a single byte with value 0x00 still represents a NULL in most collating sequences other than EBCDIC).
I'm also not sure about how vector instructions can make conditional decisions on part of a vector register. I very much doubt that it is efficient to treat each byte in a vector register with SIMD instructions. so for both of these reasons I doubt that SIMD is being used for strlen(). Maybe I should look at the source, after all, it's generally available.
I can definitely see that doing 256 bit load/save operations, using the registers but not SIMD arithmetic instructions could be a real timesaver (so strcpy, or better, strncpy), but still UTF-8 makes life a bit more tricky. It is interesting to see whether this is more efficient than the variable length instructions (basically implied loops) that were explicitly added to many instruction sets for this purpose (although I'm not sure x86-64 has them). I'm too out of touch with modern processors.
Actually, I really ought to know this, but exactly how does the application development toolset cope with building programs to run on same basic architecture but different ISA features? For what I know best, Power and AIX, unless you tell the compiler otherwise, it will create code that will only include the common subset of instructions for the processor family. This counts double for libraries, as otherwise you would need to provide multiple libraries for each set of extended instructions for a particular processor model or generation, and if you generate the code to include, say, some of the Power 10 specific features, the resultant code will not run on the older processors (or will result in processor traps and software emulation of the later instructions).
The only way that I can think that this would work is if there were multiple versions of library routines in a library, and the runtime-loader decided which version to link to at execution time depending on the model of processor it was running on. I don't think you could make this decision in during the program flow itself, because otherwise any code would be deciding which varient of code to run as much as running the code itself! Maybe someone can enlighten me here.
> UTF-8 or sometimes ISO8859 are the order of the day, and just counting along an array of bytes until you get to a zero byte is no longer enough to work out the character length of a string
Many (the vast majority?) of calls to str(n)len are looking for the end of a buffer to copy to/from, rather than actually caring about how many characters/codepoints are present.
 certainly in all my UTF8 code (which is all the string & text processing code I use) *all* of the str(n)len calls are for this - my code never actually worries about counting codepoints, leaving that up to the OS display code.
 I cheat - all the special sequences that are sought by the lexer are in the 7-bit ASCII range. Then again, this cheat is used by all the programming languages I'm familiar with...
UTF-8 or sometimes ISO8859 are the order of the day, and just counting along an array of bytes until you get to a zero byte is no longer enough to work out the character length of a string.
It is certainly enough to get the byte length of the string, which is what strlen is mostly used for. There are a vast number of strlen calls taking place in modern OSes. And, in fact, on EBCDIC OSes too, since code point 00 is NUL in EBCDIC as well as in ASCII, and C is quite often used on the remaining EBCDIC systems (z, i, etc).
"The only way that I can think that this would work is if there were multiple versions of library routines in a library, and the runtime-loader decided which version to link to at execution time depending on the model of processor it was running on. I don't think you could make this decision in during the program flow itself, because otherwise any code would be deciding which varient of code to run as much as running the code itself! Maybe someone can enlighten me here."
I don't know how common it is, but where I've seen it done, they do exactly this, but to save time, they figure out which versions of the functions they will call at the beginning of execution and cache those. The rest of the program doesn't need to spend any time checking the processor and branching to those different versions since that work was done once at the beginning of the program and was stored in memory, meaning it can be consolidated into two instructions.
> they figure out which versions of the functions they will call at the beginning of execution and cache those
That trick has been in use since the very first set of SIMD extensions were added (MMX to Intel x86 CPUs in 1996).
For example, have a look at https://libjpeg-turbo.org/ a JPEG library that was extended as new extensions came along.
The same basic trick works for the flashy stuff we have nowadays.
You don't even have to always drop down to assembler to take advantage: with a suitable compiler, you can just recompile the same C code with different processor-specific extensions enabled via cli args, each variant given a different function name (via cli args), use a macro generator to spit out the wrapper function (which checks the CPU and selects the appropriate worker function, caches that). Rinse and repeat.
how do you make sense of whatever happens to be in the registers when you get the chance to read them?
You do realize there has been a tremendous amount of published research on this, right? That we've had unauthorized-read vulnerabilities for decades, and people have been turning them into exploits for nearly as long?
Tools for extracting useful secrets from corpora of exfiltrated data are widely available. They use pattern-matching and various heuristics.
There are examples of this for Zenbleed specifically in Ormandy's article. This information is not hard to find.
The way this is explained I hear it as being the victim process that executes "vzerouppper", similar to undoing "free" and therefore getting a register full of some else's data. But I can't reconcile this explanation because -
1. It seems to me it would be the aggressor process that would deliberately trigger this bug in a loop in order to peer at the register full of another process's data. Why is this view wrong?
2. Why doesn't this bug cause programs to crash frequently when in normal operation a register unexpectedly contains incorrect data?
What am I missing?
Those are good questions.
Not particularly, since they're answered by Ormandy's article. I don't understand why people take the time to post questions but not to check to the primary source.
When vzeroupper is speculatively executed, and then that branch is discarded, the zero-upper bit for that YMM register is cleared. That should restore the previous upper 128 bits of the register; but in some cases that portion of the register file has already been reused by some other thread (because the zero-upper bit was set during spec-ex and the Zen 2 RTT incorrectly considers that committed), and in that case the upper 128 bits that were "restored" to the thread that performed vzeroupper will be the data stored by the other thread.
Hardly dumb, if this stuff wasn't complicated it wouldn't be happening, and it would have been spotted sooner, or before the parts ever shipped.
It looks like the instruction is dispatched speculatively by the attacking process. So the attacking process doesn't actually call it, it stages the instruction in the pipeline and then waits, which causes the speculative instruction to trigger out of what, the branch predictor?
The attack relies on the fact that modern systems are busy and that other processes will seize the core and start doing work when the attacking thread releases. The instruction isn't really safe for speculative execution, so when the attacking process DOESN'T actually trigger the institution, the CPU tries to roll it back (which it does incorrectly) and leaves the attacking process with access to information that wasn't flushed and it shouldn't be able to see.
The two key pieces being that is it's not actually purging the data in the register at a hardware level, it's using a flag to mark it as deleted, and the logic for handling this doesn't safely restore state when multiple processes access the resource during speculative instructions. So the data stays in the register, the delete flag that would have blocked access is removed, and the attacking process now has access to some of the other processes data.
If I missed the mark a proper low level coder will deservedly smack me, it's was WAY before the Zen core since I had to wrangle stuff down in bare metal.
You are quite close. The missing secret is that the values in the register files are not supposed to be read before they are written. In fact, "rolling back" the instruction in this context doesn't even mean what you think. The error is in clearing the zero bit! Rolling back in a register file is a matter of repointing the register in question and marking the file entry available. By "available", I mean "something can write to it". That zero bit has nothing to do at that point.
I was never a designer (I was a validator), and the hell of it is that I can understand why each of these decisions was made in isolation. I would like to think that if I can a chance to look at the design, I would have noticed this one, but uggh...
> The error is in clearing the zero bit!
> I can understand why each of these decisions was made in isolation
I wish I could!
I've read through the report once so far and I'm clearly missing something, as I can not see why that zero bit needs to be reset - i.e. missing the "in isolation" usefulness of doing so.
AFAIK It is only a speed-up, to avoid *really* writing zeroes into the register - if the speed-up didn't exist then the register really would contain zeroes, just as it would contain any non-zero value. If the rollback required the zero bit to be cleared, to revert it to the way it was, that implies (to me) that any non-zero value would also have to be reverted to the way it was.
So either there is no need to touch anything - nothing needs to be reverted - or there is a shadow copy mechanism that would revert back a non-zero correctly but for some reason is broken/incomplete with respect to the zero bit, causing it to be cleared every time. A broken mechanism doesn't seem like something that would be considered useful "in isolation".
Probably just (!) have to re-read that report a half-dozen more times and it may sink in.
Or I could not bother and just marvel that any of this stuff works as well as it does!
I just hope I can figure out how to apply any of the amelioration methods - like, are microcode patches a one-time thing or do they have to be re-applied on every power up?
I.e. can I boot a patched OS once then go back to the grotty stuff or must all the OSes be capable of uploading the microcode patch (and contain a copy of it!)? If the latter then running some of the oddball OSes that Liam Proved tells us of would be a pain.
For that matter:
> If you stick any emulation layer in between, such as Qemu, then the exploit understandably fails.
Presumably that *doesn't* work for Qemu/kvm as then you are running on the real hardware, just a subset of it. You have to force it to, um, actually emulate the x64 architecture whilst running on an x64 - tell it to emulate an Intel device instead? My little head hurts.
Yikes, there is so much about the modern CPU guts I know nothing about! Maybe if I just glue 48 Motorola 6809s together with some SRAM onto Veroboard...
Any amd64 emulation would defeat it, because those implement what an amd64 processor like an Intel or AMD is specified to do - not what they actually do, internally.
So while the emulation will go ahead and use the real hardware in a fairly efficient way, the exploit sees what should happen, not what does happen due to
bugs imperfectly hidden implementation details.
There are two points of view: It is both a bug and an exploit.
In “bug” mode, a vzeroupper in your code shouldn’t be executed, but is actually executed by branch misprediction. When this misprediction is fixed, data from any process that happened to write to an xmm register may have overwritten your register. That’s obviously a bug. But it seems this is rare: I have the impression another process must write to a rename register just between the CPU mispredicting a branch around a vzeroupper instruction and fixing the misprediction, so only a handful of cycles.
In “exploit” mode the malware does exactly the same, but intentionally, and actually hopes that it’s data gets overwritten- because it knows some other process had written that data.
The reason why this doesn’t happen with ordinary registers is that they are protected from being written to while a predicted branch is running, and for some reason this doesn’t happen for vzeroupper.
>> Probably not; as there's no reason here to blame Open Source.
You mean that Open Source that is often claimed to be so much better at security simply because of the many eyes that go over the source code? How did that work out (not just Heartbleed, there have been embarrassing bugs in the Linux kernel which were overlooked for years as well)?
Heartbleed directly leaked all information within OpenSSL including private keys through a ping reply. That's a different order than machine level instruction execution and much easier to exploit, especially from a distance. Basically anybody in control of the network routing could steal your private key and act as a bonafide server, requiring a new key ceremony and certificate. Furthermore, it would be very hard to determine if the private key was leaked in the first place.
For system admins basically the sky *was* falling, and - serious as this is - this is not on the same scale. Currently you are the only one that tries to sensationalize the news.
> This is just nonsense.
Phew, glad to hear that. Guess the same goes for the other data leaking hardware bugs, eh, so we can go back to running at full speed without any of those pointless ameliorations in place.
On the other hand...
The example shown on the github page looks pretty convincing (but that could all be faked, I guess).
True, you can't see anything too incriminating in that short example (but if they'd captured their own password would they have published that as-is?), but if you let it run and just get lucky, what wonders could you see?
Parsing the extracted data could be an interesting challenge, but anything with a pattern to it...
The malefactors don't have to hit too many jackpots to make the attempt worthwhile. Unless someone wants to invoke some "targeted scenario" where they are after something specific from a special machine within a particular company, but how many of those are there in reality, compared to mass "give it a go, see what we get" approaches (like targeting all of the machines in the company)?
Real-world data always has REDUNDANCY. This will allow heuristic algorithms to "fish out" secret information such as you Bank Account Number, your Banking PIN etc.
PLUS - if it is a crypto key, the attacker can simply brute-force with the collected "key candidates". Testing 1000000 potential keys is done in a few seconds.
In other words, your notion is badly wrong.
The idea of time-sharing a large CPU with lots of state(from register to caches) might be inherently insecure.
As Gernot Heiser states
somewhere in his lectures(can't definitely find it), CPU state must be completely flushed(zero out caches and all types of registers !) if you want to have a secure time-shared CPU. There is no other way to be really sure if you want to provably deny side-channel attacks.
So maybe computers should evolve to the transputer concept of loosely coupled CPU cores ? One process, one CPU, one register set, one cache, no time sharing ?
Data Center and Cloud Computing based on the Time Sharing idea looks very questionable. The short term fix would be a move to many small machines, which can be easily rented out individually. There should be very fast interconnects between them.
Or maybe deinstall all browsers, PDF viewers, email clients and office packages from the Command Computer. Will definitely increase security.
Imagine your company "A" runs an accounting system on a Cloud Computer "CC". Everything very secure, properly patched, good security practices. Then there are companies "B" and "C" which have similar good security. Finally, there is Joe "J", who uses CC for throw-away computing experiments.
J runs an outdated version of Apache, which has exploitable bugs. He does not care, because for him it is an experimental system.
Then comes along "I", who is a criminal hacker who runs scripts to probe all IPv4 adresses for exploitable Apache instances. He finds Js Apache on CC. I can now insert an exploit which contains ZenBleed in order to undermine all of CC, including the VMs of A,B,C. I can read out ssh keys and get access to the processes of A,B,C. GAME OVER !
In other words, if you run a cloud system for anything business-critical, you should be very worried now.