
I trust
AMZN will be reducing AWS prices pro-rata and sending the bill for the difference to Intel.
Log-sniffing vendor SolarWinds has used its own wares to chronicle the application of Meltdown and Spectre patches on its own Amazon Web Services infrastructure, and the results make for ugly viewing. The image below, for example, depicts the performance of what SolarWinds has described as “a Python worker service tier” on …
Sadly the end result of this is likely to be Amazon spending a lot more money with Intel for new kit and passing the bill on to us.
A bit like the banking crisis, causing a massive screw-up doesn't preclude you from benefiting from it due to being the only people who understand and can sort out the mess.
The performance issue is apparently centered around switching context. The initial patches brought performance down, as was forecast.
I wonder what kind of wizardry could allow for alleviating the performance issue while still conserving the security aspect of the operation ?
The problem is that switching context is in the order of magnitude of 100x slower in a VM than on bare metal (addding microseconds to nanoseconds).
That is why some workloads virtualize with minimal performance hit (few threads, low concurrency, mostly userspace CPU burn), and some workloads virtualize extremely poorly with a huge performance hit even without meltdown patches (anything highly concurrent such as compile farms, databases). I have measured performance hit from virtualization on some such workloads to be upward of 30% - and that was before meltdown patches came into play.
It could be that they had not previously enabled Process-Context Identifiers (PCID), or possibly the virtual machine manager allowed it to be virtualised.
PCID is a relatively recent x86-64 addition. The PCID is a tag against the Translation Lookaside Buffer entries that acts as a filter, saying 'this TLB entry belongs to this process'. The hardware will only use a TLB mapping if the tag matches the current process's tag. This allows the TLB to contain mappings for multiple processes or contexts.
Traditionally, on a context switch between processes, the whole TLB had to be flushed, all entries discarded (or marked invalid). That meant that for the initial memory accesses that the process performed, including the instructions to be executed, the hardware would have to walk the page tables to find the mappings from virtual to physical memory, even if it was something that the process had recently accessed the last time any of its threads ran.
With PCID, the OS doesn't have to flush the TLB on a process switch - only if it's reusing a PCID value from a different process. It can selectively flush entries for a process if it's changing that process's address map, using the INVPCID instruction. This would normally happen in response to a page fault exception.
You can mark pages as global in the x86 architecture, which means that when you switch to a new process context - register CR3 changed to point to a different set of page tables, causing a TLB flush - the TLB entries for those global pages are retained. Since it's common that the incoming thread was already executing in kernel mode - for many workloads the thread is blocked on a kernel operation, not having been pre-empted in user mode - this saves having to walk the page tables to find the kernel code.
However, we're now putting kernel code into a separate memory space altogether, so that the processor can't speculate loads of its address space. That causes an address space switch on every user->kernel and kernel->user transition, which itself causes a TLB flush on older hardware or with PCID disabled. So, if the processor doesn't support PCID or it's turned off, the newly-loaded kernel code causes page table walks, then on return to user mode, it has to walk the page tables again.
TL;DR check that your processor supports PCID and the INVPCID instruction, and that it's enabled.
Agreed, who actually uses PV these days? HVM has been available for years and clearly you are going to need to be on HVM to get the benefits of PCID - which is the only way not to get clobbered by the fixes.
PV - makes a nice headline for ops software companies but in real life not so much.
Remember those graphs are for the Solarwinds guest - not the hypervisor. Do we have any visibility of what has happened for AMZN/MS/GCP at the hypervisor level? You could say Solarwinds are using inappropriate instances if they aren't hammering the CPU they've bought off the CSP.
If I was AMZN/MS/GCP I'd be aiming for a system wide run queue approx = #cores. I'd be aiming for 100% CPU with run-state guests waiting one tick to get onto a CPU core. With good tuning of system tick this would mean effective use of all resources on the hypervisor. I suspect CSPs likely try and achieve the same thing - full use of the resource. Probably through using serious levels of overcommit (why havent we been doing this on-prem?) I suspect the CPU charts for the actual hypervisors in a CSP would 'scare' your typical on-prem sysadmin. ;-)
Is it likely the patching has only changed the CSP 'overcommit' ratios - not the actual hypervisor CPU usage. Which is likely 80-100%, 24x7. On all cores.
If the hypervisor was planned for 80-100% 24x7 and they installed a patch that caused everyone's load to increase by 20%, they would be royally fucked!
I know of course the 25% was the vCPU, not the actual CPU which as you say will be more heavily loaded. The customers want low utilization in their vCPUs so they have spare capacity for when they hit their peaks. The provider needs to look at trends to insure they have high utilization (otherwise they have idle resources that are a waste of capital) but not too high - otherwise they could get caught with their pants down if something happens to trigger peak loads for their customers - like a major news item like missile being fired at Hawaii, a president being assassinated, a stock market crash, etc.
More interestingly, what about the power and thermal design requirements across whole data centre sites? Usually providers will have plenty of spare capacity available but if say overall power usage or temperatures increased by more than a quarter across the whole site this could have serious implications on power or HVAC systems working efficiently, especially if any are close to their rated capacity? My team once added a few extra racks of power-hungry servers (maybe ten kilowatts) without bumping additional power supply to the same floor, we literally blew up the main site distribution board and had to call in electrical experts to deal with it at the substation level while all other services on site were down waiting for their power to be recovered, clients weren't prepared to pay for site upgrades till forced.
While the initial hack to protect against meltdown was justified, it's an expensive fix, loading up every system call for the 99.99....9 % of innocent applications.
The pattern of operations needed to provoke the bug and probe kernel memory is pretty specific : hitting blocks of memory with huge numbers of illegal accesses in order to measure fetch times. This pattern ought to be detectable.
Perhaps a cheaper approach would be to monitor the rate of such accesses - which themselves raise an exception - and quarantine the guilty applications by both a tarpit approach (make illegal accesses themselves expensive) and enabling kpti for them.
This would still allow the attack, but would force it to operate slowly in order to evade detection. So slow, perhaps, that it could no longer read a useful section of kernel memory in the time before it changed enough to make the operation worthless. This appears to be a strategy used against rowhammer (https://lwn.net/Articles/704920/).
Right now, we are not seeing good statistics of just how much damage the patches are causing. The number of data points are few and seem to be biased towards the worst. However, I suspect the effects will range from none/not detectable to eye-popping but the key is the distribution in the server farms once people figure out how to work around the problems. I would not be surprised at something the resembles a Weibul distribution or a mirror of a Weibul (asymmetric cluster with must values clustered at one end with longish tail).