but surely
their servers have failover? so why cant you patch one, reboot and meanwhile the others pick up the load? the do the next one...etc...etc...etc
unless of course they dont really have redundancy or failover....
The Xen project has asked for help to ensure future bugs aren't as disruptive as the XSA-108 flaw that saw major cloud operators reboot an awful lot of servers. XSA-108 emerged in late September and saw the likes of AWS, SoftLayer and Rackspace patch and reboot many servers. Such reboots are just the kind of thing that cloud …
VM migration isn't quite as transparent as it could be, which means that AWS / Rackspace customers still have to plan for potential downtime.
Moreover, you need a sufficient number of patched hosts, otherwise have to move VMs from host 'A' to vulnerable host 'B' before you can patch 'A'. Then you have to move the VMs from 'B' back to 'A'. This results in annoyed customers, because their machines were moved twice, rather than once.
And now you have to do all of this within an announced and (hopefully) short maintenance window, so that your customers can make sure all of their devops guys are on deck for it.
The AWS fix for this was spread out over several days, with several multi-hour maintenance windows. It didn't impact us greatly, but it could have been a lot worse. For some customers, it might well have been.
I remember about 4 years ago I met with the head of EC2 and their chief scientist (the head of EC2 was the brother of the CEO of the company I was at at the time). We vented our frustrations with their platform (it wasn't the first time my boss met with them to say how unhappy he was as a customer). And they acknowledged the problems (but never fixed most of them likely to this day).
ANYWAY the topic of vmotion came up in the discussion and they said "oh well when you move a VM you get a spike in latency for a few seconds, we don't want our customers to endure that so we don't have it" (I think that was more of a cheap excuse because they probably weren't technically able to do it due to the outdated and hacked Xen that they had(and probably still have).
As a customer I'd be more than happy to endure some latency during a VM migration if it means avoiding hard downtime of that VM, easy choice to make.
Moved my company out of ec2 cloud about 3 years ago, I couldn't help but laugh when about a year afterwards a developer was doing something and I asked them why (I forget what they were doing) and they said they were doing that in case we had to rebuild a server (not too uncommon in public cloud). I laughed, to this day 3 years later we have not had to rebuild a server for any reason.
lots of apps have single points of failure, not everything is a stateless web server. One of my big arguments against such cloud providers is just that -- most developers don't build for that, even dozens of ones I have worked with who built their apps from the ground up in cloud environments. It's simply not a priority. It makes sense to me anyway, you have to choose building features for the users or making the app really robust, I'd wager greater than 95% of orgs choose the former(in all of the "web" startups I've worked at the past 15 years that number is 100%). Only when the latter becomes really really painful do most orgs invest in that(yet to work for a company that did this) because it's quite complicated to get right, and in many cases it is simpler/more cost effective to provide higher levels of availability in the infrastructure than it is in the app (that usually flips around when you get to really large scale (systems numbering in the thousands at least) -- but most orgs aren't at that scale and never will be).
most management types see cloud and think of it as a utility that it'll never go down, some magical thing that just works.... there's a ton of work that has to be done to leverage such cloud providers properly and most folks never get around to doing it. Certainly was never, and is not worth the effort for my org, or any company I have worked for in the past, the return on the effort is simply not worth it with all the other compromises (both in cost and tech) that must be made to make things work.
There are good reasons for not supporting migration in some cases - doing so means you need shared storage, which typically means decreased performance (unless you put in extremely expensive storage equipment to handle the same number of IOPS as fast local disks).
There's also the issue that some workloads won't tolerate live migration, not just because of latency issues but because they rely on fairly accurate timing, and can then crash - not good as a cloud provider if your customers start complaining after you do some maintenance that their apps have crashed...
"There are good reasons for not supporting migration in some cases - doing so means you need shared storage, which typically means decreased performance (unless you put in extremely expensive storage equipment to handle the same number of IOPS as fast local disks)."
Yeah, unless you're using something like say Microsoft Hyper-V where shared storage is not required, one of the things I like most precisely because I don't need another really expensive storage system that's always the cause of downtime and hassle.
Well yes, other hypervisors can do that (often called storage motion), but to do migration on any serious scale you need shared storage, otherwise the time spent transferring disk images around will make the migrate take so long that to evacuate one host in order to do maintenance on it will be unreasonable (and a rolling upgrade of all your hosts in a cloud scenario would take weeks / months).
There's also the fact that as you migrate the I/O performance will drop significantly (as writes have to be mirrored to both current and new host), which may be noticed by users...
This post has been deleted by its author
If you're predicating a "cloud" application on the ability for a VM to be transparently migrated from one physical server to another then you are Doing It Wrong.
If you even *considered* EC2 for this kind of application (never mind had a meeting with their top brass) then you are In The Wrong Job.