Re: no pay for what you use in many cases
[revised my comment a bit]
You can say you are laying claim - as in I expect to have those resources available to me at any given time.
The critical failing is the assumption that the resources will be in use (on a larger scale) for long periods of time (certainly some applications work this way and in that case Amazon may be fine).
In a typical VMware infrastructure (or Hyper-V or anything hosted locally) you have say 100 VMs, and X amount of CPU resources. The assumption is made that the VMs can use X amount of CPU or I/O but they won't all request it simultaneously.
Thus you can significantly "oversubscribe" or over provision the system with that in mind.
Amazon lacks the technology in their hypervisors to be able to do this properly. One key aspect in allowing this capability in an environment where workloads are not predictable is the ability to migrate workloads between physical servers to re-balance the environment. VMware does this of course with DRS, other hypervisors have similar(though perhaps less sophisticated) capabilities.
Though if your workloads are predictable, as mine are (and have been for many years across multiple companies and dozens of applications) you don't even need something like DRS (though it's nice to have). I look at my VMware servers today and the *physical hardware* sits at under 20% CPU utilization(peaks to maybe 45%). That is with a dozen or two dozen VMs/server (and north of 150 gigabytes of memory/server). The only vMotions in my environment in the past 6 months have been manual load balancing (memory utilization hovering around 85%)
But even if you throw all of that out .. Amazon's VM infrastructure does not handle competing workloads effectively(maybe this is significantly changed in the past year but I doubt it). On Linux at least (other OSs may have similar properties) there is a CPU metric called "% Steal". Which basically means the % of CPU resources the underlying hypervisor has stolen from the VM. On Amazon under many circumstances involving high CPU load I have seen this number spike to as high as *30%*. which means your paying for your CPU but your only able to get 70% of the effective capacity out of it(at that point in time).
Since we moved out of amazon, the % Steal CPU time has been 0 - never once has it gone above 0 since we moved out. Well one exception might be last year with the leap second linux kernel bug which caused every VM in my infrastructure to jump to 100% CPU simultaneously (fortunately things held together fine).
Memory utilization is different though -- at least least for production workloads. Applications whether they are active or idle generally have a fairly stable memory usage profile. Hence the bottleneck on most VM infrastructures is memory rather than CPU or disk I/O. Non production workloads some people may lean towards allowing VMs to swap more to free up memory -- for me I don't like swap in any environment unless is for emergencies. I'd rather have a VM die because it runs out of memory then swap till the cows come home because swapping will not only essentially render the VM useless by slowing it to a grinding halt but may impact several other VMs in the process. -- from what I have seen Amazon VMs by default come with 0 swap configured (likely for the same reason).
When operating a public cloud you may not be able to make that sort of assumption as to the resource utilizations of individual servers. So in that circumstance you either need something like DRS, or you need the ability to be able to give the customer perhaps physical servers(not directly, but give them essentially dedicated hardware to provision what they wish within that hardware) and allow them to manage the work load distribution between them.
You can also control things in VMware (perhaps other hypervisors as well but not Amazon) with resource pools, limiting aggregate CPU/memory utilization for a particular group of systems. In my environment for example which runs a dozen test environments along side production (test envs run in different subnets of course), production gets priority. So if something like that leap second bug comes back the test environments may go nuts but they are only allowed X % of CPU time (I think in my case maybe 30% of cluster capacity). So production can keep going (during that incident our web site slowed down but did not go offline - other larger sites had significant downtime).
Moving out of Amazon has an ROI of less than 1 year in all of the analysis I have done at my companies, actual cost savings easily half a million to several million in the first year alone.
Hope this helps.
I am happy to talk more about it ...if you have further questions I have a blog where I have covered many of these topics in the past but I don't link to it here since el reg doesn't like that.
You can reach me via email at nate (@t) nateamsden (d0t) com if you have further questions and/or wish the link to my site.