Single
Point of
Failure
It'll get'cha everytime!
AWS has delivered the one message that will strike fear into the hearts of techies working out their day before Thanksgiving Holidays: US-EAST-1 region is suffering a "severely impaired" service. At fault is the Kinesis Data Streams API in that, er, minor part of the AWS empire. The failure is also impacting a number of other …
Root cause:
At 9:39 AM PST, we were able to confirm a root cause, and it turned out this wasn’t driven by memory pressure. Rather, the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration. As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.
https://aws.amazon.com/message/11201/
I guess that's all I had to say, moved the org I work for out of their cloud about 9 years ago now saving roughly $1M/year in the process. Some internal folks over the years have tried to push to go back to a public cloud because it's so trendy, but they could never make the cost numbers come close to making it worth while so nothing has happened.
What a load of tosh. So your company was all in on public cloud 9 years ago were they? Must have been a real trailblazer to have been spending enough to ‘save’ that much money by moving off? And what exactly were they using that they managed to save $1M/year by rebuilding the whole lot on-prem? And I guess you’ve not had to refresh/repair/service any of that hardware 2-3 times in the last 9 years?
Regardless of what you think about cloud, if your bean counters can’t make the numbers stack up then you need some new bean counters. There are reasons not to go to cloud, but your post is complete fiction.
Actually just retired some of our earliest hardware about 1 year ago. A bunch of DL385 G7s, an old 3PAR F200, and some Qlogic fibre channel switches. I have Extreme 1 and 10 gig switches that are still running from their first start date of Dec 2011(they don't go EOL until 2022). HP was willing to continue supporting the G7s for another year as well I just didn't have a need to keep them around anymore. The F200 went end of life maybe 2016(was on 3rd party support since).
Retired a pair of Citrix Netscalers maybe 3 years ago now that were EOL, current Netscalers EOL in 2024(bought in 2015), don't see a need to do anything with them until that time. Also retired some VPN and firewall appliances over the past 2-3 years as they went EOL.
I expect to need major hardware refreshes starting in 2022, and finishing in 2024, most gear that will get refreshed will have been running for at least 8 years at that point. Have no pain points for performance or capacity anywhere. The slowdown of "moore's law" has dramatically extended the useful life of most equipment as the advances have been far less impressive these past 5-8 years than they were the previous decade.
I don't even need one full hand to count the number of unexpected server failures in the past 4 years. Just runs so well, it's great.
As a reference point we run around ~850 VMs of various sizes. Probably 3-400 containers now too, many of which are on bare metal hardware. Don't need hypervisor overhead for bulk container deployment.
The cost savings are nothing new, been talking about this myself for about 11 years now since I was first exposed to the possibility of public cloud. The last company I was at was spending upwards of $300k/mo on public cloud. I could of built something that could do the handle their workloads for under $1M. But they weren't interested so I moved on and they eventually went out of business.
Your tone is a little supercilious sorry, but
most gear that will get refreshed will have been running for at least 8 years at that point
I'm not at all in the end of the business that needs massive data centres, so I'm not really qualified to comment, other than saying that "everything in (someone else's) cloud" has always seemed wrong to me, but I think the above is a key point.
From the dawn of "commodity" hardware (i.e. since running your business on a minicomputer stopped being the done thing) until about ten or twelve years ago, a 24 to 36 month refresh cycle for most kinds of commodity IT kit would bring huge step-improvements each time for just about any common workload. Faster processing, faster storage, faster networking, lower power consumption, smaller physical size, lower purchase cost and all this at a time when the amount of "data" handled by typical businesses was growing exponentially.
At that point, one big selling point of renting other people's infrastructure was that you could benefit from this refresh cycle without doing anything more than paying the bills. The downsides - primarily loss of control - didn't seem so bad.
These days? l don't know for sure, but I get the feeling that the amount of data handled by typical businesses is still growing, but more slowly while the service life of commodity hardware components has increased and speeds - of processing, networking, storage - remain "good enough" for much longer. Crumbs, it's twelve years (I think) since I bought my first gigabit switch at home and I am not desperately saving up for 10Gbit kit (though I am considering adding a second network connection to the NAS). At my place of work, gigabit has only become available at the desktop within the last 18 months, interconnects are only now being upgraded to 10Gbit (requires fibre replacement), and most desktops are actually still connected at 100Mbit and boot from SATA-connected spinning rust. It's "good enough" for Outlook, Teams, Word, Excel, Powerpoint, web, some minor photo editing etc.
Thus one of the big selling points of cloud has disappeared, and the main remaining attraction is that when something goes wrong it is "someone else's problem". It's a different equation.
At this point it's usually worth considering the comparison between cloud services and organisations which outsource building maintenance and cleaning, or even making a comparison with the Private Finance Initiative for public projects. In many cases it has been proven that the projected savings are not realised in practice, and the inconvenience is often a major factor - consider a large school which no longer has a "caretaker" on premises during open hours, for example. Some child breaks a tap in the chemistry lab and there's water spraying everywhere? Fred the caretaker would be there in five minutes with a spanner, maybe even a spare tap. Mitie on a four-hour call-out probably arrive after the end of the school day and have sent an electrician to do a plumbing job because at least they meet their four-hour commitment.
M.
> The last company I was at was spending upwards of $300k/mo on public cloud. I could of built something that could do the handle their workloads for under $1M.
Does that include staffing your solution? And was their cloud solution cost-optimised, or could you have got that cost down as well?
Replace your bean counters until the numbers stack up...... What a telling suggestion.
There are reasons people want to move to public cloud, some rational, some irrational (I've heard a few about strategy, but they are mainly about wanting opex or being "on trend"), but if you do it to save money, you'll likely be disappointed, or have to spend a lot of money to refactor all your apps in the 1st place.
I get that startups love public cloud - they don't have the finances to build infrastructures to deliver what they hope they might need, but if they get large enough, and need to save money, they'll likely do a "Dropbox".
I like the idea of hybrid cloud (without getting yourself locked in to any particular public cloud) - otherwise you are basically opening your chequebook whilst handcuffing yourself to your chosen cloud vendor. Who in their right mind would do that (other than someone who didn't know what they were doing)?
Public cloud seldom is cheaper (which I agree doesn't stop people moving there), but I do wish more people could at least count and understand what they were signing up for, and if they don't, perhaps they should stop suggesting people fire their bean counters for being able to count/being honest (directly aimed at AC comment).
Often beancounters' hands are tied to company 1/4ly results and other rules that mean OPEX is easier to sanction and handle than CAPEX, which can take time to realize its potential.
I've seen opportunities for cost savings and better technical environments to be overlooked because funds are in "the wrong budget".
I did read it and wonder too.
However, our experience is that you do get better real performance and save costs with on premises.
We too were pretty much fully cloudy in 2012 and reversed a lot of that.
I agree about the slowdown in CPU speed/power saving. So the rush to upgrade is smaller.
For most larger SME’s a mirror DC can be a closet sized rack with 1U servers. With the growing ubiquity of 1G and 10G FTTP the mirror site does not need to cost fortunes. There is generally a lot of space in existing server rooms as things have shrunk. I just asked around a few other CEO’s I knew and we agreed a mutual zero cost deal.
With a bit of imagination it is all doable.
The trick is to keep the upgrade cycle rolling slowly - no Big Bang must spend gazillion CAPEX this year stuff. That is what sets of Bean Counter Central.....I’m lucky I’m the majority shareholder....
We would not even look at a service unless it was in three separate DCs.
I know that Amazon's regions & AZs don't map directly to Google's DCs, but if you haven't learned by now, US-East alone cannot provide five nines. You MUST go multi-region at least for that. Last I heard, AWS charges the big bucks for inter-regional data flows. Enough that it is probably worthwhile to look into the competition.
BTW, I'm available if you need detailed instruction/help to make it happen... ;)
Having seen what is required to reliably provide a legally mandated five-9s service, I believe that no cloud, even multi-region, multi-cloud I've looked at comes close. I've said for years that AWS delivers about 99.7% uptime, year over year. A service that needs five nines, meaning only 5 minutes of downtime (planned or not) per year? That takes a very different mindset about processes like change management, redundancy, etc.
Smaller services are more likely to deliver higher reported uptimes because there are fewer moving parts that can cause widespread failures, but *any* issue impacting the overall service impacts your availability numbers. I don't give credit for "planned" service downtime.
The rule of thumb I still teach is every nine after the first one adds a zero to your cost estimate.
Some day, I'd love a discussion/briefing on what it takes to deliver the six-nines service level.
On the inside, Google's distributed configuration data storage service actually delivered so much reliability that they would deliberately take it down for one minute at the end of each quarter. This was because the system was NOT specced at 100% uptime, but they were concerned that if it never went down, people would come to expect that.
Those deliberate outages were the only ones that it had for the sixteen months I was there.
I've not compared GCP to Google's internal systems to compare what they are promising to what is/was going on from the inside. In the four years since, the reported outages suggest that they are still having some difficulties driving the software engineering principles required for success.
I've heard that "10x a nine" ruberic before. Not having personally driven a service from two nines to five, I cannot say with certainty, but there is clearly something like that involved. With two nines and a tight ship, you can get by with planned outages. To actually deliver three nines, no planned outages, 24/7 oncall, and DCs that are isolated geographically and electrically. To deliver four nines, you require automated rollbacks, switchovers, and scaling; and oncall people awake at all times, probably with "cover me--I'm going to use the facilities". Yeah, five nines requires all of that fully matured, plus full redundancy in your inter-DC communications.
And since SRE hasn't been around for a decade yet, six nines is a theoretical thing.
Contact iRobot Customer Care
An Amazon AWS outage is currently impacting our iRobot Home App. Please know that our team is aware and monitoring the situation and hope to get the App back online soon. Thank you for your understanding and patience.
Was about to factory reset my Roomba when I came upon this note. Had to pass the vacuum cleaner the old fashion way!