* Posts by Nate Amsden

2356 publicly visible posts • joined 19 Jun 2007

This typo sparked a Microsoft Azure outage

Nate Amsden Silver badge

one of the many reasons I hate cloud

is they never stop fucking with it. I don't want stuff changing constantly for no reason.

Same goes for SaaS services that feel the need to update their UIs and force the changes on the customers (vs on prem where you can opt to delay any such upgrades until you are ready for them). One exception to that in my world was Dynect, who maintained (as far as I could tell) an identical user experience spanning from when I first started using it in about 2009 until I migrated off late last year (Oracle acquired them and put their technology into their general cloud DNS offering years ago, and cut the price by 98% and I assume shut the original Dyn infrastructure off in the past couple of months if they stuck to their schedule). I haven't had the pleasure of dealing with IaaS in over a decade, my on prem stuff hums along perfectly, and I have been successful in defending against folks who wanted to bring cloud back again and again in the last decade(when they see the costs they have always given up since they don't have unlimited money)

Broadcom says Nvidia Spectrum-X's 'lossless Ethernet' isn't new

Nate Amsden Silver badge

kind of wonder how it is different

from this lossless ethernet that was pitched more than a decade ago as part of Fibre Channel over Ethernet

https://en.wikipedia.org/wiki/Data_center_bridging

"DCB aims, for selected traffic, to eliminate loss due to queue overflow (sometimes called lossless Ethernet) and to be able to allocate bandwidth on links. Essentially, DCB enables, to some extent, the treatment of different priorities as if they were different pipes. To meet these goals new standards are being (or have been) developed that either extend the existing set of Ethernet protocols or emulate the connectivity offered by Ethernet protocols."

unlike what's mentioned in the article, DCB seems to be available from many vendors (not that I have used it, or have had a need for it myself, regular ol dedicated fibre channel is still fine for me).

side note, if they are going to be developing semi unique (proprietary?) ethernet setups, wonder if there is any benefit for increasing the frame size. I mean I was using 9k jumbo frames for some things back on 1G links 15 years ago(and 10G links still today), today these new AI servers have links upwards of 100G(perhaps more), seems like things could benefit from larger frame sizes?? Though from what I've read just now ~9k really is still the max that is available.

MariaDB CEO: People who want things free also want to have very nice vacations

Nate Amsden Silver badge

losing battle

Going up against the big clouds, on their own platforms with something as "basic" as MySQL or Postgres based services is very likely to fail. Maybe they can get some customers here and there, but end of the day they don't have control over the costs of the underlying platform or capabilities, so are at a severe disadvantage.

Been using MariaDB at the orgs I've been at for several years now(everything is "on prem"), switched from Percona after they increased their support fees something like 800-1200% overnight (and we rarely filed support tickets). Not the DBA, I didn't drive the transition but it seems to work fine, not many issues, though not signed up to any formal support.

If you don't brush and floss, you're gonna get an abscess – same with MySQL updates

Nate Amsden Silver badge

Re: Well...

512MB, wow that is tiny. I used to run utility servers (DNS/NTP/email) on 1GB 32-bit VMs in 2010 and before(maybe even 512/768MB back then I don't remember), but in 2011(at a new position) decided to standardize on 64-bit, and 1GB wasn't really enough anymore(system would start swapping), at least at some points in their life(especially doing package updates - also assuming swap was only to be used in emergency situations), so upgraded to 2GB, and more recently(Ubuntu 20.04 with default 5.4 kernel) due to kernel memory leaks had to upgrade to 3GB(or reboot more often). Not that I am low on memory have a couple TB available.

My smallest mysql db (at home) is 500MB of data and has 2GB of memory allocated to it. One of my smallest DBs at work is 1.4GB of data and has 5GB memory(started with 4, increased to 5 to get it to stop swapping).

AMD probes reports of deep fried Ryzen 7000 chips

Nate Amsden Silver badge

Re: Stupid "Optimized defaults" nonsense.

Certainly have been aware that cpus have integrated memory controllers for years. And cpu and bus speed(been building my own computers since 486 days). The misleading stuff is the XMP made it seem like I needed this memory to run at the best speed that the cpu supported out of the box. No overclocking was implied(in manual or in bios settings). It seemed like just a setting to tell the cpu "hey this memory is fast enough".

Nate Amsden Silver badge

Re: Stupid "Optimized defaults" nonsense.

Count me in on the clueless I guess.

I bought a Ryzen 3700X on the day it was released along with an Gigabyte "X570 I AORUS PRO WIFI", which I think was the only Mini ITX board available at the time. Bought good Micron memory and enabled the XMP feature to run the memory at the right speed(which to me wasn't overclocking as that was specifically what the memory was sold to operate at). Looking at the manual again there is no indication XMP had any effect on the CPU, I always assumed it was a memory specific function. I'm not one to overclock, the only time I ever intentionally overclocked was a P200MMX to 233Mhz(or was it 225Mhz..).

I did fry that board after less than a year(December 2019), first time in 20+ years I had a motherboard fail on a personal system that I can recall anyway. After the system froze overnight(during encoding), I power cycled it, but it would not turn on again. I tried many times and eventually the magic smoke came out of a component on the lower right side of the board with tiny sparks. Gigabyte accepted the RMA without question(or explanation) and replaced the board(with same revision number, was hoping it would be a newer revision that fixed some known issue), been working fine since - no other components were damaged. The only thing that system does is encode video with handbrake(otherwise stays powered off). Though my backlog has been clear for some time it's probably encoded a few thousand hours worth of stuff.

Had a good quality power supply (PC Power & Cooling, reused from earlier Athlon build and still in use today, total of over 12 years of service and counting), and it was connected to a double conversion UPS, so power quality was as good as it could get.

Mandiant's 'most prevalent threat actor' may be living under your roof – the teenager

Nate Amsden Silver badge

Re: A monster of your own creation

some kids have always been the sources of this stuff going back a few decades at least. Obviously only a tiny portion of kids (or adults even) have an interest to pursue this kind of activity.

Was curious so I poked at this article: https://en.wikipedia.org/wiki/Phreaking

"The tone was discovered in approximately 1957,[7] by Joe Engressia, a blind seven-year-old boy. Engressia had perfect pitch, and discovered that whistling the fourth E above middle C (a frequency of 2637.02 Hz) would stop a dialed phone recording. Unaware of what he had done, Engressia called the phone company and asked why the recordings had stopped. Joe Engressia is considered to be the father of phreaking."

Social engineering was more commonly known as scams.

Stratus ships latest batch of fault-tolerant Xeon servers

Nate Amsden Silver badge

interesting concept

Wonder how it works in real life. Protecting against component failures is nice but there's no real mention of software reliability. If for example you are running vSphere on it, and you need to install patches that is still downtime(could mitigate that more with a pair of systems but seems quite overkill). Or worse if the OS crashes(host or guest).

Closest comparison I can think of off top of my head https://en.wikipedia.org/wiki/NonStop_(server_computers)

But that system seems to have software tolerance as well

"NonStop OS is a message-based operating system designed for fault tolerance. It works with process pairs and ensures that backup processes on redundant CPUs take over in case of a process or CPU failure. Data integrity is maintained during those takeovers; no transactions or data are lost or corrupted. "

But of course you probably can't run things like vSphere on NonStop.

vSphere itself has had fault tolerance for a long time which would cover a lot of use cases where you have to be protected against component failure but there are of course limitations (less now than originally it was limited to 1 vCPU, looks like current limit is 8 vCPU) - https://www.vmware.com/products/vsphere/fault-tolerance.html

Fortinet's latest firewall promises hyperscale security while sipping power

Nate Amsden Silver badge

if you are pushing that much throughput

Nobody is going to blink at using 7kW of power, even if it means having to draw power from 2 racks.

Likely any such facility hosting such a piece of equipment with bandwidth needs in excess of 100Gbps will easily have enough power budget in a rack to run it, one thing to be careful of though would be to ensure the cabinet has 3 PDUs from 3 different sources of power/generators, and put two PSUs on each PDU. Technically with the specs you could draw 7kW on 3 PSUs on a single feed, but personally wouldn't feel comfortable pushing things to that kind of limit. Though likely real power draw will be far less than the peak, so probably doesn't matter.

I had one service provider several years ago go down at one site(not a site I was hosted at but still impacted the routing in the region), and it was due to a power outage, but more specifically whomever set up the equipment for whatever reason, had all of their core router PSUs hooked to the same UPS/generator(when the facility they were at had multiple). So clearly setup incorrectly and eventually bit them in the ass when that one feed went down, took out their core router(s). The whole facility didn't go dark just one set of feed(s).

They owned up to it quick after I drilled them, and committed to fixing it fast. Stupid mistake, but the provider has a 100% uptime SLA with their customers so not a mistake that should of ever happened.

Just because on-prem is cheaper doesn’t make the cloud a money pit

Nate Amsden Silver badge

IaaS cloud is a money pit

Probably in ~75% of the situations out there. Maybe on prem is a money pit in ~25% of the situations out there.

Back in 2016, El reg reported on Snapchat's IPO filing where they admitted they were committed to $400M/year in google cloud spend over the next few years(the most extreme example I can think of), and I poked at their financials earlier today and they still seem to be losing upwards of $1B/year. Obviously an org that doesn't care about burning money.

Imagine what one could accomplish with $400M a year "on prem". Likely real on prem costs for them would be in the $50-70M/year range at the most?

Last company I was at was a "small" company (employee count wise), I moved them out of cloud in early 2012 when their cloud monthly bill was ~$70k/mo (and they had just launched about 4 months prior). Over the following decade they grew a lot, but never needed more than four cabinets. Company prior was even smaller, maybe 60 employees or less, and had a cloud bill at times over $400k/mo, they never moved out of cloud despite my efforts, I had everyone on board including the CEO and CTO but the board of directors wouldn't budge. Company is long since dead now.

IaaS is not for the faint of heart it requires real skills and experience to get going right, but even then you can fail (cost wise) as Basecamp has shown(most recently). Being an infrastructure person myself I realized this back in 2010. Perhaps the new normal of higher interest rates and no more "free money" will shift the tides heavily away from at least IaaS(I can hope anyway haha).

SaaS is a totally different beast of course, with a totally different cost model(for end customers), and in many cases if you want/need that software platform you may be stuck in using their model. I have seen time and again SaaS platforms launch over the years as an excuse to make poor (quality) software products that their customers couldn't otherwise hope to operate on their own. I'd like to think a small company I worked at more than 15 years ago was an early pioneer in that at the time (and we didn't even know it then). Company made a software/service for big telcos. At every major company meeting a common strategic goal for the company was to make their product so the customer could operate it themselves(that never happened while I was there). At one point it was my project to help demonstrate to the largest carrier in the U.S. that they could in fact host and operate our software on their gear. The customer agreed the demonstration was a success(took 1-2 months) and paid the company I worked for the $1M fee (or so I was told), but I could see the sheer horror in the eyes of their employees as they tried to understand how that super complicated software stack worked (by far the most complicated of any stack I've ever worked on, had two people on my team quit within their first few months because they felt they couldn't keep up, never seen that before/since). As far as I know they never took the software in house, always chose to operate it as a SaaS model from the company that made it, am guessing the usage of that software stack was mostly phased out a decade or more ago, as Google and Apple took over the markets it was targeted towards(though at least one of their applications is still online and shows the same name for copyright in the code as it did in 2005(except the date has been changed to 2002-2023, even though they got acquired in 2006, another one of their portals is online but branded with the company's name that acquired them, though the underlying host/domain of the site is the original from 2003, they don't even redirect to a new name).

I've been working on high availability internet hosted application stacks for 20 years (as of next month), and have never once yet seen a situation where IaaS (as sold by the big cloud companies at least) would be the right choice. I'm sure such situations exist, but they are few and far between(as I called out in 2010), UNLESS, of course, you don't care about burning money).

IaaS to me would be real useful if your workload is VERY variable. Needing thousands+ of CPU cores and terabytes of memory for short periods of time(guesstimate less than 7 hours/day on average over a period of time).

You can do on prem poorly/expensive, but it's my opinion it's far easier/more common to do cloud poorly/expensive.

HPE lobs scale-out storage services into GreenLake subscription vehicle

Nate Amsden Silver badge

conflicting stories

This story claims the block storage is based on Nimble, but an article linked to on El reg's home page says the block storage in Alletra MP is based on 3PAR

https://www.nextplatform.com/2023/04/04/hpe-converges-3par-block-and-vast-data-file-onto-one-alletra-mp-platform/

maybe both nimble and 3par will be used in Alletra MP, I don't know either way just not common for there to be seemingly conflicting things on the front page.

Google Cloud's US-East load balancers are lousy with latency

Nate Amsden Silver badge

Re: make a hasty move to another region

That's not really how the big IaaS clouds work. The customer is responsible for movement of data, configs, servers etc, if you really want to move to another zone or region. This can be a significant amount of work unless you prepared for such a situation in advance by doing the work beforehand. You can certainly choose whichever zone or region you want when provisioning a resource(and be aware of any excess costs from cross zone/region data transfers), but "vmotion-like" moving is not possible with the standard stuff(by design).

I haven't checked recently but at one point Amazon's SLA was such that they didn't consider it a breach unless you were unable to spin up resources in another zone in the region. So if you lose a dozen systems in zone A(due to say power loss or any reason really), but can build new systems in zone B then they didn't consider that a SLA violation.

A new version of APT is coming to Debian 12

Nate Amsden Silver badge

Re: Ubuntu video drivers

Been using ubuntu with Nvidia for close to 15 years. It's always just worked for me. Though have not used bleeding edge Nvidia chips. Was Nvidia not included at all in xubuntu?(never used it). Normally I think Ubuntu fires up with noveau driver then its a few mouse clicks to enable the commercial nvidia driver. Ubuntu automatically downloads and installs it.

Or maybe whatever card you have was too new for the drivers that was included. Probably was around 2008 for the last time I manually downloaded and installed the Nvidia drivers. And at that time I just installed them and let them overwrite whatever to get things working. It was easy at the time just download the driver which itself was a self executing installer and it prompted you through the process. Unsure if that has changed since.

Been using Nvidia on linux almost exclusively since about 1999.

Save $7 million on cloud by spending $600k on servers, says 37Signals' David Heinemeier Hansson

Nate Amsden Silver badge

Re: Clouds aren't greedy....cloud CEOs are!

Caveat as there are several forms of "cloud". IaaS cloud, which is what basecamp is mainly referring to has been super expensive for at least the past 13-14 years probably longer. It's not a big deal if you are very small, i.e. if your bill is say under $1k/mo, big whoop for a typical company. But it's easy to get to $100k/mo for even a small company on IaaS. I saw it myself for the first time back in 2010, with all cloud ROIs being between 5-10 months for bringing(or keeping) stuff "on prem" (aka co-location), and that was giving cloud every benefit that things would operate AT LEAST as good as on prem, which everyone knows would not be the case I think given the availability model of the big IaaS clouds (I have called it "built to fail" since 2011, indicating you need to do more work to cover for infrastructure failures in public cloud by design).

(before anyone tries to imply data centers fail often and you need to protect against that "on prem") -- I've experienced 2 data center failures in the past 20 years of professional co-location, both at the same data center, both in 2007 I believe it was, and I was already planning to move my org out of that facility before the failures hit(I inherited the facility after starting at a new company), it was a badly designed facility. By moving out I avoided a much longer outage at that facility that hit a couple of years later when they had a fire in the power room. So data center failures are extremely rare in my experience. That said, I would probably exclude automatically a good 75% of the world's data centers from my list of suitable locations off the bat just for inadequate design(that includes 100% of the big IaaS player's data centers for the same reason). So choose wisely. I don't consider on site "server rooms" to be data centers, data centers are purpose built facilities(preferably standalone) with dedicated staff, backup generators with fuel contracts, and routine testing of systems.

SaaS can be effective, as the billing is totally different, usually per account or some other metric unrelated to infrastructure. Much easier to justify/not in that case, especially with the value add of the software itself. That is assuming of course the SaaS provider operates the system properly with security, backups, availability etc(and in almost all cases you can never be sure they are doing it properly, just hope they are and they have a SLA that meets your needs).

VMware turns 25 today: Is it a mature professional or headed back to Mom's house?

Nate Amsden Silver badge

Re: That brings back memories

Back in the late 90s, Linux users especially wanted easier/better ways to run windows apps. WINE was around then but not too good. I do recall trying out Bochs at one point in I guess 1998, I found my screenshots (http://elreg.nateamsden.com/bochs/ censored some stuff from them), probably running Debian 2.0 with Linux 2.1.123, and probably a beta of KDE(that was the only time I seriously used KDE, switched to AfterStep not long after).

My main memory from Bochs was it was SO SLOW. Totally unusable for anything outside of taking screenshots to say hey yes I did this. I found in an unrelated screenshot from May 1998 my hardware at the time was a P233MMX(overclock from 200, only time I ever overclocked), 128MB EDO ram, and a Number 9 Imagine 128 Series 2 video card, with AcceleratedX 4.1, and as of May 1998 running KDE Beta 4 (before 1.0 I assume) on a 14" monitor.

When Vmware for linux came out it was fast, very usable. Stable too, I really don't recall many, if any issues with it crashing or anything. I don't recall my hardware changing too significantly from late 1998 (when the Bochs screenshots were taken) to sometime in 1999 when I got Vmware for linux.

Nate Amsden Silver badge

customer for ~23 years

Been a customer since 1999, when VMware was a desktop product for Linux hosts only(if I recall right), Windows support didn't come till 2000 maybe? For whatever reason I started keeping historical copies of my vmware desktop versions on my server, turned into a mini hoard of sorts, but am just a bit fascinated by the sheer number of builds it has gone through and the size of the resulting installer package over the years.

The oldest copy of vmware I have is 2.0.3 build 799, which is 5.9MB in size, has a file timestamp of Jan 12 2001(files inside appear to be dated Nov 2000). The README says their officially supported distros were Caldera 1.3->2.4, Red Hat 5.x->6.2, and SuSE 6.0->6.4 (I ran it on Debian 2.x). Looks like this could even be a beta, there is a CHANGES file inside that says "VMware Beta 2.0 for Linux contains many improvements over VMware 1.x for Linux", unsure if perhaps they just forgot to remove that reference from the final. I assume not beta since it is 2.0.3 not 2.0.0.

I had at one point a "VMware for Linux 1.0.2" CD, kept it for a long time, then I lost it somehow a decade ago.

By ~2006 I have VMware workstation 4.5.3 build 19,414 and is 41MB in size. Looks like it took the name "workstation" starting with 3.0.0 (build 1455, 9.3MB Nov 2001).

The latest version of vmware I have is Workstation 16.2.4(I know 17 is out already of course), build number 20,089,737, and is 523MB. Timestamp of Nov 1 2022, though I'm sure it's older than that.

Just sort of blows my mind they have apparently run almost 20 million builds of vmware between these 2 versions.

Also ran GSX, later VMware server, then started with ESX with version 3.5(for me around 2006/7), though others at my org at the time were using earlier versions of ESX in their test labs, and they made extensive use of GSX too back then (~2004).

I saw recently VMware's product page has nearly 180 products on it, though I only use Workstation, ESXi, and vCenter. Some of the other products look neat just way down on the list of priorities as far as budgeting goes that I've never considered purchasing them.

The VMware hypervisor products have been among the most solid pieces of software I've used in my career, which has made me very loyal to the platform. I know many folks can get burned on their more bleeding edge releases but I haven't been on the bleeding edge since ESX 4.0 came out(was very excited for that release, and none since). So very very few issues and my configurations are very conservative as well. I waited till far after ESX(didn't use/want ESXi, I liked the thick hypervisor) 4.1 was EOL before installing (not upgrading to) ESXi 5.5, waited till a year or so after that was EOL before installing (not upgrading to) 6.5, which is where things are today still. I've missed out on all the early "fun" bugs in 7.x(and probably the earlier releases too skipping 5.0,5.1 and 6.0). I didn't upgrade to workstation 16 until a a couple weeks before 17 came out, I have a license to 17, just no immediate plans to use it, 16 does everything I need(as did 15, only reason I changed was I got a new computer, though new computer runs the same Linux Mint+MATE 20 that the old computer did, and will run Mint 20 until at least 2025).

Ransomware scum launch wave of attacks on critical, but old, VMWare ESXi vuln

Nate Amsden Silver badge

Re: Attack Surface

If it is the SLP issue then you don't even need to patch, just turn SLP off. I turned mine off in late Oct 2020. The guide is here https://kb.vmware.com/s/article/76372

As far as I could tell SLP was never used on my systems, the only connection attempt logged lined up with the date/time of the boot-up of the server.

Zero impact, and zero impact since.

Of course running esxi exposed on the internet is a bad thing in any case.

Bill shock? The red ink of web services doesn’t come out of the blue

Nate Amsden Silver badge

It's amazing to me how some folks try to justify cloud. Not long ago I saw a post saying they suggest cloud because otherwise you need several 24/7 people to take care of your facility. Even things like roof repairs and stuff. I reminded them colo has been a thing since before cloud and solves that aspect fine. All big and probably most small cloud providers leverage colo in at least some of their markets. I've been in colo for 20 years across 5 companies.

Equally amazing I remember back in 2010 I had a Terremark cloud rep try to challenge me in managing my own gear. I told them my solution cost about $800k at the time. Their solution was either $272k/mo OR about $120k/mo with a $3 million install fee. They didn't think I'd be able to manage $800k of gear. It was 2 racks of equipment. So easy. But the sales guy was confused that I could do it and not need to outsource to them or another provider.

Nate Amsden Silver badge

The last company I was at was a greenfield cloud thing. They had no app stacks, everything was brand new. Their existing technology was outsourced and that company did everything from software dev to hosting and support etc. At one point before I started the company felt they had outgrown that outsourced provider and wanted their own tech team to build their own app stack. So they hired a CTO and he built a team, and they started building the new software stack.

He hired a former manager of mine whom hired me at the previous company, I worked with him only a couple of months but that was enough I guess. That previous company was hosted in Amazon cloud(also greenfield). This manager saw the pitfalls of that and wanted me at the new company mainly to move them OUT of the cloud (they had yet to actually launch for production).

They launched production in Sept 2011(I joined May 2011), after doing many weeks of their best efforts at performance/scale testing(I was not involved in any of that part). All of those results were thrown in the trash after a couple of weeks and the knobs got turned to 11 to keep up with the massive traffic. Costs skyrocketed as well, as did annoying problems with the cloud. We started ordering equipment for our small colo (2 racks, each roughly half populated initially) in early Nov 2011, and installed it in mid Dec 2011, and then moved out of Amazon to those two racks in early Feb 2012(I was a bit worried as there was a manufacturing flaw in our 10Gig Qlogic NICs that was yet to be solved, ended up not causing any impacting issues though). I repeated a similar process for their EU business which had to be hosted in the Netherlands, I moved them out in July 2012 to an even smaller infrastructure, probably about half a rack at the time. In both cases, equipment was at a proper co-location, not in a server room at an office.

The project was pitched by my manager as having a 7-8 month ROI, the CTO got on board. It wasn't easy convincing the board but they went with it. Project was a huge success. I dug up the email the CTO sent back in 2012, and sent it to the company chat on the 10th anniversary last year. He said in part "[..] In day 1, it reduced the slowest (3+ sec) Commerce requests by 30%. In addition, it reduces costs by 50% and will pay for itself within the year."

I believe we saved in excess of $12M in that decade of not being hosted in cloud(especially considering the growth during those years). Meanwhile had better performance, scalability, reliability, and security. Last/Only data center failure I've experienced was in 2006 or 2007, Fisher Plaza in Seattle. I moved the company I was at out of there quite quick after that (they were already there went I started). Remember that cloud data centers are built to fail(a term I started using in 2011), meaning they are lower tier facilities which is cheaper for them, and is a fine model at high scale, you have to have more resilient apps or be better prepared for failure vs typical enterprise on prem situation.

So count me as someone who disagrees, greenfield cloud is rarely the best option.

Basecamp details 'obscene' $3.2 million bill that caused it to quit the cloud

Nate Amsden Silver badge

Re: Hiring impact

That is very interesting, and unfortunate for the customers. Sounds like that is not a real SaaS stack? Perhaps some hacked together stuff operated as a managed service?

I would not expect in a SaaS environment a customer would even be able to look at the underlying infrastructure metrics or availability it's just not exposed to them. I know I got frustrated using IaaS years ago, because not enough infrastructure data was available to me.

Nate Amsden Silver badge

Getting good at storing files doesn't have to be basecamp's business.

People seem to jump to the end conclusion, either you build everything yourself, or you use a public cloud, and I just don't understand why. There is a massive chasm of space in between those two options in the form of packaged solutions from vendors like HPE, Dell, and others. Many different tiers of equipment hardware, software and support.

Nate Amsden Silver badge

8PB is a lot, but it's not really for object storage. HPE Apollo 4510 is a 4U server that can have up to 960TB of raw storage(so ~10PB per rack, assuming your facility supports that much power/rack). Depending on performance needs - 60 x 16TB drives may not be enough speed for 960TB by itself. Probably want some flash in front of that(handled by the object storage system). Of course you would not be running any RAID on this, data protection would be handled at the object layer. Large object storage probably starts in the 100s of PB, or Exabytes.

There's no real need to use CEPH which is super complex(unless you like that). Probably better to be using something like Cohestity or Scality (no experience in either), both available for HPE Apollo(and other hardware platforms I'm sure). There are other options as well.

I think I was told that Dropbox leveraged HPE Apollo or similar HPE gear +Object storage software when they moved out of Amazon years ago. As of 2015 Dropbox had 600PB of data according to one article I see here.

I'm quite certain it would be easy to price a solution far less than S3 at 8PB scale, even less scale. You also don't need as much "protection" if you choose proper facilities to host at. Big public cloud providers cut corners on data center quality for cost savings. It makes sense at their scale. But users of that infrastructure need to take extra care in protecting their stuff. Vs hosting it yourself you can use the same model if you want, but if you are talking about 8PB of data that can fit in a single rack(doing that would likely dramatically limit the number of providers you can use to support ~20kW/rack? otherwise split into more racks), I would opt for a quality facility with N+1 power/cooling. Sure you can mirror that data to another site as well, but no need for going beyond that (unless geo latency is a concern for getting bulk data to customers).

Nate Amsden Silver badge

Re: Open source

You got a bunch of down votes but you are right for the most part. A lot of the early open source models was release the source for free and then have a business around supporting it. Not everyone would sign up as customers but the good will from releasing the source would attract users. It worked well for several companies, and of course public cloud is taking that away from a lot of these orgs, which is unfortunate. And as El Reg has reported several such companies have been very vocal about this situation.

Obviously in many(maybe all?) cases the license permits this usage(at the time anyway, some have introduced licensing tweaks since to prevent it), but I'm quite sure if you went back in time ~15ish years and asked the people making the products did they anticipate this happening they would probably say no in almost all cases(perhaps they would of adjusted their licenses if they viewed that possibility as a credible threat). At the end of the day the big difference with these cloud companies and earlier generations of "mom & pop" ISPs that were using Apache or whatever to host their sites, is just massive scale.

Those licensing their code in BSD licensing or similarly completely open licensing probably wouldn't/shouldn't care anyway.

Similarly for the GPL, a trigger of sorts to making the GPLv3 was the TiVo "exploiting" a loophole in the GPLv2. So GPLv3 was made to close that hole(and perhaps others). There's even a Tivioization term they made

https://en.wikipedia.org/wiki/Tivoization

"In 2006, the Free Software Foundation (FSF) decided to combat TiVo's technical system of blocking users from running modified software. The FSF subsequently developed a new version of the GNU General Public License (Version 3) which was designed to include language which prohibited this activity."

Nate Amsden Silver badge

Re: Hiring impact

I think you are incorrectly confusing SaaS and IaaS in your statement.

Mom and Pop shops that have minimal IT needs likely will have almost zero IaaS, because they can't manage it. IaaS (done right) IMO requires more expertise then on prem, unless you have a fully managed IaaS provider. But the major players don't really help you with recovery in the event of failure, it's on the customer to figure that out. vs on prem with vmware for example if a server fails the VMs move to another server, if the storage has a controller failure or a disk failure there is automatic redundancy. Doesn't protect against all situations of course but far more than public cloud does out of the box. If Mom & Pop shop just have a single server with no redundant storage etc, if that server has a failure, they can get it repaired/replaced generally with minimal to no data loss. Vs server failure in the major clouds is generally viewed as normal operations and the recovery process is more complex.

I've been calling this model "built to fail" since 2011, meaning you have to build your apps to handle failure better than they otherwise would need to be. Or at least be better prepared to recover from failure even if the apps can't do it automatically.

SaaS is a totally different story, where the expertise of course is only required in the software being used, not any of the infrastructure that runs it. Hosted email, Office, Salesforce, etc etc..

On prem certainly needs skilled staff to run things, but doing IaaS public cloud(as offered by the major player's standard offerings) right requires even more expertise(and more $$), as you can't leverage native fail over abilities of modern(as in past 20 years) IT infrastructure, nor can you rely on being able to get a broken physical server repaired(in a public cloud).

Nate Amsden Silver badge

Re: Cloud Vs On-Prem

Should do the math for how bursty is bursty. At my last company I'd say they'd "burst" 10-30X sales on high events, but at the end of the day the difference between base load and max load was just a few physical servers(maybe 4).

IMO a lot of pro cloud folks like to cite burst numbers but probably are remembering the times of dual socket single core servers as a point of reference. One company I was at back in 2004 we literally doubled our physical server capacity after a couple of different major software deployments. Same level of traffic, just app got slower with the new code. I remember ordering direct from HP and having stuff shipped overnight (DL360G2 and G3 era). Not many systems, at most maybe we ordered 10 new servers or something.

Obviously modern servers can push a whole lot in a small (and still reasonably priced) package.

A lot also like to cite "burst to cloud", but again have to be careful, I expect most transactional applications to have problems with bursting to remote facilities simply due to latency (whether the remote facility is a traditional data center or a cloud provider). You could build your app to be more suited to that but that would probably be quite a bit of extra cost (and ongoing testing), not likely to be worth while for most orgs. Or you could position your data center assets very near to your cloud provider to work around the latency issue.

Now if your apps are completely self contained, or at least fairly isolated subsystems, then it can probably work fine.

One company I was at their front end systems were entirely self contained, no external databases of any kind). So scalability was linear. When I left in 2010 (company is long since dead now) costs for cloud were not worth using vs co-location. Though their front end systems at the time consisted of probably at most 3 dozen physical servers(Dell R610 back then, at their peak each server could process 3,000 requests a second in tomcat) spread across 3-4 different geo regions(for reduced latency to customer as well as fail over). Standard front end site deployment was just a single rack at a colo. There was only one backend for data processing that was about 20 half populated racks of older gear.

Nate Amsden Silver badge

nice to see

Nice to see them go public about this. Not many companies are open about this kind of stuff. Another one I like to point out to people (but with far less detail, mainly just a line item in their public budget at the time is this https://www.geekwire.com/2014/moz-posts-2013-5-7m-loss/ . They don't call it out in the article text, but there is a graphic there showing their budget breakdown and their cloud services taking between 21-30% of their REVENUE(with cloud spend peaking at $7M), and you can see in the last year they were moving out as they had a data center line item.

I moved my last org out of cloud in early 2012, easily saved well over $12M for a small operation in the decade that followed. I had to go through justification again and again as the years went on and new management rolled in(and out) thinking cloud would save them money. They were always pretty shocked/sad to see the truth.

Previous org to that I proposed moving out but the board wasn't interested(but everyone else was including CEO and CTO but not enough to fight for the project), they were spending upwards of $400-500k/mo at times on cloud(and having a terrible experience). I left soon after, and the company is long since dead.

You can do "on prem" very expensive and very poorly but it's far easier to do cloud very expensive and very poorly.

Cisco warns it won't fix critical flaw in small business routers despite known exploit

Nate Amsden Silver badge

Re: White Box Switches and Cumulus Linux

People could have the same issue here depending on their hardware. When Nvidia bought Mellenox they killed off support for Broadcom on Cumulus Linux, had a lot of upset users. Looks like Cumulus 4.2 was the last one to support Broadcom chips. (I have never used Cumulus/Mellenox or white box switches in general myself)

Assuming you purchased your gear before the acquisition(2020), since you said "a few years ago", so hopefully your switches are not Broadcom based if you ran them with Cumulus.

Nate Amsden Silver badge

Re: Time to dump Cisco

Curious can you name any such products especially in the networking space? I've been doing networking for about 20 years and haven't heard of any vendor/product remotely approaching 15 years of support after end of sale, at most maybe 5 years?

After long delays, Sapphire Rapids arrives, full of accelerators and superlatives

Nate Amsden Silver badge

intel wasn't thinking straight

Realized this and wanted to post about it. These new chips are nice, but obviously one of the big users of the chips will be VMware customers. New VMware licensing comes in 32-core increments (and I think MS Windows server licensing is 8 core increments after the first 16 cores?)

Intel says (according to HP) that the "P" series Xeons are targeted at Cloud/IaaS systems. There's only 1 P series chip (at least for DL380 Gen11), and that has 44 cores. So you're having to license 64 cores to use that processor but of course only have 44 available(and I believe for a dual socket 44-core system(88 cores) that would require 128 cores of vmware licensing as they track cores per socket not cores per system according to their docs). At the top end of 60 cores you're paying for 64 cores of VMware licensing for 60 cores of capacity(or again 128 cores of licensing for dual socket 60 core). Intel's previous generation had 32 and 40 core processors(less than 32 would probably be a waste unless you are also having to license windows or Oracle and are concerned about those per core licensing).

Vs AMD on their latest gen has CPUs with 32/64/96 cores, all of the which divide quite well into 32-core licensing. On the previous generation AMD processors they have both 32 and 64 core processors.

A vSphere enterprise license(1 socket/32-cores) with 3 years production support was about $10k last I checked(that's before adding anything like NSX, vSAN, vROPS or whatever else, for me I just use the basic hypervisor), which is a good chunk of the cost of the server.

Nate Amsden Silver badge

basic comparisons

Obviously the raw specs don't indicate what the actual performance is, but was just curious looking at the HPE DL380 Gen11 vs the DL385 Gen 11. A few things that stand out to me -

Intel says that Cloud/IaaS workloads benefit from fewer cores/higher frequencies(Intel P series Xeon), which seems strange to me, of course it depends on the app. But as a ESXi user for many years I'll take more cores any day(especially if I don't have to worry about windows/Oracle licensing ..). They say databases benefit from more cores(H series Xeon).

Intel's top end 8490H 60-core 350W processor runs at 1.9Ghz and has 112MB of L3 cache.

AMD's top end 9654 96-core 360W processor runs at 2.4Ghz and has 384MB of cache(HPE isn't specific on AMD if that is L3 or aggregate on all caches for DL385)

On the flip side, the DL380Gen11 has 16 memory slots per socket, vs the DL385Gen11 has only 12 (the DL385 Gen10+ V2 has 16/socket), can only assume the size of the new AMD chips are much larger than the Intel chips.

AMD's previous generation chips (in DL385Gen10+ V2) topped out at 280W, vs the new chips go to 360W (I have an earlier version of the Gen11 data sheet that actually says they draw 400W, had to double check vs the current data sheet).

Intel's previous generation chips (in DL380 Gen10+) topped out at 270W, vs the new chips at 350W.

I *think* I'd rather take the previous generation 64-core/socket DL385Gen10+ V2 with the extra memory slots and less power usage/server(and probably better stability with the firmware/etc being around for longer). Would be a tough choice. Key point for me is my workloads don't even tax the existing DL380Gen9 servers CPU wise(but I want lots of cores for lots of VMs/host)

Amazon slaps automatic encryption on S3 data

Nate Amsden Silver badge

Re: Really?

The way I believe most object storage works on the backend is blocks are replicated between nodes. So even if someone were to get their hands on unencrypted drives that were used for S3 for some nodes they'd only get partial bits of data, maybe a determined attacker could get something useful out of those partial bits but it would be a PITA.

Same reason I have never had a concern about not running encrypted at rest on 3PAR, if someone got a "pre failed" disk from one of my arrays, they'd just have random 1GB chunks of filesystems. Maybe you get lucky and find something useful but for anyone willing to go to those lengths they'd have to be very determined and there are probably much easier ways to compromise security. Of course there are industries/audit processes that require encryption at rest just for the checkbox.

Rackspace blames ransomware woes on zero-day attack

Nate Amsden Silver badge

Re: Not us then

sure thing, I didn't know it myself until about a month ago(not my fault as I have never been responsible for Office 365 nor exchange in my career), I knew there were office 365 backup solutions out there, and was looking into them a bit more out of curiosity and saw them quote that Microsoft site.

It's pretty bad that most office 365 admins don't seem to understand it, and are just assuming MS is invincible and they don't have to worry about backups, at least in my experience seeing people write "you should just move to office 365", almost never have I seen them also say "oh but you need to keep your own backups too".

I am not sure if Rackspace had any formal way for customers to take proper backups (aside from outlook archives).

Nate Amsden Silver badge

Re: Not us then

Microsoft would say the same if you use Office 365, backups are the responsibility of the customer.

https://learn.microsoft.com/en-us/azure/security/fundamentals/shared-responsibility

Welcome to cloud.

(myself I have self hosted email for 24 years, and haven't been responsible for corporate email since 2002 at that point ran email with postfix/cyrus imap which is what I use for home still)

Elon Musk's cost-cutting campaign at Twitter extended to not paying rent, claims landlord

Nate Amsden Silver badge

Re: Inevutable

I left a position 2 companies ago(~12 years ago). I've always made it a point not to sign contracts myself even if I had authority(and in many cases the companies I worked for didn't care if I signed provided it was approved, but often times I just prefer not to personally sign regardless).

Anyway after I left the CTO tried to terminate one of the contracts, I think it was either for a DNS provider Dyn, or Webmetrics for website monitoring. They were under contract but said the contracts weren't valid because I wasn't authorized to sign. Funny thing is the vendor pulled up the contract and found/showed my former employer(who is long out of business now) that in fact it was my Senior Director who signed for the contract in question(same director essentially resigned a day after I left, later tried to recruit me to join him at Oracle cloud but I declined and he later retired), not me, and so their argument was invalid. I got a good laugh out of that story.

Stolen info on 400m+ Twitter accounts seemingly up for sale

Nate Amsden Silver badge

Re: 400m users

Why would you think that? Twitter probably has far more than 400m accounts (I'd be surprised if they had less than 1.2 billion including bots/fake accounts/etc), the article does not indicate any of the accounts were active. Likely there are a bunch in the list that were, but maybe it's only 10-20% of "active" accounts. Or maybe a higher number, or a lower number..

Linux kernel 6.2 promises multiple filesystem improvements

Nate Amsden Silver badge

This is not accurate. I've seen people write this 1GB per TB tons of times.

The 1GB per TB was always about ZFS with dedupe enabled. Without dedupe you can get by with much less.

Myself on my laptop I still use ext4 despite having 128G of ram, just because it's simpler.

I do use ZFS in some cases, mainly at work, mainly for less used MySQL servers with ZFS compression enabled(and I use ZFS as a filesystem only, RAID is handled by the underlying SAN which is old enough not to support compression).

My home server runs an LSI 8-port SATA RAID card with a battery backup unit, 4x8TB drives in hardware RAID 10 with ext4 as the filesystem and weekly scrubs(via LSI tools). I used ZFS for a few years mainly for snapshots on an earlier home server(with 3Ware RAID 10 and weekly scrubs) but ended up never needing the snapshots, so I stopped using ZFS.

I do have a Terramaster NAS at a co-location for personal off site storage which runs Devuan, and ZFS RAID 10 on 4x12TB disks. Boot disk is a external USB HP 900G SSD with ext4 again. That's the only place I'm using ZFS' RAID.

Haven't used anything but RAID 1/10 at home at least since about 2002, which was a 5x9GB SCSI RAID with a Mylex DAC960 RAID card. Largest number of disks in array at home since has been 4.

At work I'm still perfectly happy with the 3PAR distributed sub disk RAID 5 (3+1) on any of the drives I have spinning or SSD.

openSUSE Tumbleweed team changes its mind about x86-64-v2

Nate Amsden Silver badge

Re: Sensible

I remember back in the 90s efforts to optimize by compiling for i586 or i686 for example, then there was the egcs compiler(which I think eventually became gcc?), and then gentoo came around at some point maybe much later targeting folks that really wanted to optimize their stuff. FreeBSD did this as well to some extent with their "ports" system(other BSDs did too but FreeBSD was the most popular at the time probably still is). I personally spent a lot of time building custom kernels, most often static kernels I didn't like to use kernel modules for whatever reason. But I tossed in patches here and there sometimes, and only built the stuff I wanted. I stopped doing that right when the Kernel got rid of the "stable" vs "unstable" trees as the 2.4 branch was maturing.

Myself I never really noticed any difference. I've said before to folks if there's not at least say a 30-40% difference then likely I won't even notice regardless (not referring specifically to these optimizations but referring to upgrading hardware or whatever). A 20% increase in performance for example I won't see. I may see if I am measuring something, such as encoding video for example. But my computer usage is fairly light on multimedia things (other than handbrake for encoding, I have ripped/encoded my ~4000 DVD/BD collection, but encoding is done in the background, so 20% faster doesn't mean shit to me, double the speed and I'll be interested provided quality is maintained). All of my encoding is software, I don't use GPU encoding.

I haven't gamed seriously on my computer in over a decade, I don't do video editing, or photo editing, etc etc. I disable 3D effects on my laptop (Mate+Mint 20), even though I have a decent Quadro T2000 with 4G of ram(currently says 17% of video memory is used for my 1080p display). I disable them for stability purposes(not that I had specific stability problems with them on that I recall, I also disable 3D acceleration in VMware workstation for same reason). I've never had a GPU that required active cooling and I have been using Nvidia almost exclusively for 20 years now (I exclude laptops since pretty much any laptop with Nvidia has fans, but the desktop GPUs I have bought, none have ever had a fan).

I really don't even see much difference between my 2016 Lenovo P50 with i7 quad core, SATA boot SSD (+2 NVME SSDs) and 48G of ram with Nvidia Quadro M200M, to my new(about 2 months old now) Lenovo P15 Xeon 8 core, 2 NVMe SSD, and 128G of ECC ram and Quadro T2000. It's a bit faster in day to day tasks, but I was perfectly happy on the P50.

My new employer insisted they supply me with new hardware so I said fine, if you want to pay for it, this is what I want. They didn't get it perfect, I replaced the memory with new memory out of pocket and bought the 2nd NVME SSD(not that I needed it, just thought fuckit I want to max it out). I was open this time around to ditching Nvidia and going Intel video only, but turns out the P15 laptop I wanted only came with Nvidia (even though it's hybrid, I think..). Since the Nvidia chip is there anyway I might as well use it, I've never had much of an issue with their stuff unlike some others that like to run more bleeding edge software. I expect a 6-10 year lifespan out of this laptop so I think it's worth it.

On the 12th day of the Rackspace email disaster, it did not give to me …

Nate Amsden Silver badge

Don't forget

Those saying Office 365 is the best thing to use, MS is up front about their stuff too:

https://learn.microsoft.com/en-us/azure/security/fundamentals/shared-responsibility

The "information and data" section is shown as the sole responsibility of the customer, not of MS.

"Regardless of the type of deployment, the following responsibilities are always retained by you:

Data

Endpoints

Account

Access management"

There are backup solutions for Office 365 for a reason.

I'm assuming here probably greater than 90% of their customers don't realize this.

(I don't vouch for any provider in particular, I'm a linux/infrastructure person never touched Exchange in my life, and I've been hosting my own personal email on my own personal servers(co-lo these days) since 1997)

Nate Amsden Silver badge

Re: Right.

I have read people claim ransomware can sit waiting for upwards of 90+ days before striking. I used that justification to finally get a decent tape drive approved a few years ago for my last company's IT dept, then they used Veeam to backup to tape.

I suppose in theory if you restored old data onto a server WITH A CLOCK SET TO THE RIGHT TIME (not current time), then perhaps it could be fine, but of course systems don't often behave well when their clocks are out of sync.

So for example if you restored data from 45 days ago, onto a server with it's clock set to 45 days ago then you may be ok, if you restore it to a server with current time then perhaps the existing ransomware will see the strike time has passed and activate again. I've never been involved in ransomware myself so don't know how fast it is acting.

I'd assume this was a highly targeted attack against rackspace not a drive by thing.

Also read years ago on multiple occasions claims from security professionals that on average intruders had access to a network for roughly 6 months before being detected, first saw this claim reported by the then CTO of Trend Micro, saw a presentation from him at a conference, normally I hate those kinds of things but that guy seemed quite amazing. I was shocked to see him admit on stage that "if an intruder wants to get in, they will get in, you can't stop that". And not try to claim his company's products can protect you absolutely. I posted the presentation in PDF form(probably from 2014) here previously, though it loses a lot of it's value without having the dialog that went along with it

http://elreg.nateamsden.com/TH-03-1000-Genes-Keynote_Detect_and_Respond.pdf

Nate Amsden Silver badge

Re: So where are the backups?

at my first system admin job back in 2000 one of the managers there(not someone I reported to) would on occasion ask me to restore some random thing. I thought it was a legitimate request so I did (or tried, sometimes could not depending on the situation). Later he told me he didn't actually need that stuff restored he was just testing me. Which I thought was interesting. I wasn't mad or anything. Never have had a manager do that again or at least never admit to it.

One company I was at we finally got a decent tape drive and backup system in place. I went around asking everyone what do they need backed up, as we don't have the ability to backup EVERYTHING (most of the data was transient anyway). Fast forward maybe ~6-9 months we have a near disaster on our only SAN. I was able to restore everything that was backed up, some requests did come in to restore stuff that was never backed up and I happily told them, sorry I can't get that because you never requested it be part of what was backed up. In the end minimal data loss from the storage issue but there was several days of downtime to recover.

My first near disaster with storage failure(wasn't on the backend team that was responsible) was in 2004 I believe, double controller failure in the SAN took out the Oracle DB. They did have backups, but they knowingly invalidated the backups every night by opening them read/write for reporting purposes. Obviously the team knew this and made the business accept that fact. So when disaster struck, it was much harder to restore data as you couldn't simply copy data files over, or restore the whole DB as the reporting process was rather destructive. Again multiple days of downtime to get things going again, and I recall still encountering random bits of corruption in Oracle a year later(would result in a ORA-600 or something error and the DBA would then go in and zero out the bad data or something).

My most recent near storage disaster was a few years ago at my previous company. Their accounting system hadn't been backed up in years apparently, IT didn't raise this as a critical issue, if they had raised it with me I could have helped them resolve it, it wasn't a difficult problem to fix just one they didn't know how to do themselves. Anyway, the storage array failed, again double controller failure. End of life storage array in this case. They were super lucky that I was friends with the global head of HPE storage at the time and after ~9 hours of our 3rd party support vendor trying to help us I reached out to him in a panic and he got HPE working around the clock, took about 3 days to find and repair the metadata corruption, minimal data loss(no data loss for the accounting folks). Was quite surprised when I asked for a mere $30k to upgrade another storage system so we could move the data and retire the end of life one, and the same accounting people who almost lost 10 years of data with no backups told me no.

IBM to create 24-core Power chip so customers can exploit Oracle database license

Nate Amsden Silver badge

Re: For now...

Kind of surprised they haven't already. I moved a company from Oracle EE to Oracle SE back in about 2008 for exactly this reason(did so as a result of the company failing their 2nd Oracle audit, after ignoring my advice to make this change after failing the first Oracle audit which they were caught for running Oracle EE when they only had a license for Oracle SE One, not even SE, SE ONE). I remember buying our servers for Oracle EE I opted for the fastest dual core processors, by the time we switched to SE quad core had come out so I changed them to single socket quad core. Even encountered a compatibility issue on our early DL380G5s when upgrading to quad core they didn't work without a motherboard replacement, HP later realized this and updated their docs. I don't think they charged us for the replacement since they told us it would work when we ordered the new parts, and it was their staff doing the hardware change. I remember AMD talking shit about Intel's quad core chips not being true quad core but a pair of dual core chips(early version of "chiplet" maybe?).

Also remembering having to "school" Oracle's own auditors regarding Oracle SE licensing specifically regarding the unlimited cores per socket which they didn't believe until they looked it up themselves and their hearts sank when they realized they could not do per core licensing on that.

Back then you could run Oracle Enterprise Manager with the performance tools(which were great IMO) on Oracle SE, even though technically you could not from a licensing standpoint(can only use it on Oracle EE). If they ever audited again I could easily uninstall OEM with a simple command(which I had to do regularly for various reasons on test systems). Newer versions of Oracle made this trick impossible(at least for me). Oracle did not audit again before the company shut down. Note: I am not a DBA but sometimes I fake it in my day job.

Also did the same for some of our VMware hosts, VMware at the time required purchasing licensing in pairs of sockets, and they said they did not support single cpu systems. Though I assumed at the time that really meant single socket single core.

Then I combined the two, at least in a couple of cases, Oracle and Vmware on top of a single socket DL380G5 with a quad core CPU and I don't remember how much memory or disk, some of the systems were connected to my first 3PAR, a tiny E200. Probably totally unsupported by anyone officially, Oracle's policy at the time is you had to reproduce the issue on a supported system. But I don't recall ever having to open a support case with either company while I was there, at least not on the ones running on VMware, did have some support cases for production which ran on bare metal.

At the time Oracle SE was max 2 sockets per system(no core restrictions) and max 4 sockets in a RAC. Haven't looked at their model since but it sounds like it's probably the same today.

Oracle clouds never go down, says Oracle's Larry Ellison

Nate Amsden Silver badge

Re: IaaS loves to blame the customer

One more story, I like talking/writing about this kind of stuff. This was from a former co-worker, who said they used to work as a tech at some data center. It was a small one, no big names. But his story(which was some time before 2010) was that they had a generator on site in a building or some structure to protect it from the elements outside. They ran load tests on it semi regularly, but the load tests were only for a few minutes.

One day a power outage hit, the generator kicked on as expected, then shut down after something like 15-30 minutes as it overheated(I think he said the overheating was related to the enclosure the generator was in). So in that situation we had bad design, and bad policies, either of which should of caught the issue long before it impacted customers.

Another case of bad design IMO is any facility using flywheel UPS. My thoughts there is I want technical on site staff 24/7 at any facility that is able to respond to stuff. Flywheel UPS only gives enough runtime for a a few seconds, maybe a minute for generators to kick on. That is not enough time for a human to respond to a fault(such as the switch starting the generators fails or something, this happened at a San Francisco facility that used flywheel back in 2005ish?). I was touring a new(at the time) data center south of Seattle in 2011, a very nice facility, Internap was leasing space there and was talking with them about using it. I mentioned my thoughts on flywheels and the person giving the tour felt the same, and said in fact I think it was Microsoft that had a facility near by at the time that used flywheels and he claimed they had a bunch of problems with them.

Not that UPSs are flawless by any extent, I just would like to see at least 10 minutes of power available between failure and generator kick on, however that power is provided is less important as long as it works. Flywheels(the ones I'm aware of anyway) don't last long enough. Certainly there will be situations where a failure cannot be fixed in 10 minutes, but I'm confident at least there are some scenarios where it can be (such as the automatic transfer switch not "switching" automatically and needing someone to manually switch it is the biggest).

Nate Amsden Silver badge

Re: IaaS loves to blame the customer

dug up the power issue from Amsterdam, which was previously a Telecity data center that Equinix acquired by this point(2018):

"DESCRIPTION: Please be advised that Equinix and our approved contractor will be performing remedial works to migrate several sub-busbar sections back from there temporary source to the replaced main busbar which became defective as reported in incident AM5 - [5-123673165908].

During the migration, one (1) of your cabinet(s) power supplies will be temporary unavailable for approximately six (6) hours. The redundant power supply(s) remains available and UPS backed. "

But this power incident wasn't critical for me since everything was redundant on my end, I'm not a power expert so certainly can't say for sure if a better power design could of allowed this kind of maintenance to be done without taking power circuits down to customers. But I can say I've never had another facility provider need to take power offline for maintenance for any reason in almost 20 years. Perhaps this particular activity it would be impossible to avoid I don't know.

After Equinix acquired Telecity I noticed the number of customer notifications went way up, Telecity had a history with me at least of not informing customers of stuff. I hated that facility and staff AND policies so much. Only visited it twice, before Equinix took over, according to my emails looks like we moved out less than 3 months after the above power issue (move was unrelated to that).

Nate Amsden Silver badge

Re: IaaS loves to blame the customer

I don't agree there at all. Good infrastructure management is good management. Having a properly designed facility is a good start. Well trained, knowledgeable staff is also important. Having and following standards is also important.

That Fisher plaza in Seattle at the time as far as I recall had issues such as:

* Staff not replacing UPS batteries before they expired

* Not properly protecting the "Emergency Power off" switch (which was one power incident a customer pressed it to find out what would happen, after that all customers required "EPO Training")

* Poor design led to a fire in the power room years after I moved out which caused ~40 hours of downtime and months of running on generator trucks parked outside. A couple years later I saw a news report of a similar fire at a Terremark facility, in that case they had independent power rooms, and there was zero impact to customers.

* Don't recall the causes of other power outages there if there were any other unique causes.

Another facility I was hosted in Amsterdam had an insufficient power design as well, and poor network policies

* The network team felt it was perfectly OK to do maintenance on the network, including at one point taking half of their network offline WITHOUT TELLING CUSTOMERS. They fixed that policy after I bitched enough. My normal carrier of choice is Internap, which has a 100% Uptime SLA, and has been excellent over the past 13 years as a network customer. Internap was not an option in Amsterdam at the time so we went with the facility's internet connection which was wired into the local internet exchange.

* At one point they told customers they had to literally shut off the "A" power feeds to do something, then the following week they had to shut off the "B" power feeds to do that thing to the other side, don't recall what it was, but obviously they didn't have the ability to do maintenance without taking power down (so am guessing no N+1). No real impact to either event on my end, though we did have a few devices that had only 1 PSU(with no option on those models for a 2nd), so we lost those, however they had redundant peers so things just failed over. In nearly 20 years of co-location only that facility ever had to take power down for maintenance.

One company I was at moved into a building (this was 18 years ago) that was previously occupied by Microsoft. We were all super impressed to see the "UPS Room", it wasn't a traditional UPS design from what I recall, just tons of batteries wired up in a safe way I imagine. They had a couple dozen racks on site. Wasn't till later the company realized most/all of the batteries were dead so when they had a power outage it all failed. None of that stuff was my responsibility, all of my gear was at the co-location.

My first data center was in 2003, an AT&T facility. I do remember one power outage there, my first one, I recall I was walking out of the facility and was in the lobby at the time when the lights went out. I remember the on site staff rushing from their offices to the data center floor and they stopped to assure me the data center floor was not affected(and it wasn't). Power came back on a few minutes later, don't recall if it was a local issue to the building or if it was a wider outage.

My first server room was in 2000. I built it out with tons of UPS capacity and tons of cooling. I was quite proud of the setup, about a dozen racks. Everything worked great, until one Sunday morning I got a bunch of alerts from my UPSs saying power was out. Everything still worked fine but about 30 seconds later I realized that while I have ~45min of UPS capacity I have no cooling right now so I rushed to the office to do graceful shutdowns of things. Fortunately things never got too hot I was able to be on site about 10 mins after the power went out. There was nothing really mission critical there, it was a software development company and the majority of the gear was dev systems, the local email server(we had 1 email server per office) and a few other things were there as well.

There are certainly other ways to have outages, I have been on the front lines of 3 primary storage array failures in the last 19 years, arrays which had no immediate backup so all of the systems connected to the arrays were down for hours to days for recovery. And I have been in countless application related outages as well the worst of which date back 18 years ago an unstable app stack being down for 24+ hours and the developers not knowing what to do to fix it. At one point there we had Oracle fly on site to debug database performance issues too. I've caused my own share of outages over the years though I probably have a 500:1 ratio of outages I've fixed or help fix vs outages I caused.

My original post, in case it wasn't clear, was specific to facility availability and to a lesser extent network uplink availability.

Nate Amsden Silver badge

IaaS loves to blame the customer

That's something that surprised me a lot back when I first started using cloud 12 years ago now(haven't used IaaS in a decade now). Some of their SLAs(perhaps most) are worded in ways to say, oh well if this data center is down it's not really down for you unless you can't fire up resources in another data center. If you don't have your data in multiple data centers well that's your fault and we don't owe you anything.

Which to some degree makes sense, customers using cloud and not knowing how to do it "right" (because doing it right will just make it more expensive in many cases and certainly more complex). Most traditional providers(either datacenter or network or infrastructure) will of course advise you similarly but they will often take much greater responsibility when something bad happens even if the customer didn't have better redundancy.

Myself I haven't been hosted in a data center(for work) that had a full facility failure since 2007. That's 15 years of co-location with zero facility outages. So forgive me if I'm not going to get super stressed over not having a DR site. That data center in 2007 (Fisher Plaza in Seattle, and I moved the company out within a year of starting my new position there) remains the only facility I've been with that had serious issues going back to 2003.

Of course not all facilities are the same. The facility I use for my personal co-location HAS had several power outages in the past decade(went a good 5-6 years before the first one when I became a customer). But they are cheap, and otherwise provide decent service. I can live with those minor issues(probably still better uptime than Office365 over the years even with my SINGLE server, not that I'm tracking). I need only to walk into that facility to immediately rule it out for anything resembling mission critical or anything resembling not fully active-active (across multiple facilities) operations. They don't even have redundant power(facility dates to the 90s).

I've said before I would probably guesstimate that I'd rule out 60-75% of data centers in the world for mission critical stuff(Bing tells me there are ~2500 global data centers). All of the big cloud providers design their systems so their facilities can fail, it's part of their cost model, so naturally I am repelled by that.

VMware loses three top execs who owned growth products

Nate Amsden Silver badge

Re: Troubled phrasings ?

I'd expect most customers are on maintenance contracts so the new versions would be provided to them free as part of maintenance. At least that's how it works with ESXi and vCenter.

Cloud customers are wasting money by overprovisioning resources

Nate Amsden Silver badge

Re: I have wondered about de-dupe

I don't believe most IaaS clouds do dedupe for storage at least not the big ones. The enterprise clouds I'm sure do. I'd expect customers to not see any line items on their bills related to dedupe, the providers would just factor in what their typical dedupe ratios are and figure that into the cost to the customers.

But forget dedupe, I'd expect most cloud providers to not even do basic thin provisioning and reclamation (except enterprise clouds again for same reasons). Thin provisioning AFAIK was mainly pioneered by 3PAR back around 2003ish time frame, I started using them in 2006, thin reclaim didn't appear until about late 2010 I think(and took longer to get that working right). Then discard at the OS/hypervisor level took time to implement as well(3PAR's original reclaim was "zero detection" and so I spent a lot of time with /dev/zero writing zeros to reclaim space, also sdelete on windows prior to discard being available).

For my org's gear we didn't get end-to-end discard on all of our Linux VMs (through to the backend storage) until moving to Ubuntu 20 (along with other hypervisor VM changes) in late 2020. I had discard working fine on some VMs that used raw device maps for a while prior. I know the technology was ready far before late 2020, but to do the changes to the VMs it was better to wait for a major OS refresh (16.04->20.04 in our case) rather than shoehorn the changes inline. Wasn't urgent in any case.

I remember NetApp pushing dedupe hard for vmware stuff back in 2008-2010 time frame, I never really bought into the concept for my workloads. I'm sure it makes a lot of sense for things like VDI though. When I did eventually get dedupe on 3PAR in 2014 (16k fixed block dedupe, I don't know what NetApp's dedupe block size was/is) I confirmed my original suspicions, the dedupe ratio wasn't that great since there wasn't that much truly duplicate data(which would of been OS data and typical OS was just a few gigs in Linux). I expected better dedupe on VMware boot volumes(boot from SAN), initially the ratio was great(don't recall what exactly), my current set of boot LUNs were created in 2019, and now the current dedupe ratio is 1.1:1, which is basically no savings, so next time around I won't enable dedupe on them. (ESXi 6.5 here still, I read that ESXi 7 is much worse for boot disk requirements). Average vmware boot volume is 4.3GB of written data on a 10G volume.

Equinix to cut costs by cranking up the heat in its datacenters

Nate Amsden Silver badge

Re: We make a rod for our own backs...

Google did something more creative than that, at one point looks like back in 2009 they released info showing that they were building servers with batteries built in (instead of large centralized UPSs), with the justification being that most power events only last a few seconds, so they could cut cost/complexity with that design.

Don't know how long that lasted or maybe they are still doing that today. Never recall it being mentioned since.

Nate Amsden Silver badge

Re: We make a rod for our own backs...

oh yeah, that's right, sorry was a long time ago!

Nate Amsden Silver badge

Re: This is not how data centers work

Be sure to deploy your own environmental sensors.. Most good PDUs have connections for them. I have at least 4 sensors (2 front/2 back) on each rack(2 PDUs * 2 sensors each)). They monitor temperature and humidity.

I remember the first time my alarms for them tripped, I opened a ticket with the DC asking if anything had changed. It wasn't a big problem (the humidity had either dropped or exceeded a threshold I forget which), I was just curious the dramatic change in readings. They responded that they had just activated their "outside air" cooling system or something which was the cause of the change in humidity.

Had major thermal issues with a Telecity facility in Amsterdam, no equipment failures just running way too hot in the cold isle. Didn't have alerting setup for a long time, then when I happened to notice the readings it started a several months long process to try to resolve the situation. Never got resolved to my satisfaction though before we moved out.

I remember another facility in the UK at another company that was suffering equipment failures(well at least one device, their IDS failed more than once). The facility insisted the temperature was good, then we showed them the readings from our sensors and they changed their tune. They manually measured, and confirmed the temps were bad and fixed them. Never was on site at that facility so not sure what they did perhaps just opened more floor panels or something.

But just two facilities with temperature issues over the past 19 years of using co-location.

Power and cooling, two things most people take for granted when it comes to data centers(myself included). Until you've had a bad experience or two with either then you stop taking them for granted.