He is right!
They pour down! You cannot call that speed "go".
Oracle execs are all smiles following a stellar showing for their cloud operations in the latest full quarter, and Larry Ellison is obviously feeling a little dizzy, telling the world – or anyone who would listen – that Big Red's cloud never fails. Founder and CTO of the Texas-based business, Ellison bottles the Koolaid he …
That's something that surprised me a lot back when I first started using cloud 12 years ago now(haven't used IaaS in a decade now). Some of their SLAs(perhaps most) are worded in ways to say, oh well if this data center is down it's not really down for you unless you can't fire up resources in another data center. If you don't have your data in multiple data centers well that's your fault and we don't owe you anything.
Which to some degree makes sense, customers using cloud and not knowing how to do it "right" (because doing it right will just make it more expensive in many cases and certainly more complex). Most traditional providers(either datacenter or network or infrastructure) will of course advise you similarly but they will often take much greater responsibility when something bad happens even if the customer didn't have better redundancy.
Myself I haven't been hosted in a data center(for work) that had a full facility failure since 2007. That's 15 years of co-location with zero facility outages. So forgive me if I'm not going to get super stressed over not having a DR site. That data center in 2007 (Fisher Plaza in Seattle, and I moved the company out within a year of starting my new position there) remains the only facility I've been with that had serious issues going back to 2003.
Of course not all facilities are the same. The facility I use for my personal co-location HAS had several power outages in the past decade(went a good 5-6 years before the first one when I became a customer). But they are cheap, and otherwise provide decent service. I can live with those minor issues(probably still better uptime than Office365 over the years even with my SINGLE server, not that I'm tracking). I need only to walk into that facility to immediately rule it out for anything resembling mission critical or anything resembling not fully active-active (across multiple facilities) operations. They don't even have redundant power(facility dates to the 90s).
I've said before I would probably guesstimate that I'd rule out 60-75% of data centers in the world for mission critical stuff(Bing tells me there are ~2500 global data centers). All of the big cloud providers design their systems so their facilities can fail, it's part of their cost model, so naturally I am repelled by that.
I don't agree there at all. Good infrastructure management is good management. Having a properly designed facility is a good start. Well trained, knowledgeable staff is also important. Having and following standards is also important.
That Fisher plaza in Seattle at the time as far as I recall had issues such as:
* Staff not replacing UPS batteries before they expired
* Not properly protecting the "Emergency Power off" switch (which was one power incident a customer pressed it to find out what would happen, after that all customers required "EPO Training")
* Poor design led to a fire in the power room years after I moved out which caused ~40 hours of downtime and months of running on generator trucks parked outside. A couple years later I saw a news report of a similar fire at a Terremark facility, in that case they had independent power rooms, and there was zero impact to customers.
* Don't recall the causes of other power outages there if there were any other unique causes.
Another facility I was hosted in Amsterdam had an insufficient power design as well, and poor network policies
* The network team felt it was perfectly OK to do maintenance on the network, including at one point taking half of their network offline WITHOUT TELLING CUSTOMERS. They fixed that policy after I bitched enough. My normal carrier of choice is Internap, which has a 100% Uptime SLA, and has been excellent over the past 13 years as a network customer. Internap was not an option in Amsterdam at the time so we went with the facility's internet connection which was wired into the local internet exchange.
* At one point they told customers they had to literally shut off the "A" power feeds to do something, then the following week they had to shut off the "B" power feeds to do that thing to the other side, don't recall what it was, but obviously they didn't have the ability to do maintenance without taking power down (so am guessing no N+1). No real impact to either event on my end, though we did have a few devices that had only 1 PSU(with no option on those models for a 2nd), so we lost those, however they had redundant peers so things just failed over. In nearly 20 years of co-location only that facility ever had to take power down for maintenance.
One company I was at moved into a building (this was 18 years ago) that was previously occupied by Microsoft. We were all super impressed to see the "UPS Room", it wasn't a traditional UPS design from what I recall, just tons of batteries wired up in a safe way I imagine. They had a couple dozen racks on site. Wasn't till later the company realized most/all of the batteries were dead so when they had a power outage it all failed. None of that stuff was my responsibility, all of my gear was at the co-location.
My first data center was in 2003, an AT&T facility. I do remember one power outage there, my first one, I recall I was walking out of the facility and was in the lobby at the time when the lights went out. I remember the on site staff rushing from their offices to the data center floor and they stopped to assure me the data center floor was not affected(and it wasn't). Power came back on a few minutes later, don't recall if it was a local issue to the building or if it was a wider outage.
My first server room was in 2000. I built it out with tons of UPS capacity and tons of cooling. I was quite proud of the setup, about a dozen racks. Everything worked great, until one Sunday morning I got a bunch of alerts from my UPSs saying power was out. Everything still worked fine but about 30 seconds later I realized that while I have ~45min of UPS capacity I have no cooling right now so I rushed to the office to do graceful shutdowns of things. Fortunately things never got too hot I was able to be on site about 10 mins after the power went out. There was nothing really mission critical there, it was a software development company and the majority of the gear was dev systems, the local email server(we had 1 email server per office) and a few other things were there as well.
There are certainly other ways to have outages, I have been on the front lines of 3 primary storage array failures in the last 19 years, arrays which had no immediate backup so all of the systems connected to the arrays were down for hours to days for recovery. And I have been in countless application related outages as well the worst of which date back 18 years ago an unstable app stack being down for 24+ hours and the developers not knowing what to do to fix it. At one point there we had Oracle fly on site to debug database performance issues too. I've caused my own share of outages over the years though I probably have a 500:1 ratio of outages I've fixed or help fix vs outages I caused.
My original post, in case it wasn't clear, was specific to facility availability and to a lesser extent network uplink availability.
dug up the power issue from Amsterdam, which was previously a Telecity data center that Equinix acquired by this point(2018):
"DESCRIPTION: Please be advised that Equinix and our approved contractor will be performing remedial works to migrate several sub-busbar sections back from there temporary source to the replaced main busbar which became defective as reported in incident AM5 - [5-123673165908].
During the migration, one (1) of your cabinet(s) power supplies will be temporary unavailable for approximately six (6) hours. The redundant power supply(s) remains available and UPS backed. "
But this power incident wasn't critical for me since everything was redundant on my end, I'm not a power expert so certainly can't say for sure if a better power design could of allowed this kind of maintenance to be done without taking power circuits down to customers. But I can say I've never had another facility provider need to take power offline for maintenance for any reason in almost 20 years. Perhaps this particular activity it would be impossible to avoid I don't know.
After Equinix acquired Telecity I noticed the number of customer notifications went way up, Telecity had a history with me at least of not informing customers of stuff. I hated that facility and staff AND policies so much. Only visited it twice, before Equinix took over, according to my emails looks like we moved out less than 3 months after the above power issue (move was unrelated to that).
One more story, I like talking/writing about this kind of stuff. This was from a former co-worker, who said they used to work as a tech at some data center. It was a small one, no big names. But his story(which was some time before 2010) was that they had a generator on site in a building or some structure to protect it from the elements outside. They ran load tests on it semi regularly, but the load tests were only for a few minutes.
One day a power outage hit, the generator kicked on as expected, then shut down after something like 15-30 minutes as it overheated(I think he said the overheating was related to the enclosure the generator was in). So in that situation we had bad design, and bad policies, either of which should of caught the issue long before it impacted customers.
Another case of bad design IMO is any facility using flywheel UPS. My thoughts there is I want technical on site staff 24/7 at any facility that is able to respond to stuff. Flywheel UPS only gives enough runtime for a a few seconds, maybe a minute for generators to kick on. That is not enough time for a human to respond to a fault(such as the switch starting the generators fails or something, this happened at a San Francisco facility that used flywheel back in 2005ish?). I was touring a new(at the time) data center south of Seattle in 2011, a very nice facility, Internap was leasing space there and was talking with them about using it. I mentioned my thoughts on flywheels and the person giving the tour felt the same, and said in fact I think it was Microsoft that had a facility near by at the time that used flywheels and he claimed they had a bunch of problems with them.
Not that UPSs are flawless by any extent, I just would like to see at least 10 minutes of power available between failure and generator kick on, however that power is provided is less important as long as it works. Flywheels(the ones I'm aware of anyway) don't last long enough. Certainly there will be situations where a failure cannot be fixed in 10 minutes, but I'm confident at least there are some scenarios where it can be (such as the automatic transfer switch not "switching" automatically and needing someone to manually switch it is the biggest).
A recent story from last Month:
Customer gets new cluster, new switch. The UPS-es which are already there are supposed to be strong enough with 20% to 30% room depending on the system load. I push for testing, and we do. Test with UPS 1, showing "12" minutes. Unplug from 16A 240V power, alarm, warning "too much load for battery" huh what? The 12 Minutes are going down a bit faster than expected. After 72 seconds off. Of course there is a second UPS which was still plugged in, so there was no downtime anywhere. We turned UPS 1 back on, waited until the devices, all dual power supply, showed no alarm, wait extra five minutes, and test with UPS 2. Same alarm, but we did not wait until the UPS shut itself of.
Result: The UPSes would be strong enough for the load, but require a battery extension pack. Actually two for each UPS, as I recommend - and I hope they will buy two for each.
I love real tests, even if it means downtime.
Check out ORCL's history, for never going down its got a lot of entries...
ocistatus.oraclecloud.com/#/history
That being said, Oracle OCI/Bare Metal/BMCS is a decent cloud. It could be said that the support organization is suffering immensely due to the growth that is going on. OCI is growing at a fair clip and it hosts all of Oracle's Fusion, Oracle CX/HCM/ERP/SCM/EPM, NetSuite, Analytics, Cerner (in process), Autonomous DB, ExaCS etc.
Oracle support for OCI can be brutal. Basically its SEV1/24x7 on a war-room, SEV2, or never. Customers seem to have to use sales and partners to beg for support via back channels.
The SLA for IAAS is hard to collect on in general and the SLA legalese for Oracle OCI will make any useful amount of money impossible to mine out of that SLA.
So the question is when does Oracle and Microsoft merge? Maybe not "on paper" due to that drawing an anti-trust, but the ultimate hotel-California for enterprise would be a Microsoft/Oracle hybrid. Oracle and Microsoft seem to want to be co-opetition/frenemies at this point. Not a bad thing really.
Corey Quinn / Duckbill Group / lastweekinaws is a cloud industry blow hard - he tends to make comments on all the clouds and has given Oracle OCI some credit that indicates OCI isnt complete dog crap - there have been a few articles and mentions here and there without a sponsorship talk about OCI - for how late in the game Oracle was and how horrific the first generation of Oracle Cloud (OPC) or any number of strange things Oracle was doing before OCI this newer version is halfway ok.
A new AI database is used by ChatGPT. Ideologically, it opposes to SQL, since all entries in it are annotated with text and this is done automatically. While in SQL the same records are annotated manually and almost never in text. Therefore, Oracle will soon lose SQL as the main source of its income. Oracle shall also lose Java very soon, because, for example, DeepMind (Google) already knows how to do without manual code and programmers. As for ads Oracle should compete with the same ChatGPT-Microsoft technology. The same with Oracle healthcare: Microsoft and ChatGPT. What left?
Only AI, that is, the tough competition with Microsoft left to Oracle