
But but but....
Software as a Service and the Cloud are good things, right? Right? Riiiight?
Yeah, right. Right up until they're not.
The great Atlassian outage is stumbling into a new week, with the biz reporting it has "rebuilt functionality for over 35 percent of the users who are impacted by the service outage," meaning the majority of those afflicted remain unable to access their sites. At this point it is fair to say the problem is severe. It kicked …
Software as a Service and the Cloud are ...
One of the beancounters' ultimate wet dreams.
But it is nothing but that.
What are Atlassian's clients asking their beancounters now?
"Say, how much is this royal fuck-up going to cost us and just how are we going to pay for it?"
When everything goes haywire and there's no solution in sight after a week (cue Atlassian) ...
What options do you think you really have?
Just three, last one optional:
1. grin
2. bear it
3. lay back and think of England and/or Boris Johnson
When you have in house services you are certainly not problem exent.
But if your IT line, from the manager to the last PFY is populated by technically competent people being paid a decent wage, be sure they will take care of it and things will be up and running as before with a minimum downtime.
More expensive?
Sure.
More reliable?
Yes.
By a great many orders of magnitude.
O.
Amen. I'm in IT (now, again, whatever) but I come from an Emergency Management background. I really hope all of these affected customers had a decent continuity plan that had been exercised realistically and not merely as a means of checking off some VC's checklist to get funding. My standards are probably a bit high but think I know the answer to that by even a more reasonable measure if some of the Twitter threads I've read are any indication.
BCP is like how security used to be back in the day, nobody took it seriously until it started to cost more to not give a shit.
really it comes down to too many eggs in one basket. Certainly service failures can occur on premises. But pretty much universally those failures affect only a single organization. Granted there can be times when multiple companies are experiencing problems but it's still tiny compared to the blast radius of a SaaS provider having a problem.
My biggest issue with SaaS at least from a website perspective is the seemingly constant need that the provider feels to change the user interface around and convinced everyone will love the changes. Atlassian has done that tons of times and it has driven me crazy. Others are similar, so convinced all customers will appreciate the changes.
Go change the back end all you want as long as the front end stays consistent please.
At least with on prem you usually get to choose when you take the upgrade, and in some cases you can opt to delay indefinitely (even if it means you lose support).
Just now I checked again to confirm. Every few months I go through and bulk close resolved tickets(in Jira) that have had no activity for 60 days. I used to be able to add a comment to those tickets I would say "no activity in 60 days, bulk closing". Then one day this option vanished. I asked Atlassian support what happened and they said that functionality was not yet implemented on their new cloud product (despite us having being hosted in their cloud product for years prior). I can only assume it is a different code base to some extent. Anyway that was probably 3-5 years ago, and still don't have that functionality today. (there is an option to send an email to those people when the ticket closes I don't want that, I just want to add a comment to the ticket).
Don't get me started on the editor changes in confluence in recent years just a disaster. Fortunately they have backed off of their plans to eliminate the old editor(for how long I don't know but it seems like it's about 2 years past when I expected them to try to kill it).
Then there was the time they decided to change the page width on everything in confluence(I assume to try to make it printable), at least in that case they left an option(per user option) to disable that functionality(it messed up tons of pages that weren't written for that option).
The keyboard shortcut functionality drove me insane in confluence as well, for years assuming it was there before(I don't know, I never used keyboard shortcuts in confluence going back to my earliest days of using it in 2006) it was not a problem but past couple of years I would inadvertently trigger a series of events on documents that I did not want just by typing. I was able to undo it every time, and finally disabled the keyboard shortcuts a few months ago.
When writing contract specs, our legal dept. insist on putting in KPIs to keep the vendor honest. The vendor duly agrees to these KPIs when they sign the contract.
I asked our legal people: What would happen if the vendor breached the KPIs? Would we sue them? Terminate the contract? All I got in return was a shrug. The legals are keen to add all this boiler-plate into the contract, but not keen on actually doing anything when asked to.
The vendor knows our legal team aren't keen on taking any action so don't on the KPIs anyway.
Cloud isn't cheaper - you're paying for all the bodies to do the hard graft of rebuilding a broken system for you and taking the political flack
The on premises alternative means you having availability of knowledgeable staff (not off sick with Covid/Holiday) plus spares for any server/network tin/data centre pieces/rooms/... and taking the political flack...
Pick your risk profile...
And big enough to have:
* Enough knowledgeable staff to cover for sickness, holiday, COVID, etc
* Enough kit/capacity to cope with systems(s) going down
You pick the right tool for the job. I work for a large company and we have a mixture of on-prem and cloud. On-prem when the problem is big enough to tick the boxes above; Cloud when the product/service is too small/niche for us to keep skilled up to manage.
Seen a Reddit comment - Link
got email from the community manager, that some instances can be down for further two weeks.
This is not how a billion dollar company build the system or handles recovery, I am going to look for an alternative and dump Atlassian as soon as possible.
==== snip of the email I got ====
What this means for your company
We were unable to confirm a more firm ETA until now due to the complexity of the rebuild process for your site. While we are beginning to bring some customers back online, we estimate the rebuilding effort to last for up to 2 more weeks.
I know that this is not the news you were hoping for. We apologize for the length and severity of this incident and have taken steps to avoid a recurrence in the future.
We suspect that the "dedicated team" Atlassian assigned to sorting out the problem has yet to take down the bunting from World Backup Day before the incident occurred. [...] The irony of [Jira Service Management] collapsing into a heap due to an issue with a maintenance script will not have been lost on the affected users."
^^^ This is why I read El Reg. ^^^
I bet there's a lot of customer re-negotiating their contracts to start running local copies of Confluence and Jira again, and I bet they're going to get a nice deal on those licenses under threat of taking their PM software needs elsewhere.
Before Atlassian ask, "Where you going to go?". RIght now, you got nothing, no service and no app so anything's gotta be better than absolutely nothing that Atlassian are offering for their £20,000 a year JIRA cloud license that worth less than the paper it's printed on.
The Cloud is somebody else's computer. And most likely, the contract that was signed, absolved said cloud operator from any blame should services go TITSUP.
A BOFH's wet dream. Punt a service, have people use it, make lots of money easy, and when things goes TITSUP, you just point to the relevant contract clauses.
Lather, rinse, repeat.