Oh, really?
"Azure DevOps has tests to catch such issues"
I call BS.
Microsoft Azure DevOps, a suite of application lifecycle services, stopped working in the South Brazil region for about ten hours on Wednesday due to a basic code error. On Friday Eric Mattingly, principal software engineering manager, offered an apology for the disruption and revealed the cause of the outage: a simple typo …
Test one: Peer review.
Works fine if the reviewer pays attention rather than skim reads (if that) and gives things a green light.
Then it needs testing. Even a process, or ad-hoc script is first run against a test system to ensure it a) works, b) works as expected and c) you've a roll back that also works where possible. Then the backup before putting something live. That's SOP in most places.
That's the most basic form of tests. However, it doesn't matter what tests you have if people skip them thinking 'what could possibly go wrong'. Now they know.
I see this as one of the biggest problems with Agile and to some extent DevOps. Because they are a regular deployment method, they rely on automated testing that is often good at trapping issues that have previously been seen, but are not that good at discovering new problems.
It's only when you roll out to the production environment that you get to see some of these new problems, and often by then it's too late to prevent the problem happening, and you're into recovery mode.
Another thing that I've noticed is that quite often the provided backout plan does not cover major unexpected issues, merely trying to undo the change that was made, which again means that you're back in panic mode when things go wrong.
Looks like Azure is getting more and more complex, combined with a desire to change it frequently with agile "sprints".
Frequent change combined with increasing complexity is only ever going to go one way : more disasters and decreasing reliability. And it sounds like their response to this problem is to keep applying sticking plasters. A more fundamental rethink of approach is going to be necessary at some stage to prevent it falling apart.
“This resulted in a large pull request of mechanical changes swapping out API calls.”
Sounds complicated. Why does swapping out API calls cause mechanical changes?
That's pretty much it. And Microsoft has a habit of reimplementing API's like that, either because their first try had issues, or because there is some new latest and greatest way to do things. This leads to huge changes like this, where nothing really changes but pretty much all code is modified.
It's interesting to note that they're doing this now. When I tried to make the same library upgrade on some tooling I'd made, I had to largely abandon it because the changes made to the auth mechanism were so severe that I couldn't figure out whether it was even possible to use the new API. I ended up replacing a fully automated system with a half-manual one. I wonder whether this means that it's now time to try to persuade my boss to allocate some time to revisit the issue and myself that my sanity could survive the attempt.
As a colleague once observed "we can't be sprinting all the time". To blank stares from the scrum people.
My observation is that a scrum is a situation in which everyone pushes in opposite directions, goes nowhere and then collapses in a heap, possibly resulting in severe injuries.
I've always assumed that the S word comes from Rugby. If you know otherwise, reply below.
-A.
> This religious devotion to the Agile process is becoming a cult.
Ah. I hadn't realised that there was an actual process.
Our version of your wall-head interface is the insistence that the number of "story points" for a "ticket" can only be a fibonacci number. Our "sprints" are nominally two weeks (10 working days) long, so, if a story point is one day (which, mysteriously it both is, and isn't) then a "ticket" can only take a maximum of 8 days, systematically rounded down to 5.
What can I do in 5 days?
Well, plenty, of course. I could fix some outstanding bugs. Develop a new feature that has suddenly become urgent. Move a button three pixels to the left. Oddly enough, it is literally only the last that this "process" addresses.
What should I really be doing?
Rewriting a VB6 app in a "modern" language which permits deployment online (or at least to Macs). How many story points? Hmm, at least a hundred. Doesn't fit in a "sprint". Doesn't get done.
Agile my arse.
-A.
"customers can't revive Azure SQL Servers themselves"
--and--
"Even after databases began coming back online, the entire scale unit remained inaccessible even to customers whose data was in those databases due to a complex set of issues with our web servers."
Hands up those who remember the days of the Service Bureau. and later timesharing.
And why we don't do that anymore.
I won't ask WTF web servers have to do with databases ... I don't want to know. Sounds unnecessarily painful.
> The most scarely thing about the cloud and devops is that most files are shall we say rather terse, just lots of magic key /value entries that dont actually mean anything in a code review.
What's new is old again and they used complain about Bash scripts.
Bash scripts can be commented - can these key/value pair files?
I've seen far too many data files being "parsed" by some uber-trivial sscanf()-like calls instead of at least using a simple lexer that can allow comments and whitespace handling so the files can be laid out in a readable fashion.
And for anyone who leaps in and says they are all using JSON files now, so there is a proper parser, comments and all: that means they're probably using something that is vast overkill! So are they correctly limiting the complexity of the data structure allowed, 'cos now the parser will happily let you define a list but your code only eats a single value: bet it uses the first item and doesn't raise an error that it is ignoring the rest!
I retired in 2014 and now all my programming is for personal use and exclusively in Excel VBA. I had a data source which supplied a simple json string of key-value pairs, dead easy to read and to parse by humans and computers.
Now that data source is supplied as a complex json string, with many nested layers and full of arrays. Not only that but if a value is not available then the key is still supplied. Trying to loop through this mess whilst testing for null or empty values properly has occupied my brain for over a week now. The only way I can read it is to dump the string into an online json parser. Easy to read by humans and computers? No chance! Complexity for the sake of it.
YAML is a terrible format. Using whitespace to control identation is a terrible idea, its just too easy to mis line up text. Try and review text columns on github, simple answer is you cant.
Even github doesnt support or report any schema type related errors to its yaml files for github actions and thats a simple example of how a massive org can get something so basic wrong. Put the wrong key at the wrong level, and its silently ignored.
Oh for pity's sake - you mean I've been fooled into thinking JSON is actually *better* than it officially is because of sheer luck in which parser library was chosen by the project?
Thanks for putting me straight on that.
What a totally moronic decision - taking a pure text format derived from JavaScript and deliberately stripping out comments!
> It was only later that it got used for configuration files, where the lack of comments is a nightmare
Dunno about anyone else[1], but when I've chosen the format for files that are used "only" to talk between an application and a server, as often as possible[2] they are text-based *and* damn well allow for comments!
There is this little thing we like to call "testing" where we keep lots and lots of files in the repo with subtle differences between them, all commented up to the nines to point out those differences and the expected results[3]. Which are fired off/caught back using curl.
For that matter, last paid gig where I created such data (used to send a few values between a set of peer servers, "no human intervention required") the *generated* messages contain comments (telling the reader what server subsystem created the file and guiding them to the docs in the repo - 'cos I'd wasted so much time trying to figure out similar info from other traffic in that project!); the "waste" in doing so was well within the available bandwidth for the task and the payback in live tests made *my* life easier[4], so win.
[1] in particular, it seems, the people who came up with JSON and *all* of the early users who should have been screaming at them!
[2] i.e. if the time and processing resources allow - e.g. between two 8 bit MCUs with 2K RAM I'll probably forego the comments (and shorten the keywords).
[3] what, put those comments into another file? It is hard enough to get people to change comments when they change the primary file, trusting a secondary will be up to date is lunacy!
[4] my bit is working, so it *is* your code going mad; toodles.
Please take the following as aimed at Crockford et al not your good self:
<rant>
JavaScript can just eval JSON, any other language is going to *need* a separate parser (and, of course, in real life, JavaScript *ought* to be using a separate parser! Eval'ing text from an unknown source...).
If anyone is writing a parser for a text format, using *any* programming language (even VB6 or, gasp, JavaScript) and can not figure out how to allow for comments then - well, let us just say they should not be doing so professionally or for anything that is expected to be released for more than one other person[1] in the world to use.
Even if the claim was that JSON was meant to be viable on a low-resource MCU in an IoT device (bleeugh), JSON is complex enough that the extra states to handle comments are trivial.
</rant>
[1] their lab partner in the exercise after the second lecture on basic compiler techniques.
What's new is old again and they used complain about Bash scripts
And config.sys..
(Yes - I remember hand-optimising them so that you could load all the required network drivers and bindings to actually work - and that was just netbeui! Then some bright spark wanted to add TCP/IP to the stack. Ah, the joys of working out which of the various bits were happy with loadhigh and the ones that would either crash the PC immediately or, most fun of all, work for a while *then* lock it up..)
My fun one was a super-fast (at the time) server to be used as a desktop PC by a VIP. Ethernet card, Novell IPX/SPX, and TCP/IP networking. If you executed the CONFIG.SYS commands by hand, it worked. If you stepped through CONFIG.SYS, it worked. If you let it run automatically, it did not work. It was a timing issue. My workaround was to replace protocol-handler-name.SYS in CONFIG.SYS with this series of commands:
protocol-handler-name.SYS
protocol-handler-name.SYS /U (unloads the handler)
protocol-handler-name.SYS
THe article claims there was a typo. but it doesnt actually detail the actual line of code with the error. The post mortem also doesnt even say what caused the problem. A lot of obivously prepared statements, that say a lot of words but dont actually mean anything.
Sounds a lot like most *.yaml files...lots of values but most people arent sure what half of them mean.
In the same way that a missing jar in your pantry leaves a space on the shelf, there’s a large set of spaces in their story. When they talk about testing, it sounds more like they’re referring to code coverage or build time coverage as opposed to an actual test suite, otherwise they would have tested the snap shot with a snapshot and seen the server get removed. Secondly, they talk about how this was some edge case where at the same time the support team uses this mechanism daily that leads me to the exact same place, they tested the code for “single digit typos” and spelling, but since they ran no actual unit testing, they didn’t notice that the word server and service or something to that extent had been swapped, although I’m not even sure if that counts as a typo, and why someone would design an API with words for the hosting service and the unit service to be one or two key strokes apart blows my mind. Third, the thing about the exponential back off tells me that they didn’t know how to modify their own code in an emergency to either remove that dependency or restart the service. It sounds an awful lot like they had to block everyone at the firewall level, which makes me think again that they didn’t have a team capable of performing unit testing because that’s a fairly complex operation in an operation like Azure.
Just spitballin’
That bit where they said "ring 0" is a reference to the test group. Where the patch is tested on a small group of servers (ring 0) before releasing it to a larger group (ring 1), before pushing it to general scheduled release (ring 2). And ring 0 was "internal", and contained more that one server, but .... did not contain any servers which had the use-case covered by this bug. Specifically, did not contain any servers with "snapshot databases"
This is the standard test-and-patch service used by Microsoft internally, also released for enterprise customers some time ago.
I would hope that finding that their patch-test environment of internal servers is not representative of their DevOps customer base should encourage some thought and reflection.
My former company had a service desk in South America, but they needed some server space, so, on top of the single storey service desk building they built another floor - for the servers, not understanding that lots of computers are heavier than people. Fortunately they noticed the ceiling over the ground floor rooms bulging before anything actually collapsed.
Over here in Brazil they placed a famous gym franchise on the upper floor of a shopping mall. Barbells, and stuff, in a building with long free spans for the shops below.
Luckily the owners were well aware of the issue, and placed all the weights right next to the structural beams, and following along them and the walls. The real issue wasn't the total weight - there was also a movie theater up top, something that gathers a lot of people - but the fact that all those machines with weights and barbells were tightly concentrated on the center of the spans, in such a way that exceeded the architectural limits of the building.
What set people off were the vibrations, a single barbell hittng the floor would send an audible thud accross the building.
This time, they had people with common sense on key positions...
I would have used the IT? icon, but y'know, server rooms are important too.
A scene in 'Peter's Friends' springs to mind:
https://www.imdb.com/title/tt0105130/characters/nm0748852
Collecting their luggage from baggage reclaim at Heathrow airport:
Andrew : [Struggling with Carol's suitcase] What the fuck have you got in here? Weights?
Carol : Yes.
Andrew played by Kenneth Branagh, and Carol by Rita Rudner.
When my team first deployed web-based apps we set up a test environment in parallel with production. This seemed like a terrible waste, but OK.
To this day I haven't seen a test deployment not make it through to production unmolested.
Our scrummy/devops overlords insisted on inserting "staging" and "qa" as well (and some others). I still don't really understand the difference.
When we started out we were used to testing code to destruction on our workstations before releasing for desktop deployment on the great unwashed. Pushing apps on to the web made a testing environment somewhat useful. The others? I can only imagine that virtualisation and cloud instances induce laziness.
Latterly the corporate accountants have started bleating about the cost.
Well, yes.
-A.
is they never stop fucking with it. I don't want stuff changing constantly for no reason.
Same goes for SaaS services that feel the need to update their UIs and force the changes on the customers (vs on prem where you can opt to delay any such upgrades until you are ready for them). One exception to that in my world was Dynect, who maintained (as far as I could tell) an identical user experience spanning from when I first started using it in about 2009 until I migrated off late last year (Oracle acquired them and put their technology into their general cloud DNS offering years ago, and cut the price by 98% and I assume shut the original Dyn infrastructure off in the past couple of months if they stuck to their schedule). I haven't had the pleasure of dealing with IaaS in over a decade, my on prem stuff hums along perfectly, and I have been successful in defending against folks who wanted to bring cloud back again and again in the last decade(when they see the costs they have always given up since they don't have unlimited money)
I openly denigrate Cloud, here, on LinkedIn, in the office which is CloudFirst. It's a joke!
Even something as simple as the portal interface....it changes almost WEEKLY. I tried to do some Azure learning stuff thinking about doing the certifications before I thought, sod it stick with VMWare and even the courses are out of date before they're published because some DevOps chimp has decided to move the controls in an "agile way" because THEY will never need to use the front end and THEY will definitely not need to use the front end in an emergency when 1/2 the C-Suite are screaming at you!
I've always referred to Developers as children who love the new shiny shiny, but giving them control of your infrastructure in the way that you do in a Cloud Environment, ESPECIALLY when they are hidden behind the layers of what Microsoft hilariously call "customer support" is lunacy
As a retired big shop guy, with a focus on things like D/R and database, I worry that unless your role in IT is data-focused you are insufficiently paranoid when it comes to making changes that impact data.
Losing data can be an extinction event for a company.
That's why ransomware is so effective.
That is why I once had to let an insufficiently paranoid DBA go.
I'm sure that somewhere deep in the legalsleeze is a line saying something to the effect of "If we screw up and lose all your data, our financial responsibility begins and ends with 'Oops, sorry about that.' " So off your data goes to thr cloud, perhaps never to be seen again, and there goes your business WHEN, not if, it disappears.
Tests that don't cover edge cases are not tests.
Tests that aren't functional tests on real hardware in simulated production environments are not tests.
If you don't have those, your product is not being tested.
30 years in QA, seen this blow up on every company I've ever worked for,