code and infrastructure are complicated enough that the time since September 10th wasn’t enough to do the whole job
Or maybe they were just hoping it would go away, followed by a mad rush as oh crap we really have to obey somebody else's laws...?
Google has revealed the cause of its very unwelcome Gmail outage and on The Register’s reading of the situation it boils down to forgetting to take an obsolete version of software out of production. Google’s shorter explanation for the mess is: “Credential issuance and account metadata lookups for all Google user accounts …
If one wants to do critical infrastructure right, all new code and configurations have to make their way from a testing, to a staging setup before getting even close to production. Deployment is then a simple matter of copying the tested-and-verified configuration on staging to production. Staging and production environments have to be identical for this reason to prevent any surprises.
It sounds more like Google practices the time-honoured tradition of 'production is staging', however. To have an old version of the software lying around suggests that they are not using an automatic deployment script (unless it's an intern called 'Gary') and instead probably have a haphazard semi-automatic (or even manual) procedure for updating production.
Working for large companies this behaviour is not unusual, though. Most places I have worked for never did a 'testing' environment, instead using 'staging' as testing environment, and used production for staging. It does increase code throughput and makes it seem like everything is moving faster instead being stuck in 'staging' for weeks while issues are discovered and fixed. The trade-off with omitting staging is the influx of tickets and angry phone calls the hours and days after deployment to production, of course.
Stuff that slipped past unit tests and local testing would end up in production and explode in spectacular fashion, to the point of devices rebooting (watchdog timer) and functionality being suddenly in broken due to environment detection gone wrong or such simple issues.
Omitting a staging phase in deployment is like omitting the 'are you sure?' dialogue box before a disk-erasing operation. Better keep those backups updated (and tested).
In the armored vehicles biz, "staging" is known as the prototype(s). It's ours, the customer can see it but can't have it, and we test all fixes on it. We keep it around for experiments and further testing like cyber and fault induction/detection.
Then, right off the production line(s), the first # vehicles are labelled "initial" and go to the customer's test sites. Only after that do production lines really get going making vehicles for actual field use.
My current work is more on Army Watercraft -- ships that can move these vehicles (many from my former employer) around the globe. There's not as many ships, and we're only installing small systems -- not like a full ship overhaul -- so it's 1) testing (on land), 2) "staging" is called "fit check" and 3) "production" is called "installation".
No matter the project/scope, anyone who skips the middle step(s) is going to regret it when they get to the final step.
Engineering is the same way: derive requirements, design/develop, analyze if requirements are met and any design holes (e.g. vulnerabilities) exist BEFORE moving to test. Don't skip the analysis!
Nobody seems to test software these days for boundary conditions, despite it being fundamental to creation of robust code. In order to provide adequate assurance, testing should not only check the results of what should happen, but also the results of what should not.
Technically, this would be range checking. Making sure the values you are manipulating are within an expected range. In this case, > 0. A pretty common occurrence is list handling. Users shouldn't ever see 'subscript out of range', for example. But coding practices have largely become let the shit fall where it may. and we'll address it in a future release.
Bounds checking, OTOH, or the lack of it, is the more dangerous incompetency. This is where no one is checking if the data will fit in the memory allocated. And leads to buffer overflows and exploits.
Delphi had both bounds checking and range checking since the 90s. Other languages don't. Either because someone sees them as syntactic sugar, or because it slows down their app. Seriously, there is enough other crap slowing down apps much more than bounds checking. It's time for some improved tools.
Without wanting to touch on the idiotic near-religious flame war that is exception vs error handling... the default approach, because it's lazy as hell, is to do no error handling at all and just let exceptions handle everything. Doesn't matter if it's a relatively expected error, it will be left to raise a cascade of meaningless exceptions until some unfortunate component in the stack-o-horrors traps the exception and will duly mask it with the ever helpful "an error has occurred" message and then fail.
There's a reason for that.
Once upon a time I needed to deal with files generated by a very popular sound-editing application. The file format was very well documented. I (and a "second witness") very carefully checked that my code matched the spec. Unfortunately, it immediately started reporting errors in nearly every file that had passed through that application (and only that app, not, e.g. files written by the authors of the standard). A few minutes with a hex editor revealed the problem: the very popular app did not follow the standard (basically, it started an index at zero, rather than the specified 1).
So I had to change my code to match what the app did, rather than what the spec wanted. It's not like I could force the app's developers to meet the spec, and my customers were unlikely to value my compatibility with spec over their use of the app.
I suspect a lot of this goes into the decision to elide error checking. Much like when one could choose to follow the (IIRC) ESMTP RFCs or interoperate with Exchange.
The thing about Facebook is that because of their size and dominance, people assume they're professionals and know what they're doing. But I get the distinct impression that they're mostly staffed by excitable but amateur coders who think they're pioneers and treat everything as an opportunity to 'do something cool' which amounts to badly reinventing the wheel. The bugs and weird behaviours at occur in facebook's mobile app and web front-end often suggest that the architecture is an unholy un-tamable mess
The bugs and weird behaviours at occur in facebook's mobile app and web front-end often suggest that the architecture is an unholy un-tamable mess
Ahh. Unit testing. Functional testing. User Acceptance testing. All phases of deployment most companies don't even bother to pay lip service to any more.
Had it fairly recently.... the conversation went along the lines of me asking about testing, some mumbling from the developers about 'kind of' testing it so it *should* work, at which point I asked "If that's the case, why am I on a call at 23:00 and we're discussing your code not running properly because you haven't performance tested it against a production sized dataset".
The corollary to this is the client insisting that the updates go live NOW despite having not been tested.
That's easy: insist that they sign a note saying that they requested that untested code be run live and that they bear all responsibility for any problems arising.
No signed request: no untested code gets run.
That should make even the stupidest PHB reconsider whether this is really a good idea.
I had a client call me at 8am one morning asking for updates he'd though of overnight for his web platform. I told him we could do the updates but that we wouldn't have time to test them if they needed to go live immediately. He said "just do it," so we did.
A little while later I got a rather angry phone call after he had a rather bad meeting with an investor where his site didn't work properly in Firefox. I told him we were still testing for issues and it was his decision to put the changes live before full testing. He told he'd sue me and make sure nobody ever worked with me again (I was running a small development company at the time).
I offered to continue testing and fix the issues but he wanted me to stop all work and handover all assets, which I did.
Not long after I received court papers which were from a company we'd never worked for. I looked it up and the company didn't even exist. I got the case thrown out before court without ever involving a solicitor and never heard from him again. I only lost a few hundred on unpaid invoices and walked away with the determination that I would never put anything live again - no matter the customer demands - without full testing.
I also got a quiet chuckle from the fact the customer was a former law enforcement officer who couldn't complete a simple court document correctly - or perhaps was doing so fraudulently. I also consider that I got away lightly for dropping my standards.
Anon, because, well, it wasn't my finest moment.
Given such a remit the actual users will specify that they want a text input box to be a bit bigger and for the colours to be tweaked. Users rarely specify what they need, this usually requires someone with cross-discipline skills to specify. Often, of course, the management of said users will often specify exactly how they want something to work, platform and all, rather than what they need the system to do. Usually because they don't really know.
There's a lot in this. Users can specify what they do- how they do it and why they need to do it. Management can specify what the outcomes are meant to be. It takes both of these to develop a software tool. Developers seem (in my experience of working with some of them) to produce what they think users ought to want , based on managers' explanation of what they think the users do. The latter is seldom a full story. The former a mere fantasy.
When I was doing internal development creating an inventory system, I realized I needed to get input from practically every class of employee just to get the full picture. First you talk with the stock girl, because she actually knows the current inventory system the best. Then of course the shop techs, who actually need to use the system to order more parts, and to track remaining quantities needed for their future delivery deadlines. They might give you unique ideas, like an option to put a part on hold that's still being ordered, so they're the first to get it. Then of course management, they can be clueless but if you offer up simple features like, "what about a cash flow report, that shows the cash value of all parts flowing into and out of the inventory that month?" it helps them feel included, 'special'. And the CEO, sometimes they may actually have a simple, good idea, like getting that same cost report the other managers get, but produced weekly, and automatically emailed to him Monday morning.
Then after merging all your notes from these different people, you may have enough info to finally start designing this system on paper! (I couldn't imagine what it's like doing that from the external perspective.)
This is an update to an authentication service, not an end user facing application. So there is little use for any "staging" or "validation" environment for anyone to test. The engineers are more than able to validate the change by themselves during service testing and load/fail over testing.
And in case you've not read the article, the failure was due to leaving the old quota system in place instead of switching it off, and that happened back in october. Yes, that was a time bomb, but when doing these kind of infrastructure updgrades leaving the legacy system switched on instead of turning it off is best practice, as it speeds up the rollback in case you need it. The only fault on Google's side is that they forgot to switch the old system off after more than a month of running the new one, and that caused the outage.
It is just impossible to detect that kind of ommission in any "staging" or "testing" or "pre-anything" environment. The only purpose of having the legacy system running in these environments is to test that you can rollback your changes quickly without breaking it. Unless you leave such environment running for more than month injecting a realistic worload, you'll never, ever catch the omission. Not that it does not make much sense to spend two months running a simulated production environment, as you only have to make a note of switching off the legacy system once everything is working smoothly with the new one. And remember to do it. Which wasn't the case.
I have no problems at all with Facebook developers being forced to miss Christmas. I would like to see it. I have a long list of other punishments I would like to see them subjected to as well. I would also like to see the entire company fail but I guess I should just stick to small dreams.
This post has been deleted by its author
Biting the hand that feeds IT © 1998–2021