This happens so often...
You should make a macro for this article...
Just automatically fill in the date and their canned twitter response...
Online and mobile banking services from RBS Group subsidiaries NatWest, Royal Bank of Scotland, and Ulster Bank crashed at around 5am this morning and remain down. Reports of the digital account blackout began surfacing on Downdetector early this morning (see here, here and here) and a mouthpiece at the Natwest ‘fessed up to …
for a start the web server allows for client-initiated renegotiation, which is NOT good at all..
Although the option does not bear a risk for confidentiality, it does make a web server vulnerable to DoS attacks within the same TLS connection. Therefore you should not support it.
they have not enabled DNSSEC... so you can trivally spoof it even if your using the latest and greatest security !
maybe they should look at the top level domain .bank which requires security...
http://go.ftld.com/dnssec-implementation-guide
It was a botched firewall change which has now been backed out. Things (as of 10AM) are back now. This is actually an example of things working pretty well: a mistake was made in a configuration but the change made had a proper backout process which worked and recovered service. Of course, it's RBS so everyone (understandably, their crimes were historically very great) likes to be nasty about them, but, well, compare this to what happened to TSB (or to RBS in the big incident a few years ago): it's never OK to lose service, but realistically that does happen sometimes, the real test is whether it can be recovered quickly without losing transactions or leaking information, which it was in this case.
Disclaimer: I don't work for RBS any more but I do know people who do hence AC post.
Agreed that it's good to have an effective rollback procedure. That said, five hours seems like a long time to roll back a firewall change and in terms of PR, the status page seems to have been behind the borked firewall and therefore unavailable and the customer support bods didn't seem to have a clue what was happening either.
The 5 hours probably wasn't for the reversion of the firewall configuration. It was most likely down to tracking down why the system wasn't working.
Remember, that the guys who make firewall changes rarely understand the system and how it works.
Somebody would have reported that some functionality wasn't working, but if you've just carried out a large deployment, the firewall change is just one small part of that.
I guess people are rightly annoyed because:
- RBS / NW / UB are closing branches in many towns, telling people to use internet banking instead, people are relying on the system and the system goes down.
- It has happened before, and was a massive failure.
I left the group in my late teens when I went to (what was then) a branch asking for a proper bank account (they had me on a basic account since my schooldays) and got a "computer says no" response. Went to a competitor bank and they were happy to throw all sorts of banking features at me (plus £100!). So I have no love for them.
That said, why was the firewall change not tested in a sandbox / production mirror first?
I don't know the answer to these questions: chances are it was tested but someone made a mistake in the final implementation I suppose. Note I'm not claiming that outages are acceptable, just that they will happen occasionally because people do make mistakes, and it's important to have working backout mechanisms, which it seems like they did.
(Same AC as original comment, I won't followup further as I don't want to speculate on things I don't know about.)
So you make a firewall change.
The alarms and monitors all go off that your outside connectivity is now non-functional since the change.
You wait 30-seconds to see if it's just the config taking effect.
The alarms are still going off.
You go to your change management log, see that the change in question is the cause of the problems in question (and it's not just a lucky time correlation), and back out the change made.
That should *not* take five hours. On a multi-million pound banking system. With a competent team and proper processes. Where it's literally *costing you money* each seconds it's done.
Well, assume that the monitors do go off, and that they go off promptly. If they do, then you don't just reverse whatever change you made: you have to fill in a great mass of forms which describe what you're going to do, apply for the access which lets you do what you're going to do, get approval from a bunch of very cautious people many of whom don't understand what you did to break it, how your proposed fix is going to fix it, or indeed any of the technical details of the thing at all, but who have burned into their brains the memory of a previous instance where someone 'backed out a change' which then took the bank down for days and are really concerned that this should not happen again. This takes time. It would be much quicker if you could just apply the fix you know will solve the problem, but the entire system is designed to make that hard.
Yes, it would be easier, and much quicker, if all this laborious process did not get in the way. But no-one really knows how to do that: the laborious process makes it hard to do bad things as it's intended to do, but it also makes it hard to do anything at all as a side effect. It's like chemotherapy: it kills the bad stuff, but it nearly kills the good stuff too. I think this is an open problem in running large systems: people like Googlebook claim to have solved it ('we can deploy and run millions of machines') but they do that by accepting levels of technical risk which a financial institution would find terrifying (and there's a big implied comment here about people (like banks) moving services into the clown and hence implicitly accepting these much higher levels of technical risk which this margin is not large enough to contain).
"Well, assume that the monitors do go off, and that they go off promptly. If they do, then you don't just reverse whatever change you made: you have to fill in a great mass of forms which describe what you're going to do, apply for the access which lets you do what you're going to do, get approval from a bunch of very cautious people many of whom don't understand what you did to break it, how your proposed fix is going to fix it, or indeed any of the technical details of the thing at all, but who have burned into their brains the memory of a previous instance where someone 'backed out a change' which then took the bank down for days and are really concerned that this should not happen again."
Um....no.
When I put in a request to make a network change, the last page is the backout plan, that gets approved along with the change, in fact it's part of the process:
1) reason for change
2) when will the change occur
3) customer impact, including will the change cause an outage
4) if this will cause an outage, how long will the outage be
5) where the change will be made
6) steps to effect the change
7) steps to verify the change was successful
8) steps to undo the change
9) steps to verify that it works again
Clearly not - and be grateful. It involves spending a lump of time (usually in the idle of the night) talking to generic managers who don't know or understand what your specific area of technology is about, never mind your change so that they can make the decision to do the thing you knew about several hours ago.
Yes, approval to back out. The technical person or people implementing the change are absolutely not in a position to make decisions which could influence the functioning of the organisation, especially where the functioning or otherwise of the organisation is going to be in the papers. That's why there are elaborate governance structures in banks.
And back-out plans means the worse that happens is the upgrade doesn't go through tonight, try again tomorrow.
In case you didn't notice - ABSENCE of a rapid, pre-approved back-out plan... got them into the papers.
I'll be much more worried about a place that requires approval of a back-out plan (rather than taking care to only approve plans with a safe back-out) - when the change is slowly churning through the entire database causing widespread corruption and affecting more and more and more records, and you have to wait for "approval" from someone to back that out.
Hey... maybe that explains TSB, eh?
@tfb - "moving services into the clown"
I know that sketch, it's the one with the ladder, bucket of whitewash, and the hilariously large syringe. This way is a lot more fun than running your services on someone-else's computer.
Yes, the large coat with the flower in the buttonhole. Would you like to smell the flower?
Re:
Quote:If they do, then you don't just reverse whatever change you made: you have to fill in a great mass of forms which describe what you're going to do, apply for the access which lets you do what you're going to do, get approval from a bunch of very cautious people many of whom don't understand what you did to break it, how your proposed fix
End Quote
Not really. Because when you book in a change window, that change window should allow for reversion of the system. And the post deployment testing should be conducted within that change window too, so the failure should have been detected within the change window.
There's probably a little huddle of people that occurs to make a decision to revert, but there won't be any more lengthy than that.
>chances are it was tested but someone made a mistake in the final implementation I suppose.
Chances are it was not tested. Why? Because in my experience, test environments usually contain the applications and the logical solution architecture, not the real physical hardware with the firewalls.
The network infrastructure element of the production system is rarely duplicated in a test environment, or duplicated to a sufficient fidelity to reality.
Clearly a fluff piece response by 'anon' former employee. 'Look TSB were worse' is not an acceptable response, neither is a non-redundant major firewall change with a 5 hour rollback.
Most likely as with others - Outsrouced IT for critical business functions pushed by lazy bonus sucking senior managers.
You pay less you get less - that is outsoucing.
Resilience models are well known, understood and documented.
Monitoring tools are well known, understood and documented.
So why is this so hard for people to get right ?
Or is it the quick change to fix X that ends up breaking Y as insufficient testing was performed ?
If we're going to have to rely on Internet based services to run our lives, then at least the companies making mega profits can do the right thing and build them in a manner where they are rock solid.
Oh and give us a workable Plan-B for when you screw up, local branches, cash machines, you know that sort of complicated stuff.
Yes your profits might be a bit lower, but your customers will be able to get on with their lives when you screw up again.
It's hard to get right because we're dealing with systems which are at or beyond the ability of humans to understand them.
And 'the companies making mega profits' are companies like Google, Intel, Facebook & Apple: not, for instance, RBS. Of course, those highly profitable companies never ever make mistakes. No company ever shipped several generations of processors with catastrophic security flaws, for instance.
It's hard to get right because we're dealing with systems which are at or beyond the ability of humans to understand them.
I disagree.
Any given system, however complex can be broken down into a number of sub-components / sub-systems of lower complexity where their expected functionality can be documented and understood.
This process can be repeated multiple times until its obvious that the system is a just a big pile of little systems all working in harmony. Understanding what each one does and how they interact makes it easy to get to the right bit when something goes wrong, similarly it makes it easier to assess the impact if you need to change something on a component for whatever reason (patch, upgrade, new functionality, etc)
Take some time to read up on architectural frameworks.
@tfb
Your original point was about being able to understanding complex systems, not that code will contain bugs or that people will misinterpret the specification.
The good thing about having a functional diagram when these things happen, its clear where the defective component is and what needs to be done to fix it. Comprehensive testing can also help.
Failing to test adequately - particularly after a change is just inexcusable and lazy.
@Dwarf. Thus spake the theoretician, take some time to look at architecture in the real world.
I disagree with your disagreeing.
Firstly any given system in any big corporate is a hodgepodge of 20-30 years of tech, many of which are complete black boxes with the techies responsible for them long gone.
Secondly - nearly every IT professional I know has had a WFT? moment when some extreme edge case has hit them causing unexpected results.
Thirdly - do some research on emergent behaviour.
@Gordon10
I take your point about corporate systems and the way the evolve, however even breaking down the black box systems into what the black box does and its inputs and outputs will help in the understanding of the functionality of a system.
However, failing to document a system or being unable to properly support it when it goes wrong, particularly when that subsystem is business critical is completely inexcusable.
Often, when there is "that thing that nobody understands any more", grabbing a hot coffee, and having a good poke around in the system will yield a lot of information and arguably that forms the next level of decomposition of what that black box does - its actually a collection of 6 smaller black boxes or functions.
One thing's for certain, ignoring the problem isn't going to make it go away.
If it is sensitive to some form of edge case, then ensure that the upstream interfaces are specified and coded documented accordingly to prevent that edge case making it to the black box.
None of these are complex engineering challenges. and often the problems come from a predecessor taking a short cut and leaving it for the next guy to sort out the mess. I'd prefer to be the other side of the fence, doing it right and breaking down some of the perceived barriers.
I guess that's what makes me a bad fit for the new fangled "agile" make it up as you go along project delivery approach, but that fad will soon pass.
Almost certainly will not have access to "having a good poke around in the system" in a bank to figure out what is going on in a system. Especially if no-one there already understands how it works.
You would first have to build a business case to allocate you time to documenting how it works then get access to the test system - hard as the test system will already be in use for production releases and patching. Likely you will have to apply to share the resources of the test system with devs who will not be happy if your poking around breaks it and delays releases.
"Firstly any given system in any big corporate is a hodgepodge of 20-30 years of tech, many of which are complete black boxes with the techies responsible for them long gone."
I hate to get all theoretical, but if a big corporate really finds itself in a position where it does not know how its systems work then it is no longer in control of whether they do actually continue to work, precisely because of your second and third points. The entire company could cease trading tomorrow and never be able to restart. Are the management and shareholders OK with that?
Oh, Dwarf clearly knows all the theory but has little practical experience on large production systems of high complexity.
Often, documentation is missing, it shouldn't be, but that's the real world. And even if the documentation is present, that's not the answer, at the end of the day, it's down to people and what they know about the system, and keeping information in their heads for fast recall. Understanding doesn't always come from reading a document, it comes from real world hands-on practical experience of a system.
One one system on which I work, it has taken literally me several years to build the knowledge and understanding of the system of daily use, such is its complexity.
Not only Barclays "online" stuff was out yesterday. My wife had to make an urgent payment on a property purchase. Ended up going to the local main branch. The were dead as well - Barclays branches have, of course, been largely automated in recent years.
So there was no way to use any Barclays retail banking services yesterday. Luckily they fixed in late afternoon.
Today the RBS group of banks (that all use the same firewall with such a single point of failure?), Barclays yesterday, Lloyds not so long ago along with Halifax. And so the list of names goes on. Seems to becoming more prevalent - and at a time when King Cash is being threatened. It does make you wonder if somewhere in the world there is a rubbing of hands.
Since NatWest closed their local branch I've been looking at moving my account.
This is more encouragement.
Test driving the Nationwide to see if they are any good before deciding if I should switch.
As you get older, though banks that you have...errr....loved and lost get to be the majority.
Barclays, Halifax, Santander, all screwed me over to a greater or lesser extent in the past.
Not sure how long to hold a grudge, but I haven't run out of banks yet. Probably best to have funds in at least two banks if you can afford it. I have more than one credit card via different suppliers (and a mix of Visa and Mastercard) so in theory the only choke point is when they are paid off at the end of the month.