
Kettle
...Test Systems Better, IBM tells UK IT meltdown bank TSB...
Pot, meet kettle...
A report into the IT meltdown at TSB has suggested the British bank did not carry out rigorous enough testing and that the problems went beyond previously reported middleware issues. The chaos at the bank, a subsidiary of the Spanish Sabadell Group, saw many customers unable to access services for a week at the end of April …
During my IBM Tenure, some of which was spent as a Test Manager, the majority of project squeeze was in the testing arena due to overrunning development / reduced timeline for deployment etc etc... This was always sold as 'taking a pragmatic view with Risk Based Testing'
Or cutting corners...
In my time at a large British defence company, management wanted the design specs approved asap, and accused me of delaying the project when I refused to approve them.
I particularly objected to documents that were just a restatement of the requirements, with no discussion of how to achieve them in programming terms. It might almost have been better to just start coding than to write those documents.
Not surpisingly, the coding and module testing phases overran, leaving little time for system testing.
I blame the "gee, this is easy" attitude referred to by some as "positive thinking".
You've got to enjoy the moment - being the first to give the authoritative, "No", in a "Just Do It" environment.
Defending your decision to people who assume they know more than you but don't realise it isn't your first rodeo and you read everything including the project history before you turned up on site is just the icing on the cake.
In a canned statement, TSB said: “The IBM document contained a preliminary work plan with very early hypotheses based on observations to date, that were produced after only three days of engagement with TSB. To present this document as a clear view on what went wrong wouldn’t be a fair reflection. Similarly it isn’t a fair reflection of what actions may or may not subsequently have been taken.”
Seems like someone doesn't dig what the IBM report has to say...
So according to TSB, the document didn't present a clear view on what went wrong, and isn’t a fair reflection of what actions may or may not subsequently have been taken, but TSB chose to release this to the Treasury Committee two months after the debacle, and then, having released a document saying "we're idiots", try to deny it.
That Pester is a class act - although apparently representative of the depth of talent in financial services. When are we due our next financial services induced crash?
> So according to TSB, the document didn't present a clear view on what went wrong, and isn’t a fair reflection of what actions may or may not subsequently have been taken
Not quite so much fun, but that is exactly correct.
TSB didn't elect to release it. The Treasury Select Committee did, over the objection of both TSB and IBM, on the grounds that it covered working hypotheses not conclusions and even by 6 June when it was provided, it was already out of date. Both TSB and IBM asked for the disclaimer text on it because otherwise even more people than have done so far would be assuming it was a factually complete assessment.
What is more interesting is why the TSC chose 20 June to release an incomplete assessment from 29th April (3 days after IBM started getting acquainted with the debacle, and two weeks after the TSC received it). This release seems mostly useful to sustain some news and media pressure on Pester for political reasons - it certainly doesn't inform the discussion about what went wrong, why and who is to blame.
Ref: http://data.parliament.uk/writtenevidence/committeeevidence.svc/evidencedocument/treasury-committee/service-disruption-at-tsb/written/85691.html
Pester is extremely talented. Head bullshitter in a nation of bullshitters.
Everyone is playing Cost Externalisation. Either profit by making someone else pay, or to be scapegoated.
Let's not pretend that the banks caused the last crash, everyone was trying get get richer off the demand created by the unpayable loans and mortgages racket.
Having your properties triple in value and then complaining that 'The banks made me do it' at the cost of the debts then dumped on society is simply mass delusion. We live by these myths to absolve ourselves of our group shame.
People are seething because they got outsmarted by the pros, but given the same scenario again we will all shaft the next generation and then demand that society fix the problem caused by the' nasty tories' and 'fat cat bankers'.
Spot on mate. People are seething because they got outsmarted by the pros -- which wasn't much of a hurdle to jump.
No wonder they refuse to build council houses (not near me, no way) when they could boast at dinner parties about how much they are now worth. Don't you dare build on that field -- that's my view and is worth X thousand pounds.
Brexit is another. I want to pay less and less, and I don't care that people's jobs here have to be shafted by getting foreigners to do the work for less (I know they are all here keen to improve their English, just as long as I don't have to pay). Oh, and keep cutting those taxes.
Hang on! The countries full of them. Who let them in?
It's the public, you and me, not the politicians or bankers, that are the real villains.
The article pulls out the increased fraud risk associated with the recommendation to reduce the use of Actimize, but misses the implications of the OpenAM recommendation - effectively turning off authentication of internally accessible microservices, if I read it right.
If 'internally accessible' means accessible only to other microservices within an isolated network for the backend banking platform and nothing else, that might, just possibly, be an acceptable short term risk to run, but is very very bad practice. One microservice breach would open up every unsecured microservice.
If 'internally accessible' means accessible to every other server in the datacentre (probably including a lot of poorly secured legacy), or - worse still - the general internal bank networks with all of their desktops etc., then ... wow. Just wow.
Also, most sensible microservice authentication methods (JWT etc.) are stateless, so shouldn't require calling out to OpenAM, which would obviously result in scalability issues.
Possibly the 'validation' that's been talked about here is a belt-and-braces check for token revocation or similar. Turning that off would be more reasonable from a security perspective. Possibly ... but I bet it isn't. Normally only access tokens would not be revocable, only refresh tokens, which would not be checked on "every interaction".
About Actimize, whatever software they're running to detect fraud (if they're running any) isn't up to the task of an avalanche of spam, texts, phone calls, and possibly website hacking targeting TSB that's still relieving people of their money over two months later.
"About Actimize, whatever software they're running to detect fraud..."
I've worked on the fraud detection system that Barclays uses for online and mobile banking. Without giving away too many details, I would call it "trusted and proven". If TSB's system can't handle the volume, they need some different software or to stop running it on a Raspberry Pi.
"IBM suggested that the bank's testing was not up to scratch, saying it "has not seen evidence of the application of a rigorous set of go-live criteria to prove production readiness"."
I could have told them that for free, no need to spend couple of £100k on IBM "expertise"...
I assumed that they call IBM to tell the the bloody obvious just as a PR exercise, like a action to say everyone:" You see, we take this seriously and we call someone that everyone* knows is** a serious IT expert"
* everyone outside the IT world.
** Not, it is not since eons.
This post has been deleted by its author
I'm going to bite, and assume you mean 'agile' as in 'Agile software delivery process'. In most of these discussions, someone pops up and says 'Agile means no testing'. That's incorrect.
Waterfall can be done well or badly - and in the latter case testing is usually the thing that gets squeezed, as it's towards the right hand side of the plan. Agile can be done well or badly - if the latter it's unusually a poor team that uses Agile as an excuse and synonym for "no process and chaos" rather than actually understanding it. Using Agile requires more discipline than waterfall, not less.
In my experience of Agile being done well, there's a *lot* of investment into testing, with test specialists embedded into every team complemented by specialised test teams.
Stories (functional specs) are expressed in an executable form using Cucumber etc. for full traceability through test.
Extensive automated tests - unit tests (with continuous code coverage monitoring), code-level integration tests, UI tests, external interface tests, as much non-functional testing as possible - are integrated into a continuous delivery pipeline so you know ASAP when regressions occur. Exploratory testing by specialist testers complements the automated tests.
A formal 'Definition of done' for development work means it can't be marked as complete until the associated automated test artefacts are there.
'Show and tells' get early informal feedback from real users, with regular more formal 'business proving' tests combined with old fashioned UATs before big releases. You shouldn't find anything much by the time you do the UAT, but it's good as a belt and braces. Any non-functional testing (often security) or interface testing (e.g. some hoary old system where you have to book the test environment a month in advance) that can't be easily automated gets done periodically, but multiple times through the delivery.
Basically, the whole *point* of Agile is to surface problems - both functional and non-functional - as soon as possible so they can be dealt with, rather than leave all of the surprises until just before you go live - as so frequently happens with Waterfall.
And using an Agile methodology probably wouldn't have helped. Some of the ideas and technologies often associated with Agile (but not really Agile per se - frequent build and automated testing, and frequent communication, to give two examples) might have helped. An incremental platform to platform migration, with a solid, tested back out plan would most definitely have helped.
I'm going to bite, and assume you mean 'agile' as in 'Agile software delivery process'.
No, I meant "agile" as in the way it's used by managers - at best a meaningless buzzword and at worst an admission that they don't know what's going on.
Agile, when used within IT departments, normally means "we have standup meetings". I've yet to encounter anyone who is actually doing agile.
@Martin M
"The only difference is you’re doing it just in time rather than all up front."
No, you're cramming it into a sprint having played ludicrous poker to come up with a number in story points which doesn't mean hours but you have to fit the right number of story points into a sprint that is set at 2 weeks because that's what the guys at corporate do when they put out the company website and they're really efficent and it hardly ever breaks.
The practice of agile (and scrum) is a fiasco. It is far, far, worse than everything else that has gone before it. It is being used to control, to artificially measure (velocity is all that matter apparently), and to fire people.
As you say, pick the project. The reality in a large organisation is "do agile for everything, that's the company operating model, if you're not with the team, you're a loose cannon".
Was time and money budgeted for testing ? Were the specs looked down? were the needs clearly identified ?
No, no, and nope.
TSB thinking: "Banking system needs are going to be the same whatever country you're in, and our Proteo system is proven. There's a few local interfaces to change, that's all. As for customer migration, well that's just like clicking and dragging a file in Windows. Why would we need testing?"
Of course, Pester also made certain things were going to go wrong by tempting fate. Some choice quotes in this: https://www.bankingtech.com/2017/12/tsb-unveils-new-banking-tech-platform-proteo4uk/
Seen it on £100k projects, seen it on £500m projects.
Build a clique of other bullies who value hype over substance and sideline anyone who says "Yeah, but". Always make sure you have an exit strategy and someone else to blame. Keep going till the organisation fails, and use your connections to rinse and repeat the same pattern until your retirement fund is ready.
Welcome to British Business.
Good link thanks! TSB unveils new banking tech platform, Proteo4UK
The core originates from Proteo, the platform of TSB’s Spanish parent, Sabadell. In turn, Proteo’s roots are in the Alnova retail core banking system supplied by Accenture. Sabadell has been developing the system under its own steam for a number of years and owns the IP.
Lots of hype in there. It will be interesting to get a final report on what went wrong eventually.
The previous set-up had a complex architecture, resulting in duplicate technologies, multiple data sources and heavy reliance on end-user computing (EUCs).
The new structure has simple architecture, is centralised and offers single version of truth. For example, change of address – a point of frustration for both customers and employees, as it had to be done multiple times in multiple systems – can now be done just once across all systems.
Teradata’s tech supports data acquisition, IBM Infosphere provides data integration and MicroStrategy enables data exploitation.
TSB has also invested in data quality tools and has created a data catalogue about its data (meta data).
@Martin M
Stories are not functional specifications.
Show and tell to "real users" - you mean customers? Really?
Some products are not suited to agile. Anything that can't be developed by a single thing that does it all, it sometimes looks like, but that may just be my own poor experience of it.
To me, regardless of the method, inexperienced/badly led/poorly trained/poorly tested cannot be saved by process and this dive to lower common denominator code monkeys is bad news for all involved.
@yoganmahew
“Stories are not functional specifications.”
The highest level story title certainly isn’t, but once you get to the start of the Sprint in which they’re developed, they really should have been elaborated to contain pretty much everything a traditional functional spec has, including detailed acceptance criteria, error paths etc. As mentioned, Cucumber executable specs and the like often play a big part, but ultimately anything software engineers need to design and build, and testers need to create test scripts, should be attached to the story. The only difference is you’re doing it just in time rather than all up front.
“Show and tell to real users”. Yes, absolutely, if at all possible and I’ve seen it done several times - it works very well. Sometimes it’s not possible, in which case it’s really important to find the best proxy users possible. E.g. TSB could have dragooned random cashiers who also bank with them if they didn’t feel they could use customers.
Absolutely agree that process cannot compensate for people with insufficient expertise. IMO waterfall is possibly slightly more resilient to idiots than Agile, but if your team is comprised of idiots you’re probably screwed either way.
With a good team, well run Agile is generally my preference for most projects. Would I develop a compiler or nuclear power station control system that way? No. Horses for courses.
TSB thinking: "Banking system needs are going to be the same whatever country you're in, and our Proteo system is proven. There's a few local interfaces to change, that's all. As for customer migration, well that's just like clicking and dragging a file in Windows. Why would we need testing?"
That's Sabadell thinking. They decided TSB's migration would be free by spending the £450m Lloyd's gave them for IT costs on the migration to their own system and job's a goodun.
It said that a “limited number of services” - including mortgage origination and ATM and head office functions - had been launched on the new platform and a broader set of services to about 2,000 TSB partners.
The wording on the slide is a bit obscure. Given the preceding "IBM would expect world class design rigour, test discipline, comprehensive operational proving, cut-over trial runs and operational
support set-up:" I read this as saying it's what they would have expected to happen, not what did happen.
I have actually had some good experiences with IBM et al. The trick is someone actually needs to know what they want if they can understand their head from arse.
The other issue I see is I'm not sure why IBM continues to take on these types of jobs as this isn't the first one botched. You start a project and you find through the discovery process that this thing is going south on you. As a PM you pull the project and rescope. Not wait until the explosion to find out you can't make it work.
The other issue I see is I'm not sure why IBM continues to take on these types of jobs as this isn't the first one botched.
As far as I can see IBM are (for a rare change) not implicated in a vast IT fiasco. They were appointed to try and sort out the pig's ear that TSB had made of it, and to report on what the root causes were.
Presumably the discussion was:
CEO: "Which IT integrator has the most experience of fuck-ups?"
CIO: "IBM - they've been at the centre of loads"
CEO: "Great, hire them! They must be experts in finding what's gone wrong and fixing it"
CIO: "Ermmmmm...."
In IBM's defence, they are a large organisation and do have significant technical strengths.
Most of the flak IBM gets centres around either being:
- they are the victims of grossly incompetent management (I'll leave it to the reader to determine which level of management and remind them not to forget "all" as a potential answer)
- they tend to lag a decade or so behind the rest of the IT world these days
- they are more of a financial services company than a technology company. They purchase companies for the return. If that return is only possible by not investing in new products or meaningful development other than rebranding, so be it...
> The other issue I see is I'm not sure why IBM continues to take on these types of jobs as this isn't the first one botched.
Pester had a pre-existing relationship with IBM's head of finance and banking (Hurst) and was in a deep dark hole. Pester had no doubt to agree to some lucrative terms for the initial investigation and IBM have also been handed the remediation work plus post-mortem activities. I suspect IBM are rather enjoying this project.
I'll back you up on that. Most of what people think of as IBM's fuck-ups are the result of an improper specification. Garbage in, garbage out. IBM feel it's not their job to tell you you've specced something wrong, just their job to attempt to deliver the impossible or terminally broken and get paid for it.
"IBM also told TSB to prioritise telephony and branch channels"
These are only useful if there are sufficiently trained and helpful (a contradiction in terms?) are available and in particular the second is only useful if the customer's local branch hasn't been closed to save on staffing costs. These considerations are not unique to TSB.
Knowing someone that works for TSB and was getting it in the ear from angry customers, the branches were having as many issues as the customers were.
They reported issues where possible, however as most the error messages were in Spanish they weren't sure what they were so had to take photos on their personal phones and then email them to themselves to forward onto the support teams.
"IBM also told TSB to prioritise telephony and branch channels"
My reading of this was that the active-active systems couldn't handle the load of all of the channels (web/mobile/branch/telephone) and that IBM was suggesting directing web/mobile to one active system and telephony/branch to the other active system to allow sufficient resources to allow staff to help customers rather than leaving customers AND staff with non-functional systems.
They also suggested further load shedding at the load balancers (F5's) knowing that it would result in poor customer experiences (and resultant bad press) but not having a better way of getting out of the hole.
My reading of those points suggests that actual load >> designed load >> tested load. While I expect a portion of the load to have been caused by failed interactions and customer services, the time frames for fixes were beyond "quickly put more bigger boxes in".
It's good grist for the media mill but the report from IBM has little direct value in terms of being a diagnosis of the problems. The text 'IBM has not seen evidence of' is exactly the text to use in similar system reviews when you have not seen the evidence nor have reasonable grounds (yet) to confirm non-existence.
A preliminary report only ever says, "We've started; this is where we are looking", and given the select committee involvement, the release of a 29th April doc, (that's a full three days in to crisis but two months old so why no real update?), looks more likely intended to maintain political pressure on Pester/TSB than to provide any useful information about what happened.
In practice there are a few things you can posit up front for a failure like this, and the evidence normally comes after some considerable time pulling management teeth. Yes, this is all going to be down to a lack of proper governance at the tail end of the delivery cycle seasoned with some good old planning farce, when the business pressure to just carry on regardless overrides the common sense observations of any competent delivery managers or experienced tech leads.
So no real evidence in this report, but shaping up to be the same old plot line, different characters. From personal experience, it is depressingly familiar.
"I'm sorry but if you haven't seen the docs after 3 days, or even 3 hours for that matter, there is a deep, underlying problem."
It depends if you've decided on who you're going to blame. Sometimes the tough part of the preliminary investigation isn't the technical issue, its deciding who will get to wear the Teflon coat and who gets to be covered in excrement and escorted from the building...
So IBM didn't do the original job, and got hired when things got bad?
Well, there's a reason that firms like IBM exist, and part of it is that if you hire them before things go wrong, they can advise you on how to do them correctly.
To then hire them and ask "what did we do wrong" and then quibble with the answer is a bit like running over your own dog, taking it to the vets and then quibbling with the diagnosis, saying the dog needs to be treated for fleas...
The report being discussed is the preliminary report after three days of work (so doesn't even cover identifying root causes, just candidates). Hiring somebody new to the project to lift the drains when the smelly stuff is pooling around your feet is actually not a bad idea but you do have to wait some time for the answers on anything complicated. Then you can start quibbling properly.
FWIW hiring IBM before things go wrong is not a completely untrodden path.
Their changes also triggered a validation error that locked me out. My Internet Banking registration of all numeric characters failed validation tests and I had to register with a new alphanumeric identifier. Additionally not all of the first line support team were aware and took 3 calls and 2+ hours of holding before I got lucky with someone that knew what they were doing.
Again this hints at the real issues.
An untested system goes wrong and you then rely on untrained customer services staff to support the mess.
The customer service staff didn't need training because there were no real changes with this new system and it will probably work brilliantly.
The system didn't need testing cause it was massive and could easily cope with the test loads even if they didn't test at least half the system.
Once you're past the initial wave of problems there's.....
....two more months of problems, crap and then everything begins to go black as you can no longer see the light at the end of the tunnel. And when you do see light, you know its going to hit hard and hurt even more.
If you can't read them, why are you on a redtop IT site? There is lots of websites dealing with marketing, which you will find more to your liking.
They indicate a serious lack of due diligence and a "damn the torpedos" attitude of the C-level.
IBM has not seen evidence of the application of a rigorous set of go live criteria to prove production readiness
Maybe they were still looking for it?
A combination of new applications, advanced use of microservices, combined with use of active-
active data centres, have resulted in compounded risk in production
I can imagine microservice architecture being hard to test, deploy, maintain or even plan infrastructure usage for. They are far too much like a biological system than a business system. Use only if you are DARPA, writing a PhD, or hardass. Stay monolith if you can until industry practice has evolved.
It is time for agilist Bob Martin on "what happened when they switched it on" (an excellent lecture from 2015).
Seven weeks it took to get my business account sorted. Problem was whenever I phoned I was met with "we're having IT issues, it's all part of the bigger issue it will get sorted out in time" when in fact what was wrong is the migration of my data had somehow put my companies start date in the future... Online banking won't let you use an account for a company thats not yet reading so my access got knocked out and couldn't be reinstated...
It took two 4 hour sessions of me sitting in a branch refusing to leave to get it sorted.
This screams of agile gone wrong. Small teams with no view on the overall project goals or architecture coming up with their own MVP's and only caring about the happy path. I've seen it happen many times at the start of migration projects - projects where the uncertainty is close to zero and indeed a bad fit for the agile process in the first place. They drag on for a huge amount of time missing out crucial user journeys and I've seen crucial backend work farmed out to UI centric teams full of junior members that make disastrous architectural decisions that are only noticed when the whole site goes bang.