Yeah, get rid on Oracle...
...use MySQL instead. Wait....what?
JP Morgan Chase online banking services crashed in New York last week, and the outage has been laid at Oracle's door. The bank's online services were unavailable for Monday night and Tuesday, with service restored very early on Wednesday. The crash prevented the bank's 16.5 million online customers carrying out online banking …
Whoever decided to put Oracle RAC on a 8 node cluster of T5420's should be fired.
First of all the T chips have terrible single thread performance, the chips are made in China and the systems are made in Mexico.
RAC has terrible scalability with the performance of 3 systems in a 4 system cluster.
I bet Oracle will view this as an opportunity to sell Exadata....the customer should look at a better class of hardware....or maybe even moving to the mainframe.
> and this corruption was replicated in the hot backup.
Uh, well, duh! yeah. These things only protect against hardware faults. Against buggy software or human error (or malice) they are useless, since whatever gets done to the production instance is automatically done to your copy (maybe if they'd used the word "copy" in their systems architecture, rather than "backup" the apparentness of the shortcoming would have been spotted).
it does sound like a strange choice of resilience - given that the system which failed was already clustered and presumably the EMC storage was RAIDed to hell and back. So any single hardware failure could already be detected and hopefully mitigated in the server cluster or the SAN. Sounds a lot like "management by glossy brochure" rather than a professionally designed _and_ _tested_ failover system.
Remote replication of data is standard - just ask those that have lost a building if it's a good idea.
It's also a regulatory mandated configuration in a good deal of circumstances.
It can make a good deal of sense to have multiple live copies of your databases - not only for Disaster Recovery or Business Continuity but also as further sources for backups, reports etc.
However, it is certainly not a zero-cost administration configuration and does require a good deal of tech know-how.
How is it a folly to run hot backups?
So let me get this straight, they restored the backup and rolled forward all the transactions and that cured the problem?
Wll in that case, the Oracle database didn't corrupt the records as replaying exactly the same transactions a second time (from the redo logs) would have resulted in the same corruption happening.
Sounds far more likely to be a hardware problem to me (memory, SAN controller, etc).
Not exactly.
You could have a data corruption that occurs independent of the raw data. So it could be software and not hardware.
The issue is setting the database back to a good state and then roll forward a series of transactions. They could be in the database log or it could be an external log of the transactions.
> Four files in the database were awry
I wonder is "awry" is a code word for "accidentally deleted"?
There are so many possibilities, such as a tablespace that couldn't auto-extend, database files that had permissions or ownership changed. I doubt that we'll ever be told what really went wrong unless this gets taken to court and all the gory details get reported in the technical press.
Just look out for DBA, storage admin or sysadmin vacancies at Morgan Chase - that'll probably be the only clue we'll get.
Agreed, I hate to say this but in my view this is not an Oracle error, but an operator error. If the Oracle DB was somehow at fault then merely replaying exactly the same transactions through exactly the same systems would generate the same error or hold the risk of generating the same error, so you simply wouldn't do it. Might guess is the DB admins or sysadmins screwed up and manually "corrupted" the file, probably with a hasty vi edit, then couldn't spot the problem as everything came back as status OK. A user "corrupted" file will pass all tests as it is not corrupted, it simply has incorrect data in it. The interesting bit now will be how Larry responds - he doesn't like anyone saying nasty things about his products, so will he simply swallow it seeing as Chase must be a big customer, or does he let rip and sue them?
Each weekday morning members of the clearing house to clear debits and credits between each and every bank at the start of the business day.
If a bank is late, for whatever reason, interest is incurred until settlement is made!
Poetic justice, the banks screw the retail customer as well as their own kind!
Your big bill comes at the end of term on a swap/swaption cap/caption or whatever hedging derivative they use to raise capital.
So when you have a 50 million dollar deal and you're end of term, that's where you start to pay big bucks.
But I digress. That would be a different system than the one they use to manage their retail clients.
At least it should be....
Don't know why they would be looking at DB2.
JPMChase already has Informix's IDS in house which is IBM's *other* database and is actually a much better performer in terms of HA and OLTP processing.
But then again, JPMC will take the cue from their IBM client rep and his cadre of simple minded sales reps who'll sell JPMC whatever they think will make them the most money, regardless of what is really best for the customer.
Do I sound a bit jaded? Maybe. But then again, I know more than I care to talk about. ;-P
-G
What exactly is a T4520? Certainly not a Sun/Oracle box. A quick google shows me that a T4520 is a toner cartridge from Toshiba... Might explain a bit.
Seriously though, it humors me how a company can blame a specific vendor like this and then state how they are actively looking for another vendor. Funny 'cuz the largest most successful companies in the world seem able to use Oracle every day without issue. How strange.
I have tried to deal with Chase several times and their IT department has always been suspect. For instance, it takes 3-5 days to reactivate an account where most banks can do this in real time. Anytime I have asked questions about IT policies and why processing takes so long it has generally taken me 4 calls and up to 10 forwards to get an answer from anyone with a clue.
In short it seems that Chase is generally clueless about their own IT, have conflicting policies and in general do not care much about making it work, much less making it work better.
I have worked with IT for several large banks and Chase by far seems to be the most behind and screwed of them.
Pretty much all of the companies out there will spend far more development time and money on "adding" and "enhancing" features to their products (be it hardware or software) than they will on making their hardware bullet proof, or writing software that actually checks the inputs and output are valid.
If the hardware failed - then shouldn't both the hardware and the software have picked that up?
If the software failed - then I'd have expected a "sanity" check would have flagged that too.
So basically - something screwed up ... and didn't flag a warning. That's pretty much pare for the course a lot more often than you'd think.
This is my take on this:
I am not going to blame Oracle for this misery - quite opposite actually.
The story unfolds:
"El Reg is told that Oracle support staff pointed the finger of blame at an EMC SAN controller but that was given the all-clear on Monday night."
However:
"Monash subsequently posted that the outage was caused by corruption in an Oracle database which stored user profiles. Four files in the database were awry and this corruption was *replicated* in the hot backup."
Sweet, isn’t it? And this is what differs a decent snapshot from replication – it will not inherit a corruption which just hit primary data, because it’s read only, full stop.
And the rest is obvious:
"Recovery was accomplished by restoring the database from a Saturday night backup, and then by reapplying *874,000* transactions during the Tuesday."
So my bottom line is:
Should they have properly implemented snapshot protection in place (e.g. on NetApp storage), the extend of this outage would be quite likely substantially smaller, as a recovery to, say, an hour old snapshot would mean replaying only a handful of logs comparing to 874,000 collected between Saturday backup and Monday crash...
More than likely, they are using is EMC's SRDF (assuming it's on a Symmetrix) in asynchronous mode for replication. When that is done, a "snapshop" can be made on a BCV volume, which is then replicated to the secondary Symmetrix for playback onto another BCV, then to the R2 device. Since the BCV is a complete copy of the volume, there is a huge difference between doing that and making a simple snapshot on a NAS box.
And besides, I have never met a database administrator that DIDN'T jump and scream "IT'S THE STORAGE THAT DID IT!" when something went wrong. It's something in their genes that makes them blame the hardware.
NetApp? Puh-leeeze.
Hmm, interestingly enough Oracle internally uses petabytes of NetApp storage - more accurately in their Austin DC, where they use NetApp "NAS boxes" to e.g. host Oracle on Demand.
Saying that NetApp is "just a NAS" is a plain lie (they can actually do NAS and SAN), but interestingly enough, Oracle uses NFS on NetApp as a storage protocol of choice, because it works *best* for them.
EMC may have a lot of fancy replication tools (with even fancier acronyms), yet the plain fact is - they can't match NetApp robust snapshotting technology, which addresses most of the day-to-day backup & recovery needs (including the example we are talking here about).
Labour would-be government (Our Tony): No top-up fee charges -> Labour Government: Top-up fee charges -> Extra Student loans -> More business for J P Morgan Chase -> Ha Ha! Who is J P Morgan's High Profile recruit a decade later? Coincidence?
Trebles all round, as Private Eye would say.