Lost in Translation?
From TFA: data that had accumulated in the database was deleted and organized
How does one "organize" data which has been "deleted" from a database? (I'm presuming "organized" in this case means, "re-organized".)
Toyota has revealed a server running out of disk space after botched maintenance was the cause of an outage that forced it to shut down 14 manufacturing plants across Japan last week. “The system malfunction was caused by the unavailability of some multiple servers (sic) that process parts orders,” states a company …
I read that as being dumbed down for Manglement. Sounds like a combination of purging data expanded the transaction logs and then re-indexing ran the storage into the ground. It's a fail for the DBA team, they didn't know the consequences of their actions and hadn't done proper resource planning.
This is a technique I have been known to use at times based upon a rough translation of the Christmas song "Let it snow, let it snow, let it snow"
"Let it fail, let it fail, let it fail".
If management will just not grasp the consequences of doing nothing. Do nothing and let it all crash and burn. After ensuring you have ample CYA (cover your ass) evidence to prove you warned them and a well prepared plan for GYOOTS - Getting Yourself Out Of The Shit.
Or maybe they knew they would need a new server soon and decided to demonstrate why.
Seems like a destructive, and career limiting, way to make your point. DBA is a professional position. It's their job to keep data safe and available to the business. And the end of the day, in most IT organizations we make recommendations and then work with the tools that we have.
How does one "organize" data which has been "deleted" from a database? (I'm presuming "organized" in this case means, "re-organized".)
Some of the data was deleted.
Some of the data was organised (possibly moving datafiles or logs to different volumes).
Two separate actions allows sense to be made of the sentence.
If they're using an ERP, most of which are very "delete adverse" it could make sense. You don't delete things so much as you "mark" them for deletion. Which just sort of makes them not show up in a lot of reports, but the info is still floating around. Generally only a very select few individuals can actually delete anything from an ERP instance. At least assuming it's set up with anything even remotely resembling proper division of labor.
Working for an ERP software developer, the main reason you do this isn't to recover the information, it's because an update is far faster than a delete action, thus for the most part you update the record to hide it when a user deletes things, and then later on run maintainance to clear out the fragments when it's quiet.
Not to mention can you imagine the mess the indicies would get into if you had users deleting records constnatly during the working day?
it's because an update is far faster than a delete action
Depends on the situation. Often an update is really an insert and delete. It may feel faster because it works around page locking. IE, the updated row is inserted, the delete is dropped on the queue, and the result is passed back to the app. Then when the lock is released the old row will be deleted. Note that even a delete isn't removing the row, it's just removing it from the index and returning the space so that a new row can be written over top of it.
Not to mention can you imagine the mess the indicies would get into if you had users deleting records constnatly during the working day?
Assuming we're talking traditional relational databases, deleting a row isn't a big deal. It will just leave a blank spot in the index that will be cleaned up when you optimize/reorg your indexes. On a non-clustered index it's just a pointer. A clustered index it is the actual row.
I'm not a DBA, but I do believe that those people keep an eye on that sort of thing, normally. And, if I'm not mistaken, there should be notifications about storage space, and regular reports. You don't just run out of space one morning.
So, if all of that is true (and I have no reason to believe that Toyota DBAs are incapable of using such tools), then how on Earth can this have happened ?
Could it be that the DBA was clamoring for the budget to augment storage and was being basically ignored ?
If so, I don't think they'll ignore him again.
It's late, a few beers & its not a easy song to adapt so I didn't get very far, so to the tune of "Oh What A Circus" from Evita....
Oh no more Prius, oh what a blow
Toyota systems have gone down
Over the crash of an update applied at company Nippon
Dishonorable staff got lazy
Dossing about at day and working all night
Falling over themselves to get the data restore done right
Oh no more bZ4X's, they ain't gonna go
When they're crashing the servers down
With management demands that logs will be buried like Eva Peron
It's quite a shutdown
And bad for the company, but in a roundabout way
We've at least made front page of The Register today
Sorry not sorry.....
I wouldn't necessarily let the DBA off the hook. If they are running a maintenance event to purge a large block of old data, the DBA should have been aware that the transaction logs would balloon from all the rows deleted. Storage usage will actually increase because deleting rows doesn't necessarily shrink the DB files. (you don't want to autoshrink and then re-grow when new rows are added, it fragments your data files on the disk) But all the delete transactions end up in the transaction logs. If they were running tight on storage, they should have purged some rows, then ran backups to empty the transaction logs. Rinse, repeat, over several days if possible.
Yes, in fact transaction logs grow large and the process to truncate them and keep them in trim is the backup. The backup writes the data off to another location. But the larger than usual transaction logs use up more of the space in the backup target volume which eventually fails. So the transaction log backups don't have enough space to write to, so they start failing. So now nothing is truncating the transaction logs so they grow and use up their volume (or file size quota). When the transaction logs can't grow anymore the DBMS stops doing stuff.
Or the transaction logs are set to a fixed max size and they hit that before the next periodic log backup.
You need to alert not only on remaining available space, but also on the rate that space is being consumed.
The geek that was the DBA was fired or moved on. His assistant didn't really have a handle on everything but took over at 1/2 the Salary cost, management was happy. A year or so later the new guy realizes he's grossly underpaid, asks for more money, is denied, and finds another gig elsewhere. Clerk who always wanted to be in IT takes over, fortunately everything is in good shape so things continue to run smoothly even though there is not a good understanding of all the processes. Over time, minor changes cause a disruption, nobody with the skills to notice, and it blows up.
Or maybe, new DBA was hired to replace the guy who left, but has no idea the process exists, and nobody is complaining, and the error messages were being sent to an email address that no longer exists. . .
Or it was outsourced to India, and we are in the process of training up a 4th set of contractor folk who are going to leave in a year or so. . .
This crap happens all the time. Welcome to IT.
I'm not a DBA, but I do believe that those people keep an eye on that sort of thing, normally. And, if I'm not mistaken, there should be notifications about storage space, and regular reports. You don't just run out of space one morning.
What does "people", and especially "decision makers" do when they are told something that they don't want to hear?
They will perform their finest workings to ... ignore it. If they are "decision makers" they ignore it with the flourish of sending it off to die in for the most urgent consideration of some desolate comittee staffed only with accountants and dementors!
I saw a good one with a RAID 5 array storing many years of sensor- and production data for windmills, each customer could fetch their data from a web-portal. One disk started whinging, BOFH requests authorisation to repalce the disk, PHB doesn't like it because it sounds like it might take a while and even cost money. Second disk starts whinging. BOFH is again on PHB to order something and get it done. Now, facing the costs of two enterprise-rated disks, PHB wants a report written to evaluate the specific need for storage solutions to maybe aggregate the data so a "more secure, and standardised, corporate environment". Two weeks later one disk croaks, and replacements are ordered by BOFH.
The very next day, the second disk dies, right before the spares arrive. Poof! Six years of customer-facing data gone!! Heheheheeh!
Of Course there is no tape backup "because RAID".
My guess is this, let's say you have snapshots running on your storage. A DBA anticipates a large load of new data coming in and decides to move and around and delete a bunch of existing data and rebuild indexes, a fairly normal and expected action. However that DBA doesn't understand things end to end not knowing that deleting in the database doesn't free the space in the array and all the new inbound data of the rebuilt indexes will consume even more space in the array causing it to fill the array up. You are now in a situation with a full array, in the middle of rebuilding database indexes, and you are on a type of array that deleting snapshots takes a long time.
These sorts of problems are worsened by cost-driven, bare-functional-minimum-type equipment provisioning, and by too many don't-worry-your-pretty-head-about-how-it-works subsystems. You don't have "discs", you have "vdisks" (or whatever they're called in your SAN vendor's terminology) in a SAN, whose underlying workings you frequently cannot know due to secret-sauce algorithms and software. It's subsystem-turtles all the way down.
You have to have some excess disc space, channel capacity, and CPU capacity to use to shuttle data around when you run into unexpected problems, or else you'll end up spending a long time fixing/recovering from these sorts of problems.
But so many never learn, and never will, as learning implies mistakes have been made. As for the PHBs, they can't learn either , because that would involve admissions of failure to the execs, which would see the PHB being blamed. Better to just keep passing the blame downwards instead of taking responsibility and fixing systemic problems.
a master of just-in-time manufacturing
Can someone remind me what the point of this system is ? In my mind it translates as "Making your process as flimsy and unresilliant as possible"
.
.
disclaimer:
The word "unresilient" is not a real word in English. If you are trying to express the opposite of "resilient," you could use words like "fragile," "weak," or "easily discouraged." For example, "The plants in this garden are very unresilient, so they need extra care and protection.".
Its a pretty lengthy subject, you should defo read up on it, but basically you save money by reducing inventory waiting to be processed/shipped, eliminate waste (quality problems, time stock is sat waiting, over production) and get an overall more productive manufacturing system.
Its how Toyota beat US car makers back in the day, and has been applied to many other sectors, heavily influencing software development (e.g. moving from big batch development deployed once every 6 months, to single piece flow deployed multiple times a day)
"Making your process as flimsy and unresilliant as possible"
It does quite the opposite, it enables and encourages workers to make processes less flimsy and more resilient..
JITS only work when the supply doesn't get broken for any extended time.
Examples: We've had customers shut down because customs decided they couldn't be bothered to process components for 4 weeks, our heat treatment company had issues, pushing supplies to us back 2 weeks, lorries crash scrapping 50% of parts, ferries held up for days.
And let's not forget the current chip shortage when they cancelled all the orders.
It works when it works, but is chaos when it doesn't.
Another thing, at least in the US, is you get seriously burned on taxes for spare inventory
See the "Thor Tools" decision: https://en.wikipedia.org/wiki/Thor_Power_Tool_Co._v._Commissioner
It's why books still go out of print, because publishers can't afford to keep them around as inventory.
JIT removes the need to hold stock at your manufacturing plant, thus either reducing 'useless' floor space or being able to use that floor space for more production
It also decreases cost as you only pay for whats delivered rather than 40 000 widgets on a bulk order.
It also removes the 'stock' cost and places it on your suppliers as we cant make 40 000 of your widgets all at once while keeping 4 other customers supplied with 40 000 widgets each... so we tend to batch produce them , say 10 000 per month ready for the JIT delivery which would be something like 1000 every 2 days(this is agreed in the contract.... because we cant cope with sudden changes in demand like another supplier going down and then having the delivery quanties increased to 1000 every day to make up for it.)
JIT when it works is a good system.... when it does'nt work....its chaos... as one of our customers found out when their assembly line went down, we still have a contract to fill because we have no where to store the stuff either because we're used to it flying out the door every other day.....
PS I hate some of our customers....
I used to work in a car factory and it was always a joy when JIT became no parts. The numbers left (eg, engines) would race up the track and you would 'work back' to gain a little extra time, then off to make a cuppa.
I would imagine, however, that the production staff for Toyota would get out the brooms and dusters and clean the shop floor if their production stopped.
Stop and think a little... who has been preaching about how wonderful the Toyota system is ? Their parasitic leadership, overhyping their supposed stewardship...the same leadership that couldnt figure out spending a few dollars on more disks might be a smart thing to do rather then the sigifnicantly more expensive cost of freezing production.
"Since these servers were running on the same system, a similar failure occurred in the backup function"
Ah yes, been there, done that. You should never have a distinct backup system doing nothing, right ? It's a waste for beans counters.
So let it be the same as the live system or even better, put all sorts of critical apps on it to make sure it will fail when the need for it will arise !
Curious to see the cost comparison between this situation and the ideal one with a real backup system ...
The description here is slightly ambiguous, but I think what they mean was, the backup system was physically distinct but not logically distinct. It was specced to be the exact same system but running on a different server box, so that in the event a Hardware failure took out System A, System B could immediately step in and take over, because it's "the same system" just on a different box. But in this case the hardware was fine, and a software failure took out System A, and when System B tried to step in...it immediately suffered the same software failure, exactly because it's "the same system" running the same actions.
(Really "software failure" here means "design failure that triggered a software failure when faced with hardware limitations", but that's too wordy and gets us to the same result anyway.)