back to article Toyota servers ran out of storage, crashed production at 14 plants in Japan

Toyota has revealed a server running out of disk space after botched maintenance was the cause of an outage that forced it to shut down 14 manufacturing plants across Japan last week. “The system malfunction was caused by the unavailability of some multiple servers (sic) that process parts orders,” states a company …

  1. An_Old_Dog Silver badge
    Headmaster

    Lost in Translation?

    From TFA: data that had accumulated in the database was deleted and organized

    How does one "organize" data which has been "deleted" from a database? (I'm presuming "organized" in this case means, "re-organized".)

    1. HereIAmJH

      Re: Lost in Translation?

      I read that as being dumbed down for Manglement. Sounds like a combination of purging data expanded the transaction logs and then re-indexing ran the storage into the ground. It's a fail for the DBA team, they didn't know the consequences of their actions and hadn't done proper resource planning.

      1. Spazturtle Silver badge

        Re: Lost in Translation?

        "they didn't know the consequences of their actions and hadn't done proper resource planning."

        Or maybe they knew they would need a new server soon and decided to demonstrate why.

        1. WonkoTheSane
          IT Angle

          Re: Lost in Translation?

          You're saying this may be the actions of BOFH-san?

          Coming soon to El Reg's "Who me?" thread (maybe).

        2. Anonymous Coward
          Anonymous Coward

          Re: Lost in Translation?

          I've done exactly this in the past. You know it'll run out of space, so to bring it forward you 'encourage' a collapse sooner, because it *is* the lesser of the evils

          Works every time.....

          *anonymous for obvious reasons

          1. ITMA Silver badge
            Devil

            Re: Lost in Translation?

            This is a technique I have been known to use at times based upon a rough translation of the Christmas song "Let it snow, let it snow, let it snow"

            "Let it fail, let it fail, let it fail".

            If management will just not grasp the consequences of doing nothing. Do nothing and let it all crash and burn. After ensuring you have ample CYA (cover your ass) evidence to prove you warned them and a well prepared plan for GYOOTS - Getting Yourself Out Of The Shit.

        3. HereIAmJH

          Re: Lost in Translation?

          Or maybe they knew they would need a new server soon and decided to demonstrate why.

          Seems like a destructive, and career limiting, way to make your point. DBA is a professional position. It's their job to keep data safe and available to the business. And the end of the day, in most IT organizations we make recommendations and then work with the tools that we have.

          1. el_oscuro

            Re: Lost in Translation?

            I am a DBA, that is the literal description of the job. Make sure you have backups, and more importantly, make sure you can restore from them. If you have practiced a restore in the last 6 months, you aren't doing your job.

    2. Anonymous Coward
      Anonymous Coward

      Re: Lost in Translation?

      How does one "organize" data which has been "deleted" from a database? (I'm presuming "organized" in this case means, "re-organized".)

      Also, I would posit it would be organiSed, but it appears Toyota speaks American rather than English :)

      1. MJB7

        Re: Lost in Translation?

        > Also, I would posit it would be organiSed, but it appears Toyota speaks American rather than English :)

        Or maybe they speak proper English, as recommended by the Oxford English Dictionary? en.en-gb-oxendict ftw!

        1. RegGuy1 Silver badge

          Re: Lost in Translation?

          Now, now. I know quite a lot of British (or even UK) citizens[1] read this site, and quite a few of them -- well the 'older' ones -- still think they are an exceptional bunch. Don't go upsetting them.

          [1] Technically subjects of his Majesty, the King. But hey...

    3. werdsmith Silver badge

      Re: Lost in Translation?

      How does one "organize" data which has been "deleted" from a database? (I'm presuming "organized" in this case means, "re-organized".)

      Some of the data was deleted.

      Some of the data was organised (possibly moving datafiles or logs to different volumes).

      Two separate actions allows sense to be made of the sentence.

    4. collinsl Bronze badge

      Re: Lost in Translation?

      Someone probably hit "reorganise pages before shrinking database" in M$ SQL

    5. Cris E

      Re: Lost in Translation?

      It is sooo much easier to reindex, run stats, reorg a table once the data has been removed. The system just flies.

    6. aerogems Silver badge
      Holmes

      Re: Lost in Translation?

      If they're using an ERP, most of which are very "delete adverse" it could make sense. You don't delete things so much as you "mark" them for deletion. Which just sort of makes them not show up in a lot of reports, but the info is still floating around. Generally only a very select few individuals can actually delete anything from an ERP instance. At least assuming it's set up with anything even remotely resembling proper division of labor.

      1. mattaw2001

        Re: Lost in Translation?

        To be fair the number of times I have to "undelete" things is so common I don't delete but merely hide things until they are really old.

      2. Paul 87

        Re: Lost in Translation?

        Working for an ERP software developer, the main reason you do this isn't to recover the information, it's because an update is far faster than a delete action, thus for the most part you update the record to hide it when a user deletes things, and then later on run maintainance to clear out the fragments when it's quiet.

        Not to mention can you imagine the mess the indicies would get into if you had users deleting records constnatly during the working day?

        1. HereIAmJH

          Re: Lost in Translation?

          it's because an update is far faster than a delete action

          Depends on the situation. Often an update is really an insert and delete. It may feel faster because it works around page locking. IE, the updated row is inserted, the delete is dropped on the queue, and the result is passed back to the app. Then when the lock is released the old row will be deleted. Note that even a delete isn't removing the row, it's just removing it from the index and returning the space so that a new row can be written over top of it.

          Not to mention can you imagine the mess the indicies would get into if you had users deleting records constnatly during the working day?

          Assuming we're talking traditional relational databases, deleting a row isn't a big deal. It will just leave a blank spot in the index that will be cleaned up when you optimize/reorg your indexes. On a non-clustered index it's just a pointer. A clustered index it is the actual row.

  2. Yet Another Anonymous coward Silver badge

    Not just a Japanese problem

    A similar thing happened at Morgan. Where they had to wait until the foreman had finished the whole packet before have the back of it to write on.

  3. Pascal Monett Silver badge

    Out of space

    I'm not a DBA, but I do believe that those people keep an eye on that sort of thing, normally. And, if I'm not mistaken, there should be notifications about storage space, and regular reports. You don't just run out of space one morning.

    So, if all of that is true (and I have no reason to believe that Toyota DBAs are incapable of using such tools), then how on Earth can this have happened ?

    Could it be that the DBA was clamoring for the budget to augment storage and was being basically ignored ?

    If so, I don't think they'll ignore him again.

    1. The Oncoming Scorn Silver badge
      Coat

      Re: Out of space

      It's late, a few beers & its not a easy song to adapt so I didn't get very far, so to the tune of "Oh What A Circus" from Evita....

      Oh no more Prius, oh what a blow

      Toyota systems have gone down

      Over the crash of an update applied at company Nippon

      Dishonorable staff got lazy

      Dossing about at day and working all night

      Falling over themselves to get the data restore done right

      Oh no more bZ4X's, they ain't gonna go

      When they're crashing the servers down

      With management demands that logs will be buried like Eva Peron

      It's quite a shutdown

      And bad for the company, but in a roundabout way

      We've at least made front page of The Register today

      Sorry not sorry.....

      1. spireite Silver badge
        Joke

        Re: Out of space

        I suspect beancounters saw the cost and said.....

        "We not prepared to pay that sort of Prius for the upgrade!

    2. HereIAmJH

      Re: Out of space

      I wouldn't necessarily let the DBA off the hook. If they are running a maintenance event to purge a large block of old data, the DBA should have been aware that the transaction logs would balloon from all the rows deleted. Storage usage will actually increase because deleting rows doesn't necessarily shrink the DB files. (you don't want to autoshrink and then re-grow when new rows are added, it fragments your data files on the disk) But all the delete transactions end up in the transaction logs. If they were running tight on storage, they should have purged some rows, then ran backups to empty the transaction logs. Rinse, repeat, over several days if possible.

      1. werdsmith Silver badge

        Re: Out of space

        Yes, in fact transaction logs grow large and the process to truncate them and keep them in trim is the backup. The backup writes the data off to another location. But the larger than usual transaction logs use up more of the space in the backup target volume which eventually fails. So the transaction log backups don't have enough space to write to, so they start failing. So now nothing is truncating the transaction logs so they grow and use up their volume (or file size quota). When the transaction logs can't grow anymore the DBMS stops doing stuff.

        Or the transaction logs are set to a fixed max size and they hit that before the next periodic log backup.

        You need to alert not only on remaining available space, but also on the rate that space is being consumed.

      2. KSM-AZ

        Re: Out of space

        The geek that was the DBA was fired or moved on. His assistant didn't really have a handle on everything but took over at 1/2 the Salary cost, management was happy. A year or so later the new guy realizes he's grossly underpaid, asks for more money, is denied, and finds another gig elsewhere. Clerk who always wanted to be in IT takes over, fortunately everything is in good shape so things continue to run smoothly even though there is not a good understanding of all the processes. Over time, minor changes cause a disruption, nobody with the skills to notice, and it blows up.

        Or maybe, new DBA was hired to replace the guy who left, but has no idea the process exists, and nobody is complaining, and the error messages were being sent to an email address that no longer exists. . .

        Or it was outsourced to India, and we are in the process of training up a 4th set of contractor folk who are going to leave in a year or so. . .

        This crap happens all the time. Welcome to IT.

    3. CowHorseFrog Silver badge

      Re: Out of space

      Probably some genius management board decided to save thousands on a few disks...

    4. Anonymous Coward
      Anonymous Coward

      Re: Out of space

      I'm not a DBA, but I do believe that those people keep an eye on that sort of thing, normally. And, if I'm not mistaken, there should be notifications about storage space, and regular reports. You don't just run out of space one morning.

      What does "people", and especially "decision makers" do when they are told something that they don't want to hear?

      They will perform their finest workings to ... ignore it. If they are "decision makers" they ignore it with the flourish of sending it off to die in for the most urgent consideration of some desolate comittee staffed only with accountants and dementors!

      I saw a good one with a RAID 5 array storing many years of sensor- and production data for windmills, each customer could fetch their data from a web-portal. One disk started whinging, BOFH requests authorisation to repalce the disk, PHB doesn't like it because it sounds like it might take a while and even cost money. Second disk starts whinging. BOFH is again on PHB to order something and get it done. Now, facing the costs of two enterprise-rated disks, PHB wants a report written to evaluate the specific need for storage solutions to maybe aggregate the data so a "more secure, and standardised, corporate environment". Two weeks later one disk croaks, and replacements are ordered by BOFH.

      The very next day, the second disk dies, right before the spares arrive. Poof! Six years of customer-facing data gone!! Heheheheeh!

      Of Course there is no tape backup "because RAID".

      1. HereIAmJH

        Re: Out of space

        Of Course there is no tape backup "because RAID".

        Sounds like they need to shop for a new BOFH too. Because even a PFY should know that RAID is not a backup.

        1. Richard 12 Silver badge

          Re: Out of space

          If PHB won't buy a disk, then they sure as heck aren't buying a tape deck.

        2. 43300 Silver badge

          Re: Out of space

          And if the building burns down, a RAID isn't going to help!

      2. werdsmith Silver badge

        Re: Out of space

        Tape backup. How long ago did you retire?

        1. 43300 Silver badge

          Re: Out of space

          We still use them - you can get a lot of data onto an LTO8 tape, and there's no clearer air-gap than putting the tape in its box and taking it off-site.

        2. Anonymous Coward
          Anonymous Coward

          Re: Out of space

          Tape backup. How long ago did you retire?

          You've obviously never been in a bank's data centre. That huge room-within-the-room device in the corner is their tape backup.

        3. An_Old_Dog Silver badge

          Re: Out of space

          What are you using for enterprise-level backup media?

      3. Colin Bull 1

        I have said it before ...

        I have said it before and will say it again ..

        http://www.baarf.dk/BAARF/BAARF2.html

        :-)

    5. InsaneGeek

      Re: Out of space

      My guess is this, let's say you have snapshots running on your storage. A DBA anticipates a large load of new data coming in and decides to move and around and delete a bunch of existing data and rebuild indexes, a fairly normal and expected action. However that DBA doesn't understand things end to end not knowing that deleting in the database doesn't free the space in the array and all the new inbound data of the rebuilt indexes will consume even more space in the array causing it to fill the array up. You are now in a situation with a full array, in the middle of rebuilding database indexes, and you are on a type of array that deleting snapshots takes a long time.

    6. AMBxx Silver badge

      Re: Out of space

      Could even be running in a virtual environment with the database's drives being thin-provisioned. The DBA would be unaware of any space limitation until the underlying hardware ran out of space.

      (BI developer pushing the problem down to infrastructure)

      1. An_Old_Dog Silver badge

        Re: Out of space [minimal equipment provisioning and subsystem-turtles all the way down]

        These sorts of problems are worsened by cost-driven, bare-functional-minimum-type equipment provisioning, and by too many don't-worry-your-pretty-head-about-how-it-works subsystems. You don't have "discs", you have "vdisks" (or whatever they're called in your SAN vendor's terminology) in a SAN, whose underlying workings you frequently cannot know due to secret-sauce algorithms and software. It's subsystem-turtles all the way down.

        You have to have some excess disc space, channel capacity, and CPU capacity to use to shuttle data around when you run into unexpected problems, or else you'll end up spending a long time fixing/recovering from these sorts of problems.

  4. ChoHag Silver badge
    FAIL

    Even an executive should be able to do this sum, now that they have a practical example:

    money saved by penny pinching critical hardware < cost of global production going down for a day

    1. CowHorseFrog Silver badge

      Executives are the new tax on everyone... demanding lot of money and fucking everything up al the time and costing even more money.

      1. Anonymous Coward
        Anonymous Coward

        That's why you called them 'executives' rather than 'sane'..

    2. Howard Sway Silver badge

      But so many never learn, and never will, as learning implies mistakes have been made. As for the PHBs, they can't learn either , because that would involve admissions of failure to the execs, which would see the PHB being blamed. Better to just keep passing the blame downwards instead of taking responsibility and fixing systemic problems.

  5. Prst. V.Jeltz Silver badge

    a master of just-in-time manufacturing

    Can someone remind me what the point of this system is ? In my mind it translates as "Making your process as flimsy and unresilliant as possible"

    .

    .

    disclaimer:

    The word "unresilient" is not a real word in English. If you are trying to express the opposite of "resilient," you could use words like "fragile," "weak," or "easily discouraged." For example, "The plants in this garden are very unresilient, so they need extra care and protection.".

    1. Usually 1027309

      Its a pretty lengthy subject, you should defo read up on it, but basically you save money by reducing inventory waiting to be processed/shipped, eliminate waste (quality problems, time stock is sat waiting, over production) and get an overall more productive manufacturing system.

      Its how Toyota beat US car makers back in the day, and has been applied to many other sectors, heavily influencing software development (e.g. moving from big batch development deployed once every 6 months, to single piece flow deployed multiple times a day)

      "Making your process as flimsy and unresilliant as possible"

      It does quite the opposite, it enables and encourages workers to make processes less flimsy and more resilient..

      1. Anonymous Coward
        Anonymous Coward

        JITS only work when the supply doesn't get broken for any extended time.

        Examples: We've had customers shut down because customs decided they couldn't be bothered to process components for 4 weeks, our heat treatment company had issues, pushing supplies to us back 2 weeks, lorries crash scrapping 50% of parts, ferries held up for days.

        And let's not forget the current chip shortage when they cancelled all the orders.

        It works when it works, but is chaos when it doesn't.

      2. Gene Cash Silver badge

        Another thing, at least in the US, is you get seriously burned on taxes for spare inventory

        See the "Thor Tools" decision: https://en.wikipedia.org/wiki/Thor_Power_Tool_Co._v._Commissioner

        It's why books still go out of print, because publishers can't afford to keep them around as inventory.

      3. mirachu

        Workers being able to make processes more resilient isn't a feature of JIT. It's another system Toyota uses.

      4. 43300 Silver badge

        "It does quite the opposite, it enables and encourages workers to make processes less flimsy and more resilient.."

        But it doesn't take much supply-chain disruption to bugger it up. One part suddenly becoming unavailable can stop the job.

    2. Boris the Cockroach Silver badge

      JIT removes the need to hold stock at your manufacturing plant, thus either reducing 'useless' floor space or being able to use that floor space for more production

      It also decreases cost as you only pay for whats delivered rather than 40 000 widgets on a bulk order.

      It also removes the 'stock' cost and places it on your suppliers as we cant make 40 000 of your widgets all at once while keeping 4 other customers supplied with 40 000 widgets each... so we tend to batch produce them , say 10 000 per month ready for the JIT delivery which would be something like 1000 every 2 days(this is agreed in the contract.... because we cant cope with sudden changes in demand like another supplier going down and then having the delivery quanties increased to 1000 every day to make up for it.)

      JIT when it works is a good system.... when it does'nt work....its chaos... as one of our customers found out when their assembly line went down, we still have a contract to fill because we have no where to store the stuff either because we're used to it flying out the door every other day.....

      PS I hate some of our customers....

  6. Dunstan Vavasour
    Facepalm

    Nostalgia Alert

    I suppose it's good to hear a Golden Oldie every now and then. "Out of disc space" takes me back to my youth.

  7. terry 1

    Ahh, the old days

    I used to work in a car factory and it was always a joy when JIT became no parts. The numbers left (eg, engines) would race up the track and you would 'work back' to gain a little extra time, then off to make a cuppa.

    I would imagine, however, that the production staff for Toyota would get out the brooms and dusters and clean the shop floor if their production stopped.

  8. Kev99 Silver badge

    And Toyota and other car companies want more and more computers in vehicles? OY VEY!!

  9. Munchausen's proxy
    Facepalm

    From part of that description, it sounded like they might have had Prod and Backup on two separate VMs on the same host, maybe with guest storage limits calculated a little too precisely?

  10. LateAgain

    I wonder how long the urgent requests for a new server have been ignored.

  11. BridgesRBetter

    A single server disk full issue brings down 14 production plants all over the country? Whatever happened to their somewhat famous KanBan system? Can't we have better IT system & network design than a single point of failure idling 14 production plants?

    1. CowHorseFrog Silver badge

      Stop and think a little... who has been preaching about how wonderful the Toyota system is ? Their parasitic leadership, overhyping their supposed stewardship...the same leadership that couldnt figure out spending a few dollars on more disks might be a smart thing to do rather then the sigifnicantly more expensive cost of freezing production.

  12. Anonymous Coward
    Anonymous Coward

    backup systems

    "Since these servers were running on the same system, a similar failure occurred in the backup function"

    Ah yes, been there, done that. You should never have a distinct backup system doing nothing, right ? It's a waste for beans counters.

    So let it be the same as the live system or even better, put all sorts of critical apps on it to make sure it will fail when the need for it will arise !

    Curious to see the cost comparison between this situation and the ideal one with a real backup system ...

    1. Cxwf

      Re: backup systems

      The description here is slightly ambiguous, but I think what they mean was, the backup system was physically distinct but not logically distinct. It was specced to be the exact same system but running on a different server box, so that in the event a Hardware failure took out System A, System B could immediately step in and take over, because it's "the same system" just on a different box. But in this case the hardware was fine, and a software failure took out System A, and when System B tried to step in...it immediately suffered the same software failure, exactly because it's "the same system" running the same actions.

      (Really "software failure" here means "design failure that triggered a software failure when faced with hardware limitations", but that's too wordy and gets us to the same result anyway.)

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like