back to article Trail of Bits security peeps emit tool to weaponize Python's insecure pickle files to hopefully now get everyone's attention

Evan Sultanik, principal computer security researcher with Trail of Bits, has unpacked the Python world’s pickle data format and found it distasteful. He is not the first to do so, and acknowledges as much, noting in a recent blog post that the computer security community developed a disinclination for pickling – a binary …

  1. jake Silver badge
    Pint

    Perfect.

    Pleasing.

    Perfunctory potable?

  2. Pascal Monett Silver badge
    Stop

    "they'll think about adding additional warnings"

    Oh, that's just great.

    There's already a warning nobody's paying any attention to. The solution is obviously to add another warning.

    Would the maintainers of Python be bitten by the worm of administrative thinking ? You need to lock that functionality down. You need to ensure that pickled files are encrypted or something. You need to bake security into it somehow. I don't know, I don't have the answer, but just slapping another warning on and calling it a day is not the solution.

    1. Tom 38

      Re: "they'll think about adding additional warnings"

      The maintainers of python don't need to do anything; pickling is a great way of persisting private blobs of data. The maintainers of pytorch definitely need to do something because they are using it as the primary method of data interchange between users. That's the messed up part.

    2. Brewster's Angle Grinder Silver badge

      pwned by default

      "You need to ensure that pickled files are encrypted or something."

      I'm not a python user, but it sounds like the problem isn't lack of encryption. The problem seems to be the format can contain automatically executed code; they're doing the equivalent of a javascript programmer parsing a JSON file with eval().

      There's even a hint they might be using these hooks, when all they are doing is transferring data.

      1. vtcodger Silver badge

        Re: pwned by default

        Upvoted because I'm pretty sure you're right. It'd be like .zip or .tgz files being able to run code from their package and doing so by default as opposed to only after asking or, in a script, in response to an enabling command line parameter.

        from the Python 3.9 documentation (in red)

        Warning

        The pickle module is not secure. Only unpickle data you trust.

        It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

        Consider signing data with hmac if you need to ensure that it has not been tampered with.

        Safer serialization formats such as json may be more appropriate if you are processing untrusted data. See Comparison with json.

        Probably not a super situation for files distributed to/by world+dog.

        1. Brewster's Angle Grinder Silver badge
          Joke

          Re: pwned by default

          "Probably not a super situation for files distributed to/by world+dog."

          Don't be a pessimist. Word had autorun macros for years (and so did CDs and USB drives) and none of that ever caused a problem. Who's going to be interested in a few high value researchers and their carefully curated hoard of personal data?

      2. Charlie Clark Silver badge

        Re: pwned by default

        It's actually wrong to think of pickles as data because it's actually the serialisation format for Python objects and used, for example, in multiple process environments. But, as objects, they can most definitely contain executable code.

        They are a necessary evil but as such, they should always be used with caution and never for data transfer, for which they are also too fragile.

        1. Brewster's Angle Grinder Silver badge

          A fly in an ice cube in a microwave.

          I can't think of a use case where I'd want to serialize and deserialize live code.

          When I persist data, it's deserialized and validated, and then the objects are reconstructed - not because I'm worried about being pwned, but because I want to know I'm in a known good state with all my invariants intact. ("Have you tried turning it off and turning it on again?")

          Okay, maybe I'm a bit sloppy at times and a certain pragmatism creeps in. But the intent is there (and there are always "fixme" comments in the code...)

          1. doublelayer Silver badge

            Re: A fly in an ice cube in a microwave.

            It usually comes down to laziness. There's probably harmless laziness, like using pickle to automatically serialize something because you don't want to write the thing which converts it to XML or JSON or something. Then there's harmful laziness, where people pickle code just because that makes it easier to import without giving people the actual code.

            What people receiving such models should keep in mind is that they're getting binaries, and those binaries should be treated with the same mistrust as a more typical one. If you wouldn't run an executable from these people, maybe don't run their different-format executable just because it takes a few more steps to execute.

      3. doublelayer Silver badge

        Re: pwned by default

        This is correct. Pickles are just serialized objects. And that means basically any object. If you pickle a function, then it unpickles into runnable code. If you're not careful what you do with it, you could run it. For ML models, this can end up being the intent; you just load your preprocessor, run it, then run the model. If the attacker submits a preprocessor function which does other things, you don't know what it's going to do and should protect yourself or not run it at all. The same issue occurs everywhere where you can serialize something which can execute. Unless you're careful about using it later, you could end up executing something malicious.

        1. sw guy

          Re: pwned by default

          So, pickles are kinda executable files.

          One should trust / un-trust them on the same basis as any executable files.

          1. doublelayer Silver badge

            Re: pwned by default

            Not exactly. Just unpickling one can't run code. It can produce an object that is runnable. It should be treated like anything that can be executed, but not like something which automatically executes. It's one level below a document which can run data just by opening it.

            1. Michael Wojcik Silver badge

              Re: pwned by default

              Just unpicklinng one can't run code."

              Except when it does. See Marshalling Pickles (AppSecCali 2015).

              This has been a well-known issue for over five years. And it's not just Python.

      4. bombastic bob Silver badge
        Meh

        Re: pwned by default

        I've never tried torch.save or torch.load and so I didn't even know about this...

        I can understand how a binary data format would be better. But NOT one with executable CODE in it, which is where LOTS of back doors have creeped in... Word doc... Excel spreadsheet... enhanced metafiles... flash... the list goes on.

        *IF* the parser is re-written to convert old style files into harmless data [without executing functions] then it could continue, but I expect it will convert slower in the process.

        Better still, bite the performance bullet and use XML or tab-delimited columnar text or some OTHER standard data-only interchange format (though I'm not a fan of JSON) to store and load this kind of data. NOT that hard, and if the interpreter is [intelligently] written in C, it might be just as fast on large data sets [ones limited by disk access speed], though I expect data compression might be needed to keep the file sizes small. gzip works pretty well for that on text data.

        /me points out that CPU-piggy C++ code that relies on exception handling and uses 'new' a lot would NOT qualify as "intelligently written".

  3. Mike 137 Silver badge

    A great way to ensure both quality and progress

    "ML practitioners prefer to share pre-trained pickled models rather than the data and algorithms used to train them"

    So any imperfections or errors get perpetuated without anyone being able to find out.

    1. bazza Silver badge

      Re: A great way to ensure both quality and progress

      Yes, but in that world errors are seemingly considered as acceptable...

  4. cantankerous swineherd

    torch.data

  5. CrackedNoggin Bronze badge

    This makes me recall a recent article

    JavaScript could usurp Python for mathematical programming, claims Deno team

    https://www.theregister.com/2021/03/04/deno_10_gpu_ml/

    in this journal.

    I think the general idea is that with Deno, it would be possible to unpickle-equivalent in safe mode at the per-command level, protecting the file system.

    Deno's built-in security, combined with the practical convenience of a concurrent language, may indeed make Deno a better system-level language than Python. Perhaps wedged between Python and the system.

    1. doublelayer Silver badge

      I'm not seeing it in the article. It runs in a sandbox, perhaps though I'm guessing, but the problem with malicious code is that it's run in the first place. It's not hard to put untrusted pickles in a sandbox, but if you don't or they can do whatever they want to do from in there, it hasn't fixed anything. The best way to handle this is to create a restricted language which can be serialized and runs only in an interpreter which has no OS access. It only does math and has no hooks elsewhere. That would work, but nobody would end up using it because people who so far don't have any problem unpickling random things and running them aren't going to go to extra effort for provable security, especially if it means not using one of the libraries they're used to.

  6. bazza Silver badge

    Uh Oh

    I can smell yet another software package format coming up soon...

    1. DWRandolph

      Re: Uh Oh

      Obligatory XKCD - How Standards Proliferate

      https://xkcd.com/927/

      1. jake Silver badge

        Re: Uh Oh

        The nice thing about standards is that there are so many of them to choose from. —Andrew S. Tanenbaum

  7. John Savard

    Not Good News

    This isn't just a proof-of-concept, it makes actual attacks easier.

    It seems irresponsible to release it, therefore.

    The only way there would be an "easy win for security" is if a drop-in replacement for pickling were available now. If that were the case, the article would have mentioned it.

    It was, though, made clear in the article why pickling is essential, as it's the only way to distribute AI models without giving away their internals. The only alternative offered is switching to another language.

    1. Michael Wojcik Silver badge

      Re: Not Good News

      The problem was described at length by Lawrence and Frohoff in 2015. This new tool might help the dimmest of skiddies, but it's really nothing more than a reminder for those who refuse to pay attention.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like