
Perfect.
Pleasing.
Perfunctory potable?
Evan Sultanik, principal computer security researcher with Trail of Bits, has unpacked the Python world’s pickle data format and found it distasteful. He is not the first to do so, and acknowledges as much, noting in a recent blog post that the computer security community developed a disinclination for pickling – a binary …
Oh, that's just great.
There's already a warning nobody's paying any attention to. The solution is obviously to add another warning.
Would the maintainers of Python be bitten by the worm of administrative thinking ? You need to lock that functionality down. You need to ensure that pickled files are encrypted or something. You need to bake security into it somehow. I don't know, I don't have the answer, but just slapping another warning on and calling it a day is not the solution.
The maintainers of python don't need to do anything; pickling is a great way of persisting private blobs of data. The maintainers of pytorch definitely need to do something because they are using it as the primary method of data interchange between users. That's the messed up part.
"You need to ensure that pickled files are encrypted or something."
I'm not a python user, but it sounds like the problem isn't lack of encryption. The problem seems to be the format can contain automatically executed code; they're doing the equivalent of a javascript programmer parsing a JSON file with eval()
.
There's even a hint they might be using these hooks, when all they are doing is transferring data.
Upvoted because I'm pretty sure you're right. It'd be like .zip or .tgz files being able to run code from their package and doing so by default as opposed to only after asking or, in a script, in response to an enabling command line parameter.
from the Python 3.9 documentation (in red)
Warning
The pickle module is not secure. Only unpickle data you trust.
It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.
Consider signing data with hmac if you need to ensure that it has not been tampered with.
Safer serialization formats such as json may be more appropriate if you are processing untrusted data. See Comparison with json.
Probably not a super situation for files distributed to/by world+dog.
"Probably not a super situation for files distributed to/by world+dog."
Don't be a pessimist. Word had autorun macros for years (and so did CDs and USB drives) and none of that ever caused a problem. Who's going to be interested in a few high value researchers and their carefully curated hoard of personal data?
It's actually wrong to think of pickles as data because it's actually the serialisation format for Python objects and used, for example, in multiple process environments. But, as objects, they can most definitely contain executable code.
They are a necessary evil but as such, they should always be used with caution and never for data transfer, for which they are also too fragile.
I can't think of a use case where I'd want to serialize and deserialize live code.
When I persist data, it's deserialized and validated, and then the objects are reconstructed - not because I'm worried about being pwned, but because I want to know I'm in a known good state with all my invariants intact. ("Have you tried turning it off and turning it on again?")
Okay, maybe I'm a bit sloppy at times and a certain pragmatism creeps in. But the intent is there (and there are always "fixme" comments in the code...)
It usually comes down to laziness. There's probably harmless laziness, like using pickle to automatically serialize something because you don't want to write the thing which converts it to XML or JSON or something. Then there's harmful laziness, where people pickle code just because that makes it easier to import without giving people the actual code.
What people receiving such models should keep in mind is that they're getting binaries, and those binaries should be treated with the same mistrust as a more typical one. If you wouldn't run an executable from these people, maybe don't run their different-format executable just because it takes a few more steps to execute.
This is correct. Pickles are just serialized objects. And that means basically any object. If you pickle a function, then it unpickles into runnable code. If you're not careful what you do with it, you could run it. For ML models, this can end up being the intent; you just load your preprocessor, run it, then run the model. If the attacker submits a preprocessor function which does other things, you don't know what it's going to do and should protect yourself or not run it at all. The same issue occurs everywhere where you can serialize something which can execute. Unless you're careful about using it later, you could end up executing something malicious.
Not exactly. Just unpickling one can't run code. It can produce an object that is runnable. It should be treated like anything that can be executed, but not like something which automatically executes. It's one level below a document which can run data just by opening it.
Just unpicklinng one can't run code."
Except when it does. See Marshalling Pickles (AppSecCali 2015).
This has been a well-known issue for over five years. And it's not just Python.
I've never tried torch.save or torch.load and so I didn't even know about this...
I can understand how a binary data format would be better. But NOT one with executable CODE in it, which is where LOTS of back doors have creeped in... Word doc... Excel spreadsheet... enhanced metafiles... flash... the list goes on.
*IF* the parser is re-written to convert old style files into harmless data [without executing functions] then it could continue, but I expect it will convert slower in the process.
Better still, bite the performance bullet and use XML or tab-delimited columnar text or some OTHER standard data-only interchange format (though I'm not a fan of JSON) to store and load this kind of data. NOT that hard, and if the interpreter is [intelligently] written in C, it might be just as fast on large data sets [ones limited by disk access speed], though I expect data compression might be needed to keep the file sizes small. gzip works pretty well for that on text data.
/me points out that CPU-piggy C++ code that relies on exception handling and uses 'new' a lot would NOT qualify as "intelligently written".
This makes me recall a recent article
JavaScript could usurp Python for mathematical programming, claims Deno team
https://www.theregister.com/2021/03/04/deno_10_gpu_ml/
in this journal.
I think the general idea is that with Deno, it would be possible to unpickle-equivalent in safe mode at the per-command level, protecting the file system.
Deno's built-in security, combined with the practical convenience of a concurrent language, may indeed make Deno a better system-level language than Python. Perhaps wedged between Python and the system.
I'm not seeing it in the article. It runs in a sandbox, perhaps though I'm guessing, but the problem with malicious code is that it's run in the first place. It's not hard to put untrusted pickles in a sandbox, but if you don't or they can do whatever they want to do from in there, it hasn't fixed anything. The best way to handle this is to create a restricted language which can be serialized and runs only in an interpreter which has no OS access. It only does math and has no hooks elsewhere. That would work, but nobody would end up using it because people who so far don't have any problem unpickling random things and running them aren't going to go to extra effort for provable security, especially if it means not using one of the libraries they're used to.
This isn't just a proof-of-concept, it makes actual attacks easier.
It seems irresponsible to release it, therefore.
The only way there would be an "easy win for security" is if a drop-in replacement for pickling were available now. If that were the case, the article would have mentioned it.
It was, though, made clear in the article why pickling is essential, as it's the only way to distribute AI models without giving away their internals. The only alternative offered is switching to another language.
int main(enter the void)
...