back to article This malicious PyPI package mixed source and compiled code to dodge detection

Researchers recently uncovered the following novel attack on the Python Package Index (PyPI). ReversingLabs detected a Python package in April that mixed malware with compiled code as a way to evade detection by security tools that only check source code files and not compiled output. "It may be the first supply chain attack …

  1. that one in the corner Silver badge

    Why have pyc files in a package anyway?

    Any Python install that can pull down a package and expect to run it should have the tools to generate pyc files locally.[1] Unlike C extensions, where we can no longer expect every machine to a C compiler.

    So why are pyc files allowed as part of the package in the first place, as everyone appears to know that the scanners won't work on them?

    (Or have I missed the part where the pyc was cunningly disguised and encrypted to prevent anyone spotting it is pyc in the first place?)

    [1] e.g python.exe with the appropriate arguments!

    1. A Non e-mouse Silver badge

      Re: Why have pyc files in a package anyway?

      I am not a Python guru, but I'd be surprised if you couldn't manually read some data (e.g. "image.jpg") or a hex string embedded in a .py file and then get that executed, so skipping the .pyc file.

      1. doublelayer Silver badge

        Re: Why have pyc files in a package anyway?

        You could, but that is likely to be detectable by code analysis. This exploit didn't do that and included the compiled version directly. I think the new policy should forbid that. Not only is it a security risk as I think this incident amply demonstrates, but it also loses all of the benefits introduced by new versions of the Python compiler, which can compile the same code to faster byte code. Most .pyc files are version specific, and while there's backward compatibility, there's a good reason they are usually recompiled instead.

    2. thames

      Re: Why have pyc files in a package anyway?

      The Python "interpreter" automatically compiles each source file as it is imported and then caches the compiled form as a ".pyc" file. This means that the second time it is executed it can skip the compile step and import the byte code binary directly. This can speed up start up time significantly. While the Python compiler is very fast, on large programs it can make a perceptible difference.

      Because of this you don't actually need the ".py" file on imported modules if the ".pyc" file is already present.

      This isn't something unique to Python, as many other languages have used a similar strategy.

      Some people use this as a very weak form of copy protection so they can distribute Python programs to customers without giving them source code. That isn't what it was orignally intended for, it's just a side effect of having a faster start up.

      However, this does mean that there is a use case for having ".pyc" files rather than source in a package. This in turn means that having the standard installation tools exclude ".pyc" files would break at least some existing software out there.

      The solution is to simply have the code analysis tools disassemble the ".pyc" files and analyze those (the output is like a form of assembly language). A disassembler comes as part of the Python standard library.

      1. david 12 Silver badge

        Re: Why have pyc files in a package anyway?

        there is a use case for having ".pyc" files rather than source.

        That's not a use case -- it's another side effect. You can get away with providing pyc files in a package, because python loads pyc files by preference.

        Regarding the use case of 'hiding source code', that includes "eliminating dependencies". If you merge all of your dependencies into one big ball-of-code, you can hide that mess in pyc.

  2. Plest Silver badge

    The more things change the more they stay the same

    You just know that most of this stuff malicious crud is put into some base package that does something noddy like string conversion of float casting, 'cos they know some lazy sods can't be arsed to learn how to do it using the core library in a language. Just a modern version of a classic scam, get the lazy marks 'cos they won't realise until it's too late.

    At this rate we're all going to have to go back to "rolling our own" in house packages again like we did years ago and stop depending on external repos, that'll be fun and kill the lightning pace most projects are forced to run at these days!

    1. FrogsAndChips Silver badge

      Re: The more things change the more they stay the same

      Don't worry, if devs need ot go back to writing code themselved, ChatGPT will be there to help them!

    2. sitta_europea Silver badge

      Re: The more things change the more they stay the same

      "...At this rate we're all going to have to go back to 'rolling our own'..."

      Nope. Still rolling. And not in Python, mostly C and Perl with the odd shell script. Call me old-fashioned, but then I'm old-fashioned.

      I loved the bit the other day where ChatGPT reeled off a load of legal precedents which a lawyer then relied on on Court.

      Trouble was that, as the other side pointed out, these precedents didn't actually exist. ChatGPT just made them up.

      The judge presiding called our hero into his chambers, and we now await decision on the punishment.

      I mean here, it's bad enough if you're just wearing the wrong clothes - but trying to lead the judge up the garden path is a very serious matter.

      I'm *never* going to let AI code for me.

      1. damiandixon

        Re: The more things change the more they stay the same

        ChatGPT, Bard... Google, bing... You should always check the references to make sure they make sense.

        I've used ChatGPT and Bard for reasech into how to program an area that I'm unfamiliar with. I did not copy any of the code snippets as I had no idea where they came from. I did do searches on the library calls though which helped a lot in understanding call sequencing.

        They are an interesting tool and came up with more relevant information than Google/bing search.

        However the way the material is presented as an authoritative narrative is seductive.

        I've added a policy at work to ban the copying of code from the internet without a clear copyright, licence and attribution that is acceptable to the project manager and legal.

  3. ChoHag Silver badge

    A new attack technique

    Distributing malware in compiled form is new?

  4. Erik Beall

    Either decompile or block pyc

    Most .pyc can be reversed with almost perfect fidelity to the original python (minus comments), including function and variable names, unless precautions were taken. So they could decompile any .pyc to them scan it but I think it's better to just block any pyc as a policy. I can't think of any side effects, doesn't mean there aren't any, but hopefully it's that simple as a means to slightly improve pypi.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like