back to article Google open sources file-identifying Magika AI for malware hunters and others

Google has open sourced Magika, an in-house machine-learning-powered file identifier, as part of its AI Cyber Defense Initiative, which aims to give IT network defenders and others better automated tools. Working out the true contents of a user-submitted file is perhaps harder than it looks. It's not safe to assume the file …

  1. cyberdemon Silver badge
    FAIL

    Next week's news

    How to defeat Magika by using generative AI to create a random-looking comment block that causes your malware.sh.jpg to be misclassified by Magika as an actual JPEG.

    1. Anonymous Coward
      Anonymous Coward

      Re: Next week's news

      > malware.sh.jpg

      Shouldn't that be malware.jpg.sh so that it looks like a JPEG filename to all the poor suckers who still have file extensions hidden by default?

      Oops, actually it'd better be malware.jpg.bat or malware.jpg.exe [1][2]

      [1] to which Magika replies, yes, that is a perfectly well-formed exe file, nothing to report.

      [2] it is also a totally sensible name for my program, which generates a JPEG of a warning about malware, in a different style each time you run it.

      1. nickpreston24

        Re: Next week's news

        Dumb question: We have means of grepping files and I'm sure we can do it pdq, so why not use known patterns to find embedded scripts in jpeg, etc.? I don't know encoding that well, but don't you get what you put into it? So then shouldn't we, knowing what kinds of patterns we might see in an encrypted jpeg be able to mathematically (through regex or otherwise) anticipate the 'shape' of a script or exe, even if it is encoded? I mean, files are text, so... what am I missing? Thanks, and please pardon my ignorance.

  2. theOtherJT Silver badge

    I've played Magika...

    ...I expect this to make about as much sense, but be significantly less fun.

  3. Dinanziame Silver badge
    Windows

    Isn't it a bit dangerous to publish your defense tool? Makes it easier for bad actors to figure out how to break them, no?

    1. doublelayer Silver badge

      If you don't publish your defense tool, it becomes less useful at defending anybody except you. It's also not something you'd use as a single line of defense, but one part of it. Someone who knows why to use this is probably using a variety of tools with this serving to improve performance and results but not necessarily bypassing their other tools.

  4. captain veg Silver badge

    sig

    I'm obviously not clever enough.

    > Basically, if someone uploads a .JPG to your online service, you want to be sure it's a JPEG image and not some script masquerading as one

    Well, I would check for the presence of a valid JPEG header sequence at the start of the file, i.e. bytes 0xFF 0xD8 rather than, say, 0x23 0x21.

    Were that to fail, it's not clear to me what would be the problem in feeding a file containing a script to a function expecting an image beyond it rendering garbage. The notion that it might actually execute the script seems, er, unlikely.

    -A.

    1. Anonymous Coward
      Anonymous Coward

      Re: sig

      Perhaps what they meant to say is that it checks if it is one of those malformed JPEGs that triggered an exploitable overflow in one of the crappily-written decoders from a few years back (the decoder where they traded proper JPEG parsing for speed)?

      But you still don't need an LLM for that, you just do your magic bytes test then run it into a properly-written JPEG parser (no need to actually decode it fully, so pretty fast).

      The same for any of the openly documented file formats: the single hand written rule being "if libmagic says it is format Q and I have a executable validate_q.exe then run it", where libmagic and validators for the documented formats should all be easily available.

      So that gets around everything except commercial, proprietary junk such as, oooh, just pucking one at random, don't read anything nefarious into this choice, full-fat docx, xlx and whatever PowerPoint's file extension[1]; and they probably do execute random crap hidden away and need thousands of hand-written guesses (sorry, rules) to detect.

      In other words, articles always like to blame JPEG (and mp4) as being "the problem" when only a yoghurt would be taken in by them.

      [1] ok, in the interests of fairness, how about a Jira attachment.

  5. Pete Sdev
    Headmaster

    Poor Eeyore

    which could later bite you in the ass

    Beware all owners of donkeys! Though I wouldn't recommend trying to bite a donkey to be honest.

    If you mean arse, don't be shy, write arse.

    1. captain veg Silver badge

      Re: Poor Eeyore

      > I wouldn't recommend trying to bite a donkey

      And if you did, it's probably safer (and slightly less utterly horribly smelly) to bite it *on* the arse rather than *in* it.

      -A.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like