
Next week's news
How to defeat Magika by using generative AI to create a random-looking comment block that causes your malware.sh.jpg to be misclassified by Magika as an actual JPEG.
Google has open sourced Magika, an in-house machine-learning-powered file identifier, as part of its AI Cyber Defense Initiative, which aims to give IT network defenders and others better automated tools. Working out the true contents of a user-submitted file is perhaps harder than it looks. It's not safe to assume the file …
> malware.sh.jpg
Shouldn't that be malware.jpg.sh so that it looks like a JPEG filename to all the poor suckers who still have file extensions hidden by default?
Oops, actually it'd better be malware.jpg.bat or malware.jpg.exe [1][2]
[1] to which Magika replies, yes, that is a perfectly well-formed exe file, nothing to report.
[2] it is also a totally sensible name for my program, which generates a JPEG of a warning about malware, in a different style each time you run it.
Dumb question: We have means of grepping files and I'm sure we can do it pdq, so why not use known patterns to find embedded scripts in jpeg, etc.? I don't know encoding that well, but don't you get what you put into it? So then shouldn't we, knowing what kinds of patterns we might see in an encrypted jpeg be able to mathematically (through regex or otherwise) anticipate the 'shape' of a script or exe, even if it is encoded? I mean, files are text, so... what am I missing? Thanks, and please pardon my ignorance.
If you don't publish your defense tool, it becomes less useful at defending anybody except you. It's also not something you'd use as a single line of defense, but one part of it. Someone who knows why to use this is probably using a variety of tools with this serving to improve performance and results but not necessarily bypassing their other tools.
I'm obviously not clever enough.
> Basically, if someone uploads a .JPG to your online service, you want to be sure it's a JPEG image and not some script masquerading as one
Well, I would check for the presence of a valid JPEG header sequence at the start of the file, i.e. bytes 0xFF 0xD8 rather than, say, 0x23 0x21.
Were that to fail, it's not clear to me what would be the problem in feeding a file containing a script to a function expecting an image beyond it rendering garbage. The notion that it might actually execute the script seems, er, unlikely.
-A.
Perhaps what they meant to say is that it checks if it is one of those malformed JPEGs that triggered an exploitable overflow in one of the crappily-written decoders from a few years back (the decoder where they traded proper JPEG parsing for speed)?
But you still don't need an LLM for that, you just do your magic bytes test then run it into a properly-written JPEG parser (no need to actually decode it fully, so pretty fast).
The same for any of the openly documented file formats: the single hand written rule being "if libmagic says it is format Q and I have a executable validate_q.exe then run it", where libmagic and validators for the documented formats should all be easily available.
So that gets around everything except commercial, proprietary junk such as, oooh, just pucking one at random, don't read anything nefarious into this choice, full-fat docx, xlx and whatever PowerPoint's file extension[1]; and they probably do execute random crap hidden away and need thousands of hand-written guesses (sorry, rules) to detect.
In other words, articles always like to blame JPEG (and mp4) as being "the problem" when only a yoghurt would be taken in by them.
[1] ok, in the interests of fairness, how about a Jira attachment.
int main(enter the void)
...