Reply to post: Re: C/P from PDFs can be interesting...

Trojan Source attack: Code that says one thing to humans tells your compiler something very different, warn academics

Anonymous Coward
Anonymous Coward

Re: C/P from PDFs can be interesting...

Actually PDF predates Unicode so while many modern PDFs are constructed in a way that maps glyphs to Unicode, some don't. Even if done properly, cut/paste is a bit of an issue. The main problem is that's no such thing as a word in PDF, just a glyph at a particular locaftion. So while most documents create their text in a predictable order, it gets trickier with columns, tables, callouts and so on.

Other problem areas are things like bullet symbols, whitespace, and ambiguous characters - think Ohm and Omega, space and nbsp, hyphen and m-dash and so on. Things get worse for simple RTL like hebrew, worse again for arabic and by the time you get to Hindi, Bengali etc you're pretty much f*cked in terms of text extraction unless the software that created it has thought of this and done things properly. I'll save you checking; it almost certainly hasn't.

The solution when constructing PDFs, as always, is to make sure they're PDF/A-3a and/or PDF/UA compatible.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon