Hash list
So you try and up load your image and it gets blocked, change the colour of one of the pixels slightly and bingo, a new hash.
I hope there is more to this system than a hash list.
The Internet Watch Foundation, Blighty's voluntary body for policing and filtering the 'net for child abuse images, has announced nearly 19,000 hashes of "category A" abuse images have already been stored in its new Hash List and distributed to major web firms. The abuse images are sorted into categories A, B, and C, with "A" …
I suggest you look at something like Tin Eye for how accurate image detection can be. As an example I photoshopped 2 generally available meme images I downloaded from the internet into one joke pic to stick on the wall at work. Tin Eye correctly identified which 2 photographs I cut parts out of to create my pic, and pointed me to where I could download them off the net. It really is quite good. From my chats with a couple of engineers who tinker with image recognition, the only thing holding them back in a facial recognition sense, is privacy legislation. They have had the means for some time, but are discouraged from applying it on Joe Public.
I wonder whether the people who do this research (and I am seriously glad I am not one of them) has looked at using the 'bag of vectors' approach - I saw some work a few years ago, very impressive at identifying the same types of features - matching objects taken at different angles. Computationally expensive, of course, but that's always something that can be dealt with. I have to say that I'm not surprised to hear of the hashes (from a technical perspective), but it's in contrast with what I heard on the radio the other day about how long it takes the police in the UK to search through devices. I would be interested in looking at the accuracy of taking a cluster of approaches (whole image hashes, partial hashes, feature hashes, even key word analysis).
Oh, goody! MD5!
It's lucky they chose an up-to-date hash algorithm that's got no known weaknesses.
What's that, Carnagie Mellon University's Software Engineering Institute? As of 2010 you consider it "cryptographically broken and unsuitable for further use"? Oh, that's unfortunate... MD5 has been known to have collision issues since 2004? My - that is poor.
Seriously, MD5 is fine for some things. But for important things - like anything approaching censorship or criminal justice, perhaps - I don't think we should be using MD5. SHA-2 perhaps?
cryptographically broken and unsuitable for further use
"Cryptographically" being the operative word. In this case, it's not being used cryptographically.
for important things - like anything approaching censorship or criminal justice, perhaps - I don't think we should be using MD5
In their defence, it's entirely possible that they started using MD5 for this purpose before MD5 was so widely considered useless. And since it's a criminal offence to have possession of the images in question (exceptions notwithstanding), they may no longer have the source images from which to generate new hashes. However, they certainly shouldn't be using it for new images, and given the inclusion of PhotoDNA hashes in the programme, it's entirely possible they no longer do so.
That said, I would certainly hope they do a more detailed check than just comparing MD5 hashes, before breaking your door down in the middle of the night.
I'm a dreamer, I know.
MD5 is perfectly fine for this, the odds of accidentally stumbling over a collision in real world data are insanely small. They're not going to prosecute you for it, it's obviously used for flagging up images.
That being said you could find your image deleted from facebook or something - but it's not as if you won't be able to say "well this is wrong".
When I say md5 is fine I mean on a technical level not using file hashes is generally a stupid idea for this purpose for obvious reasons, but sha-512 isn't exactly going to improve the process - there's less theoretical collisions but..
Remember you're talking about 128 bit key space so..
I would be less worried about accidental collisions and more worried about the possibility of someone intentionally crafting an innocent-looking file that has the same MD5 hash as something on the watch list. (The weaknesses in MD5 allow you to do to this.) Said file could be used for any number of mischievous and/or nefarious purposes, details of which are left as an exercise for the reader.
The IWF are not encrypting things, they are searching for things that MAY have been encrypted with MD5.
Why are people posting like "on my god, they should use XXX encryption" like they are talking about securing bank accounts? they want to find images that are ALREADY out there, not securely distribute new ones
Suppose I actually wanted to distribute this stuff (I don't !)
All you'd have to do would be to write a script that changed one pixel on the image (depends on compression), resized it, changed an irrelevant byte in the internal format or just about anything. You could probably convert a PNG to a JPG and back again and you wouldn't get the same file because of lossy compression. Or I could zip the file. Or I could add an extra 256 bytes to the end of the file filled with random values, which another program could strip off.
Am I wrong here ?
https://en.wikipedia.org/wiki/PhotoDNA
PhotoDNA ... works by computing a hash that represents an image. This hash is computed such that it is resistant to alterations in the image, including resizing and minor color alterations. It works by converting the image to black and white, re-sizing it, breaking it into a grid, and looking at intensity gradients or edges.
PhotoDNA was developed by Microsoft and offered to law enforcements worldwide at no cost. Yes, that's another reason to hate MSFT!
I think the hate will start when its use will be legally mandated and ISPs find that:
1 - it may be free for law enforcement, but not for ISPs and
2 - it requires Windows to run it on and will be as resource hungry as anything else MS has ever produced.
If you deduce from that that I do not trust MSFT for one cent, you're right. They've played that game many, many times and they're getting desperate now.
@AC -
1. It is free for both law enforcement AND ISPs. Even "service providers" such as Facebook use it. For free. It used to require hardware on-site but that's now been changed to a RESTful Azure service to remove the need for hardware and admins.
Now somebody will claim MS are are collecting all the child porn because they like it, I suppose.
2. See above regarding the Azure service. It requires absolutely nothing to run it.
Yes, I saw a post about that after I'd posted.
Problem is, the stuff they are trying to stop is not an issue of recognition, it is an issue of distribution.
What they want is to stop the stuff being stored on ftp servers and so on, right ?
So how does a "Photo DNA" algorithm cope if the thing is not actually a photo. Suppose the raw bytes in the file are swapped according to some key sequence and that key sequence is distributed seperately ?
Anything based on pictorial recognition will not work if the thing is not a photo, surely ?
19,000 images? That's a pitifully small number. Besides, this measure only eliminates existing, known images; it does absolutely nothing to prevent the production of new works. Maybe even the opposite: by destroying the existing stock, they could be stoking demand for new material. Great move.
"19,000 images? That's a pitifully small number."
Beat me to it.
I have a friend that is a professional photographer; he does weddings, portraits and such like, but he also likes to take his camera with him when he walks the dog (sadly Duster passed away a month ago). He asked me to get him some stuff to back up his picture collection; all 400,000 plus images from since he went digital.
When I suggested that he might want to trim it down a bit just to keep the good stuff, his reply was that he doesn't have time to go through and sort them out. It's just easier to keep everything.
It is easier to keep everything... says the poor sod with 568,345 (at the last count) digital images stored away. Takes up nearly two NAS units. 15.6Tb. 980Gb so far this year.
Well, sorting that out will give me something to do when I retire (that's my excuse and I'm sticking to it)
According to TFA, the 19,000 number is just for the "worst of the worst". The PhotoDNA wikipedia article mentions that "Project Vic" has a database in the millions of hashes.
" by destroying the existing stock, they could be stoking demand for new material"
Just as the big drug bust has been shown to create a gang war with the attendant shootings as the wannabees battle it out to become the successors to the trade. Unintended consequences.
The threat of this approach might deter the computer illiterate but then Darwin was already looking after people who try sharing kiddie porn via facebook.
First off, how is this hash to be generated? Will google et al calculate a hash for every image before it can be uploaded and simply not accept (sight unseen) anything that produces a hash they don't like? The first time you cant upload your holiday snaps will be the last time you use their service, so that is not a runner. Any hash will have to be calculated after upload, which means the company is now in possession of the suspect image. In most jurisdictions, possession of kiddy porn (knowingly or otherwise) is a serious criminal offence and I am not sure if safe harbour rules apply if the company is aware of the content.
What happens when a matching hash is detected? Do they send the 'suspect' image to someone else to verify? In which case they will be knowingly participating in the transport and distribution of what they believe to be kiddy porn across state and national boundaries! Try explaining that to the company lawyers.
Perhaps they have a human verify the image before they alert the authorities? In which case they must have paid employees looking at kiddy porn on company computers, on company time, with the companies knowledge and worse, consent! I wonder how HR will fill that vacancy. "Wanted: child porn expert, equal opportunity employeer"
If you are serious about stopping child exploitation, then stop this techno bullshit and actively support genuine child protection organisations.
"First off, how is this hash to be generated? Will google et al calculate a hash for every image before it can be uploaded and simply not accept (sight unseen) anything that produces a hash they don't like?"
Generate hash on client. If it matches, silently dial 999 (or whatever) whilst not doing anything to make the (suspected) criminal suspicious. Send data to law enforcement.
Negative only in some senses; half of them are mistaken about MD5, so it's fine.
And no. Only doing something effective is worth doing. Thankfully this may well be effective, so it's worth doing.
But in general, looking at emotional tone and saying "anything is better than nothing" aren't strong bases for analysis.
There's a lot of reasons why doing "anything" isn't a good idea.
I suggest you look up scope creep or feature creep and remember what fantastic success we have had with the anti-terror laws. It does little good with no transparency what-so-ever as IWF is by design set up to stop people asking questions.
PhotoDNA is good but it should be applied to already uploaded pictures and humans generally are required in this process to verify and IWF has their own setup for that. Hopefully they are now beying the origial 5 people and actually process false positives.
Doing something is only better then doing nothing if it is actually helpful. The money, time and resources being spent on this unworkable idea are money, time and resources that are not being spent on actually helping children. At its very best, if everything works properly and all the technical and legal issues are overcome, then a small number of computer illiterates who share old kiddy porn will be stopped. Or at least slowed down.
Not one child will be protected from being exploited.
Not one image will be taken out of circulation.
Not one image will be prevented from getting into circulation.
One final technical point. If the technology actually worked as advertised, why isn't it being exploited by people who could make profit from it? Where are the hundreds of millions of legitimate and copyrighted images that are being illegally used that this technology should be able to track down? Why aren't the courts being backed up with claims for compensation for provable copyright infringements? The licence fees alone for this technology should be able to fund major child protection efforts.
Not one child will be protected from being exploited.
Not one image will be taken out of circulation.
Not one image will be prevented from getting into circulation.
AIUI part of the posited benefit comes from providing a means to distinguish between previously seen and new images which should allow LEOs to focus any child rescue resources onto identifying previously unknown victims
This post has been deleted by its author
I recently looked at an issue involving fake LinkedIn profiles. I was getting nowhere with a reverse image search of the profile images with the usual technologies until somebody suggested flipping the image.. and all of a sudden the reverse image search started working.
That was a relatively simple circumvention technique. I'm sure there are plenty of reversible techniques to apply to a picture that would screen it from this sort of detection. But it would probably catch quite a lot of this material from being circulated.
I'm sure the people involved with this predominately have good intentions, but I find total secrecy around the subject disturbing. They tell us it's "child abuse images", and the worst of the worst at that. But we have absolutely no way of knowing. Obviously we can't see the images, and it doesn't sound like ordinary people can even get their hands on the PhotoDNA or the technology behind it to verify that it won't falsely match legal pictures. Nor did they tell us what exactly they intend to do if they get a match. Would it tell the uploader what happened? Would it just silently fail? Would they log the attempt and put the user on some kind of evil-list? The whole thing is too much in the dark for my liking.