still deciding on whether to release the code
How about, "No."
Please?
OpenAI researchers believe they have discovered a shockingly easy way to hoodwink their object-recognition software, and it requires just pen and paper to carry out. Specifically, the lab's latest computer vision model, CLIP, can be tricked by in what's described as a “typographical attack." Simply write the words ‘iPod’ or ‘ …
"OpenAI is still deciding whether or not to release the code." - Then maybe they should change their name? If you want to be "open" then work openly. If your work is too, dangerous, important, racist, or crap to be released they change your company name to "BiasedAI", "ClosedAI", "YAFAC - Yet Another Fucking AI Company"
This is research - not for production. Research captures the imperfections and seeks solutions to them.
You can't paper over the world's imperfections, yout identify, point it out and fix it.
Such imperfections and weakness of AI (or any science or technology for that matter) should be brought out, and publicised and encouraged. Not dusted under the carpet. And abuse of technology is also real.
Narrow minded thinking like yours would have called fire and smallpox too dangerous to research and left it at that.
Ignorance is not a defense
The form of words for UK classified documents is "This page is intentionally blank", with the same result of silicon paradox paralysis for any overly-literal AI reading it! The reason originally, of course, is that it caught errors with photocopying - if there was a truly blank page you knew there was something wrong. Photocopiers should have been set up to print "This page is unintentionally blank" on mis-feeds, I suppose.
Interesting point, human. Now I ask 'How did the general public at the time react to "this is not a pipe"'? #
Better yet, it always amuses me how humans first reacted to film of oncoming trains, and the like. Or is that one of those wrong-facts that you humans keep learning and parroting despite all the evidence? Either way, your vaunted non-artificial "intelligence" clearly still has some work to do. #
:-) #
Especially since the train doesn't really arrive towards the spectators: The camera is on the platform, and the train moves (slowing down to a stop) to the left.
There is no reason of being afraid, even if you are generally scared by silent, noisy, B/W images of moving trains...
"Simply write the words ‘iPod’ or ‘pizza’ on a bit of paper, stick it on an apple"
Since the piece of paper in the photo largely obscures the apple, I'm not bit surprised, as the AI essentially sees (and we see) a large label with little bits of apple round it. I guess a human's attention would focus on the label rather than the apple too. How would the AI perform if the label were smaller and the apple were mostly visible?
The problem is not what it 'sees', the problem is an inadequate representation of what it is 'seeing'.
If one were to take many apples, and attach a similar size label to each with different words on them e,g, 'screen'; 'keyboard', 'mouse', 'cpu', a human would see a set of apples with labels on them. The AI might well report the image as being 'a computer', or if less clever, a collection of objects like a fire-screen, a piano-keyboard, a small furry creature and an AMD Zen processor. The problem is not the size of the label, but the almost context-free processing of the information it is gaining from the analysis of the image.
Allowing a hand-written label to override what is actually there simply does not make sense. It doesn't look like the AI will easily acquire the necessary domain knowledge on its own, either.
It's also down to the entirely daft problem that is trying to be solved because the question being asked is not "what is in this image", it is instead "what single object is in this image". Which fails very very quickly, for example give it a picture of an apple, will it respond "apple" or the composite description of the two objects involved?
Even a young child when asked what is in the example image in this story would likely say something like "an apple with a piece of paper/label stuck to it", adding that there is writing on the paper/label if they are older. This is a description of a multi-object scene and includes adequate description to relate the subject for most situations and also shows context. Expecting singular object returns is self-defeating and shows a blinkered approach that is never going to work for anything other than clean, sanitised images of single, non-composite objects.
The wider problem is that no-one seems to be training AIs on wider problems. A 1yo child has experience of the world through vision, sound, taste, smell, interaction, and (one hopes) the beginnings of a system of externally imposed behavioural constraints. When such a child sees a picture, they know it is a picture rather than the real thing but they can also understand that it can stand in for the real thing in some contexts, such as a conversation.
I haven't seen any reports of AIs being trained on such a broad range of inputs, so I'm not surprised that they still so easily led astray. I do wonder though whether the hardware is now beefy enough to start planning such experiments.
"Even a young child when asked what is in the example image in this story would likely say something like "an apple with a piece of paper/label stuck to it", adding that there is writing on the paper/label if they are older. This is a description of a multi-object scene and includes adequate description to relate the subject for most situations and also shows context. Expecting singular object returns is self-defeating and shows a blinkered approach that is never going to work for anything other than clean, sanitised images of single, non-composite objects."
I wonder if the problem is that it takes the first likely answer and that text recognition has a higher priority than image recognition. I wonder how the "AI" would respond to an apple with the word pizza written directly on it with a marker pen?
Now there's some testing that could happen!
What it should respond with is "apple with the word pizza written on it" however as the question being asked is "what single object is in this scene" (single word responses please), the answer should be "apple" however "writing" would be a valid response as it's an identifiable object in the scene.
One positive thing about all this though, is that the text recognition is working well.
I suspect that the issue is due to the text recognition having much higher levels of accuracy. So the system may see an apple with 25% certainty, but because it sees whatever word is written with 90% accuracy it decides that is the best answer
Guys, did you miss the question? At no point the AI says the object in the picture is an iPod, it only says "from the list of classifiers I have (pizza, iPod, toaster), the term iPod seems to fit best". Which is definitely true.
People knowing me around here know I'm really not an AI apologist, but one has to admit that at no point the answer of "What is the keyword to that picture" could be "apple". You barely see it, not more than the surface it's standing one. So, among the software's choices, "iPod" is clearly the most appropriate, even if it's not what the programers were expecting. The problem were not the AI's answer, but their expectations that it might solve the philosophical question (some have already mentioned Magritte's pipe) of what reality in a picture is...
It pains me to say, but in this instance it's AI 1, humans 0...
Not having read the research itself, I think you are on the right track. The problem is perhaps more with the question which was asked.
If the software is being used to classify images - to list labels which are appropriate for a given image, then "iPod" is valid for the second image and "Granny Smith" probably is not. If the list of possible labels is as limited as it appears then definitely. What "iPod" isn't, from out point of view, is a useful label.
As humans we don't find the label "iPod" to be at all useful because the key things in the image are the label and the item to which it is attached. More useful tags would be "paper" or "label" with a sub-tag of "text" with a sub-sub tag of "iPod", and a secondary tag of "apple" (this should be recognisable even from the small amount visible). You could add a descriptive sub tag of "green", maybe "waxy". Given that there are scores of mostly green apple varieties, "Granny Smith" is pushing it a bit far for the second image, maybe even for the first.
M.
"The problem is perhaps more with the question which was asked."
Next up is the Jeopardy[*] AI - Given an object, what question could we ask about it such that Open AI gives an answer that is not surprising to humans?
[*] Jeopardy is probably trademarked. No problem. We just need an AI that can come up with a suitable synonym for the Jeopardy AI.
Or, perhaps, the problem is with the humans who don't understand the *actual* question that was asked, they only understand what they *think* they asked. :-)
Remember, your code is not required to do what you want it to to. It only does what you tell it to do (i.e. have coded it to do).
"one has to admit that at no point the answer of "What is the keyword to that picture" could be "apple"."
Of course it could. As other comments have noted, the answer a human would give would likely be along the lines of "an apple with a label saying "ipod" stuck on it". Only a small part of the apple is visible, but it's still plenty for a human to clearly recognise it as the primary subject of the photo. This isn't a philosophical debate, it's the entire point of this kind of machine learning development - how to get a computer to actually recognise what is in a picture. It's really quite weird to complain that the developers have the wrong expectations, when getting the system's answers to match their expectations is the sole goal of the research.
> the answer a human would give
Sure, and the answer a fish would give is "...". The problem is that in this case the software was built to chose among a list of labels (check the picture, you see part of it) and use the most fitting. From this perspective its answer is without any doubt the most pertinent: The thing characterizing that picture, the first thing one notices, is "iPod". And, coincidence, there is a fitting label for that.
Also, even if it weren't just asked to label the picture, the notion of fruit on which one can fasten another object, which in turn carries coded information, is miles over a labeling software's intellectual capacities. In short, a human might indeed, an "AI" definitely not.
I think what mike was trying to say, is how do you know whether it was identifying *the object* iPod, or *the word* iPod.
If it was the latter, then I would say that it was entirely correct and not a hack at all.
If the same engine can recognize words and objects, then just outputting "iPod" rather than "word: iPod" or "object: iPod" then that's the mistake, not that it misidentified what it saw.
> failed to recognize the apple, but that it was 99.7% sure it had recognized an iPod
It didn't "recognized" anything at all, it had the task of putting a label on that picture, and the label "iPod" is clearly the best choice, nobody can deny it.
As I said above, the philosophical joke of Magritte's pipe is way beyond the capacities of a simple multiple-choice software.
I'm pretty sure that, confronted with Magritte's pipe painting, it would had labeled it "pipe", oblivious to the hint that it isn't indeed actually a pipe, but just a painting of one. The day some AI can play with such abstract notions isn't anywhere on the calendars yet...
As mentioned, if you view the process as a classifier rather than "AI", it's pretty clear that the result is expected. So...either the researchers are being stupid, or...they are drumming up publicity for themselves for whatever reason.
A good demonstration of naivete, they need to include a bastard network to come up with reasonable Skeptical options.
That network already exists, but is recruiting new members. So those annoying captcha trainers that ask you to identify every image with a bicycle in it. I've often wondered how many times you'd have to incorrectly identify those images to pollute the AI's neurons. I do my bit to help find out.
But curious if they're trying to get too specific in training this AI. I can kinda see how it might think 'library' given the background, but seems a bit odd that it might pick toaster over some other variety of apple.