
Turing test
I love it! After years of Turing contests, no one could make a computer act like a human, but in a few years, with financial incentive, spammers can write programs to read text and hear words that I can't even figure out!
Computer scientists have developed software that easily defeats audio CAPTCHAs offered on account registration pages of a half-dozen popular websites by exploiting inherent weaknesses in the automated tests designed to prevent fraud. Decaptcha is a two-phase audio-CAPTCHA solver that correctly breaks the puzzles with a 41- …
This post has been deleted by its author
One Finnish news site I frequent uses this idea when submitting reader comments. It presents a trivial arithmetic question ("what is two plus five?") as Finnish words. I have yet to see any comment spam there, even thought the "puzzle" has remained exactly the same for years. Probably having it in Finnish is enough to deter most spammers.
It would not work in English: you could simply use Wolfram Alpha to break it. Just tried asking it "what is two plus five" and the answer came back, in several representations of the number 7.
On a website that had gone for the belt-and-braces approach of a question plus an image captcha.
The question was "what color are clouds". To which "white", "grey", and "gray" (given "color") would all seem to be possible answers. I can't remember which I chose initially but it didn't like it....
What is the success rate for breaking "question captchas"? For example, "half of six times one". Are natural language processing techniques good enough to parse the meaning of such text?
(and if not, as VoodooTrucker pointed out, I'm sure the academics would appreciate some help from the spammers)
This is not new. Search Google and you'll find many examples of this being done for years. Why do you say it's irreparable? That's nonsense. They'll just improve the audio captchas, the same way as image captchas have improved over the years.
They don't need to make them 100% uncrackable. They just need to slow people down. At the end of the day, any serious spammer pays for captcha cracking now days. 1000 captchas for $1 is the going rate. Usually outsourced to the Philippines/India.
Because of known research into noise reduction, we already have robust mechanisms for audio analysis. So computing capability runs headlong into legal obligations. Many sites MUST have audio CAPTCHAs to accommodate laws protecting the disabled, but this obligation itself becomes a tool that can be used by miscreants. It's like having a GPS that knows how to get you home...and then someone steals it.
I'm not too clear as to what kind of Semantic Noise Google uses, but I think what they're saying is that you can't use visual techniques in audio CAPTCHAs. You have to go about them from a completely different angle: instead of the CAPTCHA just saying a few words, have it speak a specific task to perform (and to defeat simple speech recognition, it cannot be a simple instruction; give an instruction that involves some thinking like, "Enter the sum of 2 and 8, as a word." or "Type, in order, all the vowels in the word 'deviate'."
That you're providing free decoding service for the google book scanning project. the secondary purpose is to prevent comment spam.
I used to do my part to get rid of the easy to decode captchas by spending 3 hours/night decoding them (yes, i was bored and suffering insomnia). Now, even I have a hard time deciphering them, since they appear to have run out of English words.
Alas, you can't beat the sweatshops of the world, and people are supposed to beat your "hard for computer" tests (though in truth, sometimes I take a few goes myself at some of the mangled text tests these days!).
If only people didn't follow spam email / links and the market would dry up and die...
http://xkcd.com/810/
It seems that captchas are trying to keep ahead of research into computer comprehension of text and spoken voice.
It would be cool if websites identified legitimate human beings using visual or auditory illusions which can be picked up by the human brain, but not easily deciphered by a computer. I'm not even sure if these exist -- certainly I imagine that using binaural beats would not work, since I suspect that the frequency of the beats is easy for a computer to calculate.
"It would be cool if websites identified legitimate human beings using..."
Sorry to pick on you, but no it wouldn't.
Even a 100% foolproof mechanism (no false positives and no false negatives) is doomed to failure because (as an earlier comment pointed out) the best scripts actually use human beings to break through the captchas. Captchas solve the wrong problem.
With the world as it is, you have an army of poor people willing to help the crooks spam the rich-but-stupid people. To truly solve the problem, you'd have to either solve world poverty or get rid of all the stupid people. Take your pick, and good luck with either, but don't hold your breath.
As I understand it, a bot scrapes a website, fills out whatever form it is trying to fill out, and passes the captcha to the human 'solvers.' They solve the captcha, and the bot takes the response and feed it back into the form.
However, if the mechanism related to something on the website itself (eg, "What animal is this website's mascot", "type the name of this website into this box", "What color is the background of the logo", "What font most resembles the website's font", and so on), that approach won't work. Taking the captcha out of the website will make the answer impossible to find.
Of course, chances are that the 'solvers' the crooks are using only speak one language, so using a captcha that uses natural language would defeat them handily enough in every language but their own. If you have a different captcha (or set thereof) for each language you support, and you notice one language group has a lot more bots making it through, you'll get a pretty good idea as to where the 'solvers' are from... dunno what you'd do with that information, but it would interesting data, anyway.
The botherders aren't THAT stupid. If the CAPTCHA requires context, they provide the mooks with the appropriate context (such as a picture of the site scrape). And as for the language barrier, they simply make sure their mooks are from certain countries or are of a certain level of language comprehension. A little more work, but nothing compared to the rewards.
Instead, try: "type the word that was followed by a cat sound." That way, you could have a lot of captchas. Even "What animal was guessed correctly?" and play animal sounds and animal names together ("This is a dog, baaa, this is a cat, meow, this is a bird, growl")
Plus, it could be extended when they start profiling the sounds (like the staccato sound of the sheep) . A pneumatic road digger, put through the right filters, would be hard to tell apart from the bleat of a sheep for a computer.
Failing that, get someone with a strong accent to read out the words on the current captchs.
A human can (usually) work out what is being said.
When they start encoding speech to text to handle UK accents, it will be a bonus for anyone thats tried voice recognition booking systems.
Half of them are bloody unreadable. I know the idea is so computers can't read them and thus it prevents bots, but half the time I can't read them either. They're especially annoying when paired with websites that reset the password fields on failure and websites which don't check if a username is free before you hit the submit button....it's a barrel of laughs trying 7 captchas, finally getting one right and then being told your username choice is already taken...
First they note that all current audio captchas _except_ Recaptcha can be defeated. They then go on to conclude: -
"As a result, we suspect that it may not be possible to design secure audio captchas that are usable by humans using current methods. It is therefore important to explore alternative approaches."
Excuse me? They've demonstrated that many sites use an easily defeated approach, when there's one available that's still undefeated. Isn't semantic noise a "current method"? What alternative approaches need to be explored exactly?
Any unique or novel system you employ for a smallish website will work simply because they won't bother putting in the man hours to create a bot just for your site.
When it comes to signing up for webmail/IM accounts ... the same tricks simply won't work. Create 1000 questions, someone will make a database of the 1000 matching answers, ask math questions and they will write code to parse the equations and solve them. Asking the user any multiple choice question (like which image is a cat) fails because they can simply guess and still get an economically viable success rate.
So if you are a little guy ... yeah you can come up with something unique and clever and be spam free, if you are Microsoft, Yahoo, Google ... there are no easy answers.
Or rather, you use the database to hold PIECES of your puzzle. To construct the actual puzzle, you take the pieces and mix them together. Then the number of combinations can add up dramatically. Add different rules for each possible phrase (such as switching between stating all the vowels to stating the sum to stating letters 5-7). The more arrangements you make, the trickier it becomes for a speech recognizer to pick out the task to do. You can also use phrases that can change depending on context ("recognize speech" vs. "wreck a nice beach") so can easily trip up speech recognition.
Years ago a blogger I read complained that his site was being botspammed by over a thousand stupid comments a day. He implemented a capcha. By day three the capcha was always the same and easily guessable even if the distortion was too bad to read the damned thing. Seems it wasn't worth the bots' time and they just moved on no matter what was used. His (s)hit rate dropped from 1K+ to ~3 per day.
I've always used a policy of moderating posts until the poster says something remarkably on-topic. This has successfully blocked bots, idiots and boring people from ruining the peace and quiet, but I can see the approach wouldn't scale to TwitFace levels.
I find the conclusion strange too, when the article obviously states that Google's reCAPTCHA system hasn't been defeated.
It's interesting because technically, even the system itself does not know the answer to one of the words presented! This is because Google is actually using us, the recaptcha users, as a way to recognize words from old books and print that have failed OCR. See:
http://www.google.com/recaptcha/learnmore
Up until now I had no idea how much wood a woodchuck could chuck...
Anyway, back to the topic. It's amazing that there are actually still sites out on the net that don't implement any form of captcha. I remember once suggesting it in an email to gumtree, and the reply I received was so snotty that I never visited the site again. Hopefully the moron who replied to me has been sacked for incompetence by now, and they have implemented something, but I won't hold my breath.