Tuesday, March 24, 2009

Worst CAPTCHA ever?

For those of you who don't know, a CAPTCHA is program that can tell whether its user is a human or another computer. They're those little images of distorted text that you translate when you sign up for Gmail or leave a comment on someone's blog. Their purpose is to make sure that someone doesn't use a computer to sign up for millions of online accounts automatically, or to spam blogs randomly with cries for help from Nigerian princes. They make sure an actual human is filling out the form. They work based on the principle that no computer program can (currently) read distorted text as well as a human can.

I have a confession to make. I fail this scaled-down version of the Turing test on a regular basis. There's probably software out there that can outperform my current success rate of just about 2/3. Then there are those challenges that are impossible for human and computer. Just today I was presented with the following CAPTCHA, presumably containing two words.

The image was presented by reCAPTCHA, probably the best anti-bot service in widespread use on the internet. reCAPTCHA works by taking scanned images from books that Optical Character Recognition (OCR) software finds difficult or impossible read and presenting it in CAPTCHAs deployed all over the internet in order to get humans to read the words that the OCR software can't. The goal is to get a bunch of scanned books efficiently translated into digital form. It's a really clever and elegant solution to two problems in one.

Fortunately for enterprising young programmers, the CAPTCHA half of this problem is far from being solved. The line between what a human can read easily and what a software application can read is blurred and constantly changing. New ideas in the battle against Nigerian princes are always welcome.


UPDATE: After receiving a comment from Ben Maurer, the chief engineer at reCAPTCHA (and after Googling him to make sure he really is) it seems I have to eat a little crow. After spending enough time on the reCAPTCHA site for a statistically significant sample, I've determined that the image above is an outlier. My guess is that the word on the left is a scanned image or drop cap. Thanks, Ben, for straightening me out on this.

2 comments:

Ben Maurer said...

Hi,

I'm the chief engineer on reCAPTCHA. The word on the left is one that comes from books and that we're trying to OCR. As such, it is not graded.

Over 96% of reCAPTCHAs are solved correctly -- try submitting your best guess for each CAPTCHA on our site, and I think you'll find your success rate is extremely high.

Bill the Lizard said...

Ben,
Thanks for the reply. After spending a little bit of time on your site I found that you're absolutely right. I got 30 out of 30 correct.