A couple of days ago Melanie commented that she'd gotten a weird image in the anti-spam verification box here:
Seriously, my first captcha had Greek letters! I don't know how to do that on my keyboard. Was I supposed to transliterate to English?
I've never seen Greek letters, but the question led me to idly click around to find out exactly what the captcha program, which is called reCAPTCHA, is doing. I had a vague memory that it was supposed to be using the human entries to digitize scanned content, but I didn't know how it worked and had wondered: if the content hasn't been digitized yet, how does it know if your entry is correct?
The explanation is on the reCAPTCHA website and also in a 2008 article that appeared in Science (pdf), and it explains why you get two words.
- One of the two words is already known to the software, and it serves as the shibboleth that proves you are human (which lets you post your comment) and verifies that you can probably decipher a distorted word in print.
- The other word is unknown to the software and represents a part of a scanned document that optical character recognition (OCR) has failed to decode satisfactorily.
Or, as the reCAPTCHA people put it,
Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.
There, isn't that interesting? And this explains how Greek letters made it in, probably via one of the unknown words: the reason OCR couldn't recognize it wasn't because it was distorted, but because it was in the Greek alphabet.