I was cruising the web tonight and ended up at reCAPTCHA.net, the image verification service run by Carnegie Mellon University. They claim there are over 60 million reCAPTCHAs completed every day now, and they've launched a program now that essentially supplies the reCAPTCHA word images from various books that are currently not machine readable and therefore not available in digital format. A successful reCAPTCHA is a translation of two words into a digital form, and thus over time an entire book can be digitized.
I have to admit this is a pretty amazing concept. It would have never occurred to me to leverage the image verification process in order to harness the semi-unwitting cooperation of millions of us in what would otherwise be a completely monotonous, boring and thankless task of converting old books into digital format.
From their website:
To archive human knowledge and to make information more accessible to the world, multiple projects are currently digitizing physical books that were written before the computer age. The book pages are being photographically scanned, and then transformed into text using "Optical Character Recognition" (OCR). The transformation into text is useful because scanning a book produces images, which are difficult to store on small devices, expensive to download, and cannot be searched. The problem is that OCR is not perfect.
reCAPTCHA improves the process of digitizing books by sending words that cannot be read by computers to the Web in the form of CAPTCHAs for humans to decipher. More specifically, each word that cannot be read correctly by OCR is placed on an image and used as a CAPTCHA. This is possible because most OCR programs alert you when a word cannot be read correctly.
Learn more about the book digitization project at http://recaptcha.net/learnmore.html








