Followers

Sunday, November 2, 2008

Google turns on OCR for scanned PDFs

By David Chartier

Google has covered quite a lot of turf during the march toward its goal of making every last bit of the world's information searchable. But considering all the ground that has yet to be covered—especially in the realms of offline data and paper documents—we weren't surprised when Google began dabbling with OCR technologies over the last couple of years. Now, the search giant has officially launched its next attempt to handle some of this previously unsearchable content.

As announced on the Official Google Blog, the company is now performing optical character recognition (OCR) on documents that it indexes and identifies as scanned as PDFs. Google has indexed documents that were saved as text-based PDFs for quite some time. But many documents wind up being made into PDFs through scans, which store the text as images. Google has now decided that its open-source OCRopus technology, based on software called "Tesseract" that HP developed, is up to the task of indexing scanned documents that can contain any mixture of text, images, and coffee stains.


Google's servers no longer need to
be afraid of these warnings

We went hands-on with the first alpha of OCRopus in October 2007, and found it to be hit and miss. At the time of our hands-on, we found that OCRopus had trouble with non-sans-serif fonts and type set in smaller sizes. But Google has since set a few engineers on the task of updating Tesseract for the 21st century. The company is obviously confident that OCRopus now has the ability to index a whole new library of texts, papers, and medical journals that previously were locked tomes as far as Google's servers were concerned.

Google didn't return our request for comment in time for publication, so we don't know what percentage of Google's massive index OCRopus has already crawled through. Google provides a few examples of the benefits of indexing scanned PDFs, though, such as this search for "repairing aluminum wiring." The first result is a PDF from the US Consumer Product Safety Commission with that exact title; the scan has all the quintessential blurry and blotchy text that makes OCR a nightmare.

Google's "View as HTML" feature is quite useful for these documents, especially if you need to copy portions of them for notes. Notably missing from Google's native text views of scanned PDFs, however, are any of the images or diagrams included in the original document. Amusingly, though, any text that Google is able to parse out of images embedded in PDFs, such as diagrams or graphs, is also indexed and available in the HTML view.

While adding OCR to Google's indexing engine will certainly make more information searchable and accessible, Google may run into opposition from organizations or universities with scanned PDFs that were placed online specifically for humans, not machines. Google has undoubtedly stumbled across PDFs that include copyrighted material and personal information; it has now made those things much easier to find.

Original here

No comments: