Tuesday, June 21, 2011

Google Docs and Time-Saving Tips on Encoding Very Long Text Documents

As I have already written before, I still keep a day-job and recently, I realized having the alter-ego of a startup apartment owner. Along with that role, I have to formulate a just, tenant- and owner-friendly rental agreement, thanks to Landlording: A Handymanual for Scrupulous Landlords and Landladies Who Do It Themselves for the great assistance.

How does Google Docs figure in this equation? Their new feature that goes by the tonguetwister Optical Character Recognition or better known as OCR, which was very very handy when I needed to encode a long document from the aforementioned book as model for the rental contract. In this article I'll discuss the steps I took, how it went, the plus and minuses of the process and the overall benefits.

The Prerequisites
Just like any process, we need particular requirements to generate the expected results. Google Docs being an online app, we need to have consistent, broadband internet access. We definitely need a scanner to scan the printed document. In this article, I used a Canon Pixma MP140 which I discussed here to have print, copy and scanner capability automagically in Ubuntu 10.04LTS.For software, we need a software to manage the scanned images, in Ubuntu 10.04, this is usually Simple Scan, a free software that's either already installed or can be freely installed since we have internet access.

The Process
Using the program Simple Scan, we will scan the printed text into the computer, making sure that all texts from all the scanned pages are in one upright orientation, which means in the first page, you scan the pages so that texts can be read from top to bottom and all the scanned pages will have to follow this configuration, you can ignore whether the texts are visually slanted when scanned, just make sure all texts will be readable to the OCR of Google Docs. It would be advisable to scan and save one page per PDF since the Google Docs will only accept 2 megabyte (MB) of PDF to be read by the OCR, yes this would be tedious but given that the OCR will encode most of the text, this step is trivial.

Once you logged into http://docs.google.com, click on Upload->Files button to choose and upload the PDF of your choice. Once you selected the PDF to be uploaded, you will be presented with this popup window:
To activate OCR functionality, make sure the checkbox beside "Convert text from PDF and image files to Google documents" is ticked as illustrated above
After you clicked on the "Start Upload" button as illustrated, you will see a small progress popup window on the bottom right of your Google Docs session window, like below:
In the example above, you can see the PDF's filename and the progress of the upload, in the case of the illustration, it was already 43% to upload and conversion completion
After this process, you'll be able to see the uploaded file to be listed as one of the documents, as shown below:
Note that the filename of the PDF you uploaded is retained as is

Notes and Warnings
If the PDF you uploaded have some dirt or markings within the paragraphs and texts, that portion of the document may be properly scanned or not scanned at all. Your mileage may also vary on the accuracy of the OCR, but given the time-savings you gained to encode a very long document, I think you have more time proof-reading your text than encoding.

Having this OCR facility in place makes Google Docs a very practical and indispensable tool in your home office, document management and text encoding. And to get this for free? I don't think there's very little if any to complain about.

1 comment:

  1. Theory of the blog is different from others as I have read hundreds of blogs and articles but found such essence nowhere. Impressive writing and deserve a title of unique writing skill.สแกนเอกสาร