This article came about my experience with a copy protected PDF file that I was using for one of the cases assigned to me in the Business Modeling class I am attending. The case requires that I determine if a series of account balances match to expected cash flows. I am to use regression and other statistical tools to determine that, but before I can run my analysis I must first enter all of the data into a spreadsheet, easy I thought, until I found out that I could not copy the table with data because it was protected! Now imagine the amount of time that would be consumed entering 150 rows and 6 columns of data manually. I began thinking of ways to get around this protection and the bulb came on…OCR!
OCR stands for Optical Character Recognition, put simply OCR software recognizes text from an image, this technology is frequently used by scanners/copiers that can automatically scan a document into an editable Word document. So how does all this fit in? What we are going to do is take a screenshot of a page from the protected PDF document, we will then open that image in an OCR program who will then recognize the text and save it to an editable document that you can use.
In order to this you would need an OCR application available and of course Acrobat Reader to open the PDF. In my case I have Adobe Acrobat 8 Professional, so I will be doing the OCR and this guide with the software. At the end of this article you will find a list of OCR software you can download for free. If you have any questions or comments, feel free to post them below.
Note: I used SnagIt to capture the screenshots and Acrobat Professional to read the image files.
***This guide is intended for PDF’s that disallow copying of data or printing only***
The example below demonstrates how to pull data from a table and add it to a spreadsheet in Excel
1. With the PDF open, center the screen so that you can clearly see the information you would like to capture, once you’ve done that hit your Print Screen button, to activate screen capture.
Drag and select the area you wish to capture. Save your capture to a local directory.
2. With Acrobat open, click on File > Create PDF > From File…
3. Browse and locate the screen capture from Step 1, click Open to open the file.
4. The image file you selected should now be open in Acrobat
5. Click on Document > OCR Text Recognition > Recognize Text Using OCR…
6. Click OK in the Recognize Text dialog box.
Note: If you are having trouble recognizing the text, you can click on the Edit… button to set a higher resolution.
7. Once Acrobat (or your OCR package) is done performing text recognition, you should be able to select the data within the program, select (highlight) the information you wish to copy.
8. Once selected, click on Edit > Copy to copy the information to the clipboard.
9. Open Excel with a blank spreadsheet, click on the Paste button and select Paste, this will paste the information you copied from the PDF you created onto the spreadsheet.
10. When the data is pasted onto the spreadsheet you will most likely have to format the data so that it matches the format from the original file.
Click on the Paste option box that appears and select Use Text Import Wizard… from the drop-down menu.
11. For this particular set of data I chose Delimited (in other words spaces, commas, hyphens and other characters separate the data).
Click Next to proceed.
12. The data in this table is separated by Space (most of the time), add a check mark to the delimiter in your data set, click Next to proceed.
13. A majority of the time not all the data will always line up so you might want complete formatting once all the data is correctly aligned. Click Finish to exit the wizard.
14. As you can see the data did not line up perfectly, this is where Find and Replace comes in handy. Once you’ve aligned your data then you can format it and you’re done. You’ve successfully pulled data from a protected PDF file.
The steps below will summarize the use of this work around to copy a protected paragraph in a PDF into a Word document.
Before proceeding repeat Steps 1-6 Using a Paragraph Instead
15. Once you’ve selected the paragraph you wish to copy, click on Edit > Copy File to Clipboard
16. Open Word and create a new file, right-click on the blank document and click on Paste from the drop-down menu.
17. The text you copied should now be available to you in Word. You’ve successfully pulled a paragraph from a protected PDF file.
Free OCR Software