Main content

Cleaning up OCR'd PDFs fix

I had a problem recent with a set of PDF'd text that I needed a fix for. I thought I'd share the problem with the solution...When PDFs get digitized they often retain elements of their paragraph formatting. The width of the paragraph is converted into paragraph breaks, when in fact it is not a real paragraph break. This does not appear to cause any problems with text to speech, however it would cause a problem when viewing the title as an eBook, especially an eBook with enlarged text. With the view of correct formatting which will help future proof these documents, I suggest these few basic steps to remove these introduced paragraph breaks.

Use MS Word's find and replace to remove the extra paragraph breaks using special Word symbols

 

Control+F to initiate find and replace

^p = Paragraph break

^m = Line break

 

We want to keep the real paragraph breaks and remove the fake extra paragraph breaks. To do this we will have to convert the double paragraphs breaks into something else unique, remove the single paragraph breaks and then convert the unique characters that were double paragraph breaks into new single paragraph breaks. It is best to do this at the begining of the text correction stage as it appears to mess with exisitng formatting styles.

 

  1. Find and replace all double paragraphs
    1. initiate a find for, ^p^p
  2. Replace with a unique symbol or code, eg, ' xswedc '
    1. (I found placing a space before and after helps make it even more unique and avoid it bunching up with other double paragraphs) this isn't anything special about these letters, other than that they are a unique string of letters we can search on later
  3. Find and replace all remaining single paragraphs,
    1. find = ^p, replace =  [single keyboard space]
  4. Find and replace all the double paragraphs you previously changed into a special symbol or code and change back to a single paragraph
  5. Find and remove all line breaks, change into double or single paragraphs instead (find = ^m, replace = ^p )
Tags: 

Thanks Anthony. Definitely a great tip.

While it uses slightly different syntax, for those using OpenOffice or LibreOffice, install the "alt search replace" plugin to do the same thing.