Digital Conversion OCR & Text Encoding

page image

 

Digitizing pages from books makes them available online, but all that allows users to do is look at each page image. They cannot search the text or pick certain pages or features without going through the text page by page.

OCR, or Optical Character Recognition, is a process that recognizes alpha numerical characters on printed page images and converts them to a machine-readable text file. OCR is what makes it possible to search through the text of a digitized book.

We currently use a commercial OCR package called PrimeOCR, which performs character-by-character rendering using six different OCR programs and employs voting technology along with artificial intelligence algorithms to achieve better accuracy than a conventional single program OCR could achieve.

The effectiveness of OCR is dependent on the quality of the originals and the actual scanned image. OCR is not effective on hand printed/hand written manuscripts, items with text going in multiple directions, or certain language and font sets, like cursive scripts. See our list of language sets for Prime OCR.

Text encoding goes a step beyond OCR and merges searchable text with functional metadata to create XML markup. This XML markup enables the user to locate specific pages like the “Index” and “Table of Contents” as well as individual page numbers within each object.

Text encoding at U-M Library is acheived through manual labor of identifying pages and adding "tags" to a database for each item and them merging the tags with the OCR generated text.

Page maintained by Lara Unger
Last modified: 08/18/2014