Digital Conversion Services: Optical Character Recognition

DCS Home | Overview | Project Initiation | Production Methods | Client Responsibilities | Workflow | Deliverables & Costs

Overview of Services

OCR, or Optical Character Recognition, is a process that recognizes characters from an image file of a printed page and converts them to a machine-readable text file. We currently use a commercial OCR package called PrimeOCR, which performs character-by-character rendering using six different OCR programs and employs voting technology along with artificial intelligence algorithms to achieve better accuracy than conventional OCR. The output of the OCR process can be used as is (uncorrected or "raw") or edited.

We can provide OCR services in conjunction with our scanning service, or you can supply us with your scanned digital image files as input to the process. There are a variety of output options depending on whether or not DLPS will be hosting your content. Our OCR services do not include proofing or editing.

Clients served
University of Michigan; External non-profit organizations

Suitable Materials
Modern (19th century and beyond) typefaces captured in high quality page images.

Limitations:

1. Languages

Our OCR software will recognize text in the following languages:

  • Danish
  • Dutch
  • French
  • German
  • Italian
  • Norwegian
  • Portuguese
  • Spanish
  • Swedish
  • U.K. English
  • U.S. English

Our software is able to recognize languages other than those listed above, but will not accurately render all accented characters. Materials can be processed in only one language, so bilingual texts or quotations will be processed and rendered as if written in the primary processing language.

2. Character sets

Materials must be printed in the Latin I character set; our software will also produce limited results for Latin II. For a description of the Latin1, Latin 2, and other ISO 8559 character sets, please see:

http://www.w3.org/TR/html401/sgml/entities.html

http://www.w3.org/International/O-charset-lang.html

3. Font size

Font or type sizes within the following ranges will be accurately recognized:

200 dpi 12-40 points
300 dpi 8-35 points
400 dpi 6-30 points
600 dpi 3-20 points

4. Image size

For best results, images should not be more than 5600 pixels across either dimension.

5. Resolution

Image resolutions of 200, 240, 300, 400, and 600 dpi (dots per inch) are supported. Both height and width resolution must be the same. For digital library applications, 600 dpi is recommended.

6. Format and Compression

Preferred format for image files: Bitonal TIFF Version 5 and above; uncompressed or with CCITT Fax 4 Compression; although Prime can handle other formats, our license does not cover them. See Client Responsibilities section for additional information on how data is to be delivered to DLPS.

7. Image Quality

The quality of the scanned image can affect the quality of the OCR - if the type is either too light or too dark, the OCR may not be optimal. If you are concerned about image quality, please contact us to discuss.

Contact:
dlps-digitization@umich.edu

If you prefer to contact us by phone, please consult our Staff Listing for current staff and phone numbers.

DCS Home | Overview | Project Initiation | Production Methods | Client Responsibilities | Workflow | Deliverables & Costs

If you can read this, your browser isn't honoring our stylesheet requests

Send us your questions and comments.

libwebsystems@umich.edu

Your question or comment:

Sending . . .



Loading ...

Your message has been sent

There was a problem sending your message.

Please try again later. Or send it to libwebsystems@umich.edu in your favorite email client.
Your message was: