DCS Home | Overview | Project Initiation | Production Methods | Client Responsibilities | Workflow | Deliverables & Costs
OCR, or Optical Character Recognition, is a process that recognizes characters from an image file of a printed page and converts them to a machine-readable text file. We currently use a commercial OCR package called PrimeOCR, which performs character-by-character rendering using six different OCR programs and employs voting technology along with artificial intelligence algorithms to achieve better accuracy than conventional OCR. The output of the OCR process can be used as is (uncorrected or "raw") or edited.
We can provide OCR services in conjunction with our scanning service, or you can supply us with your scanned digital image files as input to the process. There are a variety of output options depending on whether or not DLPS will be hosting your content. Our OCR services do not include proofing or editing.
Clients served
University of Michigan; External non-profit organizations
Suitable Materials
Modern (19th century and beyond) typefaces captured in high quality
page images.
Limitations:
1. Languages
Our OCR software will recognize text in the following languages:
Our software is able to recognize languages other than those listed above, but will not accurately render all accented characters. Materials can be processed in only one language, so bilingual texts or quotations will be processed and rendered as if written in the primary processing language.
2. Character sets
Materials must be printed in the Latin I character set; our software will also produce limited results for Latin II. For a description of the Latin1, Latin 2, and other ISO 8559 character sets, please see:
http://www.w3.org/TR/html401/sgml/entities.html
http://www.w3.org/International/O-charset-lang.html
3. Font size
Font or type sizes within the following ranges will be accurately recognized:
200 dpi 12-40 points
300 dpi 8-35 points
400 dpi 6-30 points
600 dpi 3-20 points
4. Image size
For best results, images should not be more than 5600 pixels across either dimension.
5. Resolution
Image resolutions of 200, 240, 300, 400, and 600 dpi (dots per inch) are supported. Both height and width resolution must be the same. For digital library applications, 600 dpi is recommended.
6. Format and Compression
Preferred format for image files: Bitonal TIFF Version 5 and above; uncompressed or with CCITT Fax 4 Compression; although Prime can handle other formats, our license does not cover them. See Client Responsibilities section for additional information on how data is to be delivered to DLPS.
7. Image Quality
The quality of the scanned image can affect the quality of the OCR - if the type is either too light or too dark, the OCR may not be optimal. If you are concerned about image quality, please contact us to discuss.
Contact:
dlps-digitization@umich.edu
If you prefer to contact us by phone, please consult our Staff Listing for current staff and phone numbers.
DCS Home | Overview | Project Initiation | Production Methods | Client Responsibilities | Workflow | Deliverables & Costs
