National Impact | Adaptability | Design | Management Plan | Budget | Personnel | Evaluation | Dissemination | Contributions | Information Access | Digitization Details | Sustainability | Appendix A--Items to be digitized | Appendix B--Technical Standards
APPENDIX B – Technical standards
Image Capture: Methods, formats, image quality and quality control
DLPS's Image Services will undertake the digitization of the various photographic formats and manuscript pages of field notes selected for inclusion in the project. They will employ appropriate conversion methods from among the following options. Current hardware is noted, though all DLPS hardware is on a three-year replacement schedule and may be upgraded during the term of the grant.
- grayscale imaging using a flatbed scanner: suitable for newspaper clippings and manuscript pages that are not visually distinctive enough to merit color scanning. These items will be done with a high-speed grayscale scanner with limited tonal range (Fujitsu M3096GX);
- continuous tone imaging using a flatbed scanner: suitable for flat materials such as photographs or manuscripts that are distinctive in part for their visual attributes, such as color, and which require high quality conversion methods (Agfa DuoScan);
- continuous tone imaging using a digital camera: suitable for suitable for materials that can not be captured on a flatbed scanner, and that require high quality color conversion methods (Kontron ProgRes 3012);
- continuous tone color imaging using 4x5 film as an intermediate form for scanning with a transparency scanner: suitable for exceptionally small or exceptionally large objects, manuscripts, and photographs that are otherwise out of range for the digital camera or flatbed scanner (Imacon Flextight Precision II).
- continuous tone color imaging using a 35mm slide scanner (Polaroid Sprintscan 35 Plus).
Image Services creates digital images at the highest resolution appropriate to the material and to the available conversion method. Conversion methods include digital photography (Kontron), flatbed scanning of originals, and flatbed scanning of film intermediaries. Master images are stored as TIFF files and written to gold CD-ROMs using ISO 9660 specifications. Kodak color and grayscale targets are included in scans of original materials. Scanner settings are stored on CD- ROM with the images. Image Services controls and monitors quality with a variety of methods including hardware color calibration, color imaging targets, image file format integrity, and visual inspection. Methods and the extent to which they are applied vary by project based on appropriateness.
Manuscript and photographic images will be stored on and served from a Sun Ultra 450 with dual 300 MHz processors, using the Solaris 7 operating system. DLPS Image Services uses MrSID from LizardTech (wavelet compression) for server storage and JPEG/JFIF format and compression for delivery to the user. Image processing is done with Image Magick. The digital images will be available in a range of resolutions. Typical viewing sizes include thumbnail (184 x 113 pixels), small (368 x 226 pixels), medium (737 x 452) pixels, and large (1475 x 905) pixels and include panning and zooming tools that allow a high level of investigation and study. In addition, they will have available to them other existing Image Services functionalities, such as comparative search and display. Images will be integrated into the unified project Web presence, where they can be searched and displayed in combination with textual materials.
Text Scanning and OCR
The monographs selected for digitization in the proposed project will undergo a simple and automatic conversion process, following procedures laid down for the University Library's Making of America4 project. Project staff in DLPS will prepare materials, making every reasonable effort to ensure each volume is complete before conversion. Project staff will note pagination structure, and special features, such as title pages, tables of contents, and indexes will be highlighted. Staff will enter this basic structural metadata into a database, a copy of which will be sent to the scanner operator with the volumes to be converted. Project staff will use a flatbed raster scanner to scan all pages as bitonal (one bit) images (one page per image file) at a resolution of 600 dpi. In some cases, an automatic document handler can be used for the page imaging. Scanning specifications developed by DLPS and the Library's Preservation Division for the Making of America4 and other large scale digital reformatting projects will be applied to this project as well.
Quality Control. Project staff will inspect 5% of the converted images to ensure completeness, legibility, and placement of the images. This level of random sampling is already used in the University Library conversion efforts, and is a statistically valid method that ensures a high level of compliance to our standards. When all images inspected on the random sampling basis pass quality control requirements, the entire batch of files is accepted. Should the sampling technique indicate even one unacceptable image, 100% of the images in that batch will be reviewed and sub-standard images replaced. A second sampling and quality control inspection will follow; if all images in the sample inspected meet quality control guidelines, the batch will be accepted. If not, further re-scan and re- inspection will be done until the staff are satisfied by the quality of the images captured. Staff will then write the images to CD-ROM and pass them on for post-image capture processing. Experience has shown that the single largest quality issue is skew of the digital image.
Post-capture processing. DLPS staff will process the page image files to generate OCR and simple SGML to enable search and navigation. SGML headers for the monographs are created from the MARC records for those volumes. This method is used in University Library text conversions and has proven reliable and cost-effective.
Text delivery system. TIFF page images and the simple SGML files will be put online in the DLPS production environment. In order to deliver the TIFF page images to the end user, they are transformed on-the-fly to GIF images, using Tif2Gif software.
Descriptive metadata for published and manuscript-like materials being digitized will include TEI headers. All metadata imported into the access system will be coded in SGML with a project-specific DTD. Where standard vocabularies are used, they will be tagged with identifiers. All dates will be converted to ISO standard forma, and location coordinates will be tagged in a fashion that permits export into standard GIS utilities. Selected metadata from records will be mapped to Z39.50 attributes to permit Z39.50 compliant performance.
Plans for submitting collection-level descriptive records to bibliographic networks
Most of the contents of project collections are not bibliographic in nature, for those that are, records will be contributed, either directly into OCLC or via OCLC's CORC project . Records will also be created and shared for web interfaces created by the project.
Plans for Preservation and Maintenance of Digital Files
The University Library has established a range of strategies, guidelines, practices, and policies that define and support its initiatives and programs aimed at conversion to digital format. For text delivery systems, the image storage and access strategy for converted text privileges the master image by storing it online in the access system. The access system itself depends upon the presence of the master image file to deliver information to the user. The master image resides always in the most current technologies and moves forward through technologies in the University's dynamic computing environment. This avoids problems of earlier digital efforts where the critical version of the page image was resident in an off-line medium requiring continuing refreshing. The page images are stored in redundant arrays of independent disks (RAID) at level 5. The system is mirrored in a geographically separate area of campus.
In nearly all cases, DLPS maintains multiple copies of digital masters in a variety of parallel environments, a practice that we believe further contributes to the long-term viability of the digital resource. The data stored using RAID technology is written to digital linear tape (DLT) on an at least a monthly basis; in the case of SGML/XML data and bitonal image files, the data on tape are indistinguishable from the master. All data, whether SGML/XML data or images, are written to CD-ROM at the point of final quality control acceptance. SGML/XML text files and bitonal image files are stored redundantly, with the production version in a staffed, secure, climate-controlled machine room, and a secondary version on a development server in a secure, climate-controlled (but not staffed) machine room. Although we are not yet in a position to store masters of continuous tone (photographic and manuscript) image files on production or development servers, three identical copies of all derivatives from continuous image files are stored by DLPS (again, a production version, a secondary version in a second machine room, and a copy on tape). Masters of continuous tone image TIFF files are stored on gold CD-ROM in ISO 9660 format. In addition to text and image files on production servers, other files including system software are copied to digital linear tape on a frequent basis. In addition, the development version of the system is backed up every week. In order to ensure that accidental or unauthorized changes or replacements do not occur, a high level of security and a rigorous permissions system is in place, allowing only designated staff members to alter the files. Moreover, once files are passed from production staff to technical staff, they are never passed back; in order to make further changes, the production staff must give the technical staff a new file that supersedes the previous file. Storing the data in a variety of media will help ensure future access. Also, by using industry standards for imaging and SGML/XML encoding for text, we ensure that we are able to more easily migrate data into future formats and systems.