spacer
EVANS TCP
Eighteenth Century Collections Online
spacer Goals and Strategies


Production


Become a Partner


News
spacer
ECCO-TCP Demo

Jane AustenBen FranklinMarie AntoinetteNapoleanLudwig Von BeethovenCatherine The GreatGeorge WashingtonAmadeus Mozart
PRODUCTION

Why Do ECCO Texts Need To Be Keyboarded?
The page images that make up ECCO Online are displayed in a form legible to human users, but computers themselves cannot always see the words on ECCO's pages as clearly. As a result, the search tools that users have come to associate with word processing and computer-based indices cannot sort the information on ECCO pages.

In many cases, Optical Character Recognition (OCR) software can transform image files into text files; when a text's lettering is not in a recognizable font, however, or when a page is otherwise obscured with ink smudges or wormholes, OCR software does not always accurately render its image into a usable, searchable text. In order to make the Evans texts completely readable and searchable in ways which will be most effective to modern users, given the many variations in seventeenth and eighteenth century typefaces keyboarding, done by a person trained to identify the features of these texts, actually proves most cost effective.

ECCO images will be accessible by fully searchable bibliographic records and OCR running behind the page images (most OCR based products will not show the text they are searching because the prevalence of errors is thought to be distracting to users). The reliability of the OCR for searching purposes remains an open question because of the nature of the images being used. Gale is doing their best to optimize OCR accuracy, and report positive results. Nonetheless accurate OCR on materials of this nature is not easily accomplished, and even when the average results are good, it is bound to be uneven (very good sections or texts and some very poor texts). It is recognized that OCR provides cost effective access to a large corpus like ECCO the library community stands ready to support efforts at Gale. That being said, there is benefit in creating an accurately keyboarded and SGML/XML tagged subset of ECCO text that allows more certain searching and other features that fully support research and instructional uses of this important corpus.

Why SGML Encoding?
SGML encoding marks the structure and parts of a text, which enables easy and sophisticated searching. While simply being able to pick out keywords from an early American text is a remarkable step forward, SGML encoding allows users to focus their queries more specifically. The tags added during the encoding process can, for example, permit users to look for the occurrence of a word only in the marginal notes of ECCO texts, for non-English terms as they appear in stage directions, or for proper names appearing in epigraphs.

Document Type Definition (DTD)
A DTD, or document type definition, provides guide to the various tags that may be used in encoding an XML/SGML text, showing when and how these tags may be used. While DTDs generally follow a standard form, they can be modified to fit the demands of an individual project.

Because ECCO contains so many different types of text, and because the corpus contains so many page images, the Text Creation Partnership DTD Working Group determined that the DTD reflect a low and fairly generic level of tagging. This practice, the Group decided, would move texts through the keyboarding process more quickly while also allowing for additional tagging in the future.