PRODUCTION
Why Do ECCO Texts Need To Be Keyboarded?
The page images that make up ECCO Online are displayed in a form legible to human users, but computers themselves
cannot always see the words on ECCO's pages as clearly. As a result, the search tools that users have come to
associate with word processing and computer-based indices cannot sort the information on ECCO pages.
In many cases, Optical Character Recognition (OCR) software can transform image files into text files; when a text's
lettering is not in a recognizable font, however, or when a page is otherwise obscured with ink smudges or wormholes, OCR
software does not always accurately render its image into a usable, searchable text. In order to make the Evans texts
completely readable and searchable in ways which will be most effective to modern users, given the many variations in
seventeenth and eighteenth century typefaces keyboarding, done by a person trained to identify the features of these texts,
actually proves most cost effective.
ECCO images will be accessible by fully searchable bibliographic records and OCR running behind the page images
(most OCR based products will not show the text they are searching because the prevalence of errors is thought to be
distracting to users). The reliability of the OCR for searching purposes remains an open question because of the nature
of the images being used. Gale is doing their best to optimize OCR accuracy, and report positive results. Nonetheless accurate OCR on materials of this nature is not easily accomplished, and even when the
average results are good, it is bound to be uneven (very good sections or texts and some very poor texts). It is
recognized that OCR provides cost effective access to a large corpus like ECCO the library community stands ready
to support efforts at Gale. That being said, there is benefit in creating an accurately keyboarded and
SGML/XML tagged subset of ECCO text that allows more certain searching and other features that fully support research
and instructional uses of this important corpus.
Why SGML Encoding?
SGML encoding marks the structure and parts of a text, which enables easy and sophisticated searching. While simply
being able to pick out keywords from an early American text is a remarkable step forward, SGML encoding allows users
to focus their queries more specifically. The tags added during the encoding process can, for example, permit users
to look for the occurrence of a word only in the marginal notes of ECCO texts, for non-English terms as they appear
in stage directions, or for proper names appearing in epigraphs.
Document Type Definition (DTD)
A DTD, or document type definition, provides guide to the various tags that may be used in encoding an XML/SGML
text, showing when and how these tags may be used. While DTDs generally follow a standard form, they can be modified
to fit the demands of an individual project.
Because ECCO contains so many different types of text, and because the corpus contains so many page images, the
Text Creation Partnership DTD Working Group determined that the DTD reflect a low and fairly generic level of tagging.
This practice, the Group decided, would move texts through the keyboarding process more quickly while also allowing for
additional tagging in the future.