• ecco demo button
  • ecco database button

Jump to: #goals and strategies , #production information

ECCO-TCP

The University of Michigan, the University of Oxford, and Gale (part of Cengage Learning) have cooperated in a Text Creation Partnership to make freely available 2,231 accurately keyed and fully searchable SGML/XML text editions from among the 150,000 titles available in the Eighteenth Century Collections Online (ECCO) database. ECCO is an important research database that includes every significant English-language and foreign-language title printed in the United Kingdom during the 18th century, along with thousands of important works from the Americas. ECCO contains more than 32 million pages of text and over 205,000 individual volumes, all fully searchable. ECCO is published by Gale, part of Cengage Learning.

Gale has been a generous partner. According to Maria Bonn, Associate University Librarian for Publishing, "Gale's support for the TCP's ECCO project will enhance the research experience for 18th century scholars and students around the world."

Laura Mandell, Professor of English and Digital Humanities at Miami University of Ohio, says, "The 2,231 ECCO texts that have been typed by the Text Creation Partnership, from Pope's Essay on Man to a 'Discourse addressed to an Infidel Mathematician,' are gems." Mandell, director of 18thConnect, says that the TCP is "a groundbreaking partnership that is creating the highest quality 18th century scholarship in digital form."

Because there are no longer any restrictions on how the ECCO-TCP texts may be used and shared, many users have already begun to make this data and metadata available in various forms and formats around the Web:

We hope that at least one of these options will meet your research needs, but we also welcome your questions, suggestions, and requests for alternatives (or, we hope you'll build something that works for you, and let us know about it!)

Goals and Strategies

The University of Michigan in cooperation with Thomson-Gale and with the financial support of libraries worldwide, is creating accurately keyboarded and SGML/XML encoded text editions for a significant portion of the Eighteenth Century Collections Online (ECCO) corpus. Known as the Eighteenth Century Collections Online -Text Creation Partnership (ECCO-TCP), this cooperative academic initiative is producing legible and searchable encoded texts that link to corresponding page images from Gale's ECCO product. For students and scholars, this allows immediate search access to the content of thousands of historically significant works, while retaining the cultural context of the original print representation of the material. The ECCO-Text Creation Partnership offers a number of important benefits to the library community:

  • Entrusts conversion of important but difficult works to the university community, supporting appropriate scholarly review and intervention;
  • Draws upon community expertise to develop the scope and standards underlying such projects;
  • Carries forward the work in a cost effective manner by distributing the costs across many academic institutions, as well as encouraging substantial contributions from commercial partners;
  • Ensures that Partner libraries co-own the resulting text file with robust rights to manage, re-use, and distribute the file as they see fit - including the right to distribute texts beyond their campus or community authenticated users to other partner institutions.

Creating the Text file:

We are often asked if it wouldn't be possible to make ECCO texts searchable through optical character recognition. Our belief is that OCR would not produce an acceptable or cost effective result. Keyboarding and tagging also provide the following benefits that are particularly well suited to early texts in a large corpus like ECCO:

  • Because the text is accurate, it can be displayed (unlike in most OCR based projects) and hence provides a legible reading copy of the ECCO texts that, because of early fonts and printing in the original, can be difficult for novice readers to decipher.
  • Word and phrase searching is not only more accurate, results are also displayed in context of surrounding text to help sort through a large number of returns.
  • Tagging allows for more precise searching such as limiting searches to titles, headings, notes, stage directions, captions, acts, verses, etc.
  • Tagging also renders a browseable structure to any text, analogous to a table of contents, by producing a hierarchy of titles, sections, chapter headings and sub headings.
  • The willingness to display accurately keyboarded texts allows the reader to access an index of all words in the corpus-or in a designated work- that can serve as an index, concordance, or a check on variant spellings and word forms.
  • Standard tagging of the texts allows the corpus to be combined with other corpora tagged to the same standard, hence allowing the reader to search across multiple collections.

ECCO Content

The ECCO collection consists of the works represented in the English Short Title Catalogue between 1701 and 1800. The corpus contains a diverse number of materials including not only books and broadsides but also Bibles, tract books, sermons, and printed printed ephemera by many well-known and lesser-known authors. The 150,000 works of the ECCO corpus captures the essence of the Enlightenment in Great Britain, and is essential in order to understand the context of the French, Industrial, and American Revolutions.

Licensing and Access

The ECCO-TCP project is no longer actively seeking new partners. Restrictions on access and use of these files have been lifted, and they are now available to the general public through several different channels (described above)

Benefits for Scholarly Researchers:

Word and phrase searching of the ECCO corpus provides a new research dimension never before available through print, microfilm or digital page facsimiles. Scholars are now able to pinpoint references to subjects, people or places that would not be indicated in a brief bibliographic citation. The search interface also allows scholars to uncover word patterns and other literary or linguistic forms across texts. Whether a user is seeking contemporaneous references to people or events, tracing citations to authors like Edmund Burke, or quickly finding known quotes, these thousands of searchable texts open up an array of research possibilities that was unthinkable when the texts were only accessible by author, title and broad subject.

Benefits for Teachers and Students:

The ease of access that ECCO and the ECCO-TCP offers to texts once confined to rare originals and microfilm will make the corpus a significant part of classroom teaching on a number of campuses. As these ECCO works become searchable, they will become even easier to use in the classroom. Students will readily find references to the French Revolution, or remedies for common diseases, benefiting from clearly legible text with instant access to original illustrations and typefaces.

Production Information

Why Do ECCO Texts Need To Be Keyboarded?

The page images that make up the ECCO archive are pictures, not text: the human eye interprets them as containing letters and words, but computers see them simply as pictures. As a result, the search tools that users have come to associate with word processing and computer-based indices cannot retrieve the textual information on ECCO pages.

Optical Character Recognition (OCR) software is usually the most efficient way to turn image into machine-readable text. And indeed, the ECCO images are already accessible on the Gale site via both fully searchable bibliographic records and 'hidden' text files generated by OCR from the page images. OCR is the only practical way to create searchable indices to very large corpora of page images. In the worst case, if even half the words on a page are legible to the OCR, that is still a great many usefully searchable words. Gale does much much better than that. They have done their best to optimize OCR accuracy, and report positive results in many cases. Unfortunately, OCR stumbles at many 18th-century books (especially ligatured tall s) and can produce very inconsistent results, depending on the typography, layout, and quality of the original. Eighteenth-century typography and orthography can be highly variable, often quite unfamiliar to software built for modern books, and not infrequently ambiguous; the page layout can be challenging; and the image quality often reveals the ravages wrought by time on books that may not have been all that well printed to start with. Pages are cropped, folded, and torn; letters are underinked, overinked, or missing altogether. In such circumstances, OCR may be the only practical option, but is rarely the best one. While it is conceivable that OCR software could eventually be modified to "read" ECCO texts with uniform success, the great variety in early modern typefaces make this an unrealistic option for the present. Manual keying, performed by staff trained to identify the features of early modern texts, usually proves more cost effective.

Why not 'plain text'?

Accurate transcription of letters and words is only half the story, because it represents only half the information in the text (if that). Making sense of the text requires that at least its most signal structural features are recorded as well: the text's discrete parts (paragraphs, lines, headings, lists and tables, notes and quotes, etc.) need to be distinguished, and the relationships between them indicated. The print original expressed this information implicitly by variations in typeface and layout, sometimes conventionally and sometimes idiosyncratically; sometimes manifestly, sometimes subtly. TCP uses explicit embedded 'markup' instead, a fixed and limited vocabulary of markers that describe the pieces of the text and relate them to each other. Converting from the implicit language of physical presentation to explicit markup is analogous to the conversion of image to machine-readable text. Computers are much better at reading explicit markup than they are at divining the meaning in subtle and ambiguous changes in font or layout. TCP prefers when possible to record the structural role of a given piece of text ("this is a chapter heading"); when that is not feasible, it is willing to fall back on describing its physical appearance ("this appears in the third column of this table" or "this is in a typeface different than the text around it"). The precise format in which this information is recorded is less important than its transparent and unambiguous character: once the information has been captured, it can be converted into any markup language that may come along. TCP was born in the SGML era, and its native markup language is SGML, in particular an SGML DTD ('document type definition') based on TEI version P3; but it has moved with the times and routinely converts all of its texts into XML as well, and back and forth between SGML and XML as need dictates.

The TCP Markup Scheme

A markup scheme dictates the kind of features that can be distinguished, and some of the possible (and impossible) relationships between them. It may declare (for example) that list items can only appear within a list, or that a poetic stanza must contain at least one line of verse. Creating a scheme for the the TCP texts (the same scheme is used for ECCO as for Evans and EEBO) was a particular challenge, since the scheme needed to be clear enough to be applied by external data-conversion firms working to a compact set of instructions, simple enough to be reviewed quickly in thousands of books by a distributed set of editors; and flexible enough to be applicable to almost any conceivable kind of book produced by a long and extremely inventive period of publishing history. The decisions made at the outset by the TCP DTD Working Group determined that the TCP markup should be relatively sparse, its tag set relatively generic, and its practice constrained closely by the principle that (to the extent practicality permitted) those features should be marked up that would allow for intelligible display of the text, efficient navigation through the text, and the most widely useful kinds of searches within the text. Only such a user-centered policy, the Group decided, would move texts through the keyboarding process quickly while also allowing for optional and incremental enhancement of the tagging in the future.