The Once and Future Text Encoding Model

Map image, corresponding XML, and web search results

Transcribed data behind the scenes makes maps searchable in the Michigan County Atlases in a way OCR would not.

Lately I’ve been looking back through the past of the Digital Library Production Service (DLPS) -- in fact, all the way back to the time before DLPS, when we were the Humanities Text Initiative -- to see what, if anything, we’ve learned that will help us as we move forward into a world of Hydra, ArchivesSpace, and collaborative development of repository and digital resource creation tools. One of the overarching questions for us now is that of a content model for encoded text, and how much flexibility is available within the model: can we create encoding consistency, what is the minimal required structure and metadata, and how do we provide more features when more of either (or both) is available?

Twenty years ago, when we first started delivering full-text resources like the American Verse Project, the King James Bible, and the Oxford English Dictionary via the web, we weren’t thinking about abstract content models. We built a separate cgi for each collection based on the markup that was used for that collection. When working with few collections, this was a reasonable approach. As more encoded content became available and we added more collections, including licensed content from Chadwyck-Healey that we hosted ourselves like the English Poetry Database, and  locally-created collections like the Corpus of Middle English Prose and Verse, it became clear that we were reimplementing the same search and display functions over and over again. While we used the Text Encoding Initiative (TEI) tagset for our locally-created texts, other collections used slightly different markup, and our programmers needed to keep writing new code to implement the same search and display functions depending on whether a poem was encoded as a <POEM> or a <SONNET> or a <DIV TYPE=”poem”>.

By 1999, we had enough material and enough experience to realize that if similar materials were encoded similarly, our programmers could spend their time implementing new functionality, not chasing minor nomenclature changes throughout the code. In addition, if all of the poetry collections were consistently encoded, we could search them all at once, and users would not need to know whether a poem was in our own American Verse Project or Chadwyck-Healey’s American Poetry or Twentieth Century American Poetry collections  -- they could search all the poetry grouped together. And so our first content model for text, the Grand Unified Markup Scheme (GUMS), was born! As it matured, it outgrew the silly name it had been given and became the more dignified Text Class we know as a part of DLXS today.

When we received a new text collection, I looked over the Document Type Definition (DTD) and documentation (if any) and mapped the tags into the Text Class DTD, an “optimized for delivery” variant of the TEI DTD that was used to encode our texts. This worked well, as the TEI relied on generic divisions with a type describing the material and so a specific element named “poem” in a different tagset could be readily converted to a division element with a “poem” type. As mass digitization became our focus with the advent of the Making of America (MoA), having all encoded texts in one class meant a user could search all our nineteenth-century collections together, regardless of whether they were the more structured collections with separate DIVs for each chapter or poem or play in a book, or encoded with one DIV for all the text in the entire book with the OCR for each page poured in, like MoA.

This was a fairly novel approach, but we were in a novel position -- by that point we were, after all, the Digital Library Production Service. We were not looking to build websites for one-off projects. We were also out in front in having so much material online from so many disparate sources, including material we did not create and markup we could not control.

Things have changed since then, and the question arises whether what we did with Text Class is still a relevant approach today. HathiTrust uses a variation of the approach, deriving a HathiTrust Metadata Encoding and Transmission Standard (METS) file optimized for display from the archived source METS files provided by Google or other digitization partners. This cannot be considered an independent confirmation of the concept, though, as this is likely influenced by HathiTrust’s history as an outgrowth of MBooks, which was more-or-less a next-generation DLXS Text Class implementation. The archival community seems to be addressing variation in Encoded Archival Description (EAD) tagging practice through the creation of tools like Archivist’s Toolkit, Archon, and now ArchivesSpace to standardize encoding in a way that best practice guidelines cannot. This is not a realistic strategy for texts, which can be encoded in a number of different standard schemas depending on the material, the community, and even the intended delivery device.

We are still well away from implementing Hydra for delivery of encoded text, or even ingesting texts with encoded “logical” structures like poems into HathiTrust, but I suspect that a grand unified markup scheme could benefit us in either of these workflows. I will suggest we go even further than we did in rationalizing content encoding; it might be the best use of everyone’s time to explore the mapping of content to a standard interchange encoding, like ePub3. This would allow us to integrate results for digitized print books, using ePub with print pagination markers, with born-digital materials, using ePub with flowing text. We would be sure to keep the source files in the repository for preservation and migration to whatever should come next, because one thing we have definitely learned is that we are never done working on a digital collection.

Add new comment

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.