One of the projects of the HathiTrust’s U.S. Federal Government Documents Program is to develop a metadata registry of every work published by the Federal government, for which I am the principal data wrangler and mangler. Measuring the completeness of the corpus and therefore our success and failure is difficult. One way to measure completeness is to evaluate a smaller known subset. Are we missing records for agencies that should be represented in the corpus or are there suspicious gaps in the publication history? Before we answer that question, we have to be able to reliably identify authors.
Ideally, we would have a complete list of Federal authors (departments, agencies, offices, employees) and a map of their relationships. The ability to reliably identify authors and their relationships would allow us to properly classify suspected Federal documents, match duplicate bibliographic records, and inform decisions regarding collection development.
Coping with MARC Authority Records
We have attempted to address parts of these challenges using MARC authority records from the Library of Congress. “MARC authority records contain the standardized names, or “authorized headings” for people, corporate bodies (societies, businesses, institutions, etc.), meetings, titles, and subjects” as used in MARC bibliographic records. The Library of Congress Name Authority File is the result of years of work by expert catalogers with an obsessive attention to detail. While comprehensiveness may not be possible, the content of the Authority File is impressive.
For the uninitiated, MARC stands for MAchine-Readable Cataloging standards. Arguably, MARC records are not machine readable nor even standard, and “cataloging” only so far as a record describes what to print on a 3”x5” card. While many fields resemble entity-relationship data, they are more akin to an arcane templating system. MARC bibliographic records describe the appearance of a cataloging record and only incidentally the underlying work. Likewise, MARC authority records describe the textual format of names and subjects, not the person, entity, or topic that bears that name. Forgetting this point leads to misuse and abuse of MARC metadata, and a bad time for the data programmer.
I frequently have a bad time. After much trial and error, I was able to extract 8.2 million records from the Name Authority File into a queryable database. Of the extracted records, 6.6 million records are Personal Name Headings (field 100), e.g. “Kilroy, Bill”, and 1.6 million records are Corporate Name Headings (field 110), e.g. “United States Coast Guard”.
In theory, MARC authority records contain “see also” tracing fields for related authorized headings (field 510). These fields should provide connections to an author’s alternate names, parent organization, or employer using relationship subfields. In practice, this information is usually omitted or left ambiguous.
Explicit relationships in our database of 8.2 million records:
- Successor: 126,448 records
- Predecessor: 122,122 records
- Hierarchical Superior: 81,609 records
- Employee: 134 records
For example, the aforementioned record for Bill Kilroy omits his employment with the Forest Service which would be helpful in creating a complete collection of their publications. The record for the United States Army Special Epidemiologic Team lacks tracings to the authorized headings for the United States Army or its immediate parent the United States Army Special Forces.
Computing Implicit Relationships From Authorized Headings
The lack of explicit relationships meant I was forced to get a little creative with (misuse and abuse) implicit parent/child relationships. Typically corporate headings are split into subfields in order of descending hierarchical level. For example:
Unfortunately, there are frequent exceptions to this, marked by indicators for “direct order”, “indirect order”, and “jurisdiction”. You can find an explanation (but no excuse) for these in the AACR2.
Even when a hierarchical field is given, the authorized heading, including “United States. Coast Guard”, usually obscures the intermediary organizations. Using just the authorized heading, more than 7,500 corporate names are represented as first level organizations in the United States government. Again, the heading is designed to describe a name to go into a bibliographic record display, not the entity.
Computing Implicit Relationships from Unauthorized Headings
After working with MARC bibliographic records for several years, it should not have surprised me that most of the authoritative data found in MARC authority records can be found in the unauthorized headings (field 410). When the 510 tracings fail, this is where we can usually find the more complete and useful name.
The “United States. 127th General Hospital” lacks explicit relationships in 510 tracing fields, and the authorized heading omits bureaucratic superiors. Fortunately, there are two unauthorized headings in 410 fields and one of them tells us where to place the 127th General Hospital within the hierarchy even if it lacks the precision we would like.
a| United States. b| Army. b| 127th General Hospital
a| United States. b| General Hospital, 127th
Unauthorized fields rarely give us the exact nature of the relationship with the unauthorized heading (previous name, more complete name, acronym), but we weren’t getting that from the 510s anyway.
Fuzzy MARC Is a Total Bear: Dates
Data regarding founding/birth, dissolution/death, and merger dates is very rare in MARC authority records. Of the 8.2 million records, just 240,000 have “start periods” and 50,000 have “end periods”, creating a challenge. Authority records aren’t very authoritative if we can’t be certain which person or entity we are naming.
Even when a record is more complete, like the record for the United States Coast Guard, ambiguity is inevitable. Five of the United States Coast Guard’s hierarchical superiors can be found in “510” tracing fields. If it wasn’t for a text note in a “670” field, we would not know that the Coast Guard currently falls under the Department of Homeland Security. This has consequences for establishing relationships amongst the Coast Guard’s subordinate organizations.
The record with an authorized heading of “United States. Coast Guard. Airborne Radiation Thermometer Program” is clearly a subunit of the Coast Guard. However, if we wanted every Department of Homeland Security subunit or publication would we include the Airborne Radiation Thermometer Program? Probably not, because a cursory search of WorldCat does not find publications for the Airborne Radiation Thermometer Program or its immediate parent, the Oceanographic Unit later than 1983, twenty years before the creation of the Department of Homeland Security.
Fuzzy MARC Is a Total Bear: Relationships
Corporate relationships are complex and can’t always be fit within the limited vocabulary of “superior”, “successor”, or “predecessor”. Usually authority records will be missing data, but occasionally they try to be a little too helpful and end up confusing matters.
Howard University’s relationship with the Federal Government is too long and complex to be captured by a record claiming the Department of Interior and Federal Security Agency are hierarchical superiors. Likewise, the Michigan Crop Reporting Service is described as subordinate to not only the State of Michigan’s Department of Agriculture, but also the United States Department of Agriculture. This is apparently based upon a single joint publication. These relationships are important and useful to take note of, but authority records cannot adequately describe them.
Any time I work with bibliographic metadata, or in this case authority metadata, my bar for success is relatively low. Despite my problems with authorities, we were ultimately able to produce a list of 50,000 United States government authors. This list leverages the domain expertise of countless catalogers to find most of the nooks and crannies of the Federal government. There are some good and bad surprises. I never would have identified the “IFPRI/ISRA Project on Consumption and Supply Impacts of Agricultural Price Policies in Senegal” as being within the scope of the Registry, but its authority record reveals it to be a United States Agency for International Development project. On the other hand, I have to deal with the Howard University and Michigan Crop Reporting Service records.
More than 90% of the Federal Documents Registry’s bibliographic records contain a matching authorized heading. That qualifies as a success considering the state of the bibliographic records. We can now produce reasonably accurate collections of agency publications, allowing us to construct publication histories and evaluations of comprehensiveness. Documents that had previously slipped through the cracks due to errant cataloging are being detected and will soon be brought into the Registry.
 In practice, since it’s in a text note any computational process still doesn’t know.