Problems with Authority

A graph of organization nodes and edges depicting the United States Federal bureaucracy.

An essentially meaningless graph of the United States Federal Government and its constituent organizations.

One of the projects of the HathiTrust’s U.S. Federal Government Documents Program is to develop a metadata registry of every work published by the Federal government, for which I am the principal data wrangler and mangler. Measuring the completeness of the corpus and therefore our success and failure is difficult. One way to measure completeness is to evaluate a smaller known subset. Are we missing records for agencies that should be represented in the corpus or are there suspicious gaps in the publication history? Before we answer that question, we have to be able to reliably identify authors.

Ideally, we would have a complete list of Federal authors (departments, agencies, offices, employees) and a map of their relationships. The ability to reliably identify authors and their relationships would allow us to properly classify suspected Federal documents, match duplicate bibliographic records, and inform decisions regarding collection development.

Coping with MARC Authority Records

We have attempted to address parts of these challenges using MARC authority records from the Library of Congress. “MARC authority records contain the standardized names, or “authorized headings” for people, corporate bodies (societies, businesses, institutions, etc.), meetings, titles, and subjects” as used in MARC bibliographic records.[1] The Library of Congress Name Authority File is the result of years of work by expert catalogers with an obsessive attention to detail. While comprehensiveness may not be possible, the content of the Authority File is impressive.

For the uninitiated, MARC stands for MAchine-Readable Cataloging standards. Arguably, MARC records are not machine readable nor even standard, and “cataloging” only so far as a record describes what to print on a 3”x5” card. While many fields resemble entity-relationship data, they are more akin to an arcane templating system. MARC bibliographic records describe the appearance of a cataloging record and only incidentally the underlying work. Likewise, MARC authority records describe the textual format of names and subjects, not the person, entity, or topic that bears that name. Forgetting this point leads to misuse and abuse of MARC metadata, and a bad time for the data programmer.

I frequently have a bad time. After much trial and error, I was able to extract 8.2 million records from the Name Authority File into a queryable database. Of the extracted records, 6.6 million records are Personal Name Headings (field 100), e.g. “Kilroy, Bill”, and 1.6 million records are Corporate Name Headings (field 110), e.g. “United States Coast Guard”.

Explicit Relationships

In theory, MARC authority records contain “see also” tracing fields for related authorized headings (field 510). These fields should provide connections to an author’s alternate names, parent organization, or employer using relationship subfields. In practice, this information is usually omitted or left ambiguous.

Explicit relationships in our database of 8.2 million records:

  • Successor: 126,448 records
  • Predecessor: 122,122 records
  • Hierarchical Superior: 81,609 records
  • Employee: 134 records

For example, the aforementioned record for Bill Kilroy omits his employment with the Forest Service which would be helpful in creating a complete collection of their publications. The record for the United States Army Special Epidemiologic Team lacks tracings to the authorized headings for the United States Army or its immediate parent the United States Army Special Forces.

Computing Implicit Relationships From Authorized Headings

The lack of explicit relationships meant I was forced to get a little creative with (misuse and abuse) implicit parent/child relationships. Typically corporate headings are split into subfields in order of descending hierarchical level. For example:

    |a United States. |b Coast Guard

    |a University of Michigan. |b Department of Electrical and Computer Engineering. |b Radiation Laboratory

Unfortunately, there are frequent exceptions to this, marked by indicators for “direct order”, “indirect order”, and “jurisdiction”. You can find an explanation (but no excuse) for these in the AACR2.

    |a Library of Congress

    |a Abernathy Fish Technology Center (U.S.)

    |a AIDS Drug Assistance Program (U.S.)

Even when a hierarchical field is given, the authorized heading, including “United States. Coast Guard”, usually obscures the intermediary organizations. Using just the authorized heading, more than 7,500 corporate names are represented as first level organizations in the United States government.[2] Again, the heading is designed to describe a name to go into a bibliographic record display, not the entity.

Computing Implicit Relationships from Unauthorized Headings

After working with MARC bibliographic records for several years, it should not have surprised me that most of the authoritative data found in MARC authority records can be found in the unauthorized headings (field 410). When the 510 tracings fail, this is where we can usually find the more complete and useful name.

The “United States. 127th General Hospital” lacks explicit relationships in 510 tracing fields, and the authorized heading omits bureaucratic superiors. Fortunately, there are two unauthorized headings in 410 fields and one of them tells us where to place the 127th General Hospital within the hierarchy even if it lacks the precision we would like.

    a| United States. b| Army. b| 127th General Hospital

    a| United States. b| General Hospital, 127th

Unauthorized fields rarely give us the exact nature of the relationship with the unauthorized heading (previous name, more complete name, acronym), but we weren’t getting that from the 510s anyway.

Fuzzy MARC Is a Total Bear: Dates

Data regarding founding/birth, dissolution/death, and merger dates is very rare in MARC authority records. Of the 8.2 million records, just 240,000 have “start periods” and 50,000 have “end periods”, creating a challenge. Authority records aren’t very authoritative if we can’t be certain which person or entity we are naming.

Even when a record is more complete, like the record for the United States Coast Guard, ambiguity is inevitable. Five of the United States Coast Guard’s hierarchical superiors can be found in “510” tracing fields. If it wasn’t for a text note in a “670” field, we would not know that the Coast Guard currently falls under the Department of Homeland Security.[3] This has consequences for establishing relationships amongst the Coast Guard’s subordinate organizations.

The record with an authorized heading of “United States. Coast Guard. Airborne Radiation Thermometer Program” is clearly a subunit of the Coast Guard.[4] However, if we wanted every Department of Homeland Security subunit or publication would we include the Airborne Radiation Thermometer Program? Probably not, because a cursory search of WorldCat does not find publications for the Airborne Radiation Thermometer Program or its immediate parent, the Oceanographic Unit later than 1983, twenty years before the creation of the Department of Homeland Security.

Fuzzy MARC Is a Total Bear: Relationships

Corporate relationships are complex and can’t always be fit within the limited vocabulary of “superior”, “successor”, or “predecessor”. Usually authority records will be missing data, but occasionally they try to be a little too helpful and end up confusing matters.

Howard University’s relationship with the Federal Government is too long and complex to be captured by a record claiming the Department of Interior and Federal Security Agency are hierarchical superiors. Likewise, the Michigan Crop Reporting Service is described as subordinate to not only the State of Michigan’s Department of Agriculture, but also the United States Department of Agriculture. This is apparently based upon a single joint publication. These relationships are important and useful to take note of, but authority records cannot adequately describe them.

Mission Accomplished?

Any time I work with bibliographic metadata, or in this case authority metadata, my bar for success is relatively low. Despite my problems with authorities, we were ultimately able to produce a list of 50,000 United States government authors. This list leverages the domain expertise of countless catalogers to find most of the nooks and crannies of the Federal government. There are some good and bad surprises. I never would have identified the “IFPRI/ISRA Project on Consumption and Supply Impacts of Agricultural Price Policies in Senegal” as being within the scope of the Registry, but its authority record reveals it to be a United States Agency for International Development project. On the other hand, I have to deal with the Howard University and Michigan Crop Reporting Service records.    

More than 90% of the Federal Documents Registry’s bibliographic records contain a matching authorized heading. That qualifies as a success considering the state of the bibliographic records. We can now produce reasonably accurate collections of agency publications, allowing us to construct publication histories and evaluations of comprehensiveness. Documents that had previously slipped through the cracks due to errant cataloging are being detected and will soon be brought into the Registry.


[1] https://www.loc.gov/marc/uma/pt1-7.html

[2] Technically, “United States” is a Geographic heading (field 151) instead of a Corporate heading. Records like “United States. Embassy (Botswana)” suggest this might be a bad idea.

[3] In practice, since it’s in a text note any computational process still doesn’t know.

[4] Actually, it contains an unauthorized heading indicating it is more precisely part of the Coast Guard’s Oceanographic Unit.

5 Comments

Stephen Hearn
on Aug. 10, 2:17pm

There are several points here which need some rethinking. "... MARC records are not machine readable nor even standard, and “cataloging” only so far as a record describes what to print on a 3”x5” card." Be fair. HTML could encode the formatting needed to print cards with none of the semantic coding which MARC provides, however imperfectly. And to say that a standard as long-lived and widely used in libraries as MARC is not "even standard" seems dubious. "These fields should provide connections to an author’s alternate names, parent organization, or employer using relationship subfields. In practice, this information is usually omitted or left ambiguous." MARC has been around a long time. The addition of fields for recording parent organization, employer, etc. is comparatively very recent. Likewise for corporate body dates. To say that such information is "usually omitted or left ambiguous" over the whole body of MARC authorities indicates a shortsighted understanding of the history of the MARC authority format and the LC Name Authority File. More generally, library headings are intended to serve users who generally do not want to look up corporate names by their full hierarchical string. Direct entry is often preferred. The truncation of such names in their heading form is a labor-saving device for users. The authority often provides a more complete hierarchy among the variant names as was observed, just not as the preferred term. Catalogers are not omniscient. Generally anyone doing library authority work is gleaning all the information they can within the constraints of maintaining an efficient workflow and in a context of standards and practices which has changed a lot over time. Often when information is lacking in the authority, the reason is usually that it was either not available or not accommodated in the format at the time--not that a cataloger ignored it. It's great to see the LCNAF being mined and enhanced for non-library projects, and I'm sure such projects face much frustration. But please bear in mind that library authority practice has only recently shifted its focus from name authorization for library catalog use to entity description. There's lots of work for all of us to do in support of that transition. Hopefully HathiTrust's metadata registry for US Federal agencies will be published as open linked data to that its agency metadata can be linked to library authorities.

Karen Coyle
on Aug. 11, 1:36pm

Thanks for the extensive exposition of your struggles! "Headings" (even as they are created in library cataloging today) are designed to be encountered by humans who are browsing the catalog alphabetically, and who, in doing so, note the context of the preceding and succeeding entries* and, using their intelligence, understand the relationships that are being implied. Computers instead are "Hah! Bits!" So I'm not surprised at what you found but am totally impressed at the match rates that you achieved. However, wouldn't it be nice if work with our data didn't have to be prefaced with "In spite of ...."? * why is preceding "ede" and succeeding "eed" when they really are just a difference of direction? More presumed logic that escapes me.

Karen Coyle
on Aug. 11, 1:36pm

Thanks for the extensive exposition of your struggles! "Headings" (even as they are created in library cataloging today) are designed to be encountered by humans who are browsing the catalog alphabetically, and who, in doing so, note the context of the preceding and succeeding entries* and, using their intelligence, understand the relationships that are being implied. Computers instead are "Hah! Bits!" So I'm not surprised at what you found but am totally impressed at the match rates that you achieved. However, wouldn't it be nice if work with our data didn't have to be prefaced with "In spite of ...."? * why is preceding "ede" and succeeding "eed" when they really are just a difference of direction? More presumed logic that escapes me.

Minjie Chen
on Aug. 15, 2:35pm

I wonder if the case suggests that the LC name authority file, designed to fulfill whatever its original purpose is and constructed with the limited resources and time constraint by juggling professional librarians, is being asked to perform new and more powerful functions not foreseen pre-linked data era. What rich and consistent information programmers now wish to harvest from the authority file--precise hierarchical and historical relationships, beginning and ending date of an agency, etc.--may fall into the scope of a dedicated reference tool on US federal government agencies if such a book/database exists. Librarians have not been asked to compile encyclopedia entries in addition to cataloging books, although the RDA standards seem increasingly and implicitly expect us to become mini-bio writers. As a Chinese cataloger, I have noticed similar flaws in LC NAF for Chinese geographical names and I am not the first one at that. In terms of hierarchical relationships, a heading for a district (qu), a county (xian), or a city (shi) supplies information about which Chinese province it is subordinate to only haphazardly. I am talking about human-readable text, much less structured, machine-actionable data. Similarly, the historical relationships between geographical names are unclear. Through an administrative change, a county may have been renamed as a district, reflected in two separate headings. It is hard to tell from NAF when that change happens, and catalogers have to be extra careful about choosing the right heading to assign to bibliographical metadata. Here is a made-up use case: suppose we want to locate local gazetteers at all levels about a certain Chinese province named Zhejiang. To do a comprehensive search, you will need to list all the place names in the province and Boolean search them all (by the "or" operator). Even for the provincial capital city of Hangzhou, NAF merely provides a cryptic note, if any, that the city is in the province of Zhejiang. In comparison, the Wikipedia entry of Zhejiang seems to understand the relationship between subordinate places better, suggested by the "County-level divisions of Zhejiang Province" knowledge bar that can be unfolded at the bottom of the page. Our imaginary search will be more efficient if we could leverage relationships already specified elsewhere and combine that with our own authorized headings, which apparently are not very interested in socializing with each other…

Chris Lemery
on Sept. 1, 8:20am

Thanks for this insight into your work, Joshua. I echo Karen's thoughts on your match rate and I congratulate you. I get the heebie-jeebies just thinking about trying to sort out the organizational relationships in the federal government, but I think the registry will be better for your work!

Add new comment

By submitting this form, you accept the Mollom privacy policy.