Improving Search and Discovery of Digital Resources Using Topic Modeling
Project Personnel: Youn Noh (PI at Yale University), John Weise (co-PI at University of Michigan), Kat Hagedorn (Investigator at University of Michigan), Dave Newman (Investigator at University of California, Irvine)
Sponsor: National Leadership Grant from the Institute of Museum and Library Services (IMLS)
Dates: October 1, 2008 to September 30, 2011
Award Amount: $749,990 (matching amount: $351,958)
In 2008, Yale University, University of Michigan, and the University of California-Irvine were awarded a National Leadership Grant from the Institute of Museum and Library Services to research how to improve search and discovery of digital resources using topic modeling. Topic modeling is a method of clustering documents with a similar subject matter, using an algorithm developed at University of California-Irvine. See the Yale topic modeling grant page for further information.
Digital image collections may contain hundreds of thousands of images, from images of paintings and drawings to posters and photographs. With limited text metadata, how can we better organize these collections to help users search and discover images? Multiple large-scale efforts to digitize entire library collections are producing millions of electronic volumes. But can we do better than keyword search to find relevant and interesting books?
We are investigating how topic modeling can be used to answer the above questions. Topic modeling is a relatively new, completely automated text mining technique that can extract semantic topics from any large collection of text items. These semantic topics can be useful as subjects and can be used to organize poorly categorized collections of objects. In this project, we will be applying topic modeling to three important classes of digital library resources: images, full-text books, and tagged objects. After running the topic model on each collection of objects, domain experts will interpret the automatically learned topics and evaluate their usefulness as subjects. We will then use these learned topics for ranking of search results, summarization, faceting, and seeding of tag clouds.
We will be building prototypes of user interface applications that implement topics and the above search and discovery functionality. We will then test our prototypes to assess the value of topic modeling for end users, using well-established testing methods and assessment measures. For each application developed, we will define a control group (with no access to topics or topic-based functionality) and experimental groups (using topics and topic-based functionality) and test the performance of these groups on search and discovery tasks. We have selected collections for applications based on the availability of data and suitable experimental groups.
To broaden the impact of our work, we will develop open source software and tools for topic modeling. These software tools will include the topic model, a topic browser, preprocessing and processing scripts, a topic classifier, and a metadata enhancer. The tools will be developed for a general target audience of digital libraries using current metadata standards. We will also develop course materials for a workshop that we will deliver on how to integrate topic modeling into any digital library. Software developed for the project will be used in the workshop. Finally, our research findings will be disseminated at a variety of conferences and publication venues.
Topic models have been developed using both metadata from image collections and the full text of HathiTrust texts, in the area of art, architecture and art history. Several user tests have been developed and run to inform on the quality of the topic models. In June and July 2010, and again in November 2010, we ran unmoderated user testing to further inform us about the topic models and how users are able to use them effectively in a prototype interface. We are currently analyzing the many hundreds of test recordings we received.
We also performed face-to-face tests with a number of users in February and March 2011 to gain qualitative information about how well topics work in our test interface, and how they compare to Library of Congress Subject Headings (LCSH).
The grant team gave a presentation on their work to date on October 7, 2010 in the Hatcher Gallery from 9-10:30am. We also ran a tutorial workshop to learn about the software underpinnings of topic modeling on October 7, 2010 in the ULIC from 1-3pm. Registration was required for the tutorial.
Kat Hagedorn led a poster session at the Digital Library Federation (DLF) Fall Forum in Palo Alto, November 1-3, 2010. The presentation was designed to be presented entirely on an iPad, and detailed through images only how we used Morae software to run our unmoderated user tests. The images detailed both the pros and cons of that work.
Kat Hagedorn presented a poster at the ACRL 2011 conference in Philadelphia, March 30-April 2, 2011. The poster further describes our unmoderated testing process and includes results of our analysis to date.
Kat Hagedorn, Michael Kargela, Youn Noh and Dave Newman wrote an article detailing the unmoderated and moderated testing results. The paper was published in D-Lib Magazine in the September/October 2011 issue.
The final part of the University of Michigan's work was for Youn Noh, Dave Newman and Kat Hagedorn to present the 3 years of work at Yale University on October 4, 2011.