Practical Relevance Ranking for 11 Million Books

Tom Burton-West, one of the HathiTrust developers, has been working on practical relevance ranking for all the volumes in HathiTrust for a number of years. He wrote two blog posts that were posted on the Large-Scale Search HathiTrust blog in May and June of 2014. He expects to continue writing posts about relevance ranking, which we will repost here.

Practical Relevance Ranking for 11 Million Books, Part 1

Practical Relevance Ranking for 11 Million Books, Part 2: Document Length and Relevance Ranking

Relevance is a complex concept which reflects aspects of a query, a document, and the user as well as contextual factors. Relevance involves many factors such as the user's preferences, task, stage in their information-seeking, domain knowledge, intent, and the context of a particular search.

Excerpts from Tom's posts

HathiTrust full-text search uses the Solr/Lucene open-source search engine software. We believe that Lucene’s default relevance ranking algorithm does not work well with our book-length texts because Lucene’s algorithm tends to rank short documents too high... with a few exceptions, all the published research on relevance ranking algorithms has been done on relatively short documents.

...the possible range of term frequencies is about two orders of magnitude greater for book length documents than for smaller documents such as the newswire articles used in the TREC ad hoc collections or the HathiTrust documents indexed on the page level. We believe that it is unlikely that algorithms that have been tested and tuned on smaller documents will work well with large documents.

... idf weighting is unlikely to be effective with long-documents. We also suspect that the relationship between term frequency and relevance is fundamentally different for short documents, such as the newswire articles and truncated web pages used in research test collections, than for long documents like books. The relationship between term frequency and relevance will be discussed further in future blog posts.

Tags: