Friday, December 16, 2011

Stopwords in SearchWorks - to be or not to be?

We've been examining whether or not to restore stopwords to Stanford's SearchWorks index (http://searchworks.stanford.edu).

Stopwords are words ignored by a search engine when matching queries to results. Any list of terms can be a stopword list; most often the stopwords comprise the most commonly occurring words in a language, occasionally limited to certain functions (articles, prepositions vs. verbs, nouns).

The original usage of stopwords in search engines was to improve index performance (query matching time and disk usage) without degrading result relevancy (and possibly improving it!). It is common practice for search engines to employ stopwords; in fact Solr (http://lucene.apache.org/solr), the search engine behind SearchWorks, has English stopwords turned on as the default setting. We had no compelling reason to change most of the default Solr settings.  Thus, since SearchWorks's inception we have been using the following stopword list:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, s, such, t, that, the, their, then, there, these, they, this, to, was, will, with.

What follows is an analysis of how stopwords are currently affecting SearchWorks, and what might happen if we restore stopwords to SearchWorks, making every query term significant.

 

Executive Summary

We believe that restoring stopwords to SearchWorks could improve results in up to 18% of the searches, and will degrade results only in the small number of searches with more than 6 terms.

 

How Many Terms are there in User Queries?

Over 50% of the query strings for SearchWorks are 1 or 2 terms.
Over 75% of the query strings are 1, 2 or 3 terms.
Over 90% of the query strings for SearchWorks have 6 or fewer terms.

This is strictly query strings; it does not include facet values or other parameters.  Here is a histogram showing the number of terms in our queries for October 2011.  Note that single term queries are split into "alphanum" and "numeric".


Source: (from Google Analytics for Oct 2011, analyzed by Casey Mullin)

 

What Percentage of Query Strings have Stopwords?

In November 2011, there were 142,869 searches.  Stopwords appeared 26,076 searches. Thus, stopwords appeared in roughly 18% of searches.



(Per analysis of November 2011 usage statistics by Casey Mullin, sent in email on Dec 14, 2011).

 

Do the Stopwords Currently Used in Queries Imply the Users are Trying Boolean Searches?

The 10 stopwords appearing most often in queries are (for November 2011):

Stopwordoccurrences in queries
the7578
of6582
and4106
in2298
a1137
to1033
for695
on685
an289
with231

or and not do not appear in many queries, while and is not the most frequent stopword, nor close to it in occurrences. I interpret this to mean stopwords in queries are NOT intended as boolean operators.

(per analysis of November 2011 usage statistics by Casey Mullin, sent in email on Dec 14, 2011).

 

What About Minimum Must Match?

Restoring stopwords could hugely degrade precision, since stopwords occur so often.  Solr's mm setting (minimum must match) gives us a way to mitigate this problem.  In our index employing stopwords, our mm threshold is 4:  queries with up to 4 terms must match all 4 terms;  for 5 or more query terms, 90% must match.   Given that over 90% of queries have 6 or fewer terms, 6 seems an appropriate threshold for an index that includes all words.

As it happens, increasing our mm threshold was proposed a while back, distinct from the idea of restoring stopwords to the index. 


What is Improved by Restoring Stopwords to the Index?

  1. Searches comprised only of stopwords now retrieve results (improved recall) 
    • to be or not to be (with or without quotes) 
  2. Precision is greatly improved for short searches that include stopwords 
    • pearl vs. the pearl
    • the one
    • A Zukofsky (author Zukofsky, title "A")
    • there will be blood  (3 stopwords, so huge improvement)
    • OR spectrum (a periodical)
    • Jazz: an Introduction
  3. Subject links distinguish "in" from "and", etc. 
    • Archaeology in Literature is no longer conflated with Archaeology and Literature
  4. Improved results for languages having words overlapping English stopwords

 

What is Degraded by Restoring Stopwords to the Index?

  1. long queries (over 6 terms) with a lot of stopwords have reduced precision ...  BUT the words occurring as a phrase do float to the top. 
    • Lectures on the Calculus of Variations and Optimal Control Theory

 

What Else Have Testers Reported?

  • Known Item Searches: 
    • restoring stopwords tied or improved our testers' known item searches. 
    • one exception: 
      • a search for dorothy and the wizard OF oz did not retrieve the desired title, which was actually dorothy and the wizard IN oz. 
  • Series Searches, and Uniform Title:
    • "A potential problem of the stopword change is that title access points (aka uniform title) constructed according to AACR2 are without initial articles. So, for instance, the access point for the series "The NASA history series" is "NASA history series". A query that includes the initial article will not affect the search result in current production SW because "the" is eliminated as a stopword, but will affect the search result when stopwords are treated as significant words. On searchworks-test, a phrase title search for "The NASA history series" retrieves 76 records. The same search on production retrieves 125 records. The test search still retrieves some of the records that belong to this series because the transcribed series statement, which is in the 490 field, includes the initial article, but not all of them do. The series access points in the 830 field are all without the initial article. [Symphony browse series retrieves 94 results.]"
    • my reaction: in the metadata advisory group, many of the records we examined had the "wrong" information in the field (it included the initial article, and it shouldn't have). Sooo … our data is dirty -- shocking, but true. It would also be nice to know how often the affected searches are exercised, especially by end-users.

 

Additional Comments

Everything is Imperfect. 
  • SearchWorks employing stopwords gives imperfect search results. 
  • SearchWorks restoring stopwords, so that every term is signficant, gives different imperfect search results.
  • Socrates (our OPAC from our ILS, Sirsi) gives yet different imperfect search results. 
The back end algorithms for determining what results match a query will always be fairly opaque to the end users - the algorithms are complicated. Moreover, users will have typos and other mistakes in their queries no matter what we do, and it seems unlikely we can consistently rescue them from themselves.

Everything Can be Changed.

Solr gives us incredible control over our search engine's algorithm. There are many many knobs we can twiddle in our quest to improve the relevancy of search results. A few of the possibilities include:
  • mm -- require a higher percentage of matching terms when there are more than 6 terms in the query
  • phrase boosting -- this floats result with the query terms occurring close together (and presumably in the same order) to the top.  Currently it seems high enough, but we have never performed any empirical tests.
  • phrase slop -- how close words must occur to each other in the results.  Our current setting is 3; it is not clear to me exactly how phrase boosting and phrase slop interact.
  • adjust the relative boosting of fields -- give even more weight to title field matches, etc.  Again, we've never performed any empirical tests.
  • indexed string length doesn't always have to matter -- adjust the situations where the length of the indexed string affects the score of matches.  E.g. query "my cat" can score higher for title "my cat" than for "my cat and dog."

 

So Where Are We Now?

The data is in, and a decision will be made soon.  I'm guessing stopwords are going to be left in our past.

No comments:

Post a Comment