Monday, January 6, 2014

CJK with Solr for Libraries, part 4 (Edismax Woes, part 1)

This is the fourth of a series of posts about our experiences addressingChinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford University Libraries "catalog" built with Blacklight on top of our Solr index.

You might be interested in this post in particular if you use Solr's edismax query parser.

Edismax Woes, Part 1

In the third part of this series, I mentioned that we had 21 relevancy acceptance tests fail when we used the edismax Solr query parser instead of dismax.   If I inserted the "e" in my solrconfig, I would see these test failures;  if I removed the "e", the tests would pass.  I had heard "edismax is better", but here I had proof positive that edismax was WORSE.

Our relevancy problems with edismax were unexpected blockers to improving CJK resource discovery:  recall that edismax is required to fix relevancy when using the CJKBigram filter (SOLR-3589) as mentioned in part two of this series.

I believe the intent is for edismax to be an exact equivalent of dismax with additional features ... but this is not true at this time.  It turns out that edismax has a number of unresolved bugs and unimplemented features (see SOLR-2368).

Recall that our test failures were in the following categories:
  1. Journal titles
  2. Hyphens preceded by a space (but no following space)
  3. Boolean NOT
  4. Synonyms for musical keys
I will discuss categories 2 and 3 in this blog post.

Edismax Bug with Boolean Operators

One of the unresolved bugs with edismax (see SOLR-2368 for a comprehensive list) is SOLR-2649, "MM ignored in edismax queries with operators."  This bug essentially means that if a Boolean operator, such as NOT or OR appears in the query string, then all terms in the query are effectively "OR'ed" together.  Note that 'AND' is unaffected;  the four operators affected are:  NOT, OR, - (prohibited), + (required). 

It turns out that all of our failing hyphen tests were queries with a space before but not after the hyphen, such as 'under the sea -wind.'   Solr interprets such a hyphen as a "prohibited" operator, which is the same as NOT, and such hyphens are included in bug SOLR-2649.  Thus, our failure categories 2 and 3 are essentially the same. 

Our default operator is AND and we set lowercaseOperators to false (see http://wiki.apache.org/solr/ExtendedDisMax#lowercaseOperators), so the SOLR-2649 bug means any query with terms of uppercase OR or NOT or a '-' or '+' character preceding a term gives unexpected results, unless the query has either explicit AND or explicit '+' for all other terms.  Our mm setting is 6<-1 6<90%, or high enough that 4 terms should be AND'ed together.

Here are some illustrative example user queries, how they are analyzed by Solr (via debugQuery=true), and the results:

DISMAX:
  q=customer driven academic library
    +(((custom)~0.01 (driven)~0.01 (academ)~0.01 (librari)~0.01)~4) ()
    4 hits
  q=customer NOT driven academic library:
    +(((custom)~0.01 -(driven)~0.01 (academ)~0.01 (librari)~0.01)~3) ()
    96 hits
  q=customer -driven academic library:
    +(((custom)~0.01 -(driven)~0.01 (academ)~0.01 (librari)~0.01)~3) ()
    96 hits
  q=customer academic library:
    +(((custom)~0.01 (academ)~0.01 (librari)~0.01)~3)()
    100 hits

EDISMAX
  q=customer driven academic library:
    +(((custom)~0.01 (driven)~0.01 (academ)~0.01 (librari)~0.01)~4)  
    4 hits
  q=customer NOT driven academic library:
    +((custom)~0.01 -(driven)~0.01 (academ)~0.01 (librari)~0.01)  
    984300 hits
  q=customer -driven academic library:
    +((custom)~0.01 -(driven)~0.01 (academ)~0.01 (librari)~0.01)
    984300 hits
  q=customer OR academic OR library NOT driven:
    +((custom)~0.01 (academ)~0.01 (librari)~0.01 -(driven)~0.01)
    984300 hits
  q=customer academic library:
     +(((custom)~0.01 (academ)~0.01 (librari)~0.01)~3)
    100 hits

You can see that the mm is missing from the middle 3 edismax queries (no ~3 term);  this is the manifestation of the bug.  You can also see how the number of results would be confusing to an end user.

Great - now we have learned there is a known Solr bug, and we understand its scope.  What shall we do?  The first, best option would be for someone else to fix the problem.  Given that the bug report is from July 2011 and heavily commented, that seems unlikely to happen soon.  The next best option is to fix the problem myself.  However, having already looked at the source code for edismax, I pale at the very thought.  

So I took a different tack:  I examined whether we had a significant number of user queries impacted by this bug.

Will Our Users Encounter this Bug?

Prior to the CJK improvements, SearchWorks has only supported boolean operators in its Advanced search.  Our usage data from July 2013 - November 2013, shows "advanced" is the initial search behavior somewhere between 6% and 18% of the time.   For example, here are stats for initial search behavior in July 2013:


And the analogous data for November 2013:



So let's be generous, and say that 20% of the SearchWorks queries use the boolean-enabled advanced searches.  Of those, how many contain boolean?

We looked at 10,000 advanced search queries culled from Google Analytics from July, 2013.  (Note that this could have been done via the Solr logs, but we have a load balanced Solr set up so it would have involved getting logs from 3 machines and their backups ... so we just went with Google Analytics.)  Using grep, we determined the following occurrences of the affected operators in an advanced search where SOLR-2649 might apply:

Thus, the boolean operators affected by SOLR-2649 occur extremely rarely in SearchWorks advanced search queries, which are themselves, at best 20% of our queries.

Armed with this data, we chose to let this remain broken in SearchWorks for the time being, so the failing relevancy tests in categories two and three above have been resolved as "do not fix" until the Solr bug is addressed.

Stay Tuned ...

In my next post, I will tackle the first category of failing relevancy tests, which required a work around to avoid degrading relevancy for our users when we switched to edismax.

No comments:

Post a Comment