Tuesday, January 7, 2014

CJK with Solr for Libraries, part 5 (Edismax Woes, part 2)

This is the fifth of a series of posts about our experiences addressing Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford University Libraries "catalog" built with Blacklight on top of our Solr index, and the second in the sub-series on problems we had switching to edismax from dismax.

You might be interested in this post in particular if you use Solr's edismax query parser or if you want to get more exact matches for user queries.

Edismax Woes, Part 2

In the third part of this series, I mentioned that we had 21 relevancy acceptance tests fail when we used the edismax Solr query parser instead of dismax.  Recall that our test failures were in the following categories:
  1. Journal titles
  2. Hyphens preceded by a space (but no following space)
  3. Boolean NOT
  4. Synonyms for musical keys
This blog post will discuss how we addressed relevancy failures in the first category;  I discussed failures in categories 2 and 3 in the fourth part of this series.

Digging Into Relevancy Differences

Here is example output from some failing tests in category 1:

  rspec ./spec/journal_title_spec.rb:22 # journal titles The Nation as everything search
  rspec ./spec/journal_title_spec.rb:32 # journal titles The Nation (National stems to Nation) with format journal

I used manual searches to confirm that users would perceive the edismax results as worse than dismax.  I also tried a number of similar searches (and wrote tests) to better pinpoint this problem.

Here are the first five results for a title search on The press using using dismax and edismax with the same index:



Looking at these results, I immediately have two questions:
  1. Why are the first two results of edismax not exact title matches?
  2. Why are the scores of the three documents that appear in both sets of results different?

Edismax Formula Difference

To answer these questions, I first looked at the Solr output with debugQuery=true.  The Solr query analysis was shown as exactly the same EXCEPT for this difference noted in SOLR-2058 in the comment from Michael Dodsworth on 25 Sep/12.

I'll try to express it more succinctly (thanks to Tom Burton-West) :

Let
  A = field1:"term1 term2"
  B = field2:"term1 term2"
  C = field3:"term1 term2"

Dismax:
    DisjunctionMaxQuery (A|B|C)
  returns the score of whatever field has the highest score.

Edismax:
    DisjunctionMaxQuery (A)
    DisjunctionMaxQuery (B)
    DisjunctionMaxQuery (C)
  returns the sum of the scores of each of any of the above queries that match.  So if your phrase is in all 3 fields, you get the sum of the scores for each matched field.

Unfortunately, SOLR-2058 is marked as fixed, despite this difference.  So we have an initial diagnosis, but not a treatment plan.

Visualizations To The Rescue!

I was stumped as to what to do about the above until I used the visualizations of Solr scoring data made available at explain.solr.pl.  The full visualizations I created for scoring the top 5 results of a search for 'the press' are available here:

edismax:  http://explain.solr.pl/explains/m63o1yhg
dismax:  http://explain.solr.pl/explains/a7bkurhb

Here is the visualization of the score of the first dismax result, id 9162486 with title The press:
The score is overwhelmingly dominated by the phrase match in the unstemmed short title field, title_245a_unstem_search.  In fact, this is true for all the top 5 dismax results:  if the pie charts below weren't different colors and didn't have text labels for the tiny pie pieces, I think you'd be hard pressed to tell them apart:







However, with edismax, we have two basic patterns within the first 5 results, and the phrase match in the unstemmed short title field is not nearly as dominant:






Here is a closer look at the visualization of the score of the first edismax result, id 8192320 with title MEET THE PRESS:

What I learned from these visualizations is that to make the edismax results more like dismax, I needed a way to give more weight to exact matches of the entire string in the title_245a_unstem_search field.

Another visualization example, of the top 5 results of a search for 'the nation':

edismax:  http://explain.solr.pl/explains/6bracmzw
dismax:  http://explain.solr.pl/explains/4med7pae

Tie Parameter

"DisMax" is an abbreviation of "disjunction maximum", which is a partial description of the way user queries are turned into low level Lucene queries.  From http://wiki.apache.org/solr/DisMax, dismax is "designed to process simple user entered phrases (without heavy syntax) and search for the individual words across several fields using different weighting (boosts) based on the significance of each field."  From the same document:


So basically, dismax pays attention only to the highest scoring query match in any of the document's fields.  The tie parameter, documented here, can be used to dial up the influence of other matches:


A tie value of 0.01 was used for both dismax and edismax searches.  That value is very close to zero, so the highest scoring matching clause should already be dominating the total score.  I tried a tie value of 0.99 to see if I could make a difference in the relevancy ranking this way, and while my document scores changed, the result order remained the same.   Below are results for edismax search for 'the press' with different tie values:

tie 0.01:  http://explain.solr.pl/explains/hwi1ma6x
tie 0.99:  http://explain.solr.pl/explains/jfmdli56


I coerced a colleague into trying a similar experiment, and was able to confirm the tie parameter affected his results as expected.  Further experimentation indicated that setting the tie value to something like 0.00001 for my search DID have the effect of making the highest scoring matching clause highly dominant;  this led me to the conclusion that my field boosting values were ridiculous (as high as 200,000 !) and affected my ability to use a reasonable value for the tie parameter..

Re-adjusting Boost Values

Our boost value settings were a big nasty kludge that I partly inherited and partly created. Simplifying them was definitely in order ... but perhaps not at the same time as a change to edismax, nor during a Solr upgrade from 3.5 to Solr 4.  Sure, having about 600 relevancy tests gave me a decent amount of confidence I could revise my boost values without degrading the user experience of relevancy, but upgrading to edismax was causing me plenty of difficulty without revising boost values.

Eventually, I did greatly simplify our boost values and our relevancy tests were indispensable to that process.   Of course I de-coupled this from fixing the problems we had with edismax, and again from the Solr upgrade from 3.5 to Solr 4.  You can see our simplified boost values in our solrconfig.xml file at https://github.com/solrmarc/stanford-solr-marc/blob/master/stanford-sw/solr/conf/solrconfig-slave.xml.   The highest boost value is now 5,000, a great reduction from 200,000.

However, in terms of addressing the relevancy test failures with edismax, adjusting the boost values and the tie parameter wasn't really part of the route taken.

Back to Edismax Woes

The visualizations of Solr scores helped us see that our edismax results need to give more weight to exact matches of the entire string in the short title (title_245a_unstem_search) field.  The tie parameter wasn't really working for us.  What now?

Bill Dueber of the University of Michigan wrote a wonderful blog post on using "fully-anchored" text fields to get an "exactish" match.   This approach seemed well worth a try.  I will discuss our fully-anchored solution in the next post in this series.

1 comment:

  1. I confess I am curious if the tie parameter could have helped us, but Bill's blog post made me think that having an "exactish" match field would improve the user experience overall.

    ReplyDelete