Thursday, November 7, 2013

CJK with Solr for Libraries, part 3


This is the third of a series of posts about our experiences improving CJK resource discovery for the Stanford University Libraries.

We recently rolled out some significant improvements for Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford library "catalog" built with Blacklight on top of our Solr index. If your collection has a significant number of CJK resources and they are in multiple languages, you might be interested in our recipes. You might also be interested if you have a significant number of resources in other languages with some of the same characteristics.

Relevancy Testing


In the second part of this series, I explained why SearchWorks needed to change from using the Solr dismax query parser to the edismax query parser.   I would not undertake nor recommend a fundamental change to your Solr query processing without a good testing methodology.  Nor would I change an index to accommodate CJK without a way to ensure it didn't break existing functionality.  Basically, it's a bad idea to change anything about query processing without a way to ensure it doesn't degrade existing relevancy.  This implies automated relevancy acceptance testing is needed.

Luckily for us, we have been doing automated relevancy testing for a while, first by using cucumber tests within our SearchWorks Blacklight Rails application, and now much more efficiently by using rspec-solr (http://rubydoc.info/github/sul-dlss/rspec-solr) to interact with Solr directly instead of going through the whole Rails application.  It allows our SearchWorks relevancy testing application, sw_index_tests (available at https://github.com/sul-dlss/sw_index_tests), to parse Solr responses and use rspec syntax to check whatever we want about the Solr documents returned, without going through the entire Rails stack.  I blogged about this a while back.

When we were about to switch to edismax to facilitate CJK discovery, we had around 580 relevancy tests in sw_index_tests, including tests for everything, author, title, subject and series search results, diacritics and punctuation in search terms, and journal titles, among other things.  These test searches (and their expected results) were amassed over a period of 3-4 years:  tests were written every time a tweak was made to address a problem, or when the indexing code changed for some other reason (e.g. providing call number searching).  These tests have never been comprehensive, but they are a lot better than nothing.  We run them against our live production index nightly via our Jenkins continuous integration server.  Sometimes we have to tweak the tests when records are added, removed, or changed in the production index, but that's easy.  The peace of mind knowing we have a way to do relevancy acceptance testing is well worth the trouble.

And in case you're wondering, we pass an additional http argument (testing=sw_index_tests) to Solr so we can easily segregate these test queries from actual user queries in the Solr logs:

Here are a few example tests from sw_index_tests:

Title:

Author


Journal Title

Diacritics
I will talk more about our CJK relevancy tests later;  my main point here is that we have automated tests to help us determine if anything breaks when we make changes to our Solr index or configurations, and you can do it too!   Heck, you can even use ours and change the acceptance conditions to be your ids and expected result numbers.

I would love to hear from anyone else who does automated relevancy testing, as it seems to be a rare thing.

Edismax != Dismax


Technically, to switch from the dismax query parser to the edismax query parser, you need only add the "e" to your Solr request handler defType declaration:
<requestHandler name="/search" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="defType">edismax</str>
When we tried this, we had 21 failures out of approximately 580 tests.  The failures were in four categories:

1.  Journal title failures

Example of a failing test:
  rspec ./spec/journal_title_spec.rb:22 # journal titles The Nation as everything search
  rspec ./spec/journal_title_spec.rb:32 # journal titles The Nation (National stems to Nation) with format journal

2.  Queries having hyphens with a preceding space (but no following space)

Example of a failing test:
  ./spec/punctuation/hyphen_spec.rb:140 # 'under the sea -wind' hyphen in queries with space before but not after are treated as NOT in everything searches 

3.  Boolean NOT operator

Failing tests:
  ./spec/boolean_spec.rb:100 # boolean NOT operator  space exploration NOT nasa has an appropriate number of results
  ./spec/boolean_spec.rb:88 # boolean NOT operator  space exploration NOT nasa  should have fewer results than query without NOT clause

4.  Synonyms for musical keys.

We use Solr synonyms to equate the following (for all musical keys - the full list is here):
  f#, f♯, f-sharp => f sharp
  ab, a♭, a-flat => a flat
Failing tests:
  ./spec/synonym_spec.rb:204 # musical keys sharp keys f# major
  ./spec/synonym_spec.rb:316 # musical flat keys author-title search (which is a phrase search) b♭

Clearly, we needed to address these problems before we could use edismax in production SearchWorks, which means they needed to be fixed as a prerequisite for improving CJK discovery.

How to Analyze Relevancy Problems


Thankfully, there are some excellent tools for debugging relevancy problems.

1.  Solr query debugging parameters 

If you add debugQuery=true to your Solr request, then you will get debugging information in your Solr response.  If you are at Solr release 4.0 or higher, you could use debug=query instead.  Here is an example:
My request:
 http://(solr baseurl)/solr/select?q={! qf=$qf_author}zaring&debug=query
The debug query part of the response (simplified a bit):

We can see exactly which Solr fields and terms are being searched, and their boost factors.  This example shows different terms being searched in the stemmed and unstemmed version of the fields.  (Note: the decision to stem author fields was deliberate to allow users to find, say "Michaels, Amanda" when they query "Amanda Michael", or if they use "Crook" when the name is actually "Crooke".)

See http://wiki.apache.org/solr/CommonQueryParameters#Debugging for more information.

2.  Analysis GUI

Another Solr supplied tool is the Analysis form in the admin GUI.   This tool lets you see how each part of your analysis chain of tokenizer and filters affects the data according to field, field type, or dynamic rule in the Solr schema.

This shows, for Solr field author_1xx_search, at which point in the analysis chain "zaring" becomes "zare".   I entered "zare" as a query value, and the faint purple highlighting of the bottom two lines on the left field value side shows that zare and zaring will match for field author_1xx_search. 
Note that this is not an exact representation of query matching.  As an example, the lucene query parser on the client side breaks things up by whitespace before field analysis is performed for (e)dismax processing.

3.  Visualization of Individual Result Debug Information

The Solr debug information can also contain information on how the algorithm computed the relevancy ranking of a results, either with debugQuery=true, or debug=results for Solr 4.0 or higher.

Given the same Solr query string as above, with debug=results:
 http://(solr baseurl)/solr/select?q={! qf=$qf_author}zaring&debug=results
The explain part of the response (simplified a bit):
<lst name="debug">
... 
<lst name="explain">
  <str name="3928423">
9.0780735 = (MATCH) sum of:
  9.0780735 = (MATCH) max plus 0.01 times others of:
    9.056876 = (MATCH) weight(author_1xx_unstem_search:zaring^20.0 in 1075085) [DefaultSimilarity], result of:
      9.056876 = score(doc=1075085,freq=1.0 = termFreq=1.0
), product of:
        0.999978 = queryWeight, product of:
          20.0 = boost
          14.491322 = idf(docFreq=9, maxDocs=7231138)
          0.0034502652 = queryNorm
        9.0570755 = fieldWeight in 1075085, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          14.491322 = idf(docFreq=9, maxDocs=7231138)
          0.625 = fieldNorm(doc=1075085)
    2.1197283 = (MATCH) weight(author_1xx_search:zare^5.0 in 1075085) [DefaultSimilarity], result of:
      2.1197283 = score(doc=1075085,freq=1.0 = termFreq=1.0
), product of:
        0.24188633 = queryWeight, product of:
          5.0 = boost
          14.021318 = idf(docFreq=15, maxDocs=7231138)
          0.0034502652 = queryNorm
        8.763324 = fieldWeight in 1075085, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          14.021318 = idf(docFreq=15, maxDocs=7231138)
          0.625 = fieldNorm(doc=1075085)
</str>
This can be useful in determining why a particular document is (or isn't) included in the results, but it is difficult to eyeball the above and understand what is going on, even after you format it.

Thankfully, there is a web site in Poland, http://solr.pl/en/, that has a web service, http://explain.solr.pl/, to take your Solr explain info and visualize it as a pie chart.  This presents our result like this:


Suddenly, it is obvious why this document matches.

This is tool is even more useful for more complex (e)dismax formulae with a lot of fields to match, multi-term queries and documents matching different terms in different fields.  Check out some of our actual results while debugging the edismax difficulties here:

edismax:  http://explain.solr.pl/explains/m63o1yhg
dismax:  http://explain.solr.pl/explains/a7bkurhb


Stay Tuned ...


Now that I've explained our testing methodology and some of our debugging techniques, I'm ready to tell you how we overcame the relevancy issues we bumped into when switching to edismax.  That will be the topic of my next post(s).

No comments:

Post a Comment