Wednesday, January 8, 2014

CJK with Solr for Libraries, part 6 (Edismax woes, part 3)

This is the sixth of a series of posts about our experiences addressing Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford University Libraries "catalog" built with Blacklight on top of our Solr index, and the third in the sub-series on problems we had switching to  edismax from dismax.

You might be interested in this post in particular if you use Solr's edismax query parser or if you want to get more exact matches for user queries.

Edismax Woes, Part 3

In part five of this series, I showed that our edismax results needed to give more weight to exact matches in the short title field in order to fix the failing relevancy acceptance tests for journal titles (and in order to behave more like dismax).

I also mentioned that Bill Dueber of the University of Michigan wrote a wonderful blog post on using "fully-anchored" text fields to get an "exactish" match.  This blog post will cover our solution to the journal title results using a fully-anchored text field.

What Do We Mean When We Say Exact?

Recall the results for an edismax title search on 'the press':


When we say "exact" match, our spec might be this:
  1. case insensitive.  
  2. field contains only those tokens we intend to match - it matches the entire contents of the field.  
  3. punctuation insensitive  (e.g. the nation should match "The nation.")
  4. Leading, trailing and consecutive whitespace insensitive.  
  5. Unicode folding.  
  6. possibly include synonyms.
It's not a phrase match, because that doesn't cover point 2 -- we don't want the first two results returned by edismax, even though they contain the phrase "the press".

A "good enough" solution, especially for short metadata fields, is to anchor the query string to the beginning and end of the field value, and to boost this "anchored" field type very high so will dominate the score when there's a match.

Anchored Text Fieldtype

Bill's blog post does a fine job of explaining how to do this.  Essentially, you add a prefix, like 'aaaa' to the beginning of the text, and you add a suffix, like 'zzzz' to the end of the text.

In our case, we have an unstemmed text fieldtype like this:


And we want to use a PatternReplaceCharFilterFactory to add the prefix and suffix to this field.  Bill's example is nice and simple:




Note that Bill's filter adds two tokens, which will affect mm counts.

I'm going to save you a lot of time and tell you right away:  be careful about your regular expression, especially concerning punctuation and other symbol characters.  Due to the Marc metadata standard having been originally designed to accommodate the printing of catalog cards, the short title fields in Marc data (245a) often end in punctuation and/or whitespace. 

Also, I didn't want to affect mm counts, because when we do bigramming for CJK characters, mm counts are going to be plenty tricky.  So here is what my version of the PatternReplaceCharFilterFactory to add anchoring prefix and suffix ended up as:


It's ugly, but it works.  It doesn't add tokens, like Bill's does, because it adds no whitespace to the token stream. 

Let's walk through the pattern:
   ^\s*
says ignore whitespace at the beginning of the text.
   [^\.\,:;/=<>\(\)\[\]\&\|] 
is a character class excluding these characters: .,:;/=<>(){}&|    Let's call this character class z.

z is inside another character class:
   [\S&amp;&amp;z]
which, in total, says any non-whitespace character or a doubly escaped ampersand, but none of the characters excluded by z.  Let's call this modified non-whitespace character class \S'.

So the first part
   ^\s*(.*[\S&amp;&amp;[^\.\,:;/=&lt;&gt;\(\)\[\]\&amp;\|]]) 
is simplified to
   ^\s*(.*\S')
which is saying ignore beginning whitespace and capture anything after that ending in one of the characters in \S' (the modified non-whitespace character class).

The part after the capturing group contains another character class:
   [\s\.\,:;/=&lt;&gt;\(\)\[\]\&amp;\|]
which includes:  whitespace as well as .,:;=<>()[]&|.  Let's call this character class Y.  So after the capturing group, we have:
   Y*$
which is saying to ignore any trailing characters in the Y character class.  So the whole expression simplifies to:
   ^\s*(.*\S')Y*$

So it's not really so far off from Bill's pattern.  It ignores initial whitespace, it ignores trailing whitespace and punctuation, and it captures everything in between only if it ends with a non-whitespace character other than .,:;/=<>(){}&|

So yay!  We have a charFilter to place at the beginning of the fieldtype analysis chain and we're ready to create a new field for exactish matches to address the edismax relevancy discrepancy when compared to dismax!  Our anchored text fieldtype looks like this:




And we're ready to go, right?  We re-index our data, adding a Solr field:


in which we put our short title, (245a for Marc data wonks).  Then we add this Solr field to our edismax formula with a higher boost value than the other fields used by edismax in the failing journal title relevancy tests.  We can tweak the boost value as needed, since we have a safety net of 600 relevancy tests to let us know if we're breaking existing functionality.

And all the relevancy tests now all pass, right?

Wrong.  Do you see the problem?  I didn't.

The journal title search results passed the tests (after we addressed the problem of trailing punctuation, ahem).  But then we were left with a bunch of failing synonym tests, in addition to the already failing synonym tests from our 21 tests failing with edismax.

Synonyms and Anchoring

A synonym can be thought of as a substitution.  If you have
   c++  =>  cplusplus
in your synonym file, then analysis from the text_anchored fieldtype above will change
   Great C++ programming
to
   aaaaaagreat cplusplus programmingzzzzzz
This is perfect.  But if you start with
   C++ programming
then you get
   aaaaaac programmingzzzzzz

Why?   This is where the Analysis form in the Solr admin GUI came in handy.  It tells us the analysis for the text_anchored fieldtype above proceeds like this:
   C++ programming
   aaaaaaC++ programmingzzzzzz   <== pattern replace char filter
   aaaaaaC++ programmingzzzzzz   <== whitespace tokenizer (makes 2 tokens)
   aaaaaac++ programmingzzzzzz   <== ICU folding filter
   aaaaaac++ programmingzzzzzz   <== synonym filter  (no-op) 
   aaaaaac programmingzzzzzz     <== word delimiter filter

The synonym filter doesn't find "c++" as a token, it finds "aaaaaac++" which it doesn't know about.

Because our list of synonyms is short, and because I wanted to keep synonyms in our text_anchored fieldtype, my solution was to create additional synonyms for right, left and both anchored versions of each synonym to be mapped.  This covers cases where the synonym is the first, last or only word in the token stream for the text_anchored fieldtype.  For c++, this becomes these synonyms:

   c++  =>  cplusplus
   aaaaaac++  =>  cplusplus
   c++zzzzzz  =>  cplusplus
   aaaaaac++zzzzzz  =>  cplusplus

Here is what our text_anchored fieldtype actually looks like:


(you can also look here to see it with syntax highlighting:  https://github.com/solrmarc/stanford-solr-marc/blob/master/stanford-sw/solr/conf/schema.xml#L371-388).

Edismax Woes Begone!

Recall that our goal is to improve our edismax journal title results by giving great weight to exact-ish matches of the query string in a Solr document's short title field.  We have a field type for this, per above, that we use for an "exactish" short title field (245a for Marc data wonks).  Then we add this Solr field to our edismax formula with a higher boost value than the other fields used by edismax in the failing journal title relevancy tests.  We can tweak the boost value as needed, since we have a safety net of 600 relevancy tests to let us know if we're breaking existing functionality.

This is precisely how we fixed the failing journal title tests when using edismax.

Recall that our relevancy test failures with edismax were in the following categories:
  1. Journal titles
  2. Hyphens preceded by a space (but no following space)
  3. Boolean NOT
  4. Synonyms for musical keys
We addressed the journal titles with the fully-anchored exactish matching short title field.  We determined that the hyphens and boolean NOT failures were due to a Solr bug, and that we can ignore these failures for now.  And I'll go ahead and tell you that the failing synonym test was for a musical key (F#) that also maps to a computer language, and we decided to live with that failure for now.  (Maybe I'll talk more about our synonyms in a future post.)

Relevancy/Acceptance Tests

I want to make a point of mentioning that while we were trying to figure out why searches on "the nation" were failing our tests, we had human testers looking for other errors and trying to find the pattern to the problems we were already aware of.  So we got a lot more tests for similar journal titles, such as "the sentinel," "the chronicle", "the times" (which we already had), etc.  Eventually I realized the culprit for our journal title results was the punctuation at the end of the Marc 245a fields.  But in the meantime, we beefed up our test suite in this area.

Our relevancy tests were also instrumental in determining an appropriate boost value for our new short title exactish match field, both in title searches and in "everything" searches.   I was able to find out what was too low a boost and what was too high a boost by running my test suite against different values.

And of course, our relevancy tests were crucial in my effort to simplify our boost factors.

Please feel free to take ours (https://github.com/sul-dlss/sw_index_tests) and modify them for your own needs.   Ignore the ones that don't apply to you; add tests for your local needs. Improve on my methodology.  Feel free to get in touch.

Next:  Actually Working on CJK

Now that we've gotten past the difficulties with changing from dismax to edismax query parsing, we can utilize Robert Muir's fix for SOLR-3589, the bug that was setting mm to 0 for fields using CJKBigramming (this is discussed in the second post in this series).  We also upgraded to Solr 4.3 at this point (from 3.5), partly to get Robert's bug fix without having to apply a patch to Solr manually.

So the next post in this series will return to discussing work specific to CJK.

3 comments:

  1. This series is so awesome.

    Just a quick warning about your modified char replacement: not putting in the spaces is only effective at keeping the number of tokens the same if your tokenizer splits exclusively on whitespace. In particular, in the ICU tokenizer "Words are broken across script boundaries" so this technique will still add tokens for non-latin input when using a tokenizer smarter than whitespace-only.

    ReplyDelete
  2. Note that you'll still get multiple tokens using this technique if your tokenizer is smarter than the whitespace tokenizer (e.g., if you're using the ICU tokenizer). I'm looking for a reasonable workaround.

    ReplyDelete
  3. B makes an excellent point. If we could get our desired behaviors with the ICUTokenizer alone, we would use it exclusively; it's the special characters such as superscripts and musical sharp and flat signs that mean we'll probably stick with the WhitespaceTokenizer for a while longer.

    ReplyDelete