Wednesday, January 22, 2014

CJK with Solr for Libraries, part 12

This is the twelfth of a series of posts about our experiences addressing Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford University Libraries "catalog" built with Blacklight on top of our Solr index.

In this post, I'll discuss an mm tweak I snuck past you in part ten of this series, our fix for CJK advanced searches, and our accommodation for variant quotation marks that may appear with CJK characters.

CJK Tweak 1.1:  mm adjustment

There is an additional mm tweak that I sort of snuck past you in part ten of this series.  It comes into play when a query string has CJK and non-CJK characters.  I didn't find many queries with this script combination in our logs, but I asked our East Asia librarians if this occurs and they said it does.  Unfortunately, I didn't get any examples, nor did I find any in a quick perusal of logs.  In any case, the fix is easy enough, and you've already seen it.

You may recall from part eight that we use an mm value of  3<86% when we encounter CJK characters in the query string.  And in part ten, I showed you the following method:

The code in in the purple box is as follows.  When adjusting the mm value for cjk queries, we determine how many non-CJK tokens are present in the query string.  If there are any non-CJK tokens, we increase the lower limit on the CJK mm value.  So if there are 2 non-CJK tokens, the mm would become 5<86%.  This tweak is in both our test code, and in our SearchWorks application code, as you would expect.

CJK Advanced Search

In part ten of this series, I showed what we needed to do in our SearchWorks application code to use some different Solr parameters when we had CJK characters in a user query string.  Unsurprisingly, we needed to do the analogous thing for the SearchWorks advanced search form:
Tests for this largely focused around the types of searches unique to advanced search:  publisher, description, TOC, and on combining multiple fields.  As before, we need to use the right qf and pf variables, and set the mm and qs parameters properly for CJK. There are test searches with these characteristics in the sw_index_tests project under spec/cjk/cjk_advanced_search_spec.rb.

But in some ways, the real concern is whether we have the advanced search form hooked up properly for CJK queries in the application code.  So we have cucumber features to do CJK advanced search integration tests in the SearchWorks Rails application.


We also have tests to make sure we can combine fields properly with AND or OR:
The code for advanced searching in our Rails application is currently a one-off, so the details may not be all that applicable elsewhere.  Here is some of our spec code for ensuring that the controller intercepts the user query strings for each box in the search form and adjusts qf, pf, mm and qs accordingly:

And here are specs for ensuring we handle CJK in some strings and non-CJK in other strings appropriately:
and in case it's helpful to you, here is the code that is doing the CJK character detection and altering the Solr params for CJK for advanced search:

You can see, in the "first part of CJK processing" that it is adjusting the mm, qs and new_field parameters if CJK characters are found.  The new_field is ultimately used as a key to get the "handler" from the config object, which is defined in the catalog controller.  Here's a snippet of the config object code:
and I'm sure you can imagine the definitions for all flavors of text boxes in the advanced search form.   All of the above is analogous to the Rails application code I showed in part ten for adjusting Solr params for CJK characters in simple search.  We can be confident the code does what we want it to because we wrote all those nice tests - specs to exercise the process_query code for CJK and non-CJK user text, and cucumber features for integration testing ensuring the Rails application was getting the expected results back from Solr, given user-entered text in the advanced search form.  (And yes, all the tests pass.)

CJK (and other) Variant Quote Characters

I received the following email from one of our Chinese librarians during testing:
Say what?  There are special CJK flavor quotation marks?  I got these replies to my question:

Since Solr only interprets " (ASCII hex 22) as the character for phrase searching, we needed the SearchWorks application to substitute hex 22 for the other quote characters.  It turns out there are many quote mark variants.  Per http://en.wikipedia.org/wiki/Quotation_mark_glyphs, there are lots of quotation variants: aside from hex 22, there are 14 different paired quotation marks.  Here's our spec code (courtesy of my awesome colleague, Jessie Keck):


The SearchWorks Blacklight Rails application has some simple code in the catalog controller to substitute plain old " when it encounters any of the other quotation characters.  This code is called as a before_filter in the catalog controller:
The special_quote_characters is defined in a helper:
With this code in place in the Rails application (thanks again to Jessie Keck), the spec passes.  We had the East Asia librarians confirm the fix.

That's it!

So in this twelve(!) part series, I've shared why we wanted to improve CJK discovery, why it's a hard problem to solve in a multilingual environment, the changes we applied to our Solr indexing and why, and the changes we applied to our SearchWorks web application and why.  I've shown exemplar tests for all the changes we made, and shared a lot of specifics regarding tokenizers and normalization filters chosen and why.

I've also shared how to analyze relevancy problems in general and the specifics of how I analyzed each of our problems.  I've shared my approach for working with human testers, from expectation setting to feedback methodology, and our approach for a high-level view of CJK search results quality at a glance.

Looped in for the ride was information about the Solr edismax request handler, some of its bugs and our workarounds or lack thereof, as well as some information about edismax arguments like tie, mm, and qs.  There was also some fun with synonyms and regular expressions for pattern replacement charFilters, and edismax variable declarations to be used as LocalParams.

And underlying all of this, of course, is the role our relevancy tests played in discovering, diagnosing and fixing problems.  In addition to the edismax relevancy surprises, we used relevancy tests to vastly simplify our boost values, and to make it possible for us to explore the correctness of solutions without a lot of human eyeballs.

Last, but not least, there have been links to all the indexing code: from our fork of SolrMarc to our Solr config files to the CJKFoldingFilter for Solr, and, of course, to our relevancy tests.

Please feel free to fork the code, make contributions back, ask questions, and otherwise improve on this work.  I'm sure I made some stupid decisions, and made some things more complex than was necessary.  This work is by no means perfect, but our East Asia librarians are very pleased with it and believe it is the best solution available at this time.

I know the order of topics presented may not have been ideal, but the important thing was to get the information out there.

No comments:

Post a Comment