Wednesday, January 15, 2014

CJK with Solr for Libraries, part 9

This is the ninth of a series of posts about our experiences addressing Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford University Libraries "catalog" built with Blacklight on top of our Solr index.

In the previous post, I described our level 1 evaluation of CJK search results using the CJK fieldtype definition described in part 7 of this series, and indicated that with a tweak to our mm setting for CJK, we had lift off for level 2 evaluation of CJK Discovery in SearchWorks: more thorough evaluation of resource discovery in all three languages by our East Asia librarians.

Evaluating CJK Discovery, Level 2

I haven't shared here that our baseline objective for CJK resource discovery was to have parity with Socrates (the name for Stanford's catalog via the ILS).  Thus, we wanted our East Asia librarians to evaluate CJK in SearchWorks with this parity in mind.  It's all about communicating expectations, right?  What to test, how to report back, what to report back.  And most importantly: what is "good enough?"  Here's my agenda from our first meeting:

 and here are the instructions I put together for our testers:
As I've stated before, my colleagues are awesome.  They did a bunch of testing, and they gave us feedback.  The feedback wasn't as specific per search as my highest hopes (rarely were any of the detailed questions above answered), but it was still excellent, useful feedback.  And a crucial point: our feedback link in SearchWorks provides the url of the page from which it is sent - meaning the SearchWorks feedback mechanism provides us with some context of what the user was doing (since they often leave out important details, like what kind of search they were doing, or the exact query string, all of which are in the url).

We took the spreadsheet approach started by Vitus Tang, as shown in the previous post and expanded on it as a crude way of collating the testing feedback.  I used the following color system:
  • dark green = SearchWorks is better than Socrates
  • light green = SearchWorks is at least as good as Socrates
  • dark red = Socrates is better than SearchWorks
  • light red = SearchWorks and Socrates both miss the mark.
With all this testing, we also had to split each language into a separate sheet, so I'm only showing a small portion here:


I did ultimately add nearly all the queries from this spreadsheet into our cjk relevancy tests, so don't worry if you can't read it -- I'm just sharing it here for the sake of showing methodology.

To be honest, we were astonished by how positive much of the feedback was.  Some of the problems reported turned out to be newbie SearchWorks users issues -- because CJK resource discovery had always been broken in SearchWorks, the testers were new to it -- new to facets, new to relevancy ranked results, etc.  We also educated the testers that in SearchWorks, you might get too many results (too much recall), but the most relevant results should be first, which satisfied most user needs.  It's a bit of a culture shift to go from an ILS discovery interface like Socrates to SearchWorks.

With feedback messages put directly into our JIRA instance (which I watched assiduously throughout testing) and various email exchanges with the East Asia librarians, we only needed one more meeting on vetting CJK resource discovery.  In the second meeting, we asked each East Asia librarian to indicate whether they felt SearchWorks had achieved parity (or better) with Socrates, and what problems had surfaced.  The decision was unanimous that SearchWorks was at parity or better, especially after we had fixes to the problems noted.

Here is the list of blockers that were found by the CJK testers:
  1. (empty)
Here is the list of must-dos found:
  1. phrase searching broken
  2. author + title searches not working for CJK
  3. advanced search not working for CJK
And here are the should-dos found:
  1. deal with extra spaces in our data due to cataloging practice
  2. variant characters - while most Han/Kanji traditional characters are translating properly to Han simplified/Kanji modern, some characters are not.
  3. variant quote marks - it is common to use quote marks in CJK queries other than " (ASCII hex 22).
  4. add CJK tests - beef up our CJK test suite with the searches used so we don't break functionality in the future.  (see sw_index_tests cjk tests at https://github.com/sul-dlss/sw_index_tests/tree/master/spec/cjk)
  5. publish CJK solutions for others to use (that's these blog posts ...)
The would-be-nice-to-dos (identified by me):
  1. experiment with boosting bigrams more than unigrams
  2. experiment further with Unicode normalization, after automated CJK tests are beefed up
  3. experiment with boosting phrase slop setting, ps
However, there was a huge blocker identified by me:
  1. current production (tests) cannot be broken by CJK improvements.
A couple of the must-do issues were identified early on and fixed fairly easily.  I'll cover them first, and note here that once these two "must-dos" were fixed and I got past the blocker of breaking existing behavior, all of which happened in a week, we put what we had in production.  The timing was partly rushed to have it installed before classes started in the fall, but nevertheless, it was a major coup to have such a positive response to our work.  It was exciting to get CJK improvements to SearchWorks rolled out to end users so quickly after expert testing.

CJK Tweak 2: qs setting for phrase searches

To fix the phrase searching problem, again we found assumptions in our setup that worked for English, but not for CJK.

Dismax and edismax have a "query phrase slop" parameter, qs, which is the distance allowed between tokens when the query has explicitly indicated a phrase search with quotation marks.  Probably from back in our stopword days, we use a setting of qs=1, meaning a query of "women's literature", with the quotes, is allowed to match results containing 'women and literature' as well as 'women in literature' in addition to 'women's literature'.  Because of the magic of pf sorting the best matches first, this has worked just fine for our users up until now.  However, with CJK queries, this is undesirable -- an explicit phrase query in CJK should only match the exact characters entered, with nothing inserted between them.

Since we were already changing the mm setting for CJK queries (see previous post), it was a simple matter to also change the qs setting to 0 for CJK queries.  This gave CJK phrase searching the desired behavior, per our testers.  Yay!

CJK Tweak 3: catch-all field

The problem with author + title searching is well described in this feedback from one of our Chinese materials specialists (routed from the feedback link in SearchWorks to our JIRA project):
One of the missing records:
Note that the three characters of the author are in the linked 100 field, and the short title, 245a, starts with the five title characters in the original query.  The second missing record has the same author/title data -- the second book is a reprint of the first. 
Looking at the debugQuery analysis, we can see that even though the individual tokens are being matched, the mm is applying to all tokens within a single field:
The clause is basically ((fld:吕 fld:吕思 fld:思 ... fld:朝)~12^20), where ~12 is the mm (the query is 8 characters, creating 15 tokens, and 86% of 15 = 12) and ^20.0 is the boost we give matches in the field.   So this is looking for 12 of the query tokens to match in a single field.  The 5 character title only produces 9 tokens, and 9 < 12, so it is not enough to produce a match from just the title field either.

We have always had a "catchall" field for our Solr index, but we never had an analogous field for our vernacular script data (in the 880s). Thus, we never created a CJK flavor catchall field.  I rectified this, and voila!  author + title CJK searches worked.

Stay Tuned ...

I have just explained fixes for two of the three items on the "must do" list.  The next post will address the blocker -- our CJK improvements were breaking some production behaviors, as documented by our relevancy tests.  I will follow that with posts discussing how we addressed the rest of the issues on the "must do" and "should do" lists, and of course will share our current production solutions (which are already available here with relevancy tests here).

No comments:

Post a Comment