Tuesday, January 14, 2014

CJK with Solr for Libraries, part 8

This is the eighth of a series of posts about our experiences addressing Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford University Libraries "catalog" built with Blacklight on top of our Solr index.

In the previous post, I described our initial Solr fieldtype for CJK text. Here's how we evaluated the outcome and analyzed the results.

Evaluating CJK Discovery, Level 1

I am very very fortunate to have fantastic colleagues, and I am fortunate indeed to work with Vitus Tang on many issues related to SearchWorks discovery.  Aside from his extensive knowledge of Stanford's Marc data and data practices, Vitus reads Chinese.  Thus, he was able to evaluate Chinese search results and determine if we had anything good enough to warrant broader human review by our East Asia librarians.  We also had a handful of Japanese and Korean tests from meeting with the East Asia librarians and getting their preferred priorities in writing, with some example strings.

Despite all this, there is a communication gap that is hard to breach between a CJK-illiterate techie such as myself who groks automated relevancy testing and a CJK-literate non-techie expressing search expectations.  In fact, when I started working at Stanford, SearchWorks testers believed that "correct" search results were whatever our ILS (in which the Stanford discovery interface is named "Socrates") gave us; thus it was believed that acceptance criteria should always be "match the ILS results."  For CJK, this was especially true, as the East Asia librarians had spent significant earlier effort working with Sirsi/Dynix to get functional CJK searching into the Unicorn (now Symphony) ILS application.  Sadly, there were no test scripts or any other testing artifacts I could leverage from that work, so we had to start over.

As I mentioned in the third post of this series, the first SearchWorks relevancy tests were written with cucumber and went through the entire SearchWorks application Rails stack.  We made an early effort to get our CJK acceptance test in the cucumber format.  We got a small number of tests like this in August 2012:
but these tests were inadequate - we were still getting far too many hits due to SOLR-3589 (mm set to zero when the analysis chain splits a token), and we weren't evaluating some key factors, such as whether the most relevant results were sorting first.

Before SOLR-3589 was fixed, we tried 10-15 different ideas for workarounds.  Vitus examined results for all of these efforts, and none of them passed muster.  Eventually, I learned he was tracking basic evaluation like this:
What's crucial here is that Vitus was only looking at numbers of results as his first pass -- if the numbers were too far off, then the trial wasn't worth further evaluation.  Clearly, instead of having a person do this, we could do it with code.  Given the information in the spreadsheet (the search flavor (title, author), the numbers of results from our ILS (Socrates), the comparison of simplified to traditional Han script, and specific query strings), a CJK-illiterate techie such as myself could get this info much more quickly by using a program.  

Since my ultimate goal with test code was to have CJK acceptance tests, I coded the above using rspec-solr (http://rubydoc.info/github/sul-dlss/rspec-solr) and grabbed the hit counts from the error messages.   Eventually, my tests looked something like this:
I'm not going to talk about all my tests right now (though they are available at https://github.com/sul-dlss/sw_index_tests);  the point is that once SOLR-3589 was fixed, we could easily tell our numbers looked a lot more promising.

The green column shows the result counts from Socrates;  the purple columns are different variants of the analysis for text_cjk fields, such as StandardTokenizer vs. ICUTokenizer and different normalization choices.  The numbers may be hard to read here, but the consensus was this was a big improvement over the previous attempts.  Yay!  And that the traditional <--> simplified Han equivalence was working - Yay!

Note that this approach also allowed me to evaluate different forms of Unicode normalization without needing Vitus's help for every iteration.  As best I could determine, I got the same results regardless of canonical vs. compatible equivalence and composed vs. decomposed, and all combinations thereof.  Or, if we want to think about it differently, the normalization chosen in the text_cjk fieldtype definition of my previous post was just fine.

At this point, Vitus started looking more closely at the Chinese results.  One of his focal points was bad recall:  when Solr results were missing relevant records found by our ILS.

CJK Tweak 1: mm setting

As Vitus reported specific problems, I analyzed them with the debugging tools mentioned in my third post of this series.  Here is an example of such a problem reported by Vitus:

In this particular case, setting debugQuery=true on the Solr request showed me that our "Min Number Should Match" (mm) setting was causing us difficulty.  The description of the (e)dismax mm setting is here; the spec for the format of mm settings is here.  Our mm setting was  6<-1 6<90%, which translates to:  for 1 to 6 clauses, all are required; for more than 6 clauses, 90% (rounded down) are required.  Unless you have explicit parenthesis in your query, a "clause" is a token.  This was chosen based on user query analysis when we removed stopwords from our index.  I even blogged about it

Our mm setting makes sense for whitespace separated tokens, but for overlapping bigrams, it was causing problems.  Take the following example: 婦女與文學 (women and literature in traditional Han characters).  The correct word breaks are 婦女   與   文學.  Our text_cjk fieldtype creates four overlapping bigrams:  婦女, 女與, 與文, 文學 and five unigrams: 婦, 女, 與, 文, 學 for a total of 9 separate tokens or "clauses" as far as mm is concerned.  90% of 9 is 8, rounded down.  So that means all but one of the "clauses" must be found in the record.  Assuming all unigrams are found, that means all but one of the overlapping bigrams must be found, so Solr is requiring at least one of the nonsensical bigrams combining adjacent characters with 'and':  女與 or 與文.  Thus, Solr is only returning results where 'and' immediately precedes literature or where 'and' immediately follows women.

Another example:  董橋 (Dong Quai, an author).  This creates 3 tokens:  董橋, 董 and 橋.  Our mm setting requires all clauses when there are 6 or fewer, so all of these tokens are required.  This means the bigram is required, so the characters must appear in the same order as the query.  Perhaps this is good for our author query ... but should it be true for all 2 character CJK queries?

At this point, I started wondering about the number of characters in our users' CJK queries, and how many of the bigrammed tokens should be "required" via mm.  How many CJK queries have whitespace in them?  Do CJK queries also include non-CJK characters?  How many characters are in most CJK words?

To answer these questions, I asked the Stanford ILS support folks for CJK queries in our usage logs from Socrates.  They were kind enough to provide me with 1033 (mostly) CJK searches for 2 months. I did a rough analysis of these queries, sorting them by length and getting a sort of histogram.  It's a bit tricky since we have to count characters for CJK, but some of the query strings were in quotes, e.g.  '昌化縣志', had whitespace (妇女  与  婚姻) or included boolean operators (戴季陶 AND 國民黨).  Ignoring the boolean queries and non-CJK queries, and not counting non-CJK characters such as quotes, the data looked like this:
 or, if you prefer it in a table:
Note that nearly 95% of our queries in the data sample contained 8 or fewer CJK characters.
In the interests of broader validation, Tom Burton-West of HathiTrust was kind enough to share like information.  I was not able to so carefully vet the character counts in the HathiTrust queries (e.g.  婦女與文學 is five characters, but I consider 婦女  與  文學 to also be five characters, because I'm not counting the whitespace), but if you look where 10 characters crosses the graph (the horizontal line labeled 10 on the y axis), and go down two steps, you can see that well over 50% of the queries in Tom's data are 8 characters or fewer.
I'm guessing that if we removed spaces and punctuation from the char counts, the character count distribution would look more like ours.

So the data supports an assumption that the vast majority of our CJK queries will be 8 CJK characters or fewer.  What does this mean about our mm setting? At this point, I checked in with Vitus to buttress another assumption: that most Chinese written words are 2 or more characters.  This is true, and since Chinese has more materials than our Japanese and Korean collections combined, I decided it was a good starting place for picking an appropriate mm value for CJK bigrammed + unigrammed strings.
The second column in the table shows the number of tokens from the overlapping bigrams + unigrams if all the CJK characters are adjacent -- the maximum number of tokens that will be parsed from a CJK string.  For example, I showed above that a 5 character string has 4 overlapping bigrams, and 5 unigrams for a total of 9 tokens.  If there is a space or punctuation mark within the string, there are fewer tokens, because one of the bigrams is no more:  婦女  與  文學  still has 5 unigrams, but only 2 bigrams, for a total of 7 tokens.
Next I came up with the Desired mm values by thinking of how many adjacencies I could subtract if words were 2 characters. For three and four character queries, I wanted to require all but one adjacency:   舊小說 (old fiction) can split into 舊 (old) and 小說 (fiction). For five and six character queries, I wanted to required all but two adjacenies: 婦女與文學 (women and literature) can be separated into 婦女 (women) 與 (and) 文學 (literature).   This sort of thinking would have similar recall to Socrates, which is what Vitus's analysis was investigating.
Using the table above, I chose this mm setting for CJK:
Per the mm spec, this says for three or fewer "clauses" (tokens), all are required, but for four or more tokens, only 86% (rounded down) are required.  This is perfect for CJK queries of 6 or fewer characters, a tad high for 7 characters, and perfect again for 8 characters, and seems to be the best fit available.
Recall that with (e)dismax's pf settings, in a result set where some results have the query characters as adjacent and some don't, the hits with the characters adjacent should score higher and sort to the top.   - so when we get too many results (too much recall), we can mitigate it by making sure the best (most precise) results are first.
This tweak to our mm setting for CJK queries got more test searches through the Vitus-meter.

I will talk later about how we use code in the SearchWorks application to only apply this mm setting to queries with CJK characters, retaining our original mm setting for all other queries.

Evaluating CJK Discovery, Level 2

With the text_cjk fieldtype shown in the previous post and the mm tweak above, Vitus felt we had results ready for broader testing by the East Asia librarians.  I will discuss how we approached that testing, and what else we had to tweak for CJK resource discovery in future posts in this unbelievably long series.


  1. This is really interesting, thanks for assembling all this documentation!

  2. I should have mentioned that we did do a bit of empirical testing of the mm setting; it wasn't simply "Naomi made it up."