In the previous post, I described our initial Solr fieldtype for CJK text. Here's how we evaluated the outcome and analyzed the results.
Evaluating CJK Discovery, Level 1I am very very fortunate to have fantastic colleagues, and I am fortunate indeed to work with Vitus Tang on many issues related to SearchWorks discovery. Aside from his extensive knowledge of Stanford's Marc data and data practices, Vitus reads Chinese. Thus, he was able to evaluate Chinese search results and determine if we had anything good enough to warrant broader human review by our East Asia librarians. We also had a handful of Japanese and Korean tests from meeting with the East Asia librarians and getting their preferred priorities in writing, with some example strings.
Despite all this, there is a communication gap that is hard to breach between a CJK-illiterate techie such as myself who groks automated relevancy testing and a CJK-literate non-techie expressing search expectations. In fact, when I started working at Stanford, SearchWorks testers believed that "correct" search results were whatever our ILS (in which the Stanford discovery interface is named "Socrates") gave us; thus it was believed that acceptance criteria should always be "match the ILS results." For CJK, this was especially true, as the East Asia librarians had spent significant earlier effort working with Sirsi/Dynix to get functional CJK searching into the Unicorn (now Symphony) ILS application. Sadly, there were no test scripts or any other testing artifacts I could leverage from that work, so we had to start over.
As I mentioned in the third post of this series, the first SearchWorks relevancy tests were written with cucumber and went through the entire SearchWorks application Rails stack. We made an early effort to get our CJK acceptance test in the cucumber format. We got a small number of tests like this in August 2012:
Before SOLR-3589 was fixed, we tried 10-15 different ideas for workarounds. Vitus examined results for all of these efforts, and none of them passed muster. Eventually, I learned he was tracking basic evaluation like this:
Since my ultimate goal with test code was to have CJK acceptance tests, I coded the above using rspec-solr (http://rubydoc.info/github/sul-dlss/rspec-solr) and grabbed the hit counts from the error messages. Eventually, my tests looked something like this:
https://github.com/sul-dlss/sw_index_tests); the point is that once SOLR-3589 was fixed, we could easily tell our numbers looked a lot more promising.
The green column shows the result counts from Socrates; the purple columns are different variants of the analysis for text_cjk fields, such as StandardTokenizer vs. ICUTokenizer and different normalization choices. The numbers may be hard to read here, but the consensus was this was a big improvement over the previous attempts. Yay! And that the traditional <--> simplified Han equivalence was working - Yay!
Note that this approach also allowed me to evaluate different forms of Unicode normalization without needing Vitus's help for every iteration. As best I could determine, I got the same results regardless of canonical vs. compatible equivalence and composed vs. decomposed, and all combinations thereof. Or, if we want to think about it differently, the normalization chosen in the text_cjk fieldtype definition of my previous post was just fine.
At this point, Vitus started looking more closely at the Chinese results. One of his focal points was bad recall: when Solr results were missing relevant records found by our ILS.
CJK Tweak 1: mm settingAs Vitus reported specific problems, I analyzed them with the debugging tools mentioned in my third post of this series. Here is an example of such a problem reported by Vitus:
In this particular case, setting debugQuery=true on the Solr request showed me that our "Min Number Should Match" (mm) setting was causing us difficulty. The description of the (e)dismax mm setting is here; the spec for the format of mm settings is here. Our mm setting was 6<-1 6<90%, which translates to: for 1 to 6 clauses, all are required; for more than 6 clauses, 90% (rounded down) are required. Unless you have explicit parenthesis in your query, a "clause" is a token. This was chosen based on user query analysis when we removed stopwords from our index. I even blogged about it.
Our mm setting makes sense for whitespace separated tokens, but for overlapping bigrams, it was causing problems. Take the following example: 婦女與文學 (women and literature in traditional Han characters). The correct word breaks are 婦女 與 文學. Our text_cjk fieldtype creates four overlapping bigrams: 婦女, 女與, 與文, 文學 and five unigrams: 婦, 女, 與, 文, 學 for a total of 9 separate tokens or "clauses" as far as mm is concerned. 90% of 9 is 8, rounded down. So that means all but one of the "clauses" must be found in the record. Assuming all unigrams are found, that means all but one of the overlapping bigrams must be found, so Solr is requiring at least one of the nonsensical bigrams combining adjacent characters with 'and': 女與 or 與文. Thus, Solr is only returning results where 'and' immediately precedes literature or where 'and' immediately follows women.
Another example: 董橋 (Dong Quai, an author). This creates 3 tokens: 董橋, 董 and 橋. Our mm setting requires all clauses when there are 6 or fewer, so all of these tokens are required. This means the bigram is required, so the characters must appear in the same order as the query. Perhaps this is good for our author query ... but should it be true for all 2 character CJK queries?
At this point, I started wondering about the number of characters in our users' CJK queries, and how many of the bigrammed tokens should be "required" via mm. How many CJK queries have whitespace in them? Do CJK queries also include non-CJK characters? How many characters are in most CJK words?
To answer these questions, I asked the Stanford ILS support folks for CJK queries in our usage logs from Socrates. They were kind enough to provide me with 1033 (mostly) CJK searches for 2 months. I did a rough analysis of these queries, sorting them by length and getting a sort of histogram. It's a bit tricky since we have to count characters for CJK, but some of the query strings were in quotes, e.g. '昌化縣志', had whitespace (妇女 与 婚姻) or included boolean operators (戴季陶 AND 國民黨). Ignoring the boolean queries and non-CJK queries, and not counting non-CJK characters such as quotes, the data looked like this:
In the interests of broader validation, Tom Burton-West of HathiTrust was kind enough to share like information. I was not able to so carefully vet the character counts in the HathiTrust queries (e.g. 婦女與文學 is five characters, but I consider 婦女 與 文學 to also be five characters, because I'm not counting the whitespace), but if you look where 10 characters crosses the graph (the horizontal line labeled 10 on the y axis), and go down two steps, you can see that well over 50% of the queries in Tom's data are 8 characters or fewer.
So the data supports an assumption that the vast majority of our CJK queries will be 8 CJK characters or fewer. What does this mean about our mm setting? At this point, I checked in with Vitus to buttress another assumption: that most Chinese written words are 2 or more characters. This is true, and since Chinese has more materials than our Japanese and Korean collections combined, I decided it was a good starting place for picking an appropriate mm value for CJK bigrammed + unigrammed strings.
Next I came up with the Desired mm values by thinking of how many adjacencies I could subtract if words were 2 characters. For three and four character queries, I wanted to require all but one adjacency: 舊小說 (old fiction) can split into 舊 (old) and 小說 (fiction). For five and six character queries, I wanted to required all but two adjacenies: 婦女與文學 (women and literature) can be separated into 婦女 (women) 與 (and) 文學 (literature). This sort of thinking would have similar recall to Socrates, which is what Vitus's analysis was investigating.
Using the table above, I chose this mm setting for CJK:
3<86%Per the mm spec, this says for three or fewer "clauses" (tokens), all are required, but for four or more tokens, only 86% (rounded down) are required. This is perfect for CJK queries of 6 or fewer characters, a tad high for 7 characters, and perfect again for 8 characters, and seems to be the best fit available.
Recall that with (e)dismax's pf settings, in a result set where some results have the query characters as adjacent and some don't, the hits with the characters adjacent should score higher and sort to the top. - so when we get too many results (too much recall), we can mitigate it by making sure the best (most precise) results are first.
This tweak to our mm setting for CJK queries got more test searches through the Vitus-meter.
I will talk later about how we use code in the SearchWorks application to only apply this mm setting to queries with CJK characters, retaining our original mm setting for all other queries.