In the previous post, I described our level 1 evaluation of CJK search results using the CJK fieldtype definition described in part 7 of this series, and indicated that with a tweak to our mm setting for CJK, we had lift off for level 2 evaluation of CJK Discovery in SearchWorks: more thorough evaluation of resource discovery in all three languages by our East Asia librarians.
Evaluating CJK Discovery, Level 2I haven't shared here that our baseline objective for CJK resource discovery was to have parity with Socrates (the name for Stanford's catalog via the ILS). Thus, we wanted our East Asia librarians to evaluate CJK in SearchWorks with this parity in mind. It's all about communicating expectations, right? What to test, how to report back, what to report back. And most importantly: what is "good enough?" Here's my agenda from our first meeting:
and here are the instructions I put together for our testers:
We took the spreadsheet approach started by Vitus Tang, as shown in the previous post and expanded on it as a crude way of collating the testing feedback. I used the following color system:
- dark green = SearchWorks is better than Socrates
- light green = SearchWorks is at least as good as Socrates
- dark red = Socrates is better than SearchWorks
- light red = SearchWorks and Socrates both miss the mark.
I did ultimately add nearly all the queries from this spreadsheet into our cjk relevancy tests, so don't worry if you can't read it -- I'm just sharing it here for the sake of showing methodology.
To be honest, we were astonished by how positive much of the feedback was. Some of the problems reported turned out to be newbie SearchWorks users issues -- because CJK resource discovery had always been broken in SearchWorks, the testers were new to it -- new to facets, new to relevancy ranked results, etc. We also educated the testers that in SearchWorks, you might get too many results (too much recall), but the most relevant results should be first, which satisfied most user needs. It's a bit of a culture shift to go from an ILS discovery interface like Socrates to SearchWorks.
With feedback messages put directly into our JIRA instance (which I watched assiduously throughout testing) and various email exchanges with the East Asia librarians, we only needed one more meeting on vetting CJK resource discovery. In the second meeting, we asked each East Asia librarian to indicate whether they felt SearchWorks had achieved parity (or better) with Socrates, and what problems had surfaced. The decision was unanimous that SearchWorks was at parity or better, especially after we had fixes to the problems noted.
Here is the list of blockers that were found by the CJK testers:
- phrase searching broken
- author + title searches not working for CJK
- advanced search not working for CJK
- deal with extra spaces in our data due to cataloging practice
- variant characters - while most Han/Kanji traditional characters are translating properly to Han simplified/Kanji modern, some characters are not.
- variant quote marks - it is common to use quote marks in CJK queries other than " (ASCII hex 22).
- add CJK tests - beef up our CJK test suite with the searches used so we don't break functionality in the future. (see sw_index_tests cjk tests at https://github.com/sul-dlss/sw_index_tests/tree/master/spec/cjk)
- publish CJK solutions for others to use (that's these blog posts ...)
- experiment with boosting bigrams more than unigrams
- experiment further with Unicode normalization, after automated CJK tests are beefed up
- experiment with boosting phrase slop setting, ps
- current production (tests) cannot be broken by CJK improvements.
CJK Tweak 2: qs setting for phrase searchesTo fix the phrase searching problem, again we found assumptions in our setup that worked for English, but not for CJK.
Dismax and edismax have a "query phrase slop" parameter, qs, which is the distance allowed between tokens when the query has explicitly indicated a phrase search with quotation marks. Probably from back in our stopword days, we use a setting of qs=1, meaning a query of "women's literature", with the quotes, is allowed to match results containing 'women and literature' as well as 'women in literature' in addition to 'women's literature'. Because of the magic of pf sorting the best matches first, this has worked just fine for our users up until now. However, with CJK queries, this is undesirable -- an explicit phrase query in CJK should only match the exact characters entered, with nothing inserted between them.
Since we were already changing the mm setting for CJK queries (see previous post), it was a simple matter to also change the qs setting to 0 for CJK queries. This gave CJK phrase searching the desired behavior, per our testers. Yay!
CJK Tweak 3: catch-all fieldThe problem with author + title searching is well described in this feedback from one of our Chinese materials specialists (routed from the feedback link in SearchWorks to our JIRA project):
Looking at the debugQuery analysis, we can see that even though the individual tokens are being matched, the mm is applying to all tokens within a single field:
We have always had a "catchall" field for our Solr index, but we never had an analogous field for our vernacular script data (in the 880s). Thus, we never created a CJK flavor catchall field. I rectified this, and voila! author + title CJK searches worked.