In this post, I'll discuss an mm tweak I snuck past you in part ten of this series, our fix for CJK advanced searches, and our accommodation for variant quotation marks that may appear with CJK characters.
CJK Tweak 1.1: mm adjustmentThere is an additional mm tweak that I sort of snuck past you in part ten of this series. It comes into play when a query string has CJK and non-CJK characters. I didn't find many queries with this script combination in our logs, but I asked our East Asia librarians if this occurs and they said it does. Unfortunately, I didn't get any examples, nor did I find any in a quick perusal of logs. In any case, the fix is easy enough, and you've already seen it.
You may recall from part eight that we use an mm value of 3<86% when we encounter CJK characters in the query string. And in part ten, I showed you the following method:
The code in in the purple box is as follows. When adjusting the mm value for cjk queries, we determine how many non-CJK tokens are present in the query string. If there are any non-CJK tokens, we increase the lower limit on the CJK mm value. So if there are 2 non-CJK tokens, the mm would become 5<86%. This tweak is in both our test code, and in our SearchWorks application code, as you would expect.
CJK Advanced SearchIn part ten of this series, I showed what we needed to do in our SearchWorks application code to use some different Solr parameters when we had CJK characters in a user query string. Unsurprisingly, we needed to do the analogous thing for the SearchWorks advanced search form:
sw_index_tests project under spec/cjk/cjk_advanced_search_spec.rb.
But in some ways, the real concern is whether we have the advanced search form hooked up properly for CJK queries in the application code. So we have cucumber features to do CJK advanced search integration tests in the SearchWorks Rails application.
We also have tests to make sure we can combine fields properly with AND or OR:
And here are specs for ensuring we handle CJK in some strings and non-CJK in other strings appropriately:
You can see, in the "first part of CJK processing" that it is adjusting the mm, qs and new_field parameters if CJK characters are found. The new_field is ultimately used as a key to get the "handler" from the config object, which is defined in the catalog controller. Here's a snippet of the config object code:
part ten for adjusting Solr params for CJK characters in simple search. We can be confident the code does what we want it to because we wrote all those nice tests - specs to exercise the process_query code for CJK and non-CJK user text, and cucumber features for integration testing ensuring the Rails application was getting the expected results back from Solr, given user-entered text in the advanced search form. (And yes, all the tests pass.)
CJK (and other) Variant Quote CharactersI received the following email from one of our Chinese librarians during testing:
http://en.wikipedia.org/wiki/Quotation_mark_glyphs, there are lots of quotation variants: aside from hex 22, there are 14 different paired quotation marks. Here's our spec code (courtesy of my awesome colleague, Jessie Keck):
The SearchWorks Blacklight Rails application has some simple code in the catalog controller to substitute plain old " when it encounters any of the other quotation characters. This code is called as a before_filter in the catalog controller:
That's it!So in this twelve(!) part series, I've shared why we wanted to improve CJK discovery, why it's a hard problem to solve in a multilingual environment, the changes we applied to our Solr indexing and why, and the changes we applied to our SearchWorks web application and why. I've shown exemplar tests for all the changes we made, and shared a lot of specifics regarding tokenizers and normalization filters chosen and why.
I've also shared how to analyze relevancy problems in general and the specifics of how I analyzed each of our problems. I've shared my approach for working with human testers, from expectation setting to feedback methodology, and our approach for a high-level view of CJK search results quality at a glance.
Looped in for the ride was information about the Solr edismax request handler, some of its bugs and our workarounds or lack thereof, as well as some information about edismax arguments like tie, mm, and qs. There was also some fun with synonyms and regular expressions for pattern replacement charFilters, and edismax variable declarations to be used as LocalParams.
And underlying all of this, of course, is the role our relevancy tests played in discovering, diagnosing and fixing problems. In addition to the edismax relevancy surprises, we used relevancy tests to vastly simplify our boost values, and to make it possible for us to explore the correctness of solutions without a lot of human eyeballs.
Last, but not least, there have been links to all the indexing code: from our fork of SolrMarc to our Solr config files to the CJKFoldingFilter for Solr, and, of course, to our relevancy tests.
Please feel free to fork the code, make contributions back, ask questions, and otherwise improve on this work. I'm sure I made some stupid decisions, and made some things more complex than was necessary. This work is by no means perfect, but our East Asia librarians are very pleased with it and believe it is the best solution available at this time.
I know the order of topics presented may not have been ideal, but the important thing was to get the information out there.