Wednesday, November 6, 2013

CJK with Solr for Libraries, part 2

This is the second of a series of posts about our experiences improving CJK resource discovery for the Stanford University Libraries.

We recently rolled out some significant improvements for Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford library "catalog" built with Blacklight on top of our Solr index. If your collection has a significant number of CJK resources and they are in multiple languages, you might be interested in our recipes. You might also be interested if you have a significant number of resources in other languages with some of the same characteristics.

What Solutions Are Out There?

Of course our first thoughts on how to fix our CJK discovery woes were to find existing solutions we could use.  Solr and Lucene are used widely for many different languages, so we hoped to find great ready-made solutions.

Solr Analyzers for Japanese and Chinese

Solr currently ships with a Kuromoji Japanese morphological analyzer/tokenizer (shown in the Solr example schema), and there is also support for Simplified Chinese word segmentation.  Both of these language specific analyzers are mentioned in the README.txt of the Lucene analysis module.  It is possible there is a Korean Solr analyzer available as well, though it might take someone fluent in Korean to find it on the internet - it is not currently part of the Solr distribution files.

Utilizing language-specific analyzers may make sense if you can ask users to indicate language at query time, perhaps by selecting from a small pulldown list.  For example, we could run all our vernacular language metadata through a Japanese analysis, a Chinese analysis, and a Korean analysis and send user queries to the appropriate set of language-specific indexed fields, based on user indication of language.  Given the number of different languages with materials in SearchWorks, we would either have to present our users with a very long list of languages to select from or we would need to navigate a political storm to determine which languages made the short list in the UI.  Neither of these options were desirable for us.

Script Translations

Since Solr release 3.1, the Solr code has been able to utilize some of the Unicode support java libraries from the International Components for Unicode (ICU) project.  There is a Solr tokenizer available, solr.ICUTokenizer as well as some Solr filters for field type analysis: ICU collation, ICU character normalization/folding, and some Unicode script translations.

Han Traditional <--> Simplified
As mentioned in the first part of this series, the top priority for Chinese discovery improvements is to equate Traditional Han script characters with simplified Han script characters.  Similarly, the top priority for Japanese discovery improvements is to equate Modern Kanji (Han) characters with Traditional Kanji characters.  The ICU script translations include a translation to equate Traditional Han with Simplified Han.  In Solr, it could be specified as shown in the filter example:




So the next question is:  should we be translating from traditional to simplified, or from simplified to traditional?

Since multiple traditional characters can map to the same simplified character, we are likely to get the best recall mapping from traditional to simplified than from vice versa.  Some precision may be lost going from traditional to simplified, but our CJK language experts preferred this approach - they would rather get more results with some of them irrelevant than miss results.  As it happens, this approach is also taken by our ILS, which is Symphony by Sirsi/Dynix.

Katakana <--> Hiragana
The second priority for Japanese discovery improvements is to equate all scripts:  Kanji, Hiragana, Katakana and Romanji.  The only other relevant ICU script translation available is a mapping between Hiragana and Katakana.  This is a straightforward one-to-one character mapping, so it doesn't matter which direction the translation is done, as long as it is consistent between the query and the index.  The Solr filter could look like this:

Other CJK Script Translations?
The additional translations we would like would be Hangul <--> Han for Korean, Kanji/Han <--> Hiragana (or Katakana), and Japanese Romanji to one of the other Japanese scripts.  Unfortunately, the Solr use of ICU only supports ICU System transforms at this time, and none of these translations are included.  ICU itself supports user-supplied transforms, but Solr use of ICU does not.

Still, covering the top discovery priorities for Japanese and Chinese with out-of-the-box Solr components is a huge win.

Multi-lingual Solutions

As mentioned above, language-tailored text analysis is not a good solution for SearchWorks.  So we must take a multi-lingual approach to solving CJK discovery.

We were already acquainted with the Arcadia funded 2010 Yale University report "Investigating Multilingual, Multi-script Support in Lucene/Solr Library Applications" by Barnett, Lovins, et. al., (https://collaborate.library.yale.edu/yufind/public/FinalReportPublic.pdf), which explains the problem and suggests some approaches, but gives no test-kitchen-approved recipes.  I conferred with a number of developers working in libraries with large collections of Asian materials using Solr, but all the folks I talked to hadn't tackled this yet and their thoughts of how they planned to do it mirrored my own. It was also suggested repeatedly that I talk to Tom Burton-West of the Hathi Trust Digital Library (http://www.hathitrust.org), which is a very large, full-text multi-lingual digital library containing a significant body of CJK full-text material.

Tom Burton-West was amazingly helpful, and had already documented a number of relevant issues in a blog post in December 2011:  http://www.hathitrust.org/blogs/large-scale-search/multilingual-issues-part-1-word-segmentation.  Tom's research-backed post suggests that the best way to work simultaneously with multiple CJK languages would be indexing with a combination of unigrams and of overlapping character bigrams. As an example, if the original characters are ひらがな then the unigrams would be ひ, ら, が, な and the bigrams would be ひら, らが, and がな, and all seven of these tokens would be in the index.  If the index only used bigrams, it would not find unigram words that have no whitespace on either side, and if the index only used CJK unigrams, then it would produce too many false drops.

Solr CJKBigram Analyzer
Handily enough, Solr makes available a CJKBigramFilter, which creates overlapping bigrams when it encounters adjacent CJK characters (it creates unigrams for CJK characters that aren't adjacent).   The example schema provided with Solr even shows one way you might configure Solr to use it:


 

(This is not the way we ultimately configured our CJK field type; I will share that in a subsequent post.)

Hooray!   We've got two script translations and a way to do multi-lingual CJK bigramming -- all with Solr out-of-the-box components!  We're ready to go! 

Except we're not.

There was also a bug in Solr affecting the search results of bigrammed fields when using the "dismax" or "edismax" query parser.  (https://issues.apache.org/jira/browse/SOLR-3589).  The Solr dismax and edismax query parsers provide a way to search individual words across a combination of indexed fields and boost values. "Dismax" is an abbreviation of "disjunction maximum", which is a partial description of the way user queries are turned into low level Lucene queries. "Edismax" is an abbreviation for "extended disjunction maximum", where the extensions include improved punctuation handling and boolean syntax among other things.

SearchWorks at this time used the dismax query parser, so this Solr bug affected our CJK search results when we tried using the CJKBigramFilter. In essence, this bug meant that all the tokens created by the filter would be combined into a big boolean OR statement. Using our example above, with a user query of ひらがな, we would get results as if the query was this:  ひら OR らが OR がな OR ひ OR ら OR が OR な.   Clearly, this would bloat the results returned and reduce precision.

I created some test indexes, and worked closely with Vitus Tang, a Stanford metadata expert fluent in Chinese, to confirm SOLR-3589 made our CJK search results unacceptably bad.  I looked at the Solr source code and determined that I was unlikely to be able to fix the bug myself.  I thought of and tried a bunch of workarounds for this bug, but none of them produced acceptable search results.  I even thought about substituting a multilingual dictionary in one of the existing Chinese or Japanese tokenizers, but there was no easy way to modify the dictionary of either one of these analyzers.  I also conferred repeatedly with Tom Burton-West and others, but this bug was stumping us.

Lucky for us, as we asked around wondering who had the chops to fix the bug, Solr expert Robert Muir stepped up and fixed SOLR-3589 for Solr version 4.x and for the "edismax" query parser.  We were using Solr 3.5 and the "dismax" query parser at this point.  Tom Burton-West was kind enough to backport the fix to Solr version 3.6, but it still only worked for the edismax query parser.  Both he and I found it too daunting to port the fix to the dismax query parser.

At this point, the best solution was for SearchWorks to upgrade to Solr 4 or at least to Solr 3.6 with the patch, and to switch to the edismax query parser.   This was not a simple journey, as I will explain in a subsequent post in this series.

No comments:

Post a Comment