Thursday, November 7, 2013

CJK with Solr for Libraries, part 3


This is the third of a series of posts about our experiences improving CJK resource discovery for the Stanford University Libraries.

We recently rolled out some significant improvements for Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford library "catalog" built with Blacklight on top of our Solr index. If your collection has a significant number of CJK resources and they are in multiple languages, you might be interested in our recipes. You might also be interested if you have a significant number of resources in other languages with some of the same characteristics.

Relevancy Testing


In the second part of this series, I explained why SearchWorks needed to change from using the Solr dismax query parser to the edismax query parser.   I would not undertake nor recommend a fundamental change to your Solr query processing without a good testing methodology.  Nor would I change an index to accommodate CJK without a way to ensure it didn't break existing functionality.  Basically, it's a bad idea to change anything about query processing without a way to ensure it doesn't degrade existing relevancy.  This implies automated relevancy acceptance testing is needed.

Luckily for us, we have been doing automated relevancy testing for a while, first by using cucumber tests within our SearchWorks Blacklight Rails application, and now much more efficiently by using rspec-solr (http://rubydoc.info/github/sul-dlss/rspec-solr) to interact with Solr directly instead of going through the whole Rails application.  It allows our SearchWorks relevancy testing application, sw_index_tests (available at https://github.com/sul-dlss/sw_index_tests), to parse Solr responses and use rspec syntax to check whatever we want about the Solr documents returned, without going through the entire Rails stack.  I blogged about this a while back.

When we were about to switch to edismax to facilitate CJK discovery, we had around 580 relevancy tests in sw_index_tests, including tests for everything, author, title, subject and series search results, diacritics and punctuation in search terms, and journal titles, among other things.  These test searches (and their expected results) were amassed over a period of 3-4 years:  tests were written every time a tweak was made to address a problem, or when the indexing code changed for some other reason (e.g. providing call number searching).  These tests have never been comprehensive, but they are a lot better than nothing.  We run them against our live production index nightly via our Jenkins continuous integration server.  Sometimes we have to tweak the tests when records are added, removed, or changed in the production index, but that's easy.  The peace of mind knowing we have a way to do relevancy acceptance testing is well worth the trouble.

And in case you're wondering, we pass an additional http argument (testing=sw_index_tests) to Solr so we can easily segregate these test queries from actual user queries in the Solr logs:

Here are a few example tests from sw_index_tests:

Title:

Author


Journal Title

Diacritics
I will talk more about our CJK relevancy tests later;  my main point here is that we have automated tests to help us determine if anything breaks when we make changes to our Solr index or configurations, and you can do it too!   Heck, you can even use ours and change the acceptance conditions to be your ids and expected result numbers.

I would love to hear from anyone else who does automated relevancy testing, as it seems to be a rare thing.

Edismax != Dismax


Technically, to switch from the dismax query parser to the edismax query parser, you need only add the "e" to your Solr request handler defType declaration:
<requestHandler name="/search" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="defType">edismax</str>
When we tried this, we had 21 failures out of approximately 580 tests.  The failures were in four categories:

1.  Journal title failures

Example of a failing test:
  rspec ./spec/journal_title_spec.rb:22 # journal titles The Nation as everything search
  rspec ./spec/journal_title_spec.rb:32 # journal titles The Nation (National stems to Nation) with format journal

2.  Queries having hyphens with a preceding space (but no following space)

Example of a failing test:
  ./spec/punctuation/hyphen_spec.rb:140 # 'under the sea -wind' hyphen in queries with space before but not after are treated as NOT in everything searches 

3.  Boolean NOT operator

Failing tests:
  ./spec/boolean_spec.rb:100 # boolean NOT operator  space exploration NOT nasa has an appropriate number of results
  ./spec/boolean_spec.rb:88 # boolean NOT operator  space exploration NOT nasa  should have fewer results than query without NOT clause

4.  Synonyms for musical keys.

We use Solr synonyms to equate the following (for all musical keys - the full list is here):
  f#, f♯, f-sharp => f sharp
  ab, a♭, a-flat => a flat
Failing tests:
  ./spec/synonym_spec.rb:204 # musical keys sharp keys f# major
  ./spec/synonym_spec.rb:316 # musical flat keys author-title search (which is a phrase search) b♭

Clearly, we needed to address these problems before we could use edismax in production SearchWorks, which means they needed to be fixed as a prerequisite for improving CJK discovery.

How to Analyze Relevancy Problems


Thankfully, there are some excellent tools for debugging relevancy problems.

1.  Solr query debugging parameters 

If you add debugQuery=true to your Solr request, then you will get debugging information in your Solr response.  If you are at Solr release 4.0 or higher, you could use debug=query instead.  Here is an example:
My request:
 http://(solr baseurl)/solr/select?q={! qf=$qf_author}zaring&debug=query
The debug query part of the response (simplified a bit):

We can see exactly which Solr fields and terms are being searched, and their boost factors.  This example shows different terms being searched in the stemmed and unstemmed version of the fields.  (Note: the decision to stem author fields was deliberate to allow users to find, say "Michaels, Amanda" when they query "Amanda Michael", or if they use "Crook" when the name is actually "Crooke".)

See http://wiki.apache.org/solr/CommonQueryParameters#Debugging for more information.

2.  Analysis GUI

Another Solr supplied tool is the Analysis form in the admin GUI.   This tool lets you see how each part of your analysis chain of tokenizer and filters affects the data according to field, field type, or dynamic rule in the Solr schema.

This shows, for Solr field author_1xx_search, at which point in the analysis chain "zaring" becomes "zare".   I entered "zare" as a query value, and the faint purple highlighting of the bottom two lines on the left field value side shows that zare and zaring will match for field author_1xx_search. 
Note that this is not an exact representation of query matching.  As an example, the lucene query parser on the client side breaks things up by whitespace before field analysis is performed for (e)dismax processing.

3.  Visualization of Individual Result Debug Information

The Solr debug information can also contain information on how the algorithm computed the relevancy ranking of a results, either with debugQuery=true, or debug=results for Solr 4.0 or higher.

Given the same Solr query string as above, with debug=results:
 http://(solr baseurl)/solr/select?q={! qf=$qf_author}zaring&debug=results
The explain part of the response (simplified a bit):
<lst name="debug">
... 
<lst name="explain">
  <str name="3928423">
9.0780735 = (MATCH) sum of:
  9.0780735 = (MATCH) max plus 0.01 times others of:
    9.056876 = (MATCH) weight(author_1xx_unstem_search:zaring^20.0 in 1075085) [DefaultSimilarity], result of:
      9.056876 = score(doc=1075085,freq=1.0 = termFreq=1.0
), product of:
        0.999978 = queryWeight, product of:
          20.0 = boost
          14.491322 = idf(docFreq=9, maxDocs=7231138)
          0.0034502652 = queryNorm
        9.0570755 = fieldWeight in 1075085, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          14.491322 = idf(docFreq=9, maxDocs=7231138)
          0.625 = fieldNorm(doc=1075085)
    2.1197283 = (MATCH) weight(author_1xx_search:zare^5.0 in 1075085) [DefaultSimilarity], result of:
      2.1197283 = score(doc=1075085,freq=1.0 = termFreq=1.0
), product of:
        0.24188633 = queryWeight, product of:
          5.0 = boost
          14.021318 = idf(docFreq=15, maxDocs=7231138)
          0.0034502652 = queryNorm
        8.763324 = fieldWeight in 1075085, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          14.021318 = idf(docFreq=15, maxDocs=7231138)
          0.625 = fieldNorm(doc=1075085)
</str>
This can be useful in determining why a particular document is (or isn't) included in the results, but it is difficult to eyeball the above and understand what is going on, even after you format it.

Thankfully, there is a web site in Poland, http://solr.pl/en/, that has a web service, http://explain.solr.pl/, to take your Solr explain info and visualize it as a pie chart.  This presents our result like this:


Suddenly, it is obvious why this document matches.

This is tool is even more useful for more complex (e)dismax formulae with a lot of fields to match, multi-term queries and documents matching different terms in different fields.  Check out some of our actual results while debugging the edismax difficulties here:

edismax:  http://explain.solr.pl/explains/m63o1yhg
dismax:  http://explain.solr.pl/explains/a7bkurhb


Stay Tuned ...


Now that I've explained our testing methodology and some of our debugging techniques, I'm ready to tell you how we overcame the relevancy issues we bumped into when switching to edismax.  That will be the topic of my next post(s).

Wednesday, November 6, 2013

CJK with Solr for Libraries, part 2

This is the second of a series of posts about our experiences improving CJK resource discovery for the Stanford University Libraries.

We recently rolled out some significant improvements for Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford library "catalog" built with Blacklight on top of our Solr index. If your collection has a significant number of CJK resources and they are in multiple languages, you might be interested in our recipes. You might also be interested if you have a significant number of resources in other languages with some of the same characteristics.

What Solutions Are Out There?

Of course our first thoughts on how to fix our CJK discovery woes were to find existing solutions we could use.  Solr and Lucene are used widely for many different languages, so we hoped to find great ready-made solutions.

Solr Analyzers for Japanese and Chinese

Solr currently ships with a Kuromoji Japanese morphological analyzer/tokenizer (shown in the Solr example schema), and there is also support for Simplified Chinese word segmentation.  Both of these language specific analyzers are mentioned in the README.txt of the Lucene analysis module.  It is possible there is a Korean Solr analyzer available as well, though it might take someone fluent in Korean to find it on the internet - it is not currently part of the Solr distribution files.

Utilizing language-specific analyzers may make sense if you can ask users to indicate language at query time, perhaps by selecting from a small pulldown list.  For example, we could run all our vernacular language metadata through a Japanese analysis, a Chinese analysis, and a Korean analysis and send user queries to the appropriate set of language-specific indexed fields, based on user indication of language.  Given the number of different languages with materials in SearchWorks, we would either have to present our users with a very long list of languages to select from or we would need to navigate a political storm to determine which languages made the short list in the UI.  Neither of these options were desirable for us.

Script Translations

Since Solr release 3.1, the Solr code has been able to utilize some of the Unicode support java libraries from the International Components for Unicode (ICU) project.  There is a Solr tokenizer available, solr.ICUTokenizer as well as some Solr filters for field type analysis: ICU collation, ICU character normalization/folding, and some Unicode script translations.

Han Traditional <--> Simplified
As mentioned in the first part of this series, the top priority for Chinese discovery improvements is to equate Traditional Han script characters with simplified Han script characters.  Similarly, the top priority for Japanese discovery improvements is to equate Modern Kanji (Han) characters with Traditional Kanji characters.  The ICU script translations include a translation to equate Traditional Han with Simplified Han.  In Solr, it could be specified as shown in the filter example:




So the next question is:  should we be translating from traditional to simplified, or from simplified to traditional?

Since multiple traditional characters can map to the same simplified character, we are likely to get the best recall mapping from traditional to simplified than from vice versa.  Some precision may be lost going from traditional to simplified, but our CJK language experts preferred this approach - they would rather get more results with some of them irrelevant than miss results.  As it happens, this approach is also taken by our ILS, which is Symphony by Sirsi/Dynix.

Katakana <--> Hiragana
The second priority for Japanese discovery improvements is to equate all scripts:  Kanji, Hiragana, Katakana and Romanji.  The only other relevant ICU script translation available is a mapping between Hiragana and Katakana.  This is a straightforward one-to-one character mapping, so it doesn't matter which direction the translation is done, as long as it is consistent between the query and the index.  The Solr filter could look like this:

Other CJK Script Translations?
The additional translations we would like would be Hangul <--> Han for Korean, Kanji/Han <--> Hiragana (or Katakana), and Japanese Romanji to one of the other Japanese scripts.  Unfortunately, the Solr use of ICU only supports ICU System transforms at this time, and none of these translations are included.  ICU itself supports user-supplied transforms, but Solr use of ICU does not.

Still, covering the top discovery priorities for Japanese and Chinese with out-of-the-box Solr components is a huge win.

Multi-lingual Solutions

As mentioned above, language-tailored text analysis is not a good solution for SearchWorks.  So we must take a multi-lingual approach to solving CJK discovery.

We were already acquainted with the Arcadia funded 2010 Yale University report "Investigating Multilingual, Multi-script Support in Lucene/Solr Library Applications" by Barnett, Lovins, et. al., (https://collaborate.library.yale.edu/yufind/public/FinalReportPublic.pdf), which explains the problem and suggests some approaches, but gives no test-kitchen-approved recipes.  I conferred with a number of developers working in libraries with large collections of Asian materials using Solr, but all the folks I talked to hadn't tackled this yet and their thoughts of how they planned to do it mirrored my own. It was also suggested repeatedly that I talk to Tom Burton-West of the Hathi Trust Digital Library (http://www.hathitrust.org), which is a very large, full-text multi-lingual digital library containing a significant body of CJK full-text material.

Tom Burton-West was amazingly helpful, and had already documented a number of relevant issues in a blog post in December 2011:  http://www.hathitrust.org/blogs/large-scale-search/multilingual-issues-part-1-word-segmentation.  Tom's research-backed post suggests that the best way to work simultaneously with multiple CJK languages would be indexing with a combination of unigrams and of overlapping character bigrams. As an example, if the original characters are ひらがな then the unigrams would be ひ, ら, が, な and the bigrams would be ひら, らが, and がな, and all seven of these tokens would be in the index.  If the index only used bigrams, it would not find unigram words that have no whitespace on either side, and if the index only used CJK unigrams, then it would produce too many false drops.

Solr CJKBigram Analyzer
Handily enough, Solr makes available a CJKBigramFilter, which creates overlapping bigrams when it encounters adjacent CJK characters (it creates unigrams for CJK characters that aren't adjacent).   The example schema provided with Solr even shows one way you might configure Solr to use it:


 

(This is not the way we ultimately configured our CJK field type; I will share that in a subsequent post.)

Hooray!   We've got two script translations and a way to do multi-lingual CJK bigramming -- all with Solr out-of-the-box components!  We're ready to go! 

Except we're not.

There was also a bug in Solr affecting the search results of bigrammed fields when using the "dismax" or "edismax" query parser.  (https://issues.apache.org/jira/browse/SOLR-3589).  The Solr dismax and edismax query parsers provide a way to search individual words across a combination of indexed fields and boost values. "Dismax" is an abbreviation of "disjunction maximum", which is a partial description of the way user queries are turned into low level Lucene queries. "Edismax" is an abbreviation for "extended disjunction maximum", where the extensions include improved punctuation handling and boolean syntax among other things.

SearchWorks at this time used the dismax query parser, so this Solr bug affected our CJK search results when we tried using the CJKBigramFilter. In essence, this bug meant that all the tokens created by the filter would be combined into a big boolean OR statement. Using our example above, with a user query of ひらがな, we would get results as if the query was this:  ひら OR らが OR がな OR ひ OR ら OR が OR な.   Clearly, this would bloat the results returned and reduce precision.

I created some test indexes, and worked closely with Vitus Tang, a Stanford metadata expert fluent in Chinese, to confirm SOLR-3589 made our CJK search results unacceptably bad.  I looked at the Solr source code and determined that I was unlikely to be able to fix the bug myself.  I thought of and tried a bunch of workarounds for this bug, but none of them produced acceptable search results.  I even thought about substituting a multilingual dictionary in one of the existing Chinese or Japanese tokenizers, but there was no easy way to modify the dictionary of either one of these analyzers.  I also conferred repeatedly with Tom Burton-West and others, but this bug was stumping us.

Lucky for us, as we asked around wondering who had the chops to fix the bug, Solr expert Robert Muir stepped up and fixed SOLR-3589 for Solr version 4.x and for the "edismax" query parser.  We were using Solr 3.5 and the "dismax" query parser at this point.  Tom Burton-West was kind enough to backport the fix to Solr version 3.6, but it still only worked for the edismax query parser.  Both he and I found it too daunting to port the fix to the dismax query parser.

At this point, the best solution was for SearchWorks to upgrade to Solr 4 or at least to Solr 3.6 with the patch, and to switch to the edismax query parser.   This was not a simple journey, as I will explain in a subsequent post in this series.

Tuesday, October 29, 2013

CJK with Solr for Libraries, part 1

This is the first of a series of posts about our experiences improving CJK resource discovery for the Stanford University Libraries.

We recently rolled out some significant improvements for Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford library "catalog" built with Blacklight on top of our Solr index.   If your collection has a significant number of CJK resources and they are in multiple languages, you might be interested in our recipes.  You might also be interested if you have a significant number of resources in other languages with some of the same characteristics.

Disclaimer: I am not knowledgeable about Chinese, Japanese or Korean languages or scripts -- the below is an approximate explanation meant only to illustrate the complexities of CJK resource discovery.

Why do we care about CJK resource discovery?


Stanford University Libraries has over 7 million resources in SearchWorks;  over 450,000 of them are shown as resources in Chinese, Japanese, or Korean:


Of these CJK records, 85% have vernacular scripts in the metadata:









We want to leverage the CJK vernacular text to improve resource discovery for our CJK users.

Why approach CJK resource discovery differently?


1.  Meaningful discovery units (words) are not necessarily separated by whitespace in CJK text.

  • Solr/Lucene has some baked in assumptions about whitespace separating words. It'sasifthetextalwayslookslikethisbut the software expects it to look like this. 
  • This is true for user behavior as well as for resources.

2.  Search results must be as script agnostic as possible.

Chinese, Japanese and Korean each have multiple scripts or multiple character representations for each word ... and search results should include matches from all of them.

Chinese

Uses Han script only, BUT:
  • There is more than one way to write each word. "Simplified" characters were emphasized for printed materials in mainland China starting in the 1950s;  "Traditional" characters were used in printed materials prior to the 1950s, and are still used in Taiwan, Hong Kong and Macau today.  Since the characters are distinct, it's as if Chinese materials are written in two scripts.  
  • Another way to think about it:  every written Chinese word has at least two completely different spellings.  And it can be mix-n-match:  a word can be written with one traditional  and one simplified character.
  • Example:   Given a user query 舊小說  (traditional for old fiction), the results should include matches for 舊小說 (traditional) and 旧小说 (simplified characters for old fiction)

Japanese

Mainly uses three scripts:
  • Han ("Kanji")
    • Kanji characters can be "traditional" or "modern," akin to Chinese "traditional" and "simplified."  However, given a traditional Han/Kanji character, the corresponding Kanji modern character is not always the same as the Han simplified character.
    • That is, "some of the Chinese characters used in Japan are neither 'traditional' nor 'simplified'. In this case, these characters cannot be found in traditional/simplified Chinese dictionaries."  from http://en.wikipedia.org/wiki/Simplified_Chinese_characters#Computer_encoding
    • Note:  Kanji characters are still actively used in contemporary writing.
  • Hiragana
    • syllabary used to write native Japanese words.
  • Katakana
    • syllabary primarily used to write foreign language words.
Also makes some use of
  • Latin ("Romanji")

Korean

Uses two scripts:
  • Han ("Hanja")
    • some Hanji characters are still actively used by South Koreans.
  • Hangul
    • in widespread use;  was promulgated in the mid 15th century.

Note:  Han script is used by all three CJK languages BUT:

  • the meaning of the characters is not necessarily the same in the different languages.
  • you can't translate Han characters for one language without potential degradation of results in the other languages.

3.  Multilingual indexes can't sacrifice, say, Japanese searching precision in favor of Chinese searching precision. 

4.  Automatic language detection is not possible.

  • script detection isn't sufficient.  Example: a record has Latin and Han characters.  Is it Japanese?  Or English and Chinese?   Or English and Korean?  Or English and Japanese?
  • the indicated language(s) in a MARC record may be insufficient.  For example, the record may be a Korean record for a resource that is mostly in Chinese.  
  • the user queries are short:  90% of our CJK queries are less than 25 characters; 50% have 12 or fewer chars.   (Evidence of this will be shown in another part of this series on CJK.)
  • the amount of CJK text in an individual record may also be too small.

5.  Artificial spacing may be present in Korean Marc records. 

Cataloging practice for Korean for many years was to insert spaces between characters according to "word division" cataloging rules (See http://www.loc.gov/catdir/cpso/romanization/korean.pdf, starting page 16.)  End-users entering queries in a search box would not use these spaces.  It 's analog ous to spac ing rule s in catalog ing be ing like this for English.

CJK Discovery Priorities

Given the difficulties above, we asked our East Asia Librarians what their priorities were for discovery improvements.

Chinese

1.  Equate Traditional Characters With Simplified Characters

About half of our Chinese resources are in traditional characters, the other half are in simplified characters.  Queries can be in either traditional or simplified characters, or a combination of the two;  search results should contain all matching resources, whether traditional or simplified.

2.  Word Breaks

Search results should match conceptual word breaks whether or not whitespace is used to separate words in the results or the query.

Japanese

1.  Equate Traditional Kanji Characters With Modern Kanji Characters

Kanji (Han) Queries can be in either traditional or modern characters, or a combination of the two;  search results should contain all matching resources, whether traditional or simplified.  It is important to restate that Modern Kanji characters are not always the same as Simplified Han characters for the equivalent traditional character.

2.  Equate All Scripts

Search results should contain matches in all four scripts: Hiragana, Katakana, Kanji or Romanji.  Queries can be in any script, or any combination of scripts.

3.  Imported Words

Japanese represents some foreign words with Romanji and/or Katakana:
"sports" --> "supotsu" <==> スポーツ
Search results should contain matches in all representations and allow queries in any representation.

4.  Word Breaks

Search results should match conceptual word breaks whether or not whitespace is used to separate words in the results or the query.

Korean

1.  Word Breaks

Search results should match conceptual word breaks whether or not whitespace is used to separate words.

2.  Equate Hangul and Hancha Scripts

Search results should contain matches in both scripts: Hangul and Hancha (Han).  Queries can be in any script, or any combination of scripts.


Next ...

We'll be looking at what Solr offers in the way of CJK tools, and some of the recently fixed and current Solr bugs that get in the way, including two that almost sunk us. We'll also examine what current CJK queries look like and where the CJK characters are in our Marc data. And of course we'll cover our testing methodology and the final recipes. No guarantees on the order of these topics!