Tuesday, October 29, 2013

CJK with Solr for Libraries, part 1

This is the first of a series of posts about our experiences improving CJK resource discovery for the Stanford University Libraries.

We recently rolled out some significant improvements for Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford library "catalog" built with Blacklight on top of our Solr index.   If your collection has a significant number of CJK resources and they are in multiple languages, you might be interested in our recipes.  You might also be interested if you have a significant number of resources in other languages with some of the same characteristics.

Disclaimer: I am not knowledgeable about Chinese, Japanese or Korean languages or scripts -- the below is an approximate explanation meant only to illustrate the complexities of CJK resource discovery.

Why do we care about CJK resource discovery?


Stanford University Libraries has over 7 million resources in SearchWorks;  over 450,000 of them are shown as resources in Chinese, Japanese, or Korean:


Of these CJK records, 85% have vernacular scripts in the metadata:









We want to leverage the CJK vernacular text to improve resource discovery for our CJK users.

Why approach CJK resource discovery differently?


1.  Meaningful discovery units (words) are not necessarily separated by whitespace in CJK text.

  • Solr/Lucene has some baked in assumptions about whitespace separating words. It'sasifthetextalwayslookslikethisbut the software expects it to look like this. 
  • This is true for user behavior as well as for resources.

2.  Search results must be as script agnostic as possible.

Chinese, Japanese and Korean each have multiple scripts or multiple character representations for each word ... and search results should include matches from all of them.

Chinese

Uses Han script only, BUT:
  • There is more than one way to write each word. "Simplified" characters were emphasized for printed materials in mainland China starting in the 1950s;  "Traditional" characters were used in printed materials prior to the 1950s, and are still used in Taiwan, Hong Kong and Macau today.  Since the characters are distinct, it's as if Chinese materials are written in two scripts.  
  • Another way to think about it:  every written Chinese word has at least two completely different spellings.  And it can be mix-n-match:  a word can be written with one traditional  and one simplified character.
  • Example:   Given a user query 舊小說  (traditional for old fiction), the results should include matches for 舊小說 (traditional) and 旧小说 (simplified characters for old fiction)

Japanese

Mainly uses three scripts:
  • Han ("Kanji")
    • Kanji characters can be "traditional" or "modern," akin to Chinese "traditional" and "simplified."  However, given a traditional Han/Kanji character, the corresponding Kanji modern character is not always the same as the Han simplified character.
    • That is, "some of the Chinese characters used in Japan are neither 'traditional' nor 'simplified'. In this case, these characters cannot be found in traditional/simplified Chinese dictionaries."  from http://en.wikipedia.org/wiki/Simplified_Chinese_characters#Computer_encoding
    • Note:  Kanji characters are still actively used in contemporary writing.
  • Hiragana
    • syllabary used to write native Japanese words.
  • Katakana
    • syllabary primarily used to write foreign language words.
Also makes some use of
  • Latin ("Romanji")

Korean

Uses two scripts:
  • Han ("Hanja")
    • some Hanji characters are still actively used by South Koreans.
  • Hangul
    • in widespread use;  was promulgated in the mid 15th century.

Note:  Han script is used by all three CJK languages BUT:

  • the meaning of the characters is not necessarily the same in the different languages.
  • you can't translate Han characters for one language without potential degradation of results in the other languages.

3.  Multilingual indexes can't sacrifice, say, Japanese searching precision in favor of Chinese searching precision. 

4.  Automatic language detection is not possible.

  • script detection isn't sufficient.  Example: a record has Latin and Han characters.  Is it Japanese?  Or English and Chinese?   Or English and Korean?  Or English and Japanese?
  • the indicated language(s) in a MARC record may be insufficient.  For example, the record may be a Korean record for a resource that is mostly in Chinese.  
  • the user queries are short:  90% of our CJK queries are less than 25 characters; 50% have 12 or fewer chars.   (Evidence of this will be shown in another part of this series on CJK.)
  • the amount of CJK text in an individual record may also be too small.

5.  Artificial spacing may be present in Korean Marc records. 

Cataloging practice for Korean for many years was to insert spaces between characters according to "word division" cataloging rules (See http://www.loc.gov/catdir/cpso/romanization/korean.pdf, starting page 16.)  End-users entering queries in a search box would not use these spaces.  It 's analog ous to spac ing rule s in catalog ing be ing like this for English.

CJK Discovery Priorities

Given the difficulties above, we asked our East Asia Librarians what their priorities were for discovery improvements.

Chinese

1.  Equate Traditional Characters With Simplified Characters

About half of our Chinese resources are in traditional characters, the other half are in simplified characters.  Queries can be in either traditional or simplified characters, or a combination of the two;  search results should contain all matching resources, whether traditional or simplified.

2.  Word Breaks

Search results should match conceptual word breaks whether or not whitespace is used to separate words in the results or the query.

Japanese

1.  Equate Traditional Kanji Characters With Modern Kanji Characters

Kanji (Han) Queries can be in either traditional or modern characters, or a combination of the two;  search results should contain all matching resources, whether traditional or simplified.  It is important to restate that Modern Kanji characters are not always the same as Simplified Han characters for the equivalent traditional character.

2.  Equate All Scripts

Search results should contain matches in all four scripts: Hiragana, Katakana, Kanji or Romanji.  Queries can be in any script, or any combination of scripts.

3.  Imported Words

Japanese represents some foreign words with Romanji and/or Katakana:
"sports" --> "supotsu" <==> スポーツ
Search results should contain matches in all representations and allow queries in any representation.

4.  Word Breaks

Search results should match conceptual word breaks whether or not whitespace is used to separate words in the results or the query.

Korean

1.  Word Breaks

Search results should match conceptual word breaks whether or not whitespace is used to separate words.

2.  Equate Hangul and Hancha Scripts

Search results should contain matches in both scripts: Hangul and Hancha (Han).  Queries can be in any script, or any combination of scripts.


Next ...

We'll be looking at what Solr offers in the way of CJK tools, and some of the recently fixed and current Solr bugs that get in the way, including two that almost sunk us. We'll also examine what current CJK queries look like and where the CJK characters are in our Marc data. And of course we'll cover our testing methodology and the final recipes. No guarantees on the order of these topics!

1 comment:

  1. I found out yesterday that there is a Korean language analyzer in the process of getting added to the Solr code. See https://issues.apache.org/jira/browse/LUCENE-4956

    ReplyDelete