Wednesday, January 22, 2014

Searching in Solr, Analyzing Results and CJK

In my recently completed twelve post series on Chinese, Japanese and Korean (CJK) with Solr for Libraries, my primary objective was to make information available to others in an expeditious manner. However, the organization of the topics is far from optimal for readers, and the series is too long for easy skimming for topics of interest. Therefore, I am providing this post as a sort of table of contents into the previous series.

Introduction

In Fall 2013, we rolled out some significant improvements for Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford library "catalog" built with Blacklight on top of our Solr index. If your collection has a significant number of CJK resources and they are in multiple languages, you might be interested in our recipes. You might also be interested if you have a significant number of resources in multiple languages, period.

If you are interested in improving searching, or in improving your methodology when working on searching, these posts provide a great deal of information. Analysis of Solr result relevancy figured heavily in this work, as did testing: relevancy/acceptance/regression testing against a live Solr index, unit testing, and integration testing. In addition, there was testing by humans, which was well managed and produced searches that were turned into automated tests. Many of the blog entries contain useful approaches for debugging Solr relevancy and for test driven development (TDD) of new search behavior.

Resource Discovery with CJK Improvements

The production discovery service for the Stanford University Libraries is SearchWorks, which has all the CJK improvements discussed in the previous twelve blog posts: http://searchworks.stanford.edu

Where's the Code?

https://github.com/solrmarc/stanford-solr-marc - Stanford's fork of SolrMarc (for indexing our Marc data into Solr)

stanford-sw/solr - Solr config files, jars, etc.

https://github.com/sul-dlss/sw_index_tests - over 1000 relevancy tests run against our production Solr index

spec/cjk - tests for CJK

https://github.com/solrmarc/CJKFoldingFilter - a Solr filter to get better recall for Chinese and Japanese when using solr.ICUTransformFilter translation from traditional Han/Kanji characters to simplified/modern Han/Kanji characters.
SearchWorks application code - currently available upon request (will be on github in the next few months). This is a Ruby on Rails application built with Blacklight.
Blacklight - a configurable Ruby on Rails front-end to provide a discovery UI for a Solr index.
Solr - a search platform available from the Apache Lucene project.

Fork my code (the first three bullets), use it, rip it to shreds, contribute back, ask questions and otherwise improve on it.

CJK Work

CJK Overview

Why do we care about CJK resource discovery? (part 1)
Why approach CJK resource discovery differently? (part 1)
Our CJK Discovery Priorities (part 1)

CJK Data

Where is our CJK data in our MARC records? (part 7)
Extraneous spaces in CJK MARC data (part 11)
number of characters in CJK queries (part 8)

Solutions Made Available with Solr

Solr Language Specific Analysis (part 2)
Multilingual CJK Solutions (part 2)
Solr CJK Script Translations (part 2)
ICUFoldingFilter (part 7)
CJKBigramFilter (part 2)
CJKBigramFilter prerequisites (part 2 (edismax); part 7 (tokenizer))

Our Specific CJK Solutions for Solr

text_cjk Solr fieldtype definition (part 7, part 10, part 11)
mm for CJK (part 8, part 12)
phrase searching: qs for CJK (part 9)
catch-all field (part 9)
special non-alpha chars vs. cjk (part 10)
CJKFoldingFilter for variant Han/Kanji characters (part 11)
Remove extraneous spaces in Korean MARC data (part 11)
CJK relevancy tests (part 8, part 9, part 10, part 11, part 12)

Discovery UI Application Changes

applying LocalParams on the fly (part 10)
detecting CJK in a query string (part 10)
additional quotation mark characters (part 12)
CJK advanced search (part 12)

Solr in General

Field Analysis

anchored text fieldtype (exactish match) (part 6)
anchored text and synonyms (part 6)
removing trailing punctuation with a patternReplaceCharFilterFactory (part 6)
LocalParams (part 10)
solrconfig file variables with LocalParams (part 10)
patternReplaceCharFilterFactory (part 11, part 6)
ICUTokenizer (part 7)
Unicode normalization via ICUFoldingFilterFactory (part 7)
ICUTransformFilterFactory (part 2, part 7)
synonyms (part 6)
positionIncrementGap (part 7)
autoGeneratePhraseQueries (part 7)

Debugging Relevancy

how to analyze relevancy problems (part 3, part 4, part 5, part 6, part 9, part 10)
debug argument (part 3)
analysis GUI (part 3)
relevancy score (part 3)
score visualization via http://solr.pl/en/ (part 3, part 5)

Edismax Issues

(part 3, part 4, part 5)
relevancy formula change (part 5)
relevancy workaround: anchored text fieldtype (exactish match) (part 6)
tie parameter (part 5)
boolean operators bug (and other bugs) (part 4)
split tokens bug SOLR-3589 (part 2)

Testing

first pass human testing of CJK (part 8)
broader human testing of CJK (part 9)

Relevancy/Acceptance/Regression Testing

overview (part 3)
approach for high level view of search result quality (part 9)
specific relevancy tests (part 8, part 9, part 10, part 11, part 12)
detecting CJK in a query string (part 10)

Data Based Decisions

is Boolean syntax utilized in actual user queries? (part 4)
workaround for edismax relevancy change (part 6)
what should CJK mm value be? (part 8)

CJK with Solr for Libraries, part 12

This is the twelfth of a series of posts about our experiences addressing Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford University Libraries "catalog" built with Blacklight on top of our Solr index.

In this post, I'll discuss an mm tweak I snuck past you in part ten of this series, our fix for CJK advanced searches, and our accommodation for variant quotation marks that may appear with CJK characters.

CJK Tweak 1.1: mm adjustment

There is an additional mm tweak that I sort of snuck past you in part ten of this series. It comes into play when a query string has CJK and non-CJK characters. I didn't find many queries with this script combination in our logs, but I asked our East Asia librarians if this occurs and they said it does. Unfortunately, I didn't get any examples, nor did I find any in a quick perusal of logs. In any case, the fix is easy enough, and you've already seen it.

You may recall from part eight that we use an mm value of 3<86% when we encounter CJK characters in the query string. And in part ten, I showed you the following method:

The code in in the purple box is as follows. When adjusting the mm value for cjk queries, we determine how many non-CJK tokens are present in the query string. If there are any non-CJK tokens, we increase the lower limit on the CJK mm value. So if there are 2 non-CJK tokens, the mm would become 5<86%. This tweak is in both our test code, and in our SearchWorks application code, as you would expect.

CJK Advanced Search

In part ten of this series, I showed what we needed to do in our SearchWorks application code to use some different Solr parameters when we had CJK characters in a user query string. Unsurprisingly, we needed to do the analogous thing for the SearchWorks advanced search form:

Tests for this largely focused around the types of searches unique to advanced search: publisher, description, TOC, and on combining multiple fields. As before, we need to use the right qf and pf variables, and set the mm and qs parameters properly for CJK. There are test searches with these characteristics in the sw_index_tests project under spec/cjk/cjk_advanced_search_spec.rb.

But in some ways, the real concern is whether we have the advanced search form hooked up properly for CJK queries in the application code. So we have cucumber features to do CJK advanced search integration tests in the SearchWorks Rails application.

We also have tests to make sure we can combine fields properly with AND or OR:

The code for advanced searching in our Rails application is currently a one-off, so the details may not be all that applicable elsewhere. Here is some of our spec code for ensuring that the controller intercepts the user query strings for each box in the search form and adjusts qf, pf, mm and qs accordingly:

And here are specs for ensuring we handle CJK in some strings and non-CJK in other strings appropriately:

and in case it's helpful to you, here is the code that is doing the CJK character detection and altering the Solr params for CJK for advanced search:

You can see, in the "first part of CJK processing" that it is adjusting the mm, qs and new_field parameters if CJK characters are found. The new_field is ultimately used as a key to get the "handler" from the config object, which is defined in the catalog controller. Here's a snippet of the config object code:

and I'm sure you can imagine the definitions for all flavors of text boxes in the advanced search form. All of the above is analogous to the Rails application code I showed in part ten for adjusting Solr params for CJK characters in simple search. We can be confident the code does what we want it to because we wrote all those nice tests - specs to exercise the process_query code for CJK and non-CJK user text, and cucumber features for integration testing ensuring the Rails application was getting the expected results back from Solr, given user-entered text in the advanced search form. (And yes, all the tests pass.)

CJK (and other) Variant Quote Characters

I received the following email from one of our Chinese librarians during testing:

Say what? There are special CJK flavor quotation marks? I got these replies to my question:

Since Solr only interprets " (ASCII hex 22) as the character for phrase searching, we needed the SearchWorks application to substitute hex 22 for the other quote characters. It turns out there are many quote mark variants. Per http://en.wikipedia.org/wiki/Quotation_mark_glyphs, there are lots of quotation variants: aside from hex 22, there are 14 different paired quotation marks. Here's our spec code (courtesy of my awesome colleague, Jessie Keck):

The SearchWorks Blacklight Rails application has some simple code in the catalog controller to substitute plain old " when it encounters any of the other quotation characters. This code is called as a before_filter in the catalog controller:

The special_quote_characters is defined in a helper:

With this code in place in the Rails application (thanks again to Jessie Keck), the spec passes. We had the East Asia librarians confirm the fix.

That's it!

So in this twelve(!) part series, I've shared why we wanted to improve CJK discovery, why it's a hard problem to solve in a multilingual environment, the changes we applied to our Solr indexing and why, and the changes we applied to our SearchWorks web application and why. I've shown exemplar tests for all the changes we made, and shared a lot of specifics regarding tokenizers and normalization filters chosen and why.

I've also shared how to analyze relevancy problems in general and the specifics of how I analyzed each of our problems. I've shared my approach for working with human testers, from expectation setting to feedback methodology, and our approach for a high-level view of CJK search results quality at a glance.

Looped in for the ride was information about the Solr edismax request handler, some of its bugs and our workarounds or lack thereof, as well as some information about edismax arguments like tie, mm, and qs. There was also some fun with synonyms and regular expressions for pattern replacement charFilters, and edismax variable declarations to be used as LocalParams.

And underlying all of this, of course, is the role our relevancy tests played in discovering, diagnosing and fixing problems. In addition to the edismax relevancy surprises, we used relevancy tests to vastly simplify our boost values, and to make it possible for us to explore the correctness of solutions without a lot of human eyeballs.

Last, but not least, there have been links to all the indexing code: from our fork of SolrMarc to our Solr config files to the CJKFoldingFilter for Solr, and, of course, to our relevancy tests.

Please feel free to fork the code, make contributions back, ask questions, and otherwise improve on this work. I'm sure I made some stupid decisions, and made some things more complex than was necessary. This work is by no means perfect, but our East Asia librarians are very pleased with it and believe it is the best solution available at this time.

I know the order of topics presented may not have been ideal, but the important thing was to get the information out there.

Tuesday, January 21, 2014

CJK with Solr for Libraries, part 11

This is the eleventh of a series of posts about our experiences addressing Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford University Libraries "catalog" built with Blacklight on top of our Solr index.

Let's recall the list of blocker, must do, and should do problems that cropped up with level 2 evaluation of CJK discovery in part nine (after relevancy tweaks for edismax applied per part six, Solr fieldtype for text_cjk defined in part seven):

Blockers:

other production behaviors broken (fix described in part ten)

Must Dos:

CJK phrase searching broken (fix described as CJK Tweak 2 in part nine)
CJK author + title searches broken (fix described as CJK Tweak 3 in part nine)
CJK advanced search broken

Should Dos:

cope with extra spaces in our CJK data
additional "variant" Han/Kanji characters translated to simplified Han/Kanji characters
accommodate variant quote marks
beef up our CJK test suite (see https://github.com/sul-dlss/sw_index_tests/tree/master/spec/cjk)
publish CJK solutions for others to use (that's these blog posts ...)

The first two "Should Do" items are what I'll discuss here.

Extraneous Spaces in Marc CJK Metadata

As I stated in the first post of this series, cataloging practice for Korean for many years was to insert spaces between characters according to "word division" cataloging rules (See http://www.loc.gov/catdir/cpso/romanization/korean.pdf, starting page 16.) End-user queries would not use these spaces. It 's analog ous to spac ing rule s in catalog ing be ing like this for English.

My trusty Stanford-Marc-Institutional-Memory guy, Vitus Tang, assured me there had been no practice, at least for us, to introduce extra whitespace into our Chinese or Japanese data. Only Korean got the benefit of this ... er ... interesting ... approach. Here's an example of a record with extra spaces in its title (which I have emphasized with red):

Here's the particular 880 field with the four spaces in the linked 245a (also emphasized with red):

And here are the specs for how we'd like it to work:

According to our Korean librarians, users would most likely put no spaces in this query. They would next most likely put a single space between 이 and 될, or possibly a space between 이 and 될 as well as between 때 and 까.

Our full specs for Korean spacing issues are here: https://github.com/sul-dlss/sw_index_tests/blob/master/spec/cjk/korean_spacing_spec.rb.

Why are these extra spaces a problem? Think of it this way: for every space added, a potential bigram is removed. For 7 characters, we have 6 overlapping bigrams + 7 unigrams for 13 possible tokens. If we add a space, we have 5 bigrams + 7 unigrams for 12 tokens. If we add 4 total spaces, as in the record above, we have only 2 bigrams and 7 unigrams for 9 tokens. If the query is 7 characters with no spaces, and mm = 86% (as shown in part eight), we need 10 matching tokens. If we need 10 matching tokens, but there are only 9 tokens in the record itself, it will never match.

Vitus Tang had a great idea for addressing the information retrieval problems introduced by these extra spaces: what if we indexed the Korean text as if it didn't have spaces?

CJK Tweak 4: removing extraneous Korean spaces

Removing spaces from a string before analyzing it was a perfect case for a pre-tokenization character filter, PatternReplaceCharFilter. This is a tricky regex because it requires lookahead - you have to go past the space to see if it should be removed. To make it easier to follow, for now we will refer to a Korean character as \k and whitespace is \s, per usual regex predefined character classes. The lookahead construct is (?=X), so we get:

\k*\s+(?=\k)

to remove any whitespace between Korean characters.

Sadly, though, Korean characters can be Hangul script, which is only used by Korean, or they can be Han script, which is used by Chinese, Japanese and Korean. And Korean strings can have a mix of Hangul and Han characters (and Arabic numerals, as shown above). We can only know for certain we have Korean text if we encounter at least one Hangul character in the string. Using Unicode script names as character classes, per java 7, we can express \k as

  [\p{Hangul}\p{Han}]

but to guarantee a Korean phrase, we must insist on at least one Hangul character. If we require a Hangul character at the beginning of a string, the lookahead expression to eliminate internal whitespace becomes:

  (\p{Hangul}\p{Han}*)\s+(?=[\p{Hangul}\p{Han}])

If we require a Hangul character at the end of a string, the lookahead expression to eliminate internal whitespace becomes:

  ([\p{Hangul}\p{Han}])\s+(?=[\p{Han}\s]*\p{Hangul})

This was non-obvious, but our tests for both Korean and Chinese helped us know when we got it wrong, and a regex tester (we used Rubular) helped us get the lookahead correct. Thanks go to Jon Deering for helping me with this. The charFilters, which go before the tokenizer in the analysis chain, look like this:

Sadly, though, we're not using java 7; we're using java 6. Java 6 regular expressions don't allow Unicode script names as character classes, so we have to use Unicode character classes. This means

  \p{Hangul}

becomes

  [\p{InHangul_Jamo}\p{InHangul_Compatibility_Jamo}\p{InHangul_Syllables}]

and

  \p{Han}

becomes the awful (reformatted for readability)

Yuck. You can look at the resulting char filters in our Solr schema; I won't include them here.

When we put these charFilters in the analysis chain, like so

the Korean spacing specs passed. W00t!

Variant Han Characters

As I've mentioned before, the ICU transform provided by Solr for traditional Han characters to simplified Han characters is incomplete. Some Japanese modern Kanji characters are not included in the transform, so Japanese search results for these modern Kanji characters won't include the corresponding traditional Kanji character. The following feedback gives an example:

A corresponding relevancy test might be:

Similarly, there are Chinese Unicode variants of traditional characters in our data (and in user queries):

The corresponding relevancy test might be:

So we want to improve recall by adding some characters to the ICU transform from traditional Han to simplified Han. However, at the time, Solr only worked with the ICU System transforms. I didn't see an easy way to alter the ICU transform, but knew I could write a Solr filter that I could add to the analysis chain to accomplish our goal.

The specs for which characters to include came from our East Asia librarians. Most of the source characters are from Japanese modern Kanji, but there are also some Chinese traditional Han variant characters not covered by the ICU transform. This approach was made far less daunting when I told the librarians that we could add in more characters at any time -- they didn't need to provide me with a comprehensive list from the get-go.

One note: for a CJK-illiterate such as myself, the pattern recognition on similar CJK characters is not great, especially for traditional Han script characters. It was really helpful when the Unicode codepoint was provided, in addition to the character in question. I ended up doing a lot of web lookups of codepoint to character and vice versa.

CJK Tweak 5: CJKFoldingFilter

The same wonderful folks at http://solr.pl who created the pie chart visualization of Solr scores have two blog posts on writing Solr filters. Their first post is about writing one for Solr 3.6; their second post is for Solr 4.1. By the time I was implementing this, we were at Solr 4.4, but I found both posts useful, as well as various other sources I bumped into via Google.

Conceptually, what we want this filter to do is very simple: when ever it encounters characters with particular Unicode codepoints, substitute different Unicode characters, such that the end result is that the original character, when passed through our entire analysis chain, will match the equivalent (simplified Han/modern Kanji) character. The code for this Solr filter, complete with test code, jars for testing, and ant build scripts, is at https://github.com/solrmarc/CJKFoldingFilter.

The source code itself is very small. We need a factory class that extends org.apache.lucene.analysis.util.TokenFilterFactory:

and we need the filter we defined to extend org.apache.lucene.analysis.TokenFilter. That code is a bit long to include inline, but you can see it at src/edu/stanford/lucene/analysis/CJKFoldingFilter.java in the CJKFoldingFilter project on github. The core of the Filter code is a simple mapping from each variant codepoint to another codepoint:

Note that I deliberately wrote the tests with a different notation, to hopefully catch any transcription errors:

In addition to various jars needed for testing, I found I needed some additional testing code to get my tests to work. I figured this out by modeling my tests on what was in core code, and having the compile or execution output indicate what was missing. All the additional teseting code is in test/src/org/apache/lucene/analysis/util/BaseTokenStreamFactoryTestCase.java, which I shamelessly copied from the Lucene 4.4.0 analysis/common/src/test folder. The code I copied even has these cryptic comments:

Presumably this oddity has been addressed in a later release of Solr.

In any case, the CJKFoldingFilter I wrote passes all its tests, and when the CJKFolderFilter jar file is added to the Solr lib directory (the same place where all the icu jars go), and the filter is added to our fieldtype analysis chain:

all our Japanese and Chinese variant character tests pass.

It is crucial to note that CJKFoldingFilter must be added to the analysis chain BEFORE the ICUTransform from Traditional Han to Simplified Han.

You may have also noticed that this is the final version of the text_cjk fieldtype definition - it is what we are currently using for production SearchWorks.

What Else?

Future post(s) will cover getting CJK queries working with the SearchWorks advanced search form, accommodating variant quotation marks for Solr phrase searching, and an mm tweak I snuck past you in part ten of this series.

Thursday, January 16, 2014

CJK with Solr for Libraries, part 10

This is the tenth (gah!) of a series of posts about our experiences addressing Chinese, Japanese and Korean (CJK) resource discovery in SearchWorks, the Stanford University Libraries "catalog" built with Blacklight on top of our Solr index.

In the previous post, I described our level 2 evaluation of CJK resource discovery, and noted a significant blocker to moving forward: the CJK improvements we were making broke current production behavior.

CJK Indexing Breaks Production?

I call our sw_index_tests relevancy tests, because these tests submit Solr queries to our index and evaluate the results for "correctness." In a sense, though, they could be called regression tests - changes to the Solr index shouldn't break any of these tests. Of course, they do break sometimes when records are added to or removed from our collection, but we try to keep them from being too brittle without sacrificing much test precision.

Let me step back a minute and give a bit of context. Our data is almost entirely bibliographic metadata in the MARC format, and frequently contains data in scripts beyond English. When data is in a different script, the MARC 21 standard expects it to be put in a linked 880 field. SearchWorks has always indexed the data in the 880s, and has always recognized that the data isn't English, and therefore should not be stemmed or have stopwords applied. This is the fieldtype definition we use for these linked 880 vernacular script fields:

We also use this field type for unstemmed versions of English fields, so we can sort more exact matches first. For example, we index the title both stemmed and unstemmed, and we boost the unstemmed matches more than the stemmed ones so they are scored higher.

Note that the textNoStem fieldtype uses the WhitespaceTokenizer, so tokens are created by splitting text on whitespace. When we spun up SearchWorks five years ago, we wanted to have behaviors like those below (all of these are from sw_index_tests) for non-alpha characters due to specific resources in our collection. You even can see the bug tracker (jira) ids in many of the tests as tags.
numbers mashed with letters:

hyphens:

musical keys (also here):

programming languages:

As you've probably guessed, what the above tests all have in common is that they failed when we changed our textNoStem fieldtype definition to use a Tokenizer that assigns a CJKBigramFilter friendly token type. So if we used this fieldtype definition instead:

we got a number of test failures:

Most of the failures we got by substituting the text_cjk fieldtype for textNoStem are due to hyphens, programming languages with special characters, and musical keys.

Let's use the handy-dandy field analysis GUI to see what happens to a string with these special characters when it passes through the textNoStem fieldtype:

and what happens when we pass the same string through the ICUTokenizer:

Note that all of the non-alphanum characters have been dropped by the ICUTokenizer. The same happens with the StandardTokenizer:

So it seems we are unlikely to have a one-size-fits-all fieldtype definition that will work for CJK text analysis and also for non-alphanum characters. What should we do?

Context-Sensitive Search Fields

Recall that as discussed in previous posts, we need to use a different mm and qs value to get satisfactory CJK results, but we want to keep our original mm and qs values for non-CJK discovery. So we are already on the hook for the SearchWorks Rails application to recognize when it has a user query with CJK characters in order to pass in CJK specific mm and qs values. I realized that I could make CJK flavors of all qf and pf variables in the requestHandler in solrconfig.xml, like this:

So now when our Rails application detects CJK characters in a user's query, it needs to pass in special mm, qs, qf, pf, pf3 and pf2 arguments to the Solr requestHandler. So a normal Solr query for an author search might look like this (I am not url-encoding here):

  q={!qf=$qf_author pf=$pf_author pf3=$pf3_author pf2=$pf2_author}Steinbeck

while a Solr query for a CJK author search would look like this (scroll right to see the whole thing):

  q={!qf=$qf_author_cjk pf=$pf_author_cjk pf3=$pf3_author_cjk pf2=$pf2_author_cjk}北上次郎&mm=3<86%25&qs=0

You may already be familiar with Solr LocalParams, the mechanism by which we are passing in qf and pf settings as part of the q value above.

Now, how do we test this? Does it work?

Applying LocalParams on the Fly

First, I wanted to try this new approach in our relevancy tests to see if it worked. If you look at this test, you'll see there isn't anything in the test itself indicating it uses CJK characters:

This test uses the magic of rspec shared_examples, which were employed to DRY up the test code. The shared example code looks like this:

The method called by the shared example is in the spec_helper.rb file. I'll do a quick trace through here:

So the cjk_query_resp_ids method calls cjk_q_arg, which, for our author search, calls cjk_author_q_arg:

which clearly inserts the correct cjk author flavor qf and pf localparams.

What about mm and qs? These are added later in the helper code chain; the following method is also called from cjk_query_resp_ids:

which in turn calls the solr_response method:

Detecting CJK in a query string

What I've shown above is that my CJK tests use a method, cjk_query_resp_ids, to ensure the correct CJK localparams are sent in to Solr. But I haven't shown how to automatically detect if a query has CJK characters.

Look again at the solr_response method. It first calls a method that counts the number of CJK unigrams in the user query string, which is a snap with Ruby Regexp supporting named character properties:

If CJK characters are found in the user query, then solr_resp calls a method that returns the mm and qs parameters:

Even if you didn't follow all that, we definitely end up with the following being sent to Solr:

  q={!qf=$qf_author_cjk pf=$pf_author_cjk pf3=$pf3_author_cjk pf2=$pf2_author_cjk}北上次郎&mm=3<86%25&qs=0

Using all that helper code above, and a bit more for other types of searches (title, subject, everything ...), I can assert that we did get all our production tests to pass as well as all our CJK tests -- all the issues with hyphens and musical key signs and so on went away. Yay!

(I want to mention that the layers of indirection in that code are useful for other contexts than the particular test illustrated.)

Context-sensitive CJK in Our Web App

Great - our tests all pass, and the CJK tests we have pass too! But what about repeating all that fancy footwork in the SearchWorks web application? We need to get the correct requests sent to Solr based on automatic CJK character detection in user entered queries.

In fact, the SearchWorks Rails application uses a lot of the same code as what I use for our relevancy tests. Let's go through it.

As you know, SearchWorks is built using Blacklight, which is a Ruby on Rails application. It all starts with a "before_filter" in the CatalogController:

Which is really a method that calls another method and adds the result to the existing Solr search parameters logic (Blacklight magic):

The cjk_query_addl_params method checks the user query string for CJK characters, adjusts mm and qs accordingly, and applies the local params as appropriate for CJK flavored searches:

The two methods I highlighted with purple boxes above are exact copies of those same methods in our relevancy testing application. They live in a helper file, because they are used to tweak advanced search for CJK, which will be addressed in a future post.

We put these changes in a test version of the SearchWorks application, asked the CJK testers to try it out, and it all worked! Also, we have end-to-end test code for the application that ensured these app changes didn't negatively impact any existing functionality while it did afford better CJK results.

D'oh!

In the course of writing this post, I realized that I probably can use the same fieldtype for non-CJK 880 text and for CJK 880 text; I may only need to distinguish between un-stemmed English fields and all vernacular fields, rather than between CJK and non-CJK 880 text. Since this approach could reduce the number of qf and pf variables we have in our solrconfig.xml files, I may experiment with it in the future.

Where are we Now?

This post showed how I removed our blocker to putting CJK discovery improvements into production: how I analyzed how CJK Solr field analysis was breaking production behaviors, my chosen solution, and how we tested and applied that solution.

With the blocker removed, and two of the three items on the "must do" list fixed (see previous post), we asked the testers via email to vet our work. They agreed we had fixed the intended problems and not broken anything else. At this point, we worked out a schedule to push these fixes to production.

Well, in reality, one of the fixes on the "should do" list went to production as well, and we agreed that getting advanced searching working for CJK could be a "phase 2" improvement. Future posts will address all the other improvements: fixing advanced search for CJK and the four other tweaks we added to improve CJK resource discovery (not necessarily in that order). And of course we will share the final Solr recipes and our code (already available via github).