Discovery Grindstone: (LC) Call Number Searching in Solr

(what you really want to know: http://paste.pocoo.org/show/279886/ or http://pastie.org/1242164)

I was recently charged with fixing call number searching in SearchWorks, and it was a great opportunity to set up a proper testing framework for Search Acceptance tests.

Search Acceptance tests need to assert the following:

Given a set of Solr documents with known values
and Given a known query string
Then queries against the Solr index should return the expected results.

The testing framework allows me to experiment with different Solr configurations (e.g. field analysis or request handler settings) so I can find out how I can best fulfill the acceptance criteria. Of course I made heavy use of the Solr (solr baseurl)/admin/analysis.jsp page, but I didn't want to manually check all the acceptance criteria every time I tweaked a promising configuration. I would much rather fire off automated tests to ensure I'm not breaking case B while I fix case G. And for those of you that do ad hoc testing in this situation ... do you really run through all the test cases every time you make a change? Don't lie to me - you don't do it.

My Solr acceptance test framework uses Cucumber* (http://cukes.info/), (Ruby), Rake (http://rake.rubyforge.org/) and Jetty. Cucumber is tidily run from a rake task; I have cucumber do the following**:

spin up a test Solr instance
delete all documents in Solr index
add test Solr documents in xml file to Solr
run cucumber features/scenarios
shut down test Solr instance

Now I'm ready to write my acceptance tests.

For (LC) call number searching, my test cases covered:

searches are always left anchored

class doesn't match against cutter (query "A1" doesn't match "B2 .A1")
class doesn't match against volume designations (query "V1" doesn't match "A1 .C2 V.1")
query that includes first cutter should not match wrong or second cutter
query that includes first cutter should not match class value repeated for second cutter (query "B2 A1" should not match "A1 .B2 A1")

classifications match only like classifications

A1 does not match A1.1 or A11
A1.11 does not match A11.1 or A111

searches that include only the first letter of the first cutter must match (query "A1 B" should match "A1 B2" and "A1 B77"
full call number queries must match (an example of moderate length: "N7433.4 .G73 H53 2007")
punctuation and spacing before first cutter doesn't matter (all these are equivalent: "A1 .B2", "A1.B2", "A1 B2", "A1B2")
space between class letters and digits doesn't matter ("PQ 3563" same as "PQ3563")
a very small number of real test cases

There are usually many variants in to the simply stated test cases above. For example, are searches for each of the likely punctuation and spacing variants before the first cutter working correctly when the classification has more than one letter? When there is a decimal point in the classification? When the raw value to go into the Solr index has a period, but the query does not?

For each test, we need

Solr documents with appropriate fields and values to exercise correct and incorrect possibilities,
cucumber scenarios capturing the acceptance criteria as applied to the test data.

So, for various call number searching behaviors, I created xml for Solr docs to put each test case through its paces. Then I wrote cucumber scenarios to perform searches against the test data and examine the results.

I am now ready to determine the best Solr text analysis to fulfill the desired call number searching behaviors.

I wasn't sure which of a variety of Solr analysis possibilities would work best:

EdgeNGram (ngrams, but only the ones starting with the first character of the field)
WordDelimiterFilter with just the right settings
WhitespaceTokenizer with careful pre-tokenized normalization
PatternTokenizer with careful pre-tokenized normalization

(As an aside, Git was a great tool for keeping all my approaches separate: I had a branch for each approach, and could switch at will, and also had a way to go back to older tries ... all right on my laptop. )

So, what is the best approach? (drum roll ...)

I wasn't even sure it would be possible to fulfill all of my acceptance tests, so I was thrilled that I got all tests to pass for two different methods: the EdgeNGram and the WhitespaceTokenizer.

I put the field type definitions for both of them here: http://pastie.org/1242164 as well as here http://paste.pocoo.org/show/279886/

but I will be working with the WhitespaceTokenizer -- it will require far less disk space.

A few notes:

Solr localized parameters rule
a local param prefix query string does NOT go through analysis
queries with spaces in them can be a little tricky to test via cucumber
left anchoring tokenized searches requires a little cleverness
getting appropriate results for A111 vs. A1.11 vs. A11.1 can require some cleverness

Of course, now they're going to ask me to get this working for our SUDOC and Dewey call numbers (we have about a half million of each of those flavors) and for our special local call numbers. Guess what? I'll ask for test cases and add 'em to the ones I have and I'll feel very secure that I won't be breaking anything I already test for as I accommodate new criteria.

* for Cucumber documentation, I find the wiki pages very helpful: http://github.com/aslakhellesoy/cucumber/wiki/_pages

** Cucumber does not support a "Before/After All Scenarios - do this once per feature file", so I fake it in a feature file by having its first scenario doing steps 1-3 and its last scenario doing step 5. The actual tests are all the scenarios in between. (Cucumber does support Before/After each scenario, in a couple of different ways: http://github.com/aslakhellesoy/cucumber/wiki/Background and there is also a way to run something before any scenario using global hooks: http://github.com/aslakhellesoy/cucumber/wiki/hooks ).

Discovery Grindstone

Friday, October 22, 2010

(LC) Call Number Searching in Solr

1 comment: