I was recently charged with fixing call number searching in SearchWorks, and it was a great opportunity to set up a proper testing framework for Search Acceptance tests.
Search Acceptance tests need to assert the following:
- Given a set of Solr documents with known values
- and Given a known query string
- Then queries against the Solr index should return the expected results.
My Solr acceptance test framework uses Cucumber* (http://cukes.info/), (Ruby), Rake (http://rake.rubyforge.org/) and Jetty. Cucumber is tidily run from a rake task; I have cucumber do the following**:
- spin up a test Solr instance
- delete all documents in Solr index
- add test Solr documents in xml file to Solr
- run cucumber features/scenarios
- shut down test Solr instance
For (LC) call number searching, my test cases covered:
- searches are always left anchored
- class doesn't match against cutter (query "A1" doesn't match "B2 .A1")
- class doesn't match against volume designations (query "V1" doesn't match "A1 .C2 V.1")
- query that includes first cutter should not match wrong or second cutter
- query that includes first cutter should not match class value repeated for second cutter (query "B2 A1" should not match "A1 .B2 A1")
- classifications match only like classifications
- A1 does not match A1.1 or A11
- A1.11 does not match A11.1 or A111
- searches that include only the first letter of the first cutter must match (query "A1 B" should match "A1 B2" and "A1 B77"
- full call number queries must match (an example of moderate length: "N7433.4 .G73 H53 2007")
- punctuation and spacing before first cutter doesn't matter (all these are equivalent: "A1 .B2", "A1.B2", "A1 B2", "A1B2")
- space between class letters and digits doesn't matter ("PQ 3563" same as "PQ3563")
- a very small number of real test cases
For each test, we need
- Solr documents with appropriate fields and values to exercise correct and incorrect possibilities,
- cucumber scenarios capturing the acceptance criteria as applied to the test data.
I am now ready to determine the best Solr text analysis to fulfill the desired call number searching behaviors.
I wasn't sure which of a variety of Solr analysis possibilities would work best:
- EdgeNGram (ngrams, but only the ones starting with the first character of the field)
- WordDelimiterFilter with just the right settings
- WhitespaceTokenizer with careful pre-tokenized normalization
- PatternTokenizer with careful pre-tokenized normalization
So, what is the best approach? (drum roll ...)
I wasn't even sure it would be possible to fulfill all of my acceptance tests, so I was thrilled that I got all tests to pass for two different methods: the EdgeNGram and the WhitespaceTokenizer.
I put the field type definitions for both of them here: http://pastie.org/1242164 as well as here http://paste.pocoo.org/show/279886/
but I will be working with the WhitespaceTokenizer -- it will require far less disk space.
A few notes:
- Solr localized parameters rule
- a local param prefix query string does NOT go through analysis
- queries with spaces in them can be a little tricky to test via cucumber
- left anchoring tokenized searches requires a little cleverness
- getting appropriate results for A111 vs. A1.11 vs. A11.1 can require some cleverness
Of course, now they're going to ask me to get this working for our SUDOC and Dewey call numbers (we have about a half million of each of those flavors) and for our special local call numbers. Guess what? I'll ask for test cases and add 'em to the ones I have and I'll feel very secure that I won't be breaking anything I already test for as I accommodate new criteria.
* for Cucumber documentation, I find the wiki pages very helpful: http://github.com/aslakhellesoy/cucumber/wiki/_pages
** Cucumber does not support a "Before/After All Scenarios - do this once per feature file", so I fake it in a feature file by having its first scenario doing steps 1-3 and its last scenario doing step 5. The actual tests are all the scenarios in between. (Cucumber does support Before/After each scenario, in a couple of different ways: http://github.com/aslakhellesoy/cucumber/wiki/Background and there is also a way to run something before any scenario using global hooks: http://github.com/aslakhellesoy/cucumber/wiki/hooks ).