Thursday, November 4, 2010

Solr, Hyphenated Words, and Query Slop

Executive Summary:  you probably need to increase your query slop.  A lot.

Revision  (thanks to Robert Muir):  See  https://issues.apache.org/jira/browse/SOLR-1852.  There was a patch applied to Solr 1.4 that fixes this.  Also, in that Jira issue is a comment from Mark Bennett:  "Just put the stopwords filter after the Word Delimiter filter. That worked for us without impacting much else, until we can get over to the new version."



We recently had a feedback ticket that a title search with a hyphen wasn't working properly.  This is especially curious because we solved a bunch of problems with hyphen searching AND WROTE TESTS in the process, and all the existing hyphen tests pass.  Tests like "hyphens with no spaces before or after, 3 significant terms, 2 stopwords" pass.

Our metadata contains:
record A with title:   Red-rose chain.
record B with title:   Prisoner in a red-rose chain.

A title search:  prisoner in a red-rose chain  returns no results

Further exploration (the following are all title searches):
  • red-rose chain  ==>  record A only
  • "red rose" chain ==>  record A only
  • "red rose chain" ==> record A only
  • "red-rose chain" ==> record A only
  • red rose chain ==>  records A and B
  • red "rose chain" ==>  records A and B  (!!)
What is going on?  First, let's see how the Solr field is analyzed.  The field definition is:


    <field name="title_search" type="text" indexed="true" stored="false" />

The type definition is:

 <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.WhitespaceTokenizerFactory" />
     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
     <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
          splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
          catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
     <filter class="solr.LowerCaseFilterFactory" />
     <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
     <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
   </analyzer>
 </fieldtype>

So of all the stuff above, the only filter that touches hyphens is the WordDelimiterFilter factory, which is documented at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

The relevant settings we use in WDF say:
  • split words into parts at non-alphanum characters or at case changes
  • catenate word parts into a single word
We can look at the resulting analysis using http://solr.baseurl/solr/admin/analysis.jsp.  

In record A, "Red-rose chain" becomes:
 
term position 123
term text redrosechain
redros
term type wordwordword
word
source start,end 0,34,89,14
0,8

This shows the token "red-rose" becomes term "red" followed by terms "rose" and "redros." "redros" is the term resulting from the catenation of word parts "red" and "rose" with stemming applied.

In record B, "Prisoner in a red-rose chain" becomes:

term position 1478
term text prisonredrosechain
redros
term type wordwordwordword
word
source start,end 0,814,1718,2223,28
14,22

Note the term positions!  Term "red" is at position 4, and the following terms are at position 7.  So as far as Solr is concerned these terms are NOT adjacent.  Which is precisely what the results of our search variants told us.  (Why is this true?  I'll leave that as an exercise for the reader.)

Important: the query term red-rose becomes the phrase query "red rose" via Solr magic and field definitions.

How do we address the fact that terms aren't adjacent?  Increase Phrase Query Slop.   The Solr Relevancy Cookbook (http://wiki.apache.org/solr/SolrRelevancyCookbook#Term_Proximity)  suggests "one way to get term proximity effects with the current query parser is to use a phrase query with a very large slop. Phrase queries with slop will score higher when the terms are closer together."
  • For a lucene request handler, this is applicable only for explicit phrase queries.  "red-rose"~2  says look for the phrase "red rose" with a phrase query slop of 2.  (red-rose~2 gives a parsing error.)
  • For a dismax request handler, this is the qs parameter.  Note:  the qs parameter is applied to
    • From the explanation of dismax parameters at http://wiki.apache.org/solr/DisMaxQParserPlugin,  we know that qs is the "amount of slop on phrase queries explicitly included in the user's query string" -- qs affects which Solr documents match the query.
    • Confusingly, ps is the "amount of slop on phrase queries built for "pf" fields" -- ps only affects ranking of the search results.
The Solr Revelancy Cookbook examples use a phrase query slop of 1,000,000.   I experimented with this particular query, and found that such a high query slop value returned a result I wasn't pleased with:  a document with "red" and "rose" so far apart I didn't want it included.  So I did a couple of manual searches and found that 150 retrieved my two desired results, but not the third un-desired result.

Okay, I have analyzed the problem and have a solution.  What do I do now???
  1. WRITE A TEST.  (it fails - I haven't applied the solution yet.)
  2. Run all of my search tests, including the new one to ensure all tests are passing except the one I just wrote.
    1. If the new test passes, rewrite it - you're getting a false positive.
    2. If some other test(s) fail, you're not running your tests often enough to catch failures and fix them.  Run the tests at every check in.  If you have long running tests (like our search tests), run them at least once a day.
      1. ha ha - now you have to fix these failures before you apply the query slop change.
  3. Apply fix.
  4. Run all of my search tests.  If they don't all pass, then I don't have a solution.  Return to step 3.
    1. Very occasionally (like, almost never), a change will break tests  - the tests themselves need to be revised.  

Sunday, October 24, 2010

Testing Solr Indexing Software - Full Stack Tests

In a previous post, I talked about the different levels of testing for code that writes to a Solr index.   This post will go into detail about Full Stack Tests, which are acceptance tests for search results including the UI wrapping.

The testing mantra:  if you had to test it manually, then it's worth having automated tests.  How many times can you ask human testers to bang on your application?   How often do they repeat searches that worked in the past? 

Our UI for http://searchworks.stanford.edu is based on Project Blacklight (http://projectblacklight.org/), a Ruby on Rails application.  There is a great way to test the RoR application stack from the user input forms to the html that would be returned:  Cucumber (http://cukes.info/).  So our cucumber tests fake user input into the UI, run it through our RoR code, send the request to Solr, run the response through our RoR code, and then we can look for desired data in the resultant html.

Here are some example cucumber tests:

Scenario: Query for "cooking" should have exact word matches before stemmed ones
  Given a SOLR index with Stanford MARC data
  And I go to the catalog page
  When I fill in "q" with "cooking"
  And I press "search"
  Then I should get ckey 4779910 in the first 1 results
  And I should get result titles that contain "cooking" as the first 20 results

Scenario: Stopwords in author searches should be ignored
  Given I am on the home page
  When I fill in "q" with "king of scotland"
  And I select "Author" from "search_field"
  And I press "search"
  Then I should get at least 20 total results
  And I should get the same number of results as an author search for "king scotland"
  And I should get more results than an author search for "\"king of scotland\""

Scenario: Two term query with COLON, no Stopword
  Given a SOLR index with Stanford MARC data
  And I go to the home page
  When I fill in "q" with "Jazz : photographs"
  And I press "search"
  Then I should get ckey 2955977 in the results
  And I should get the same number of results as a search for "Jazz photographs"
  And I should get the same number of results as a search for "Jazz: photographs"

Yes, these are executable tests.  And they give us a huge safety net for ensuring Solr configuration changes and indexing changes don't break anything.   If we change boost values in a Request Handler.  If we change a field type in Solr.  If we tweak the UI code handling raw user queries.

Whenever we get a user-feedback message, or an email from a staff member about expected search behavior, it is fodder for these tests.  Normally, we get reports of what is broken.  Great!  The ideal testing scenario.  We write a cuke test before we fix it, assert the cuke test fails, then we work on a fix, assert the cuke test passes.  And we can run all our other cuke search tests to ensure it doesn't break anything else.

The staff are delighted to hear that we now have a way to know automatically if we break the behavior in the future.  And that we'll fix it.   They are delighted to hear that they won't be asked to repeat the tests manually when we upgrade Solr or make any other changes.

Perhaps you are starting to see how we can do relevancy testing.  But that's for another post.

Testing Solr Indexing Software - Search Acceptance and Searchable Value Tests

In a previous post, I talked about the different levels of testing for code that writes to a Solr index.   This post will go into detail about Search Acceptance and Searchable Value tests, which are essentially acceptance tests for search results.

Recall the testing mantra:  if you had to test it manually, then it's worth automating the test. Solr is a complex black box for most of us - you need to know that twiddling any knobs on that box won't affect your search results in ways you didn't expect.

Many people do these sorts of tests ad hoc, which is fine if the raw data and the desired searching behavior are simple, you're certain you have met all the search requirements, and you'll never have to touch the Solr configuration files again.  (... what reality are you living in?  I want to join you.)

For most of us, there are times when we're not sure how to achieve the appropriate search results.

Search Acceptance Tests

Some of the questions we confront when we configure Solr:
  1. how to define the Solr field(s) and field type(s) to be searched - 
    • how is the text analyzed/tokenized?
    •  should it be indexed? stored? multivalued?
  2. how to transform the raw data into Solr searched values (this part on its own could be a Mapping test, if you have non-Solr code transforming the particular raw data in question)
  3. how to set up a RequestHandler to achieve the appropriate search results
Sometimes the raw data is inconsistent in strange ways, or the desired search behavior is complex enough to beg for test code.  Here are some examples:
  1. In a call number, some periods are vital to searches (P35.8 vs. P358) and some are not (A1 .B2 vs. A1 B2).  Some spaces are vital (A1 .B2 1999 vs. A1 .B21999) and some are not (A1 .B2 vs. A1.B2).  Call numbers  A1234 1999 .B2 and A1234 .B2, the desired search behavior may be for both to match.  Plus, the end user queries will be inconsistent and the data is definitely dirty.
  2. Will a query containing a stopword match raw data with the same stopword?  Will it match raw data without a stopword?  With a different stopword (dogs in literature vs. dogs of literature vs. dog literature)? 
  3. How should author names deal with stemming (Michaels vs. Michael)?  With stopwords (Earl of Sandford vs. Earl Sandford)?  Are results correct for hyphenated names?  Are results reasonable for specialized author searches as well as for unspecialized searches?
  4. "Why are (hyphens, ampersands, colons, semicolons) in query strings causing empty results?  How do I fix this?" 
  5. "I am not seeing the publisher in the search results, but I know publisher is written to the index because I wrote a mapping test for that field."
    • just checking if you're paying attention:  this one is a matter of setting stored="true" on the field definition.
I have to experiment to meet acceptance criteria like the above.  And once I figure out a solution, I want assurance that future twiddles won't break what I've already solved.  And if they do get broken, I want to know which twiddle is at fault.  And if I can't meet all the criteria, I like being able to know what I could get working, and what I had to chuck.

In a chat with Jonathan Rochkind, he said that you twiddle and you twiddle search configurations and eventually you hit a point where meeting acceptance test J will break acceptance test H.  From his perspective, this is why these sorts of tests are fickle - he feels a conflict is inevitable and show the folly of these tests.   But from my perspective, this is EXACTLY why these tests are necessary.   They let me pinpoint exactly which behaviors conflict, so I can pursue a new solution, or choose which behavior will be addressed.  (In my world, this generally means informing Librarians what the trade off is and letting them make the decision.)


Searchable Value Tests are rarely needed, because Search Acceptance tests tend to address nearly all  the searchable value issues.

Searchable value tests answer questions like this:
  1. "Why aren't searches with the 'a' prefix on the ckey working?  I know I left the letter 'a' prefix in that field, because I wrote a mapping  test to ensure the 'a' prefix is present in the field value to be written to Solr."
Bill Dueber points out that Solr's schema.xml could transform a field's searchable value as the document is added to Solr.  The field's type definition in schema.xml might strip out whitespace, punctuation, stopwords, etc.  It might substitute characters (maybe ü becomes ue, or maybe it becomes u).  Note that the stored field value will not be changed by schema.xml -- whatever was in the Solr document written to the index will be the value retrieved from a stored field.

How to Test

These tests must occur AFTER the Solr document is written to the index.  We have to check search results given known data - we send a search query to a Solr populated with our test data and check if we get the desired results back from Solr.

Note that this does NOT include your web application code.  In our case, Blacklight may filter the user query or the results, and we don't want or need that layered into the testing stack.  Recall that test code should live close to the code it is testing;  these tests should be sending queries directly to Solr and examining the Solr results.

As it happens, you can test both of these cases more or less the same way.

The Right Way is to create a script to do the following:
  1. Get your latest, greatest Solr configuration files from source control.
    1. Prepare a Solr instance for testing:  copy in the latest configuration files, clear the test index, etc.
  2. Pull down the indexing software, your test data, and your test code from source control.
    1. Build the indexing software if necessary.  SolrMarc is built from an ant task.
  3. Start Solr test instance.  Generally this means starting or restarting the web server you are using for Solr testing (jetty, tomcat ...)
  4. Run your tests.  Your tests should create very small Solr indexes - a fresh index for a group of related tests, or sometimes for a single test.
    1. (MARC) records are sent through the indexing software, creating Solr documents, and add the Solr documents to the testing index.
    2. Submit test queries against the testing index.
    3. Programmatically test for acceptance criteria in the Solr results.
    4. Repeat as necessary
  5. Stop Solr test instance.
  6. Clean up
I am currently working on a ruby Rake task with Cucumber (but not with Rails) to do the above.  I would like to be able to plug in different indexing software (SolrMarc, some ruby MARC -> Solr code being developed ...).   The tests themselves will not be MARC centric - just the data and the indexing software.  And it will be easy to have continuous integration execute the script, because it's a Rake task.

I already have a chunk of this working, thanks in great part to the Ruby code that I stole from the Hydrangea and Blacklight projects.  This allowed me to figure out Solr configurations for a call number search field that satisfied the most important searching criteria as defined by my most excellent group of advising librarians.

I have a Rake task that:
  1. Spins up a Solr instance on jetty (3. above)
  2. Clears the existing index, (then commit), indexes a file of Solr docs to exercise what I'm testing (4.1 part in italic) (then commit).
  3. Cucumber scenario submits queries against Solr with the test data (4.2 above)
  4. The same cucumber scenario compares the Solr results with the acceptance criteria (4.3)
  5. (keep running Cucumber scenarios - 4.4)
  6. Stop Solr test instance  (5. above)
For more details, see my post on Call Number Searching in Solr .

For Searchable Value Tests:

To some extent, you can examine searchable values of Solr fields by doing facet queries for those fields.  NOTE:  if the fields are tokenized (and most searchable fields will be) then you will see each token as a facet value.  So you may need to create a tiny test index with very few records when doing this sort of test.
    1. http://your.solr.server/solr/select?facet.field=your_indexed_field&facet=true&rows=0
To be honest, I'm not sure how this will work if the Solr field is both stored and indexed - will you see the stored value?

Another way I currently test values written to Solr:

I wrote JUnit tests for my SolrMarc instance that build a Solr index from test data and then search the index via the Solr API.  This works okay when searching against a single field, but it doesn't tackle searching via a Solr Request Handler.

Here is an example test:

    /**
     * isbn_search should be case insensitive
     */
@Test
    public final void testISBNCaseInsensitive()
        throws IOException, ParserConfigurationException, SAXException
    {
        String fldName = "isbn_search";
        createIxInitVars("isbnTests.mrc");

        Set<String> docIds = new HashSet<String>();
        docIds.add("020suba10trailingText");
        docIds.add("020SubaAndz");
        assertSearchResults(fldName, "052185668X", docIds);
        assertSearchResults(fldName, "052185668x", docIds);
    }

Continuous Integration runs an ant target just like the one for the mapping tests to execute these tests;  here is an ant task to run all the test code in a directory and its children:

    <target name="runTests" depends="testCompile" description="Run mapping tests for local SolrMarc">
        <mkdir dir="${core.coverage.dir}"/>
   
        <path id="test.classpath">
            <pathelement location="${build.dir}/test" />
            <path refid="test.compile.classpath" />
        </path>
        
        <junit showoutput="yes" printsummary="yes" fork="yes" forkmode="perBatch">
            <classpath refid="test.classpath" />
           
<!-- use test element instead of batchtest element if desired
            <test name="${test.class}" />
-->
            <batchtest fork="yes">
                 <fileset dir="${build.dir}/test" excludes="org/solrmarc/testUtils/**"/>
            </batchtest>
           
        </junit>

    </target>   

Testing Solr Indexing Software - Mapping Tests

In a previous post, I talked about the different levels of testing for code that writes to a Solr index.   This post will go into detail about Mapping Tests -- tests of how raw data is processed by the indexing software into the Solr document value before it is written to the Solr index.
Mapping tests answer questions like these:
  1. "Am I assigning the right format value to my Solr document given a MARC record with this leader, this 008 and this 006?"
  2. "Will the Solr document have exactly the subset of OCLC numbers we want from this record with 8 OCLC numbers?"
  3. "Is the sortable version of the title correctly removing non-filing characters for Hebrew titles?"
Note that these are all questions that can be answered BEFORE the Solr document is written to the index;  we test the mapping of our raw data to what *will* be sent to Solr.

Recall these principles of good test code:
  • test code should live close to the code it is testing
  • tests should be automate-able via a script for continuous integration
  • tests are useless if they give false positives:  you should be certain that the test code fails when it is supposed to - this is the "test the test code" principle.
  • tests should assert that the code behaves well with bad data ... the data is always dirty.
  • tests should exercise as much of the code you write as is practical.

The first of these principles can be translated to:  mapping test code belongs with the code that transforms your raw data into the Solr documents you will write to the index.


Mapping tests are idiosyncratic to the code used to get from raw data to a Solr document.  In my case, the raw data is MARC (http://www.loc.gov/marc/), and the code I use to transform my MARC data into Solr documents is Bob Haschart's SolrMarc (http://code.google.com/p/solrmarc/).

Bob Haschart came up with one way to do mapping tests in SolrMarc;  I was sidetracked by the hammer in my hand making everything look like a nail.  After I saw Bob's work, I came up with an additional way to do mapping tests in SolrMarc.  Both approaches are documented at http://code.google.com/p/solrmarc/wiki/Testing, and the code, as well as test data, are available in the SolrMarc code base.

Here is an example of a junit mapping test for SolrMarc.  Obviously some of the work is done in the super class(es) -- I'll leave perusal of those as an exercise for the reader.

public class AuthorTests extends AbstractStanfordBlacklightTest {

@Before
    public final void setup()
    {
        mappingTestInit();
    }
 

    /**
     * Personal name display field tests.
     */
@Test
    public final void testPersonalNameDisplay()
    {
        String fldName = "author_person_display";
        String testFilePath = testDataParentPath + File.separator + "authorTests.mrc";

        // 100a
        // trailing period removed
        solrFldMapTest.assertSolrFldValue(testFilePath, "345228", fldName, "Bashkov, Vladimir");
        // 100ad
        // trailing hyphen retained
        solrFldMapTest.assertSolrFldValue(testFilePath, "919006", fldName, "Oeftering, Michael, 1872-");
        // 100ae  (e not indexed)
        // trailing comma should be removed
        solrFldMapTest.assertSolrFldValue(testFilePath, "7651581", fldName, "Coutinho, Frederico dos Reys");
        // 100aqd
        // trailing period removed
        solrFldMapTest.assertSolrFldValue(testFilePath, "690002", fldName, "Wallin, J. E. Wallace (John Edward Wallace), b. 1876");
        // 100aqd
        solrFldMapTest.assertSolrFldValue(testFilePath, "1261173", fldName, "Johnson, Samuel, 1649-1703");
        // 'nother sort of trailing period - not removed
        solrFldMapTest.assertSolrFldValue(testFilePath, "8634", fldName, "Sallust, 86-34 B.C.");
        // 100 with numeric subfield
        solrFldMapTest.assertSolrFldValue(testFilePath, "1006", fldName, "Sox on Fox");
    }

Is a Mapping Test the same as Unit Test?

No.  This is not necessarily what I would call *unit* testing.  I think of unit testing as being at the "method" level -- they ensure that the method gives expected results for different input values.   If I write a method to normalize the classification portion of an LC call number, unit tests would affirm correct method results for these sorts of questions:  what if there is a space between the letters and digits of the class?  what if there are decimal digits?   what if the class letters are illegal?  what if there is a numeric class suffix?  what if there is an alphabetic class suffix?

Here is an example of a unit test I wrote (note that I tried to think of as many cases as possible):

    /**
     * unit test for Utils.getLCStringB4FirstCutter()
     */
@Test
    public void testLCStringB4FirstCutter()
    {
        String callnum = "M1 L33";
        assertEquals("M1", getLCB4FirstCutter(callnum));
        callnum = "M211 .M93 K.240 1988"; // first cutter has period
        assertEquals("M211", getLCB4FirstCutter(callnum));
        callnum = "PQ2678.K26 P54 1992"; // no space b4 cutter with period
        assertEquals("PQ2678", getLCB4FirstCutter(callnum));
        callnum = "PR9199.4 .B3"; // class has float, first cutter has period
        assertEquals("PR9199.4", getLCB4FirstCutter(callnum));
        callnum = "PR9199.3.L33 B6"; // decimal call no space before cutter
        assertEquals("PR9199.3", getLCB4FirstCutter(callnum));
        callnum = "HC241.25F4 .D47";
        assertEquals("HC241.25", getLCB4FirstCutter(callnum));

        // suffix before first cutter
        callnum = "PR92 1990 L33";
        assertEquals("PR92 1990", getLCB4FirstCutter(callnum));
        callnum = "PR92 1844 .L33 1990"; // first cutter has period
        assertEquals("PR92 1844", getLCB4FirstCutter(callnum));
        callnum = "PR92 1844.L33 1990"; // no space before cutter w period
        assertEquals("PR92 1844", getLCB4FirstCutter(callnum));
        callnum = "PR92 1844L33 1990"; // no space before cutter w no period
        assertEquals("PR92 1844", getLCB4FirstCutter(callnum));
        // period before cutter
        callnum = "M234.8 1827 .F666";
        assertEquals("M234.8 1827", getLCB4FirstCutter(callnum));
        callnum = "PS3538 1974.L33";
        assertEquals("PS3538 1974", getLCB4FirstCutter(callnum));
        // two cutters
        callnum = "PR9199.3 1920 L33 A6 1982";
        assertEquals("PR9199.3 1920", getLCB4FirstCutter(callnum));
        callnum = "PR9199.3 1920 .L33 1475 .A6";
        assertEquals("PR9199.3 1920", getLCB4FirstCutter(callnum));
        // decimal and period before cutter
        callnum = "HD38.25.F8 R87 1989";
        assertEquals("HD38.25", getLCB4FirstCutter(callnum));
        callnum = "HF5549.5.T7 B294 1992";
        assertEquals("HF5549.5", getLCB4FirstCutter(callnum));

        // suffix with letters
        callnum = "L666 15th A8";
        assertEquals("L666 15th", getLCB4FirstCutter(callnum));

        // non-compliant cutter
        callnum = "M5 .L";
        assertEquals("M5", getLCB4FirstCutter(callnum));

        // no cutter
        callnum = "B9 2000";
        assertEquals("B9 2000", getLCB4FirstCutter(callnum));
        callnum = "B9 2000 35TH";
        assertEquals("B9 2000 35TH", getLCB4FirstCutter(callnum));

        // wacko lc class suffixes
        callnum = "G3840 SVAR .H5"; // suffix letters only
        assertEquals("G3840 SVAR", getLCB4FirstCutter(callnum));

        // first cutter starts with same chars as LC class
        callnum = "G3824 .G3 .S5 1863 W5 2002";
        assertEquals("G3824", getLCB4FirstCutter(callnum));
        callnum = "G3841.C2 S24 .U5 MD:CRAPO*DMA 1981";
        assertEquals("G3841", getLCB4FirstCutter(callnum));

        // space between LC class letters and numbers
        callnum = "PQ 8550.21.R57 V5 1992";
        // assertEquals("PQ 8550.21", getLCB4FirstCutter(callnum));
        callnum = "HD 38.25.F8 R87 1989";
        // assertEquals("HD 38.25", getLCB4FirstCutter(callnum));
    }
 
In contrast, the following would be mapping tests:
  • Given a MARC record with a call number in 999 subfield a that begins "MFILM", does the Solr document we're creating have "Microform" in the format field?
  • Given a MARC record that contains a 502 field, does the Solr document we're creating have "Thesis" in the format field? 
  • Given a MARC record with a 956 subfield u that does NOT contain "sfx", does the Solr document we're creating NOT have an sfx_url field?

What about the following situation?  Is this a mapping test or a unit test or something else?
  • Given a MARC record containing an 050 subfield a with value "A1    1997" and subfield b with value ".B2 V.17",  does the call_number field in the Solr document we're creating contain value "A1 1997 .B2 V.17"?
It depends.  If your Solr schema declarations convert consecutive white space into a single space for the call_number field type, and if your indexing configuration (e.g. blah_index.properties for SolrMarc) is the only other thing you've used to create the call_number field values, then I would say you don't need a mapping test.   That's because there should be *unit* tests asserting that generic subfield concatenation is working properly ("Does concatenation add a space between subfields?"  "What about when I don't want a space?").

However, if you have written special handling code for the call_number field that looks for values in the MARC 050 as well as 090 fields ... then you need a mapping test for the call_number field.  You'd want to test records with 050 only, with 090 only, with more than one 050, with both 050 and 090, with no 050 or 090, etc.

Where Do I Get Raw Data for Mapping Tests? 

You will need to create your own test data for mapping tests, or at least collect it from problem reports.  Yes, it can be time consuming.

But remember the payoff - you get a safety net that becomes increasingly important as your indexing code gets more complex (and it will!).  Some examples:
  • I need to be certain my site's customized code is not impacted when I pull the latest code updates from the SolrMarc project.
  • I think I can ditch some of my custom code because of new functionality available in the core SolrMarc code -- my test code assures me the change didn't break anything.  
  • I have to change my local format algorithm and know the changes shouldn't affect the assignment of the "book" format -- existing mapping tests for "book" format assignment ensure that my new changes don't step on the existing code.
Of course new problems are always presenting themselves.  But if you write tests for each new problem as you fix it, you won't have to say "dang!  I thought I fixed that!" to anyone downstream.  Each problem you fix adds to your safety net.  This is a HUGE time saver (and face saver) as your project evolves.

Automate Your Safety Net

Wouldn't it be great if every time you change to your mapping code (e.g. your localized version of SolrMarc), all your mapping tests ran automatically as soon as you put the new code into source control??  Then you would know immediately if you inadvertently broke something.  It would be as if your testing safety net magically unfurled below you.

Continuous Integration tools like Hudson have been created to fulfill this purpose. Given a Hudson server (I confess I have wonderful colleagues who set up a Hudson instance for me, so I don't know much about that.  I'm told it's not too difficult.), it is easy to set up automatic testing for your projects.  We use Hudson for java and ruby, running continuous integration with code coverage information automatically generated.

Of course, you can save yourself some grief if you run the same tests in your workspace before you commit to source control.  Hudson provides additional safety because it's usually on a different machine with a different configuration than your workspace.  And if there are multiple people working on the same code base, you probably want to have the certainty that Joe didn't break your code when his fixed his bug.  You can probably trust Joe ... but we're software engineers -- we want proof.


How Do I Set Up Continuous Integration for Mapping Tests? 

The SolrMarc mapping tests are written in java, because SolrMarc is in java.  Running JUnit test code is one thing that ant actually makes easy.  Here is an ant task to run all the test code in a directory and its children:

    <target name="runTests" depends="testCompile" description="Run mapping tests for local SolrMarc">
        <mkdir dir="${core.coverage.dir}"/>
   
        <path id="test.classpath">
            <pathelement location="${build.dir}/test" />
            <path refid="test.compile.classpath" />
        </path>
        
        <junit showoutput="yes" printsummary="yes" fork="yes" forkmode="perBatch">
            <classpath refid="test.classpath" />
           
<!-- use test element instead of batchtest element if desired
            <test name="${test.class}" />
-->
            <batchtest fork="yes">
                 <fileset dir="${build.dir}/test" excludes="org/solrmarc/testUtils/**"/>
            </batchtest>
           
        </junit>

    </target>   

Hudson configurations can be set to automatically poll a source control system's project (we mostly use Git, but have some laggard projects in Subversion) at whatever interval you like.  In my case, Hudson polls our SolrMarc code in our local subversion source control every 15 minutes.  If Hudson finds a new commit, it will checkout the source code (or a delta) and then run the ant target given in the configuration.  (Obviously the target should do a fresh build before it runs the new tests).

Testing Solr Indexing Software

Some of the sharpest MARC -> Solr coders have scanned some of my indexing tests and have scratched their heads, wondering why I have "gone overboard" and why I am "testing Solr itself."   Others have positively panted in anticipation when they heard that Bess Sadler (and I, sort of) came up with a way to do relevancy testing. And I know a few of you out there remember the 2010 Code4Lib presentation Jessie Keck, Willy Mene and I did on testing (http://www.stanford.edu/~ndushay/code4lib2010/I_am_not_your_mother.pdf).

So ... what tests should be written by those of us saddled with converting (MARC) records into a Solr index?  And how can we write these tests so they can be used by continuous integration?

Building on previous conversations with Bob Haschart, Jonathan Rochkind and Bill Dueber,  Bill and I had an illuminating conversation last Friday about testing code that writes data to Solr. We teased out four different levels of tests.
  1. Mapping Tests: from raw data to what will been written to Solr.  This is the code that takes the raw data and turns it into field values that will be written to Solr.  "Mapping Tests" are a good name for these.
    1. "Am I assigning the right format to a MARC record with this leader, this 008 and this 006?"
    2. "Am I pulling out exactly the OCLC numbers we want from this record with 8 OCLC numbers?"
    3. "Is the sortable version of the title correctly removing non-filing characters for Hebrew titles?"
  2. Searchable Value Tests: from (raw data) to (searchable) value stored in Solr.  Bill points out that Solr's schema.xml could transform a field's searchable value as the document is added to Solr.  The field's type definition in schema.xml might strip out whitespace, punctuation, stopwords, etc.  It might substitute characters (maybe ü becomes ue, or maybe it becomes u).  Note that the stored field value will not be changed by schema.xml -- whatever was in the Solr document written to the index will be the value retrieved from a stored field.
    1. "I know I left the letter 'a' prefix in that field, because I wrote a test to ensure the 'a' prefix is present in the field value to be written to Solr.  So why aren't searches with the 'a' prefix working?"
  3. Search Acceptance Tests: from (raw data) to Solr search results.  What if you're not sure of the best way to define a field type and/or to transform the raw data in the indexing code and/or to set up a RequestHandler to achieve the appropriate search results?
    1. In a call number, some periods are vital to searches (P3562.8) and some are not (A1 .B2).  Some spaces are vital (A1 .B2 1999) and some are not (A1.B2).  What is the best way  index the raw data to get the desired results?  What is the best way to configure the request handler to get the desired results?   
    2. Do author search results contain stemmed versions of names (michaels vs. michael)?  Stopwords (earl of sandford vs. earl sandford)?  What about the default request handler?  Are results correct for hyphenated names?
    3. Will a query containing a stopword match raw data with the same stopword?  Will it match raw data without a stopword?  With a different stopword?  (dogs in literature vs. dogs of literature vs. dog literature).  How should the query be analyzed? 
    4. "I am not seeing that field in the search results, but I know the field is present because I wrote a test to check the field value is correct in the document to be written to Solr."  (Oops - is stored="true" on the field definition?) 
  4. Full Stack Tests: from a query in your UI to Solr search results.  There are simple things that might be done by your UI code, such as translating the displayed value of a location facet to a different value for the actual query.  There are also situations when the transformations must be more complex --  converting query strings from multiple form fields into a complex Solr query with local params, as we do with our Advanced Search form  (See http://www.stanford.edu/~ndushay/code4lib2010/advSearchSolrQueries.pdf  or, for a more complete explanation, see http://www.stanford.edu/~ndushay/code4lib2010/advanced_search.pdf.) 
Recall the testing mantra:  if you had to test it manually, then it's worth automating the test. 


A few principles of good test code:
  • test code should live close to the code it is testing
  • tests should be automate-able via a script for continuous integration
  • tests should exercise as much of the code you write as is practical.
  • tests are useless if they give false positives:  you should be certain that the test code fails when it is supposed to - this is the "test the test code" principle.
  • tests should assert that the code behaves well with bad data ... the data is always dirty.
Yes, it takes you longer to do the initial coding ... but you get all that time back, and then some, as the system grows and needs maintenance.  It may be hard to believe, but when you start writing tests, your code improves.  Not only because you are testing for error conditions and the like, but also because thinking about tests subtly changes the way you write your code - it becomes more modular, clearer.  Plus it allows you to refactor with confidence as a better approach presents itself.
    Continuous Integration for these tests

    Automate-able scripts that run your tests will differ depending on the language and tools you have used.  I tend to have ant targets for tests written in java and rake tasks in other contexts.  I will provide more specifics in future posts as I step through examples of the different types of tests.

    If you need to sell any of this to your management, try this:

    Our staff are delighted to hear that when we fix a reported searching problem, we use the examples from their bug reports to ensure we won't unknowingly unfix things in the future.  It's a big win to tell them they won't be asked to repeat the tests manually when we upgrade Solr or make any other changes.   Also, over time, the maintenance costs of the software will be lower.

    Friday, October 22, 2010

    (LC) Call Number Searching in Solr

    (what you really want to know: http://paste.pocoo.org/show/279886/ or  http://pastie.org/1242164)

    I was recently charged with fixing call number searching in SearchWorks,  and it was a great opportunity to set up a proper testing framework for Search Acceptance tests.

    Search Acceptance tests need to assert the following:
    1. Given a set of Solr documents with known values
    2. and Given a known query string
    3. Then queries against the Solr index should return the expected results.
    The testing framework allows me to experiment with different Solr configurations (e.g. field analysis or request handler settings) so I can find out how I can best fulfill the acceptance criteria.  Of course I made heavy use of the Solr (solr baseurl)/admin/analysis.jsp page, but I didn't want to manually check all the acceptance criteria every time I tweaked a promising configuration.  I would much rather fire off automated tests to ensure I'm not breaking case B while I fix case G.   And for those of you that do ad hoc testing in this situation ...  do you really run through all the test cases every time you make a change?  Don't lie to me - you don't do it.

    My Solr acceptance test framework uses Cucumber* (http://cukes.info/), (Ruby), Rake (http://rake.rubyforge.org/) and Jetty.  Cucumber is tidily run from a rake task;  I have cucumber do the following**:
    1. spin up a test Solr instance
    2. delete all documents in Solr index
    3. add test Solr documents in xml file to Solr
    4. run cucumber features/scenarios
    5. shut down test Solr instance
    Now I'm ready to write my acceptance tests. 

    For (LC) call number searching, my test cases covered:
    • searches are always left anchored
      • class doesn't match against cutter (query "A1" doesn't match "B2 .A1")
      • class doesn't match against volume designations (query "V1" doesn't match "A1 .C2 V.1")
      • query that includes first cutter should not match wrong or second cutter
      • query that includes first cutter should not match class value repeated for second cutter (query "B2 A1" should not match "A1 .B2 A1")
    • classifications match only like classifications
      • A1 does not match A1.1 or A11
      • A1.11 does not match A11.1 or A111
    • searches that include only the first letter of the first cutter must match (query "A1 B" should match "A1 B2" and "A1 B77"
    • full call number queries must match (an example of moderate length:  "N7433.4 .G73 H53 2007")
    • punctuation and spacing before first cutter doesn't matter (all these are equivalent:  "A1 .B2", "A1.B2", "A1 B2", "A1B2")
    • space between class letters and digits doesn't matter ("PQ 3563" same as "PQ3563")
    • a very small number of real test cases
    There are usually many variants in to the simply stated test cases above.  For example, are searches for each of the likely punctuation and spacing variants before the first cutter working correctly when the classification has more than one letter?  When there is a decimal point in the classification?  When the raw value to go into the Solr index has a period, but the query does not?

    For each test, we need
    1. Solr documents with appropriate fields and values to exercise correct and incorrect possibilities,
    2. cucumber scenarios capturing the acceptance criteria as applied to the test data.
    So, for various call number searching behaviors, I created xml for Solr docs to put each test case through its paces. Then I wrote cucumber scenarios to perform searches against the test data and examine the results.

    I am now ready to determine the best Solr text analysis to fulfill the desired call number searching behaviors.

    I wasn't sure which of a variety of Solr analysis possibilities would work best:
    • EdgeNGram (ngrams, but only the ones starting with the first character of the field)
    • WordDelimiterFilter with just the right settings
    • WhitespaceTokenizer with careful pre-tokenized normalization
    • PatternTokenizer with careful pre-tokenized normalization
    (As an aside, Git was a great tool for keeping all my approaches separate:  I had a branch for each approach, and could switch at will, and also had a way to go back to older tries ... all right on my laptop. )

    So, what is the best approach?   (drum roll ...)

    I wasn't even sure it would be possible to fulfill all of my acceptance tests, so I was thrilled that I got all tests to pass for two different methods:  the EdgeNGram and the WhitespaceTokenizer.

    I put the field type definitions for both of them here:  http://pastie.org/1242164  as well as here  http://paste.pocoo.org/show/279886/

    but I will be working with the WhitespaceTokenizer -- it will require far less disk space.

    A few notes:
    • Solr localized parameters rule
    • a local param prefix query string does NOT go through analysis
    • queries with spaces in them can be a little tricky to test via cucumber
    • left anchoring tokenized searches requires a little cleverness
    • getting appropriate results for A111 vs. A1.11 vs. A11.1 can require some cleverness 

    Of course, now they're going to ask me to get this working for our SUDOC and Dewey call numbers (we have about a half million of each of those flavors) and for our special local call numbers.  Guess what?  I'll ask for test cases and add 'em to the ones I have and I'll feel very secure that I won't be breaking anything I already test for as I accommodate new criteria.

    * for Cucumber documentation, I find the wiki pages very helpful: http://github.com/aslakhellesoy/cucumber/wiki/_pages

    ** Cucumber does not support a "Before/After All Scenarios - do this once per feature file", so I fake it in a feature file by having its first scenario doing steps 1-3  and  its last scenario doing step 5.  The actual tests are all the scenarios in between.  (Cucumber does support Before/After each scenario, in a couple of different ways: http://github.com/aslakhellesoy/cucumber/wiki/Background and there is also a way to run something before any scenario using global hooks: http://github.com/aslakhellesoy/cucumber/wiki/hooks ).