Discovery Grindstone: Testing Solr Indexing Software

In a previous post, I talked about the different levels of testing for code that writes to a Solr index. This post will go into detail about Mapping Tests -- tests of how raw data is processed by the indexing software into the Solr document value before it is written to the Solr index.
Mapping tests answer questions like these:

"Am I assigning the right format value to my Solr document given a MARC record with this leader, this 008 and this 006?"
"Will the Solr document have exactly the subset of OCLC numbers we want from this record with 8 OCLC numbers?"
"Is the sortable version of the title correctly removing non-filing characters for Hebrew titles?"

Note that these are all questions that can be answered BEFORE the Solr document is written to the index; we test the mapping of our raw data to what *will* be sent to Solr.

Recall these principles of good test code:

test code should live close to the code it is testing
tests should be automate-able via a script for continuous integration

tests are useless if they give false positives: you should be certain that the test code fails when it is supposed to - this is the "test the test code" principle.
tests should assert that the code behaves well with bad data ... the data is always dirty.
tests should exercise as much of the code you write as is practical.

The first of these principles can be translated to: mapping test code belongs with the code that transforms your raw data into the Solr documents you will write to the index.

Mapping tests are idiosyncratic to the code used to get from raw data to a Solr document. In my case, the raw data is MARC (http://www.loc.gov/marc/), and the code I use to transform my MARC data into Solr documents is Bob Haschart's SolrMarc (http://code.google.com/p/solrmarc/).

Bob Haschart came up with one way to do mapping tests in SolrMarc; I was sidetracked by the hammer in my hand making everything look like a nail. After I saw Bob's work, I came up with an additional way to do mapping tests in SolrMarc. Both approaches are documented at http://code.google.com/p/solrmarc/wiki/Testing, and the code, as well as test data, are available in the SolrMarc code base.

Here is an example of a junit mapping test for SolrMarc. Obviously some of the work is done in the super class(es) -- I'll leave perusal of those as an exercise for the reader.

public class AuthorTests extends AbstractStanfordBlacklightTest {

@Before

    public final void setup() 

    {

        mappingTestInit();

    }

    /**

     * Personal name display field tests.

     */

@Test

    public final void testPersonalNameDisplay() 

    {

        String fldName = "author_person_display";

        String testFilePath = testDataParentPath + File.separator + "authorTests.mrc";

        // 100a

        // trailing period removed

        solrFldMapTest.assertSolrFldValue(testFilePath, "345228", fldName, "Bashkov, Vladimir"); 

        // 100ad

        // trailing hyphen retained

        solrFldMapTest.assertSolrFldValue(testFilePath, "919006", fldName, "Oeftering, Michael, 1872-"); 

        // 100ae  (e not indexed)

        // trailing comma should be removed

        solrFldMapTest.assertSolrFldValue(testFilePath, "7651581", fldName, "Coutinho, Frederico dos Reys"); 

        // 100aqd 

        // trailing period removed

         solrFldMapTest.assertSolrFldValue(testFilePath, "690002", fldName,  "Wallin, J. E. Wallace (John Edward Wallace), b. 1876");

        // 100aqd 

        solrFldMapTest.assertSolrFldValue(testFilePath, "1261173", fldName, "Johnson, Samuel, 1649-1703");

        // 'nother sort of trailing period - not removed

        solrFldMapTest.assertSolrFldValue(testFilePath, "8634", fldName, "Sallust, 86-34 B.C.");

        // 100 with numeric subfield

        solrFldMapTest.assertSolrFldValue(testFilePath, "1006", fldName, "Sox on Fox");

    }

Is a Mapping Test the same as Unit Test?

No. This is not necessarily what I would call *unit* testing. I think of unit testing as being at the "method" level -- they ensure that the method gives expected results for different input values. If I write a method to normalize the classification portion of an LC call number, unit tests would affirm correct method results for these sorts of questions: what if there is a space between the letters and digits of the class? what if there are decimal digits? what if the class letters are illegal? what if there is a numeric class suffix? what if there is an alphabetic class suffix?

Here is an example of a unit test I wrote (note that I tried to think of as many cases as possible):

    /**

     * unit test for Utils.getLCStringB4FirstCutter()

     */

@Test

    public void testLCStringB4FirstCutter()

    {

        String callnum = "M1 L33";

        assertEquals("M1", getLCB4FirstCutter(callnum));

        callnum = "M211 .M93 K.240 1988"; // first cutter has period

        assertEquals("M211", getLCB4FirstCutter(callnum));

        callnum = "PQ2678.K26 P54 1992"; // no space b4 cutter with period

        assertEquals("PQ2678", getLCB4FirstCutter(callnum));

        callnum = "PR9199.4 .B3"; // class has float, first cutter has period

        assertEquals("PR9199.4", getLCB4FirstCutter(callnum));

        callnum = "PR9199.3.L33 B6"; // decimal call no space before cutter

        assertEquals("PR9199.3", getLCB4FirstCutter(callnum));

        callnum = "HC241.25F4 .D47";

        assertEquals("HC241.25", getLCB4FirstCutter(callnum));

        // suffix before first cutter

        callnum = "PR92 1990 L33";

        assertEquals("PR92 1990", getLCB4FirstCutter(callnum));

        callnum = "PR92 1844 .L33 1990"; // first cutter has period

        assertEquals("PR92 1844", getLCB4FirstCutter(callnum));

        callnum = "PR92 1844.L33 1990"; // no space before cutter w period

        assertEquals("PR92 1844", getLCB4FirstCutter(callnum));

        callnum = "PR92 1844L33 1990"; // no space before cutter w no period

        assertEquals("PR92 1844", getLCB4FirstCutter(callnum));

        // period before cutter

        callnum = "M234.8 1827 .F666";

        assertEquals("M234.8 1827", getLCB4FirstCutter(callnum));

        callnum = "PS3538 1974.L33";

        assertEquals("PS3538 1974", getLCB4FirstCutter(callnum));

        // two cutters

        callnum = "PR9199.3 1920 L33 A6 1982";

        assertEquals("PR9199.3 1920", getLCB4FirstCutter(callnum));

        callnum = "PR9199.3 1920 .L33 1475 .A6";

        assertEquals("PR9199.3 1920", getLCB4FirstCutter(callnum));

        // decimal and period before cutter

        callnum = "HD38.25.F8 R87 1989";

        assertEquals("HD38.25", getLCB4FirstCutter(callnum));

        callnum = "HF5549.5.T7 B294 1992";

        assertEquals("HF5549.5", getLCB4FirstCutter(callnum));

        // suffix with letters

        callnum = "L666 15th A8";

        assertEquals("L666 15th", getLCB4FirstCutter(callnum));

        // non-compliant cutter

        callnum = "M5 .L";

        assertEquals("M5", getLCB4FirstCutter(callnum));

        // no cutter

        callnum = "B9 2000";

        assertEquals("B9 2000", getLCB4FirstCutter(callnum));

        callnum = "B9 2000 35TH";

        assertEquals("B9 2000 35TH", getLCB4FirstCutter(callnum));

        // wacko lc class suffixes

        callnum = "G3840 SVAR .H5"; // suffix letters only

        assertEquals("G3840 SVAR", getLCB4FirstCutter(callnum));

        // first cutter starts with same chars as LC class

        callnum = "G3824 .G3 .S5 1863 W5 2002";

        assertEquals("G3824", getLCB4FirstCutter(callnum));

        callnum = "G3841.C2 S24 .U5 MD:CRAPO*DMA 1981";

        assertEquals("G3841", getLCB4FirstCutter(callnum));

        // space between LC class letters and numbers

        callnum = "PQ 8550.21.R57 V5 1992";

        // assertEquals("PQ 8550.21", getLCB4FirstCutter(callnum));

        callnum = "HD 38.25.F8 R87 1989";

        // assertEquals("HD 38.25", getLCB4FirstCutter(callnum));

    }

In contrast, the following would be mapping tests:

Given a MARC record with a call number in 999 subfield a that begins "MFILM", does the Solr document we're creating have "Microform" in the format field?
Given a MARC record that contains a 502 field, does the Solr document we're creating have "Thesis" in the format field?
Given a MARC record with a 956 subfield u that does NOT contain "sfx", does the Solr document we're creating NOT have an sfx_url field?

What about the following situation? Is this a mapping test or a unit test or something else?

Given a MARC record containing an 050 subfield a with value "A1 1997" and subfield b with value ".B2 V.17", does the call_number field in the Solr document we're creating contain value "A1 1997 .B2 V.17"?

It depends. If your Solr schema declarations convert consecutive white space into a single space for the call_number field type, and if your indexing configuration (e.g. blah_index.properties for SolrMarc) is the only other thing you've used to create the call_number field values, then I would say you don't need a mapping test. That's because there should be *unit* tests asserting that generic subfield concatenation is working properly ("Does concatenation add a space between subfields?" "What about when I don't want a space?").

However, if you have written special handling code for the call_number field that looks for values in the MARC 050 as well as 090 fields ... then you need a mapping test for the call_number field. You'd want to test records with 050 only, with 090 only, with more than one 050, with both 050 and 090, with no 050 or 090, etc.

Where Do I Get Raw Data for Mapping Tests?

You will need to create your own test data for mapping tests, or at least collect it from problem reports. Yes, it can be time consuming.

But remember the payoff - you get a safety net that becomes increasingly important as your indexing code gets more complex (and it will!). Some examples:

I need to be certain my site's customized code is not impacted when I pull the latest code updates from the SolrMarc project.
I think I can ditch some of my custom code because of new functionality available in the core SolrMarc code -- my test code assures me the change didn't break anything.
I have to change my local format algorithm and know the changes shouldn't affect the assignment of the "book" format -- existing mapping tests for "book" format assignment ensure that my new changes don't step on the existing code.

Of course new problems are always presenting themselves. But if you write tests for each new problem as you fix it, you won't have to say "dang! I thought I fixed that!" to anyone downstream. Each problem you fix adds to your safety net. This is a HUGE time saver (and face saver) as your project evolves.

Automate Your Safety Net

Wouldn't it be great if every time you change to your mapping code (e.g. your localized version of SolrMarc), all your mapping tests ran automatically as soon as you put the new code into source control?? Then you would know immediately if you inadvertently broke something. It would be as if your testing safety net magically unfurled below you.

Continuous Integration tools like Hudson have been created to fulfill this purpose. Given a Hudson server (I confess I have wonderful colleagues who set up a Hudson instance for me, so I don't know much about that. I'm told it's not too difficult.), it is easy to set up automatic testing for your projects. We use Hudson for java and ruby, running continuous integration with code coverage information automatically generated.

Of course, you can save yourself some grief if you run the same tests in your workspace before you commit to source control. Hudson provides additional safety because it's usually on a different machine with a different configuration than your workspace. And if there are multiple people working on the same code base, you probably want to have the certainty that Joe didn't break your code when his fixed his bug. You can probably trust Joe ... but we're software engineers -- we want proof.

How Do I Set Up Continuous Integration for Mapping Tests?

The SolrMarc mapping tests are written in java, because SolrMarc is in java. Running JUnit test code is one thing that ant actually makes easy. Here is an ant task to run all the test code in a directory and its children:

Hudson configurations can be set to automatically poll a source control system's project (we mostly use Git, but have some laggard projects in Subversion) at whatever interval you like. In my case, Hudson polls our SolrMarc code in our local subversion source control every 15 minutes. If Hudson finds a new commit, it will checkout the source code (or a delta) and then run the ant target given in the configuration. (Obviously the target should do a fresh build before it runs the new tests).

Discovery Grindstone

Sunday, October 24, 2010

Testing Solr Indexing Software - Mapping Tests

No comments:

Post a Comment