Discovery Grindstone: 2012

Tuesday, August 21, 2012

Solr Acceptance Tests: introducing rspec-solr (and sw_index_tests)

I've released the rspec-solr ruby gem, which applies RSpec custom matchers to Solr responses. Rdoc is at http://rubydoc.info/github/sul-dlss/rspec-solr and the source code is at https://github.com/sul-dlss/rspec-solr/ .

It's pretty simple: once you have a ruby Solr response, you wrap it in RSpecSolr:

 resp = RSpecSolr::SolrResponseHash.new(yer_solr_resp)

and then you can make useful assertions for acceptance testing:

 resp.should include({'id'=>'81234'})

 resp.should include({'title'=>'Harry Potter'}).in_first(3).results

 resp.should include('111'}).before('222')

So you might write specs like this:

it "q of 'Buddhism' should get 8,500-10,500 results" do
  resp = solr_resp_doc_ids_only({'q'=>'Buddhism'})

  resp.should have_at_least(8500).documents
  resp.should have_at_most(10500).documents
end

 it "q of 'Two3' should have excellent results", :jira => 'VUF-386' do
   resp = solr_resp_doc_ids_only({'q'=>'Two3'})
   resp.should have_at_most(10).documents
   resp.should include("5732752").as_first_result
   resp.should include("8564713").in_first(2).results
   resp.should include("5732752").before("8564713")
   resp.should_not include("5727394")
   resp.should have_the_same_number_of_results_as(solr_resp_doc_ids_only({'q'=>'two3'}))
   resp.should have_fewer_results_than(solr_resp_doc_ids_only({'q'=>'two 3'}))
 end

 it "Traditional Chinese chars 三國誌 should get the same results as simplified chars 三国志" do
   resp = solr_resp_doc_ids_only({'q'=>'三國誌'})  
   resp.should have_at_least(240).documents
   resp.should have_the_same_number_of_results_as(solr_resp_doc_ids_only({'q'=>'三国志'})) 
 end

Note that these examples utilize a couple of helper methods. See the README for more details.

The gem is only at release 0.1.0, but I'm finding it useful already. You'll see some FIXME and TODO comments, and I suspect there's plenty that can be improved. I'm happy to take your pull requests.

If it looks too much like code ...
If you can get non-coding colleagues to write your tests, then making the testing syntax easier for them is probably worthwhile. You could certainly use Cucumber on top of rspec-solr to write your Solr acceptance tests in more natural language.

Tip:

For most of my tests, I realized I don't check anything but the Solr document id in the results.   It makes it a lot easier to look through RSpec error messages when the Solr response doesn't have extraneous fields or the facet counts ... and it's also a much smaller http response. So I rigged up a method that adds {'fl'=>'id', 'facet'=>'false'} to the request params I send to Solr.   My spec errors now read like this:

expected {"responseHeader"=>{"status"=>0, "QTime"=>10, "params"=>{"facet"=>"false", "fl"=>"id", "qt"=>"search_author", "wt"=>"ruby", "q"=>"契沖"}}, "response"=>{"numFound"=>3, "start"=>0, "docs"=>[{"id"=>"6675613"}, {"id"=>"6675393"}, {"id"=>"6274534"}]}} to include ["6675613", "6675393", "7191966", "6274534", "4783602"]
Diff:
@@ -1,2 +1,14 @@
-[["6675613", "6675393", "7191966", "6274534", "4783602"]]
+{"responseHeader"=>
+ {"status"=>0,
+   "QTime"=>10,
+   "params"=>
+    {"facet"=>"false",
+     "fl"=>"id",
+     "qt"=>"search_author",
+     "wt"=>"ruby",
+     "q"=>"契沖"}},
+ "response"=>
+ {"numFound"=>3,
+   "start"=>0,
+   "docs"=>[{"id"=>"6675613"}, {"id"=>"6675393"}, {"id"=>"6274534"}]}}

and they could have even less output, if I turned off "diffable" -- but I am currently finding it helpful.

Okay, but what good is this, really?

My current project is to improve search results for CJK (Chinese, Japanese and Korean) queries in SearchWorks.   It's nearly impossible for a CJK-ignorant coder such as myself to write good tests. It's pretty darn hard for our non-coder CJK experts to write good tests, too. So we have to iterate to figure out a set of acceptance tests. Doing this without coding repeatable, automatable tests is ludicrous.**

We already have search tests, but our current search tests are slow. They use Cucumber to mimic a user interacting with the web page to do a search, send the request to Solr, then the SearchWorks Blacklight Rails stack prepares the html that would be served by the application to present the search results from Solr.   The assertions are made against the html.   Given that for search acceptance testing, we don't care about the rails stack, this is a lot of extra processing.

So it's time to take Rails out of the picture. With some help from my colleague Chris Beer, we conceived a way to make it really simple -- let's write rspec style language on Solr response objects!   That spawned rspec-solr.

I am already using rspec-solr for our CJK acceptance tests. All I needed was the rsolr gem, a spec_helper file, and some simple configuration stuff - 4 very small files. (See rspec-solr README)

I've got CJK tests like this:

  it "should parse out 中国 (china)  经济 （economic)  政策 (policy)" do

    resp = solr_resp_doc_ids_only({'q'=>'中国经济政策'}) 

    resp.should have_at_least(85).documents

    resp.size.should be_within(5).of(solr_resp_doc_ids_only({'q'=>'中国  经济  政策'}).size) 

end

  it "Traditional chars 三國誌 should get the same results as simplified chars 三国志" do

    resp = solr_resp_doc_ids_only({'q'=>'三國誌'}) 

    resp.should have_at_least(240).documents

    resp.should have_the_same_number_of_results_as(solr_resp_doc_ids_only({'q'=>'三国志'}))

  end

  it "hangul  광주 should get results for hancha  光州" do

    resp = solr_resp_doc_ids_only({'q'=>'광주'}) 

    resp.should include(["7763372", "7773313"]) # hancha  光州

    resp.should have_at_least(110).documents

end

I'm also migrating our cucumber search regression tests to the rspec-solr approach -- obviously, I want a full suite of regression tests as I make changes for CJK searching.

A sample regression test:

  it "q of 'Two3' should have excellent results", :jira => 'VUF-386' do
    resp = solr_resp_doc_ids_only({'q'=>'Two3'})
    resp.should have_at_most(10).documents
    resp.should include("5732752").as_first_result
    resp.should include("8564713").in_first(2).results
    resp.should_not include("5727394")
    resp.should have_the_same_number_of_results_as(solr_resp_doc_ids_only({'q'=>'two3'}))
    resp.should have_fewer_results_than(solr_resp_doc_ids_only({'q'=>'two 3'}))
  end
 

Both types are very much works in progress, but I've deliberately put the tests up on github as the sw_index_tests repository so you can leverage them however you see fit.

I think it's pretty slick.

** In fact, they already DID do this for our ILS without repeatable, automatable tests ... and without records of their manual tests ... so we're starting from scratch. How annoying and wasteful!

Tuesday, March 13, 2012

Upgrading from Solr 1.4 to Solr 3.5 - hiccups

Stanford SearchWorks has been due for a Solr upgrade for a loooong time -- we've been using Solr 1.4 since ... well, forever.   Bob Haschart upgraded SolrMarc to work with Solr 3.5, so I figured I would upgrade Solr as I refactored SolrMarc for the stanford-solr-marc fork. (See also previous blog entry).
In the course of upgrading from Solr 1.4 to Solr 3.5, a number of our tests were failing. Usually the problem was a mistake in my configuration files for Solr 3.5; sometimes the tests were too brittle. It took a pass or two to start using the ICU library for unicode normalization, rather than SolrMarc's unicodeNormalizer. I managed to get most of the failing tests to pass, but a handful stumped me.

Here's what I learned:

I. (Hyphens) and WordDelimiterFilterFactory

Solr 3.2 (?) added a new setting for field analysis: autoGeneratePhraseQueries, that defaults to "false". In Solr 1.4, this setting was always true. The difference is important for certain settings of WordDelimiterFilterFactory. Let's say we have a query with a value of "red-rose" (no quotes).

in Solr 1.4:

<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j"
        composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/>
     <filter class="solr.WordDelimiterFilterFactory"
        splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
        splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
        catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>
</fieldtype>

With debugQuery=true, we find the following query fragment being generated by dismax:
   text_field:"red (rose redros)"

in Solr 3.5:

<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.ICUFoldingFilterFactory"/>
     <filter class="solr.WordDelimiterFilterFactory"         splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
        splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
        catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>
</fieldtype>
debugQuery=true shows us this query fragment:
   (text_field:red text_field:rose text_field:redros) -- including the parens.

Thus, a match on just "rose" is good enough in Solr 3.5, but not so in Solr 1.4's analysis.

How to fix this?

Add the attribute autoGeneratePhraseQueries="true" to the field type declaration:

<fieldtype name="text" class="solr.TextField" positionIncrementGap="100"
       autoGeneratePhraseQueries="true">

2. StreamingUpdateServer and Binary Updates

In the most recent release of SolrJ (3.5), the streaming update server was not processing binary fields properly. Two solutions: 1) use the SolrJ jar provided in Bob Haschart's SolrMarc, as he has modified it to address this problem. 2) use a nightly jar, as this has been fixed in the SolrJ trunk and the SolrJ 3.6 branch.

3. Phrase Slop and Queries with Repeated Terms

Ultimately, I managed to get our tests passing except for two. I couldn't figure out the difficulty - I looked at debugQuery results on Solr 1.4 and Solr 3.5; I compared using the analysis debugger from the admin interface - nothing looked different.

Jonathan Rochkind pointed out that both phrases had repeated words; these were both phrase searches as well.

It turns out that there was a bug in Lucene (that crept in sometime between Solr 1.4 and Solr 3.5). If there was a non-zero slop setting in a phrase query with repeated terms, then results were incorrect.

https://issues.apache.org/jira/browse/LUCENE-3821

Thanks to Doron Cohen and Robert Muir, a fix was found and a patch was applied to Lucene, which was picked up in the Solr trunk and Solr 3.6 branch as of March 10, 2012.

Wednesday, February 15, 2012

stanford-solr-marc fork of SolrMarc

In the interests of reducing my ongoing work for Stanford's SearchWorks index, I have, with Bob Haschart's blessing, forked the SolrMarc code and made my fork available via the (new) SolrMarc space on github:

http://github.com/solrmarc/stanford-solr-marc

Specifics of how my fork digresses are below.

This is an experiment: I believe my personal efforts will be reduced by using this pared down derivative of SolrMarc. I am NOT committing to supporting all the use cases that Bob supports with SolrMarc. Bob is doing a great job of juggling VuFind needs, Blacklight needs, UVa needs, less savvy consumers' needs, and maintaining backward compatibility with earlier versions of Solr. I cannot make those kinds of commitments on Stanford's dollar or on my own time.
One goal of the fork is to simplify the code and the build scripts for development purposes. This creates a slightly higher expectation of users: they will be presumed to have expertise to do what they need downstream. (e.g. edit the build.properties file, set up analogous directories for their local site code and/or their local versions of Solr, substitute their own java customizations, set their own version up for bean shell, etc).

If anyone likes what I've done or any part of it, feel free to grab it, fork it, mimic it or whatever. I am happy to add committers if they write test code for any changes they want to push up.

I have created hudson builds for the core code and the site specific code in stanford-solr-marc on the projectblacklight hudson server. These builds will kick off after each commit to the stanford-solr-marc github repository, and they create javadoc and test coverage reports (see the hudson pages below for links to these).

http://hudson.projectblacklight.org/hudson/job/stanford-solr-marc%20CORE%20code/
http://hudson.projectblacklight.org/hudson/job/stanford-solr-marc%20SITE%20code/

I can add emails to the hudson build notifications, and can probably figure out how to have github send emails upon commits, if folks desire.

It would be awesome if the fork converges with SolrMarc future development to the point of re-combining the code base. Meanwhile, as Bob and I have discussed, this fork may help Bob with some of his refactoring plans, and I can forge ahead with Stanford specific needs more easily.
Significant Differences between my fork and the SolrMarc on GoogleCode:

git
reorg of the directory structures for clarity and to reduce nesting.
complete rewrite of the ant builds.

a single build.xml file

no macros

a single build.properties file -- it should be straightforward to change build.properties as desired.
the build process does not result in a single jar, but instead creates a dist directory with all the files and folder structure as needed to execute the code.

the wonderful scripts written by Bob are not "localized" by the build process
strives to use "vanilla" versions of Solr and Marc4j, with version clearly indicated
the utility class has been refactored into smaller pieces
the only exemplar site code is Stanford SearchWorks
functionality not used by Stanford is often stripped out, such as

bean shell scripting capability (it could be added back in easily, if desired)
notion of running under windows (could be added back in)
unused code placeholders, such as z39.50

embedded solrj update options are not exercised - this code will be stripped out soon
core tests have been largely rewritten to adhere to junit common practices: ant calls a junit class which executes the java code and asserts the correct results.
current intent is to move away from using java reflection to simultaneously support multiple versions of Solr -- I will create a tag/branch for a Solr version if a Solr upgrade isn't backwards compatible, and I make no promise to keep that branch up to date.

I have not written or rewritten the type of documentation available on the googlecode SolrMarc wiki - much of that documentation is directly applicable (settings for xxx_config.properties, settings for xxx_index.properties …).

Note that the SITE code for Stanford SearchWorks will lag behind our actual production code, as the copy of record is *not* the github repository.
a. avoids commit messages for every commit for local work
b. allows our copy-of-record to be behind the Stanford firewall.
c. I will update the github repository to the current Stanford production code from time to time.

Let me repeat: I'm not promising to keep this project backwards compatible with older versions of Solr or of xx_index.properties files, as those progress. The main audience for this codebase is me. Others are welcome to the code, and will probably be welcomed as committers … but consumers of this codebase will be presumed to have enough expertise to do what they need downstream. (e.g. substitute their own java customizations, or set their own version up for bean shell, or for a different version of Solr).

There is plenty more work to do. Just a few examples:

More tests of core code
More refactoring of core code
Documentation