Tuesday, March 13, 2012

Upgrading from Solr 1.4 to Solr 3.5 - hiccups

Stanford SearchWorks has been due for a Solr upgrade for a loooong time -- we've been using Solr 1.4 since ... well, forever.   Bob Haschart upgraded SolrMarc to work with Solr 3.5, so I figured I would upgrade Solr as I refactored SolrMarc for the stanford-solr-marc fork.  (See also previous blog entry).
  In the course of upgrading from Solr 1.4 to Solr 3.5, a number of our tests were failing.  Usually the problem was a mistake in my configuration files for Solr 3.5;  sometimes the tests were too brittle.  It took a pass or two to start using the ICU library for unicode normalization, rather than SolrMarc's unicodeNormalizer.  I managed to get most of the failing tests to pass, but a handful stumped me.

Here's what I learned:

I.  (Hyphens) and WordDelimiterFilterFactory

Solr 3.2 (?) added a new setting for field analysis:  autoGeneratePhraseQueries, that defaults to "false".  In Solr 1.4, this setting was always true.  The difference is important for certain settings of WordDelimiterFilterFactory.  Let's say we have a query with a value of  "red-rose" (no quotes).

in Solr 1.4:

<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j"
        composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/>
     <filter class="solr.WordDelimiterFilterFactory"
        splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
        splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
        catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>
</fieldtype>

With debugQuery=true, we find the following query fragment being generated by dismax:
   text_field:"red (rose redros)"

in Solr 3.5:

<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.ICUFoldingFilterFactory"/>
     <filter class="solr.WordDelimiterFilterFactory"         splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
        splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
        catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>
 </fieldtype>
debugQuery=true shows us this query fragment:
   (text_field:red text_field:rose text_field:redros) -- including the parens.

Thus, a match on just "rose" is good enough in Solr 3.5, but not so in Solr 1.4's analysis.

How to fix this?

Add the attribute autoGeneratePhraseQueries="true" to the field type declaration:

  <fieldtype name="text" class="solr.TextField" positionIncrementGap="100"
       autoGeneratePhraseQueries="true"

2. StreamingUpdateServer and Binary Updates

In the most recent release of SolrJ (3.5), the streaming update server was not processing binary fields properly.  Two solutions:  1)  use the SolrJ jar provided in Bob Haschart's SolrMarc, as he has modified it to address this problem.  2) use a nightly jar, as this has been fixed in the SolrJ trunk and the SolrJ 3.6 branch.

3. Phrase Slop and Queries with Repeated Terms

Ultimately, I managed to get our tests passing except for two.  I couldn't figure out the difficulty - I looked at debugQuery results on Solr 1.4 and Solr 3.5;  I compared using the analysis debugger from the admin interface - nothing looked different.

Jonathan Rochkind pointed out that both phrases had repeated words;  these were both phrase searches as well.

It turns out that there was a bug in Lucene (that crept in sometime between Solr 1.4 and Solr 3.5).  If there was a non-zero slop setting in a phrase query with repeated terms, then results were incorrect.

https://issues.apache.org/jira/browse/LUCENE-3821

Thanks to Doron Cohen and Robert Muir, a fix was found and a patch was applied to Lucene, which was picked up in the Solr trunk and Solr 3.6 branch as of March 10, 2012.