Discovery Grindstone: Upgrading from Solr 1.4 to Solr 3.5

Stanford SearchWorks has been due for a Solr upgrade for a loooong time -- we've been using Solr 1.4 since ... well, forever.   Bob Haschart upgraded SolrMarc to work with Solr 3.5, so I figured I would upgrade Solr as I refactored SolrMarc for the stanford-solr-marc fork. (See also previous blog entry).
In the course of upgrading from Solr 1.4 to Solr 3.5, a number of our tests were failing. Usually the problem was a mistake in my configuration files for Solr 3.5; sometimes the tests were too brittle. It took a pass or two to start using the ICU library for unicode normalization, rather than SolrMarc's unicodeNormalizer. I managed to get most of the failing tests to pass, but a handful stumped me.

Here's what I learned:

I. (Hyphens) and WordDelimiterFilterFactory

Solr 3.2 (?) added a new setting for field analysis: autoGeneratePhraseQueries, that defaults to "false". In Solr 1.4, this setting was always true. The difference is important for certain settings of WordDelimiterFilterFactory. Let's say we have a query with a value of "red-rose" (no quotes).

in Solr 1.4:

<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j"
        composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/>
     <filter class="solr.WordDelimiterFilterFactory"
        splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
        splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
        catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>
</fieldtype>

With debugQuery=true, we find the following query fragment being generated by dismax:
   text_field:"red (rose redros)"

in Solr 3.5:

<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.ICUFoldingFilterFactory"/>
     <filter class="solr.WordDelimiterFilterFactory"         splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
        splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
        catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>
</fieldtype>
debugQuery=true shows us this query fragment:
   (text_field:red text_field:rose text_field:redros) -- including the parens.

Thus, a match on just "rose" is good enough in Solr 3.5, but not so in Solr 1.4's analysis.

How to fix this?

Add the attribute autoGeneratePhraseQueries="true" to the field type declaration:

<fieldtype name="text" class="solr.TextField" positionIncrementGap="100"
       autoGeneratePhraseQueries="true">

2. StreamingUpdateServer and Binary Updates

In the most recent release of SolrJ (3.5), the streaming update server was not processing binary fields properly. Two solutions: 1) use the SolrJ jar provided in Bob Haschart's SolrMarc, as he has modified it to address this problem. 2) use a nightly jar, as this has been fixed in the SolrJ trunk and the SolrJ 3.6 branch.

3. Phrase Slop and Queries with Repeated Terms

Ultimately, I managed to get our tests passing except for two. I couldn't figure out the difficulty - I looked at debugQuery results on Solr 1.4 and Solr 3.5; I compared using the analysis debugger from the admin interface - nothing looked different.

Jonathan Rochkind pointed out that both phrases had repeated words; these were both phrase searches as well.

It turns out that there was a bug in Lucene (that crept in sometime between Solr 1.4 and Solr 3.5). If there was a non-zero slop setting in a phrase query with repeated terms, then results were incorrect.

https://issues.apache.org/jira/browse/LUCENE-3821

Thanks to Doron Cohen and Robert Muir, a fix was found and a patch was applied to Lucene, which was picked up in the Solr trunk and Solr 3.6 branch as of March 10, 2012.

Discovery Grindstone

Tuesday, March 13, 2012

Upgrading from Solr 1.4 to Solr 3.5 - hiccups

No comments:

Post a Comment