Stanford SearchWorks has been due for a Solr upgrade for a loooong time -- we've been using Solr 1.4 since ... well, forever. Bob Haschart upgraded SolrMarc to work with Solr 3.5, so I figured I would upgrade Solr as I refactored SolrMarc for the stanford-solr-marc fork. (See also previous blog entry).
In the course of upgrading from Solr 1.4 to Solr 3.5, a number of our
tests were failing. Usually the problem was a mistake in my configuration files for Solr 3.5; sometimes the tests were too brittle. It took a pass or two to start using the ICU library for unicode normalization, rather than SolrMarc's unicodeNormalizer. I managed to get most of the failing tests to pass, but a handful stumped me.
Here's what I learned:
I. (Hyphens) and WordDelimiterFilterFactory
Solr 3.2 (?) added a new setting for field analysis: autoGeneratePhraseQueries, that defaults to "false". In Solr 1.4, this setting was always true. The difference is important for certain settings of WordDelimiterFilterFactory. Let's say we have a query with a value of "red-rose" (no quotes).
in Solr 1.4:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j"
composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldtype>
With debugQuery=true, we find the following query fragment being generated by dismax:
text_field:"red (rose redros)"
in Solr 3.5:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldtype>
debugQuery=true shows us this query fragment:
(text_field:red text_field:rose text_field:redros) -- including the parens.
Thus, a match on just "rose" is good enough in Solr 3.5, but not so in Solr 1.4's analysis.
How to fix this?
Add the attribute autoGeneratePhraseQueries="true" to the field type declaration:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="true">
2. StreamingUpdateServer and Binary Updates
In the most recent release of SolrJ (3.5), the streaming update server
was not processing binary fields properly. Two solutions: 1) use the
SolrJ jar provided in Bob Haschart's SolrMarc, as he has modified it to
address this problem. 2) use a nightly jar, as this has been fixed in
the SolrJ trunk and the SolrJ 3.6 branch.
3. Phrase Slop and Queries with Repeated Terms
Ultimately, I managed to get our tests passing except for two. I couldn't figure out the difficulty - I looked at debugQuery results on Solr 1.4 and Solr 3.5; I compared using the analysis debugger from the admin interface - nothing looked different.
Jonathan Rochkind pointed out that both phrases had repeated words; these were both phrase searches as well.
It turns out that there was a bug in Lucene (that crept in sometime between Solr 1.4 and Solr 3.5). If there was a non-zero slop setting in a phrase query with repeated terms, then results were incorrect.
https://issues.apache.org/jira/browse/LUCENE-3821
Thanks to Doron Cohen and Robert Muir, a fix was found and a patch was applied to Lucene, which was picked up in the Solr trunk and Solr 3.6 branch as of March 10, 2012.