Discovery Grindstone: Solr, Hyphenated Words, and Query Slop

Executive Summary: you probably need to increase your query slop. A lot.

Revision (thanks to Robert Muir): See https://issues.apache.org/jira/browse/SOLR-1852. There was a patch applied to Solr 1.4 that fixes this. Also, in that Jira issue is a comment from Mark Bennett: "Just put the stopwords filter after the Word Delimiter filter. That worked for us without impacting much else, until we can get over to the new version."

We recently had a feedback ticket that a title search with a hyphen wasn't working properly. This is especially curious because we solved a bunch of problems with hyphen searching AND WROTE TESTS in the process, and all the existing hyphen tests pass. Tests like "hyphens with no spaces before or after, 3 significant terms, 2 stopwords" pass.

Our metadata contains:
record A with title: Red-rose chain.
record B with title: Prisoner in a red-rose chain.

A title search: prisoner in a red-rose chain returns no results

Further exploration (the following are all title searches):

red-rose chain ==> record A only
"red rose" chain ==> record A only
"red rose chain" ==> record A only
"red-rose chain" ==> record A only
red rose chain ==> records A and B
red "rose chain" ==> records A and B (!!)

What is going on? First, let's see how the Solr field is analyzed. The field definition is:

<field name="title_search" type="text" indexed="true" stored="false" />

The type definition is:

 <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">

   <analyzer>

     <tokenizer class="solr.WhitespaceTokenizerFactory" />

     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

     <filter class="solr.WordDelimiterFilterFactory" 

          splitOnCaseChange="1" generateWordParts="1" catenateWords="1" 

          splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1" 

          catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />

     <filter class="solr.LowerCaseFilterFactory" />

     <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />

     <filter class="solr.RemoveDuplicatesTokenFilterFactory" />

   </analyzer>

 </fieldtype>

So of all the stuff above, the only filter that touches hyphens is the WordDelimiterFilter factory, which is documented at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

The relevant settings we use in WDF say:

split words into parts at non-alphanum characters or at case changes
catenate word parts into a single word

We can look at the resulting analysis using http://solr.baseurl/solr/admin/analysis.jsp.

In record A, "Red-rose chain" becomes:

term position	1	2	3
term text	red	rose	chain
term text	red	redros	chain
term type	word	word	word
term type	word	word	word
source start,end	0,3	4,8	9,14
source start,end	0,3	0,8	9,14

This shows the token "red-rose" becomes term "red" followed by terms "rose" and "redros." "redros" is the term resulting from the catenation of word parts "red" and "rose" with stemming applied.

In record B, "Prisoner in a red-rose chain" becomes:

term position	1	4	7	8
term text	prison	red	rose	chain
term text	prison	red	redros	chain
term type	word	word	word	word
term type	word	word	word	word
source start,end	0,8	14,17	18,22	23,28
source start,end	0,8	14,17	14,22	23,28

Note the term positions! Term "red" is at position 4, and the following terms are at position 7. So as far as Solr is concerned these terms are NOT adjacent. Which is precisely what the results of our search variants told us. (Why is this true? I'll leave that as an exercise for the reader.)

Important: the query term red-rose becomes the phrase query "red rose" via Solr magic and field definitions.

How do we address the fact that terms aren't adjacent? Increase Phrase Query Slop. The Solr Relevancy Cookbook (http://wiki.apache.org/solr/SolrRelevancyCookbook#Term_Proximity) suggests "one way to get term proximity effects with the current query parser is to use a phrase query with a very large slop. Phrase queries with slop will score higher when the terms are closer together."

For a lucene request handler, this is applicable only for explicit phrase queries. "red-rose"~2 says look for the phrase "red rose" with a phrase query slop of 2. (red-rose~2 gives a parsing error.)
For a dismax request handler, this is the qs parameter. Note: the qs parameter is applied to

From the explanation of dismax parameters at http://wiki.apache.org/solr/DisMaxQParserPlugin, we know that qs is the "amount of slop on phrase queries explicitly included in the user's query string" -- qs affects which Solr documents match the query.
Confusingly, ps is the "amount of slop on phrase queries built for "pf" fields" -- ps only affects ranking of the search results.

The Solr Revelancy Cookbook examples use a phrase query slop of 1,000,000. I experimented with this particular query, and found that such a high query slop value returned a result I wasn't pleased with: a document with "red" and "rose" so far apart I didn't want it included. So I did a couple of manual searches and found that 150 retrieved my two desired results, but not the third un-desired result.

Okay, I have analyzed the problem and have a solution. What do I do now???

WRITE A TEST. (it fails - I haven't applied the solution yet.)
Run all of my search tests, including the new one to ensure all tests are passing except the one I just wrote.

If the new test passes, rewrite it - you're getting a false positive.
If some other test(s) fail, you're not running your tests often enough to catch failures and fix them. Run the tests at every check in. If you have long running tests (like our search tests), run them at least once a day.

ha ha - now you have to fix these failures before you apply the query slop change.

Apply fix.
Run all of my search tests. If they don't all pass, then I don't have a solution. Return to step 3.

Very occasionally (like, almost never), a change will break tests - the tests themselves need to be revised.

1 comment:

Jamie BurnsJanuary 24, 2011 at 8:32 AM
I am not convinced your example is the same as the similar bug involving stopwords. Your example does not have stopwords in it. Also, we are seeing the bug you have described even with the patched verion (and with the stopwords filter after the word delimeter filter).

Thursday, November 4, 2010

Solr, Hyphenated Words, and Query Slop

1 comment: