Thursday, November 4, 2010

Solr, Hyphenated Words, and Query Slop

Executive Summary:  you probably need to increase your query slop.  A lot.

Revision  (thanks to Robert Muir):  See  https://issues.apache.org/jira/browse/SOLR-1852.  There was a patch applied to Solr 1.4 that fixes this.  Also, in that Jira issue is a comment from Mark Bennett:  "Just put the stopwords filter after the Word Delimiter filter. That worked for us without impacting much else, until we can get over to the new version."



We recently had a feedback ticket that a title search with a hyphen wasn't working properly.  This is especially curious because we solved a bunch of problems with hyphen searching AND WROTE TESTS in the process, and all the existing hyphen tests pass.  Tests like "hyphens with no spaces before or after, 3 significant terms, 2 stopwords" pass.

Our metadata contains:
record A with title:   Red-rose chain.
record B with title:   Prisoner in a red-rose chain.

A title search:  prisoner in a red-rose chain  returns no results

Further exploration (the following are all title searches):
  • red-rose chain  ==>  record A only
  • "red rose" chain ==>  record A only
  • "red rose chain" ==> record A only
  • "red-rose chain" ==> record A only
  • red rose chain ==>  records A and B
  • red "rose chain" ==>  records A and B  (!!)
What is going on?  First, let's see how the Solr field is analyzed.  The field definition is:


    <field name="title_search" type="text" indexed="true" stored="false" />

The type definition is:

 <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.WhitespaceTokenizerFactory" />
     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
     <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
          splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
          catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
     <filter class="solr.LowerCaseFilterFactory" />
     <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
     <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
   </analyzer>
 </fieldtype>

So of all the stuff above, the only filter that touches hyphens is the WordDelimiterFilter factory, which is documented at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

The relevant settings we use in WDF say:
  • split words into parts at non-alphanum characters or at case changes
  • catenate word parts into a single word
We can look at the resulting analysis using http://solr.baseurl/solr/admin/analysis.jsp.  

In record A, "Red-rose chain" becomes:
 
term position 123
term text redrosechain
redros
term type wordwordword
word
source start,end 0,34,89,14
0,8

This shows the token "red-rose" becomes term "red" followed by terms "rose" and "redros." "redros" is the term resulting from the catenation of word parts "red" and "rose" with stemming applied.

In record B, "Prisoner in a red-rose chain" becomes:

term position 1478
term text prisonredrosechain
redros
term type wordwordwordword
word
source start,end 0,814,1718,2223,28
14,22

Note the term positions!  Term "red" is at position 4, and the following terms are at position 7.  So as far as Solr is concerned these terms are NOT adjacent.  Which is precisely what the results of our search variants told us.  (Why is this true?  I'll leave that as an exercise for the reader.)

Important: the query term red-rose becomes the phrase query "red rose" via Solr magic and field definitions.

How do we address the fact that terms aren't adjacent?  Increase Phrase Query Slop.   The Solr Relevancy Cookbook (http://wiki.apache.org/solr/SolrRelevancyCookbook#Term_Proximity)  suggests "one way to get term proximity effects with the current query parser is to use a phrase query with a very large slop. Phrase queries with slop will score higher when the terms are closer together."
  • For a lucene request handler, this is applicable only for explicit phrase queries.  "red-rose"~2  says look for the phrase "red rose" with a phrase query slop of 2.  (red-rose~2 gives a parsing error.)
  • For a dismax request handler, this is the qs parameter.  Note:  the qs parameter is applied to
    • From the explanation of dismax parameters at http://wiki.apache.org/solr/DisMaxQParserPlugin,  we know that qs is the "amount of slop on phrase queries explicitly included in the user's query string" -- qs affects which Solr documents match the query.
    • Confusingly, ps is the "amount of slop on phrase queries built for "pf" fields" -- ps only affects ranking of the search results.
The Solr Revelancy Cookbook examples use a phrase query slop of 1,000,000.   I experimented with this particular query, and found that such a high query slop value returned a result I wasn't pleased with:  a document with "red" and "rose" so far apart I didn't want it included.  So I did a couple of manual searches and found that 150 retrieved my two desired results, but not the third un-desired result.

Okay, I have analyzed the problem and have a solution.  What do I do now???
  1. WRITE A TEST.  (it fails - I haven't applied the solution yet.)
  2. Run all of my search tests, including the new one to ensure all tests are passing except the one I just wrote.
    1. If the new test passes, rewrite it - you're getting a false positive.
    2. If some other test(s) fail, you're not running your tests often enough to catch failures and fix them.  Run the tests at every check in.  If you have long running tests (like our search tests), run them at least once a day.
      1. ha ha - now you have to fix these failures before you apply the query slop change.
  3. Apply fix.
  4. Run all of my search tests.  If they don't all pass, then I don't have a solution.  Return to step 3.
    1. Very occasionally (like, almost never), a change will break tests  - the tests themselves need to be revised.  

1 comment:

  1. I am not convinced your example is the same as the similar bug involving stopwords. Your example does not have stopwords in it. Also, we are seeing the bug you have described even with the patched verion (and with the stopwords filter after the word delimeter filter).

    ReplyDelete