Captain Codeman Captain Codeman

NHibernate.Search using Lucene.NET Full Text Index (3)

Contents

Introduction

In Part 1 we looked at how to create a full-text index of NHibernate persisted domain objects using the Lucene.NET project. Part 2 then looked at how to query the index complete with query-parsing and hit-highlighting of the results.

Now that we have a full-text index there are other things that we can use it for. The easiest and most useful is probably adding a ‘similar items’ feature where the system can automatically display related items based on the text that they share in common. While it isn’t exact the results are often surprisingly good and while a human editor could probably pick out some links with more finesse it can quickly become an impossible task as the number of items grow - the human will typically resort to searching for similar items using the index anyway so why not automate it?!

This feature can be used to display related web pages or blog entries or, in this case, related books. It probably isn’t too far off from the system that Amazon uses. The benefit is that as new content is being added, the top related items can constantly be updated - even for existing items in the system. So, for example, if a new Harry Potter book is released then the existing books can immediately start linking to it and vice-versa or if a company starts offering a new training course or product then any related pages will immediately start to link together.

While it sounds complicated, it is actually quite easy thanks to the contrib assemblies provided with Lucene.NET. In fact, it’s so simple it’s almost trivial so this won’t be a long post!

First, we need to add a new reference to the SimilarityNet.dll assembly (part of Lucene.NET contrib). This provides a SimilarityQueries class which contains a FormSimilarQuery method. Calling this will a piece of text (from an existing field), an analyzer and the field name will produce a boolean query using every unique word where all words are optional. If we repeat this with each field, boosting the relevance of the most important ones (such as title) then we end up with a query that will look for every word in each field of the original item.

To quote the Lucene documentation:

The philosophy behind this method is “two documents are similar if they share lots of words”. Note that behind the scenes, Lucene’s scoring algorithm will tend to give two documents a higher similarity score if the share more uncommon words.

What this means in practice is that the more unique a word is, the more likely it will be taken into account when ranking the similar items. So, if our original book has ‘Agile’ in the title and words such as ‘scrum’ and ‘backlog’ in the summary then chances are we will find other books that also have these more unique words … and it’s very likely that they will be related to our original book.

Of course, when we search our index for books with all these words there is going to be one obvious match - the original book! In fact, this should be the first result returned so we could either skip this when creating the result-set (looking for the same unique Id rather than just skipping the first one just to be safe) or, as in the example below, use a boolean search and specifically exclude the Id of the source item from the query. I haven’t experimented to see which one is quicker but I prefer to let Lucene do all the work - I trust it and it saves me writing any more code or getting results back that I am just going to discard which feels wrong.

Here is the code to find the best 4 similar matches to any book passed in. Note that I include the Authors and Publisher fields when doing the comparison so it will tend to favour books by the same author or publisher - you will need to experiment to see what makes most sense for your application and usage.

  /// <summary>    
  /// Gets similar books.    
  /// </summary>    
  /// <param name="book">The book.</param>    
  /// <returns></returns>    
  public override IList<IBook> GetSimilarBooks(IBook book)
  {
    IFullTextSession session = (IFullTextSession)NHibernateHelper.GetCurrentSession();
    Analyzer analyzer = new StandardAnalyzer();
    BooleanQuery query = new BooleanQuery();

    Query title = Similarity.Net.SimilarityQueries.FormSimilarQuery(book.Title, analyzer, "Title", null);
    title.SetBoost(10);
    query.Add(title, BooleanClause.Occur.SHOULD);

    if (book.Summary != null) {
      Query summary = Similarity.Net.SimilarityQueries.FormSimilarQuery(book.Summary, analyzer, "Summary", null);
      summary.SetBoost(5);
      query.Add(summary, BooleanClause.Occur.SHOULD);
    }

    if (book.Authors != null) {
      Query authors = Similarity.Net.SimilarityQueries.FormSimilarQuery(book.Authors, analyzer, "Authors", null);
      query.Add(authors, BooleanClause.Occur.SHOULD);
    }

    if (book.Publisher != null) {
      Query publisher = Similarity.Net.SimilarityQueries.FormSimilarQuery(book.Publisher, analyzer, "Publisher", null);
      query.Add(publisher, BooleanClause.Occur.SHOULD);
    }


    // avoid the book being similar to itself!    
    query.Add(new TermQuery(new Term("Id", book.Id.ToString())), BooleanClause.Occur.MUST_NOT);

    IQuery nhQuery = session.CreateFullTextQuery(query, new Type[] { typeof(Book) }).SetMaxResults(4);

    IList<IBook> books = nhQuery.List<IBook>();
    return books;
  }

That about wraps it up for using NHibernate and Lucene. I’m expecting things to change when the new NHibernate version 2.0 is released so I’ll probably post again to update you of any changes though when it is. Also, there are a few other features available in Lucene which I may blog about such as using Synonyms for the ‘did you mean …’ type suggestions.

Please let me know if there is anything that I haven’t explained particularly well or you would like to see more about.