The Unreasonable Effectiveness of Data: maj 2010

Image by mikebaird via Flickr

This post is about scorers in Apache Lucene. Let me first tell what scorers are, for those not intimately familiar with Lucene. In inverted index, a result list is compiled by merging postings lists of terms present in the query. A scorer is a function that combines scores of individual query terms into a single score for each document in the index. The scoring is usually the most time consuming step of searching in inverted index. Therefore, the efficiency of the scorer function is a prerequisite for efficient search in inverted index.

There exists two scorers for a BooleanQuery in Apache Lucene, BooleanScorer and BooleanScorer2. The BooleanScorer uses a ~16k array to score windows of docs, while the BooleanScorer2 merges priority queues of postings (see description in BooleanScorer.java for more details). In principle, the BooleanScorer should be much faster for boolean queries with lots of frequent terms, since it does not need to update a priority queue for each posting.

We were curious how much faster the BooleanScorer really is, so we compared the two scorers using Zemanta related articles application. In order to identify related articles, Zemanta's engine matches approximately a dozen of entities extracted from user's blog post against an index of several millions of documents.

The results have confirmed that the BooleanScorer is much faster than the BooleanScorer2:

the average response time has decreased by one third,
the maximum response time for 99% of requests has decreased by 20%.

Please notice, that BooleanScorer scores documents out of order and therefore cannot be used with filters or in complex queries.

The Unreasonable Effectiveness of Data

sreda, 19. maj 2010

Apache Lucene EuroCon 2010

ponedeljek, 10. maj 2010

BooleanScorer vs. BooleanScorer2

Arhiv spletnega dnevnika

O meni