Image by mikebaird via Flickr
There exists two scorers for a BooleanQuery in Apache Lucene, BooleanScorer and BooleanScorer2. The BooleanScorer uses a ~16k array to score windows of docs, while the BooleanScorer2 merges priority queues of postings (see description in BooleanScorer.java for more details). In principle, the BooleanScorer should be much faster for boolean queries with lots of frequent terms, since it does not need to update a priority queue for each posting.
We were curious how much faster the BooleanScorer really is, so we compared the two scorers using Zemanta related articles application. In order to identify related articles, Zemanta's engine matches approximately a dozen of entities extracted from user's blog post against an index of several millions of documents.
The results have confirmed that the BooleanScorer is much faster than the BooleanScorer2:
- the average response time has decreased by one third,
- the maximum response time for 99% of requests has decreased by 20%.
Please notice, that BooleanScorer scores documents out of order and therefore cannot be used with filters or in complex queries.
So how do I choose between the two in my code ?
OdgovoriIzbriši