petek, 10. september 2010

Term pruning

Pruning tools utilized by a pruning and tree-s...Image via Wikipedia
At Zemanta we are constantly experimenting with new ideas how to improve our service. Most of our experiments are gainless, but quite often one learns more from failure than success.

One of the gainless but illuminating experiments we did lately is term pruning. Experimentally, we have observed that 52% of terms occur in only one document and that excluding terms occurring only once have had no influence on precision of our recommendations.  Our recommendation engine is computationally very demanding and make it more efficient is a never-ending process. A chance to prune 52% of terms seemed quite promising for increasing performance of our engine and reducing index size.

Our recommendation engine is based on Apache Lucene/Solr. At a recent Lucene EuroCon conference, Andrzej Bialecki presented a Lucene patch that provides an easy tool for index manipulation. Using this tool we have removed all terms occuring in only one document, and all postings and payloads belonging to such terms. It has turned out that efficiency of our engine did not change and also the index size decreased only slightly (by 1.5%).

In our opinion, this experiment has shown that Lucene is very efficient at storing terms and associated term data (postings & payloads), and that presence of rarely used terms in the index is not of a concern.
Enhanced by Zemanta