The Unreasonable Effectiveness of Data

nedelja, 6. november 2011

Comments are not code

Image by Richard Masoner / Cyclelicious via Flickr

I'm a firm believer that the best software documentation is the running code. If the code is well structured and written, it speaks for itself and it does not need any additional documentation. Comments are not code and therefore should not be used where better code organization would suffice.

A misplaced use of comments that I often see while doing code reviews is to use comments to divide a method into logical subunits. For example:

def check_specific_candidate():

    # first check if we already have X by any chance

< 10 lines of code, return if true>

# Try out if candidate is Y
< 30 lines of code, return if true>

    # candidate is not Y, try out if it is Z

< another 30 lines of code, return if true>

# construct a list of elements in the candidate
    < another 30 lines of code>

    if len(list_of_elements) > 0:
        # process list of elements for the candidate
        < another 10 lines of code>

This example is based on actual routine in Zemanta code base that is altogehter 140 lines long. Supporting such code is not a nice experience. While comments in this routine do help, they are actually a symptom of a larger problem, i.e. poor code organization. Comments would immediately become redundant, if this routine would be split into logical steps with each step being a separate routine. Let's refactor the above routine as such:

def check_specific_candidate(candidate):

    if _candidate_has_X(candidate):

        return

    if _candidate_is_Y(candidate):

        return

    if _candidate_is_Z(candidate):

        return

    list_of_elements = _get_list_of_elements(candidate)

    if len(list_of_elements) > 0:

        _process_list_of_elements(list_of_elements)

So instead of using comments, this routine is now documented using method names. When you approach such code for the first time, seeing such nice 15-lines long routine is much less stressful than seeing a 140-lines long monster.

ponedeljek, 22. avgust 2011

#sigir2011

It's more than a bit ironic that a premiere conference on information retrieval took place behind the Great Firewall and consequently without discussions on Twitter. But Chinese have become also great scientist and are not just cheap labor anymore, so I guess they have well deserved to host this event.

The 34th instance of SIGIR conference in Beijing was attended by more than 800 people from throughout the world (China 400, USA 250, Europe 100, ...). The acceptance rate for the papers was only 20%, which makes this conference one of the more competitive. What came as a nice surprise this year is that the presentation level was substantially better than last year, with almost all speakers giving their talks in comprehensible English and with good rhetoric skills.

Bruce Croft (program chair) presenting basic facts about the conference

What makes the field of IR different from the other scientific fields is the influence of industry and their research labs. Almost 50% of the papers had at least one author from Microsoft, Google, Yahoo, Facebook, Yandex, Baidu or some other company. Therefore, while SIGIR is a scientific conference, I got the feeling that it is very much oriented towards the real problems of the industry. If this assumption is correct, than we could perhaps deduce the problems of the industry by examining share of papers in different areas.

Top 5 areas for accepted papers

The main stress of SIGIR2011 could be summed as "find data that solve the problem". Here are couple of examples of this approach in action:

The best paper award was given to a Russian Mikhail Ageev, who devised a simple game that enabled collection of data for measuring success of search. They collected search trails for apx. 150 users using Mechanical Turk and that was sufficient to learn the model that predicts whether the user found the information he was searching for or not. This technique enable Google et al. to automatically evaluate quality of their search.
PICASSO is a system by Aleksandar Stupar that, given an image, recommends related music. The main idea behind this system is to use movies and their soundtracks to learn relation between images and music.
Guys from Microsoft have presented a clever way how to identify geographical relevance of a web site - just track where the readers come from.

Overall (and excluding censorship) I liked the SIGIR2011 in Beijing more than last year's conference in Geneve. Last year too much stress was put on rigorous evaluation, while program committee allowed for more bold thinking this year. I got many good ideas while attending SIGIR2011 and you may expect many of them being implemented in Zemanta soon.

Looking forward to SIGIR2012!

sreda, 11. maj 2011

Startup Slovenia

Image by FromTheNorth via Flickr

Today I attended a talk by Robert Farazin about DoubleRecall's successful application to Y Combinator. While the talk was great, what has really impressed me was the attendance of some 200 people. Startups are mushrooming in Slovenia at the moment and hopefully many more will join the ranks of Zemanta, Celtra, Outfit7, Vox.io and DoubleRecall. Everything seems to be in place for Ljubljana to become Boulder of Europe. I hold my fingers crossed for some great exits that would enable the creation of a proper startup ecosystem.

Ten years ago I was trying to start a company that had set to achieve something similar to what NetSuite later managed to achieve. As I recollect those times now, the first thing that comes to my mind is how doomed to fail we really were. At that time the only VC fund at least remotely interested in funding eastern European ventures was a murky fund from Vienna called Red-stars.com whose motto was "from communism to .com". At that time there were no people around here to tell us that a startup does not need a fifty-page business plan. At that time the only two other "start-ups" that we could share experience with were two dubious endeavors, the first being Telemach and the second EON of Zoran Thaler. At that time the nearest event for start-ups was First Tuesday in Zagreb.

I am so glad to see how much environment for start-ups has changed for the better in these ten years and I am really grateful that I have an opportunity to contribute to Startup Slovenia myself.

nedelja, 14. november 2010

Man-Computer Symbiosis

Deep Blue, the computer who defeated chess wor...

Image via Wikipedia

October issue of Communications of the ACM reports about a scientific paper that shows how a complex scientific problem of predicting protein structure can be solved by harvesting brain-power of 57.000+ people. The integration of human problem-solving capabilities with computational algorithms has enormous potential that might fundamentally change the world. While machines excel at computation, humans shine at pattern recognition. By combining the two, many intractable problems can be solved.

There are 1.4B people living below the $1.25 poverty line and at current pace of mobile phones penetration, even the poorest people will soon own a mobile phone. Even a basic mobile phone enables somebody to receive a problem, to post a solution and to get a payment. If we would be able to map important problems to a large number of people, reduce them to small cogs in a humongous analytic machine, and harvest a solution, everyone would benefit.

Just like Zynga has stormed the world by social gaming, some future start-up might fundamentally change the lives of hundreds millions of poorest people in the world for the better by social problem solving, earning billions along the way.

--
It is almost 15 years since Kasparov lost against Deep Blue. I think it is time for human race to take back supremacy in chess by intelligently harnessing our brain power. I know I would be most thrilled to take part in such a match.

petek, 10. september 2010

Term pruning

Pruning tools utilized by a pruning and tree-s...

Image via Wikipedia

At Zemanta we are constantly experimenting with new ideas how to improve our service. Most of our experiments are gainless, but quite often one learns more from failure than success.

One of the gainless but illuminating experiments we did lately is term pruning. Experimentally, we have observed that 52% of terms occur in only one document and that excluding terms occurring only once have had no influence on precision of our recommendations. Our recommendation engine is computationally very demanding and make it more efficient is a never-ending process. A chance to prune 52% of terms seemed quite promising for increasing performance of our engine and reducing index size.

Our recommendation engine is based on Apache Lucene/Solr. At a recent Lucene EuroCon conference, Andrzej Bialecki presented a Lucene patch that provides an easy tool for index manipulation. Using this tool we have removed all terms occuring in only one document, and all postings and payloads belonging to such terms. It has turned out that efficiency of our engine did not change and also the index size decreased only slightly (by 1.5%).

In our opinion, this experiment has shown that Lucene is very efficient at storing terms and associated term data (postings & payloads), and that presence of rarely used terms in the index is not of a concern.

sreda, 19. maj 2010

Apache Lucene EuroCon 2010

Image by StrudelMonkey via Flickr

I'm in Prague this week to attend Apache Lucene EuroCon 2010. It is always great to be in Prague. It is such a great city to just stroll along the banks of Vltava, across the Charles bridge to Zlata ulička and all the way up to Hradčani.

ponedeljek, 10. maj 2010

BooleanScorer vs. BooleanScorer2

Image by mikebaird via Flickr

This post is about scorers in Apache Lucene. Let me first tell what scorers are, for those not intimately familiar with Lucene. In inverted index, a result list is compiled by merging postings lists of terms present in the query. A scorer is a function that combines scores of individual query terms into a single score for each document in the index. The scoring is usually the most time consuming step of searching in inverted index. Therefore, the efficiency of the scorer function is a prerequisite for efficient search in inverted index.

There exists two scorers for a BooleanQuery in Apache Lucene, BooleanScorer and BooleanScorer2. The BooleanScorer uses a ~16k array to score windows of docs, while the BooleanScorer2 merges priority queues of postings (see description in BooleanScorer.java for more details). In principle, the BooleanScorer should be much faster for boolean queries with lots of frequent terms, since it does not need to update a priority queue for each posting.

We were curious how much faster the BooleanScorer really is, so we compared the two scorers using Zemanta related articles application. In order to identify related articles, Zemanta's engine matches approximately a dozen of entities extracted from user's blog post against an index of several millions of documents.

The results have confirmed that the BooleanScorer is much faster than the BooleanScorer2:

the average response time has decreased by one third,
the maximum response time for 99% of requests has decreased by 20%.

Please notice, that BooleanScorer scores documents out of order and therefore cannot be used with filters or in complex queries.

The Unreasonable Effectiveness of Data

nedelja, 6. november 2011

Comments are not code

ponedeljek, 22. avgust 2011

#sigir2011

sreda, 11. maj 2011

Startup Slovenia

nedelja, 14. november 2010

Man-Computer Symbiosis

petek, 10. september 2010

Term pruning

sreda, 19. maj 2010

Apache Lucene EuroCon 2010

ponedeljek, 10. maj 2010

BooleanScorer vs. BooleanScorer2

Arhiv spletnega dnevnika

O meni

The Unreasonable Effectiveness of Data

nedelja, 6. november 2011

Comments are not code

Related articles

ponedeljek, 22. avgust 2011

#sigir2011

Related articles

sreda, 11. maj 2011

Startup Slovenia

Related articles

nedelja, 14. november 2010

Man-Computer Symbiosis

Related articles

petek, 10. september 2010

Term pruning

Related articles by Zemanta

sreda, 19. maj 2010

Apache Lucene EuroCon 2010

ponedeljek, 10. maj 2010

BooleanScorer vs. BooleanScorer2

Arhiv spletnega dnevnika

O meni