The Unreasonable Effectiveness of Data: 2010

nedelja, 14. november 2010

Man-Computer Symbiosis

Image via Wikipedia

October issue of Communications of the ACM reports about a scientific paper that shows how a complex scientific problem of predicting protein structure can be solved by harvesting brain-power of 57.000+ people. The integration of human problem-solving capabilities with computational algorithms has enormous potential that might fundamentally change the world. While machines excel at computation, humans shine at pattern recognition. By combining the two, many intractable problems can be solved.

There are 1.4B people living below the $1.25 poverty line and at current pace of mobile phones penetration, even the poorest people will soon own a mobile phone. Even a basic mobile phone enables somebody to receive a problem, to post a solution and to get a payment. If we would be able to map important problems to a large number of people, reduce them to small cogs in a humongous analytic machine, and harvest a solution, everyone would benefit.

Just like Zynga has stormed the world by social gaming, some future start-up might fundamentally change the lives of hundreds millions of poorest people in the world for the better by social problem solving, earning billions along the way.

--
It is almost 15 years since Kasparov lost against Deep Blue. I think it is time for human race to take back supremacy in chess by intelligently harnessing our brain power. I know I would be most thrilled to take part in such a match.

petek, 10. september 2010

Term pruning

Image via Wikipedia

At Zemanta we are constantly experimenting with new ideas how to improve our service. Most of our experiments are gainless, but quite often one learns more from failure than success.

One of the gainless but illuminating experiments we did lately is term pruning. Experimentally, we have observed that 52% of terms occur in only one document and that excluding terms occurring only once have had no influence on precision of our recommendations. Our recommendation engine is computationally very demanding and make it more efficient is a never-ending process. A chance to prune 52% of terms seemed quite promising for increasing performance of our engine and reducing index size.

Our recommendation engine is based on Apache Lucene/Solr. At a recent Lucene EuroCon conference, Andrzej Bialecki presented a Lucene patch that provides an easy tool for index manipulation. Using this tool we have removed all terms occuring in only one document, and all postings and payloads belonging to such terms. It has turned out that efficiency of our engine did not change and also the index size decreased only slightly (by 1.5%).

In our opinion, this experiment has shown that Lucene is very efficient at storing terms and associated term data (postings & payloads), and that presence of rarely used terms in the index is not of a concern.

sreda, 19. maj 2010

Apache Lucene EuroCon 2010

Image by StrudelMonkey via Flickr

I'm in Prague this week to attend Apache Lucene EuroCon 2010. It is always great to be in Prague. It is such a great city to just stroll along the banks of Vltava, across the Charles bridge to Zlata ulička and all the way up to Hradčani.

ponedeljek, 10. maj 2010

BooleanScorer vs. BooleanScorer2

Image by mikebaird via Flickr

This post is about scorers in Apache Lucene. Let me first tell what scorers are, for those not intimately familiar with Lucene. In inverted index, a result list is compiled by merging postings lists of terms present in the query. A scorer is a function that combines scores of individual query terms into a single score for each document in the index. The scoring is usually the most time consuming step of searching in inverted index. Therefore, the efficiency of the scorer function is a prerequisite for efficient search in inverted index.

There exists two scorers for a BooleanQuery in Apache Lucene, BooleanScorer and BooleanScorer2. The BooleanScorer uses a ~16k array to score windows of docs, while the BooleanScorer2 merges priority queues of postings (see description in BooleanScorer.java for more details). In principle, the BooleanScorer should be much faster for boolean queries with lots of frequent terms, since it does not need to update a priority queue for each posting.

We were curious how much faster the BooleanScorer really is, so we compared the two scorers using Zemanta related articles application. In order to identify related articles, Zemanta's engine matches approximately a dozen of entities extracted from user's blog post against an index of several millions of documents.

The results have confirmed that the BooleanScorer is much faster than the BooleanScorer2:

the average response time has decreased by one third,
the maximum response time for 99% of requests has decreased by 20%.

Please notice, that BooleanScorer scores documents out of order and therefore cannot be used with filters or in complex queries.

četrtek, 8. april 2010

Authentication on the Internet - a solved problem?

Image by Rafiq Phillips via Flickr

In my opinion, the greatest potential of Facebook is to become a dominant authentication method on the Internet. If Facebook succeeds, it will supplement Google as a gatekeeper of the Internet and become the mightiest company in the universe with unprecedented power. But I think Facebook will fail to seize this opportunity. Here is why.

The solution to the problem of authentication on the Internet has so far eluded us. None of the more secure methods (e.g., digital certificates, OpenID, security tokens) have gained wide traction and the dominant authentication system we have in place today (username/password combination) is so broken that it does not stand a chance against Nigerian phishermen.

Mostly by coincidence Facebook has found out that people who know you are also the most authoritative source for confirming your identity. By building your social graph you are building your on-line identity. Facebook, as the keeper of your social graph, can pass upon your request your social graph to some other web site in order for you to confirm your identity. Building a social graph takes time and requires constant maintenance resulting in a substantial user lock-in that Facebook enjoys.

We have put up with effort required for building the social graph because Facebook gave us some very sweet incentives. 400M+ users are a living proof how strong incentives friday party's photos and juicy comments can be. But our social graphs are now mostly built. In order to keep us interested, Facebook is providing us with ever more incentives (videos, games, news, chat, e-mail, ...). But with every new incentive that seduces us, more and more about our lives can be deduced from the river of news we generate. While our social graph included only our like-minded friends, total exposure of our personality on Facebook did not matter. But with moms and bosses joining Facebook, teenagers and tweens will leave (with others to follow soon) for more private quarters or hide themselves under a false name.

I think Facebook is making a capital mistake by making us spend more time using it. With every piece of information about ourselves exposed on Facebook, the chances are increasing that our social graphs will collapse due to unwanted exposure of our personal details. The situation very much reminds me of the year 2000 when portals such as Yahoo and AOL were competing for users' eyeballs time. As history testifies, the whole portal edifice collapsed with arrival of Google who made a fortune by making users leave Google's web site as fast as possible.

I think Facebook should follow Google's recipe of getting out of user's way and transform itself into a simple Facebook button on every web site thus becoming the dominant authentication method on the Internet and collecting hundredths billions dollars in transaction fees for (virtual) goods along the way.

I feel a premonition that Facebook is already too big to be capable of simplification of its business model. Even though Facebook feels like the king of the world at the moment, it might very well end the same as Yahoo and AOL did. I think now it's the perfect time for the onset of a company that would do social authentication that does not suck, just as Google provided us with a web search that doesn't suck a decade ago. Just to make fun out of myself ten years from now, I'll make a prediction that Foursquare will be the David who will trounce the Facebook Goliath.

---
As a side note, let me note that Twitter is no alternative to Facebook as a authentication method since Twitter's "follower" model does not provide a chain of trust. Well, unless Twitter solves the scalability problem of the verified account approach.

sreda, 10. marec 2010

Boosting more recent content in a custom Apache Solr request handler

Image via Wikipedia

Users prefer recent information. Apache Solr 1.4 has excellent support for boosting more recent content. It turns out that adding the same functionality to custom request handlers requires some digging into Solr internals. Though it's really cool digging into Solr, you might appreciate a ready made solution. So here's the code to implement boosting more recent content for a given query.

ValueSource document_date =
 new TrieDateFieldSource("document_date_field",
     FieldCache.NUMERIC_UTILS_LONG_PARSER);

/* ValueSource that calculates the number of miliseconds
* between the document_date (e.g. blog publication date)
* and the present time, i.e. now.
ValueSource vs = new DualFloatFunction(
 new LongConstValueSource(now), document_date) {

     private static final long serialVersionUID = 1L;

     protected String name() { return "ms"; }

     protected float func(int doc, DocValues aVals,
             DocValues bVals) {

         return now - bVals.longVal(doc);
     }
 };

/* ReciprocalFloatFunction implements a reciprocal
* function f(x) = a/(mx+b), based on the float value
* of a field or function as exported by ValueSource vs.
* Values m, a, and b are float constants. */
ValueSource recip = new ReciprocalFloatFunction(vs, m, a, b);

/* Boosting a given query with the reciprocal function recip */
Query boostedQuery = new BoostedQuery(query, recip);

Notes:

"date" field type should be of the class "solr.TrieDateField" (introduced in Solr 1.4) in order for the above recipe to work.
Unfortunately, org.apache.solr.search.LongConstValueSource and org.apache.solr.schema.TrieDateFieldSource classes are not public. I've expressed my wish to the Solr team to make this two classes public. Until they do, you should copy these two classes from Solr source code to your project.

The Unreasonable Effectiveness of Data

nedelja, 14. november 2010

Man-Computer Symbiosis

petek, 10. september 2010

Term pruning

sreda, 19. maj 2010

Apache Lucene EuroCon 2010

ponedeljek, 10. maj 2010

BooleanScorer vs. BooleanScorer2

četrtek, 8. april 2010

Authentication on the Internet - a solved problem?

sreda, 10. marec 2010

Boosting more recent content in a custom Apache Solr request handler

Arhiv spletnega dnevnika

O meni

The Unreasonable Effectiveness of Data

nedelja, 14. november 2010

Man-Computer Symbiosis

Related articles

petek, 10. september 2010

Term pruning

Related articles by Zemanta

sreda, 19. maj 2010

Apache Lucene EuroCon 2010

ponedeljek, 10. maj 2010

BooleanScorer vs. BooleanScorer2

četrtek, 8. april 2010

Authentication on the Internet - a solved problem?

sreda, 10. marec 2010

Boosting more recent content in a custom Apache Solr request handler

Arhiv spletnega dnevnika

O meni