Sunday, November 23, 2008

Map - Reduce - relevance to analytics?

One of the key problems we face when dealing with vast numbers of events and/or vast amounts of data is how to efficiently parallelize the processing so as to accelerate the analytic development cycles. This is not just a question of efficiency (minimize the downtime of high cost PhD resources, ...) but also a question of feasibility (make the development of the analytics fit in business driven time-lines, such as the time to detect and react to new fraud mechanisms).

My experience, coming from the real time embedded systems days, is that the best ways to achieve massive parallelization result from highly simplified low level programming models. These models are designed to do only very simple operations but with nothing in them that prevents systematic parallelization.

Google has made popular (at least in the technosphere) the map-reduce approach derived from the functional programming world [http://en.wikipedia.org/wiki/MapReduce]. Google officially applies the approach to its search core capabilities, but it's very likely it uses it for much more than that.
Its formal definition is trivial, but map-reduce does provide a solid basis for massive parallelization of list processing. Which is very much what a lot - a lot - of analytics development work is. It may very well end up being a key component that will allow the implementation of complex analytics processing of amounts of data well beyond what we tackle today, enabling applications that are beyond reach today.
Today, map-reduce is essentially used for search (processing text), network processing (as in social networks), etc. The Hadoop open source effort [http://hadoop.apache.org/core/] has a list of current applications to which its map-reduce implementation is applied [http://wiki.apache.org/hadoop/PoweredBy]: that makes a very interesting reading.

Its applicability to machine learning / predictive analytics building is illustrated by the Mahout effort [http://lucene.apache.org/mahout/] which seeks to leverage Hadoop to specific implementations of traditional machine learning algorithms [http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf]. I see immediate applications to both including this approaches to CEP (highly parallel processing of events to achieve fast correlation) and predictive analytics development (highly parallel processing of data to find patterns - neural-net like; highly parallel implementations of genetic approaches, etc.).

I would be curious to know what the reader thinks about this.

I also believe a few developments are bound to happen, and I expect to see more happening around them in the coming few years:
- development of hybrid systems combining more traditional processing (SQL, SAS datasets,...) with map-reduce, potentially introduced by the vendors themselves but more likely by minor innovative players first
- development of new algorithms in event processing / data processing / machine learning that leverage map-reduce
- introduction of other approaches similar to map-reduce as the corresponding results.

I am particularly intrigued by the last two possibilities.

No comments: