Monday, November 24, 2008

Map - Reduce - Already happening...

Following up on the previous post [http://architectguy.blogspot.com/2008/11/map-reduce-relevance-to-analytics.html], and after some more quality time with google, it's clear that there is more activity around the extension of map-reduce to large scale analytics than I originally thought or knew about.

Joe Hellerstein has published in O'Reilly Radar an interesting post on this exact subject [http://radar.oreilly.com/2008/11/the-commoditization-of-massive.html]. It is very interesting reading.

I think that beyond "massive data analysis" we will see the application of map-reduce to "massive event correlation". The same way the amount of data available for (and crying for) analytics processing is staggering and continues growing at staggering speed, the number and complexity of events that need to be processed to find the patterns and correlations that make them relevant to business apps processing is staggering and continues growing.
RFID, all sorts of devices connected to the cloud, exchanging information / pushing events related in many ways, all sorts of formal or informal transactions, connections and disconnections, ...

Joe points out that we will get to some convergence between traditional data management environments (your SQL) and the map-reduce approaches. He even points out to offerings put forward by innovative players - so what I thought was bound to happen in my previous post is happening even faster.

To paraphrase (in a slightly reworded fashion) what I wrote earlier,
- event pattern identification and correlation are ripe for this type of approaches, and will require it
- new algorithms will be made possible that leverage the approaches to provide results we do not think are possible today

This is one of those typical cycles: the abundance of data and events pushes the creation of new techniques, which in turn enable new applications which produce more data and events.

Very interesting times.

Sunday, November 23, 2008

Map - Reduce - relevance to analytics?

One of the key problems we face when dealing with vast numbers of events and/or vast amounts of data is how to efficiently parallelize the processing so as to accelerate the analytic development cycles. This is not just a question of efficiency (minimize the downtime of high cost PhD resources, ...) but also a question of feasibility (make the development of the analytics fit in business driven time-lines, such as the time to detect and react to new fraud mechanisms).

My experience, coming from the real time embedded systems days, is that the best ways to achieve massive parallelization result from highly simplified low level programming models. These models are designed to do only very simple operations but with nothing in them that prevents systematic parallelization.

Google has made popular (at least in the technosphere) the map-reduce approach derived from the functional programming world [http://en.wikipedia.org/wiki/MapReduce]. Google officially applies the approach to its search core capabilities, but it's very likely it uses it for much more than that.
Its formal definition is trivial, but map-reduce does provide a solid basis for massive parallelization of list processing. Which is very much what a lot - a lot - of analytics development work is. It may very well end up being a key component that will allow the implementation of complex analytics processing of amounts of data well beyond what we tackle today, enabling applications that are beyond reach today.
Today, map-reduce is essentially used for search (processing text), network processing (as in social networks), etc. The Hadoop open source effort [http://hadoop.apache.org/core/] has a list of current applications to which its map-reduce implementation is applied [http://wiki.apache.org/hadoop/PoweredBy]: that makes a very interesting reading.

Its applicability to machine learning / predictive analytics building is illustrated by the Mahout effort [http://lucene.apache.org/mahout/] which seeks to leverage Hadoop to specific implementations of traditional machine learning algorithms [http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf]. I see immediate applications to both including this approaches to CEP (highly parallel processing of events to achieve fast correlation) and predictive analytics development (highly parallel processing of data to find patterns - neural-net like; highly parallel implementations of genetic approaches, etc.).

I would be curious to know what the reader thinks about this.

I also believe a few developments are bound to happen, and I expect to see more happening around them in the coming few years:
- development of hybrid systems combining more traditional processing (SQL, SAS datasets,...) with map-reduce, potentially introduced by the vendors themselves but more likely by minor innovative players first
- development of new algorithms in event processing / data processing / machine learning that leverage map-reduce
- introduction of other approaches similar to map-reduce as the corresponding results.

I am particularly intrigued by the last two possibilities.

Friday, November 21, 2008

Something different - Lapointe and Perec

Not being able to sleep much has its small rewards... such as having time to spend in seemingly futile brain teasers, mind games and other "exercises" that, maybe because of the general exhaustion, I find quite amusing.

I was listening to a few old recordings by Boby Lapointe, a 50s-60s French singer, master of clever alliterations and plays on words. For some reason the lyrics of "Ta Katie t'a quitte" reminded me that I had read somewhere that he had introduced a bi-bi-binary system that would make manipulating large numbers a pleasure.
His system was basicaly an hexadecimal system (base 16 = base 2 ^ 2 ^ 2 ... no wonder Lapointe would focus on base 16 given his love for alliterations...), for which he defined a complete symbol set and corresponding pronunciation.
With this system, something like "119" + "137" = "256" would become "bibi" + "koka" = "hahoho" (thanks Google).
Much better...

Following hyperlinks reading that stuff, I stumbled upon references to other language games. In particular the famous (in the French speaking world) book written by Perec (La Disparition) that not only fully avoids using the letter "e" in a 300+ pages work, but also is largely self referential, discussing as part of its plot the disappearance of said letter [http://www.amazon.fr/Disparition-Georges-Perec/dp/207071523X/ref=sr_1_3?ie=UTF8&s=books&qid=1227261348&sr=8-3].
This is a brilliant technical achievement: in French, around 85% of the words that you would ever write will have the letter "e" in it (with or without accentuation).
Furthermore, the result is a book you would read: an interesting plot, constructed characters, etc. I read it many years ago, after a teacher gave me a copy of the book after challenging me to write a formal essay with the same challenge (I could not go beyond three pages in one hour, and barely articulated a very limited argumentation).
Georges Perec loved these formal challenges, very similar in nature to what Boby Lapointe did in songs.

The same teacher mentioned earlier also introduced me to the French poetry from the Baroque period - a lot of the same alliterations and plays on words. I still have in my brain somewhere the first verses of Pierre de Marbeuf's "La mer et l'amour ont l'amer pour partage".

In any case, I love this stuff.

I would love to find the equivalent kind of work in English. Recommendations?

Thursday, November 20, 2008

More on CEP

Obviously, the CEP / BRMS / BPM / boundaries debate remains hot and continues provoking soul searching.

My previous post on the this was a long (and complicated in the way it was phrased) ranting around the fact that CEP attempts to cover too many areas of the EDM picture, and at the same time does not do a deep enough job at solving the challenges around its core contribution - events as first class citizens in EDM apps. I do think the points I tried to make there remain valid.

I did also contribute a post in another blog as a reaction to a thread on the same subject.
Here is the essence of what I wrote there:

In this debate, we are essentially dealing with three core notions: events, decisions, processes. There is of course much more than that, but I think the confusion we are seeing results from lack of solid boundaries between responsibilities around these three notions. Getting to basics helps.

- Events are all about detecting what decision needs to be taken in a complex temporal context (essentially event correlation, converting monitored ‘ambient’ events into relevant business events that require reaction)
- Decisions (and I will use that term instead of rules) are all about deciding what needs to be done as a reaction to those business events
- Processes are all about executing the decision taken

In a very simplistic analogy,
- Dealing with events in an enterprise application is akin to the sensory system absorbing the information from all the sensors and sources it is connected to, and constructing a view for further decisioning and action, with or without immediate reaction, and communicating it through the nervous system.
- Dealing with decisions in an enterprise application is akin to the multi-step highly collaborative decision making the brain engages in, resulting in events and situational analysis, inclusion of further data, inferences, deductions, etc. and leading to conclusions on further reactions or proactive actions.
- Dealing with processes in an enterprise application is akin to the body executing the plan elaborated by the brain including the input from the nervous system.

CEP should address primarily the first, BRMS and other decision management technologies the second, business process the latter.

What I just outlined is centered operational aspects - as in the operational / improvement aspects distinction presented by Carole-Ann in her talks about the EDM vision at ORF and BRF [http://www.edmblog.com].

The improvement aspects cover how to leverage the information gathered by the operational system - as well as expertise - in order to improve the decisions. It starts by gathering the information and understanding it. All aspects above are to be covered: Am I capturing the right events? What if I decided to capture something slightly different: does that make my business events more relevant? Why did I capture this event at that point in time? What other events are captured within a given contextual distance from this one? Ditto for the decisions (the rules). Ditto for the execution (the processes).Why not just call these things "Event Analytics", "Decision Analytics", "Process Analytics"?Yes, they are connected. But different.

This is simplistic of course, but it has the virtue of being clear. Maybe we can try to reframe the discussions on these issues in that context?

Tuesday, November 11, 2008

State, events, time - a view on the confusion around CEP

CEP

Over the recent few months, the EDM world I work in has seen a lot of noise generated by the arrival of CEP - "Complex Event Processing" - and the impact it has had in terms of provoking soul searching in the BRMS and, to a lesser extent the EDA, ESP and other E(x)(x) worlds.
The "E" in these E(x)(x) is "events" which are of course at the core of CEP.
But so is "complex" and "processing", both of which lead more to the area of EDM or BRMS.

This is one of these situations in which a technology addressing at its core a valid set of concerns gets dropped in the middle of a complex soup of acronyms, and confusion ensues.

Of course, the CEP specialists should bear with me for the duration of the blog. I am aware that CEP has been around for a long time, etc., but it's also clear that it is undergoing a renewal through its adoption by the big platform vendors (IBM and Oracle acquisitions) and the innovative ones (Tibco, JBoss).

Why am I writing this blog? Essentially because I believe that the confusion is self-inflicted because as an industry (enterprise applications) we have not been careful to focus the usage of the terminology to its key area - events - and we've spent too much energy trying to justify the complex and processing aspect.

While some of the points below touch on semantic confusion (big words), I will try to remain pragmatic. Let's take "event", "complex processing" in turn.

"Events"

It is true that "events" and their semantics are quite important in systems that need to make decisions.

This is not new, it has always been the case. So why are we - the enterprise software world - only recently starting to give to "events" the preeminence they already have in other worlds - such as the real-time systems world I started in?

Well, simply because the enterprise world has a way to cope with a lot of the value of events by translating events into state.
Traditionally, the occurrence of events, and in sophisticated systems, even their sequencing and timing, are encoded in the state of the system, transforming the "event" management problem into the dual "state" management problem.
Take an example: a fraud detection system is of course keenly interested in knowing what happened where when in which order. It's not a surprise that most CEP vendors use fraud detection as a key example. Well, guess what? Fair Isaac's Falcon Fraud Manager, by far the most used credit card transaction fraud detection system, as well as a host of other fraud detection systems, do not use CEP as defined by the current vendors. They translate events into profile (read "state") information, and they enrich at run time the profiles using sophisticated business rules (not even production systems). This profile encodes variable values that are highly targeted precisely to capture the essence of the business semantics of the events: "number of purchases of gas using the same credit card charging less than a certain amount within a window of time and a geographic region".
You could say that what they have is a sophisticated hugely scalable (90%+ of all credit card transactions in the US go through Falcon) stateful decision management system, supported by a powerful "cache" of variables computed through complex business rules.
And there are many cases like that.
The reality is that the "enterprise" world has been dealing with events for ever. They have simply not needed to resort to any E(x)(x) notion / stack / etc.

That being said, I believe there is an important piece in "event" centric expression of logic - and that is the separation of the event processing logic from the rest, and the resulting clarity, elegance, maintainability and ultimately all those scalability, ability to audit, robustness, ..., qualities that result from clean concepts, clean designs, clean architectures.

Which brings me to the following first opinion on CEP:
(1) CEP should stop worrying about the origin of the events and focus on the events themselves. It does not matter how the events originate, and the issues of ESB, EDA, ESP, in-database generation, etc..., are all orthogonal / independent on how events-dependant logic is managed.
And to the second opinion on CEP:
(2) CEP should stop worrying about caching. Yes, caching is important but irrelevant to the power of the approach - as the fact that among the largest and most scalable event-driven enterprise apps, many handle the issue with no need to couple the event management piece from the management of caches. Right now, there are efforts to extend the Rete structures and to adapt the algorithm to build this cache in a more efficient way for the corresponding type of rules processing. Great usage of the technology, but that will not make the potential power of event centric approaches any more compelling.

Maybe it's time to be a little more constructive.

(3) CEP would do wonders in enterprise systems if it focused all its attention to the "event" part: the semantics of events, the richness of the event context (time, location, referential, ...), the clean semantics of the key contextual notions (time operations including referential management, etc...), etc.

The events semantics question is not innocent, and is absolutely not a settled question - witness the numerous exchanges between clever people than me on this subject.

I had a discussion with Paul Haley once on this subject and we went into the "what is an event" question that has the virtue of quickly getting people upset. It's a valid question: the definition of "event = observed state transition" has the bad taste of defining events in terms of state, but its key issue in my eyes is that it supposes observation of a state and lacks content.
The value of events is that they have intrinsic context that are not naturally contained in state systems: they occur at a point in time with respect to a given referential - or more generally, they occur at a contextual point (could be time+location, etc.) with respect to a given referential. Different events used within the same system may have their intrinsic context expressed with respect to different referentials - and that will be the default case in any distributed system.
Events occur, they are atomic, immutable. Their only existence is their occurrence. We may track the fact that they occurred, but an event instance only happens once and is instantaneous. An event does not last a duration. That is not logical to me - its effects or the state change it triggers may last a duration, but the event in itself is instantaneous.
Which leads to the fact that there are natural event correlations you want to express (not just discover): an event creates a state transition, a correlated event will create another state transition that will bring the state back to the original one.
This is just my opinion - but if you talk to more than one real specialist, you will get more than one view. Not a good sign of maturity of the concepts.

Clean ontologies / semantics / etc. needed.

With the clarification of the semantics of events and their intrinsic referential-dependent context are clarified, we need to focus on what we want to express about events - and for that, we need to bring on the enterprise application experts.
There is a lot to learn from the real-time systems experts - refer to the very old but incredibly good insights from Parnas' work. There is a lot to get back from the original event correlation systems - many built with systems such as Ilog's rules engine. These could even be said to be the purest predecessors to what CEP attempts to do.

What this will end up doing is giving us - the decision management world - a very powerful tool to "naturally" express logic on events, with their referential-dependant context, and to do so in a way that enables true management (things like verification of the logic included), powerful optimizations, etc.


I honestly do not think we are there, and I would really like to see the standardization world - the OMGs and others - help us get there; but I do think we need the enterprise business app drivers. We had that, in real-time systems: the military and transportation apps.

I would love the specialists to prove me wrong and to show me we are there.


"Complex processing"

This will be shorter.

As stated above, I believe that:
- CEP should narrow its processing ambitions. One approach is to focus the purpose of its processing to clear outcomes - such as what the correlation engines did 20 years ago. For example, we could say that CEP's processing is about processing / correlating events to generate higher order events: transaction events to generate a potentially fraudulent transaction event. I will call these "ambient events" and "business events": the CEP processing goal is to translate ambient events into business events
- CEP should focus the complexity of its processing to the corresponding revised ambitions.
- CEP should leave all issues related to event streaming, event transport, event communication, etc. to other layers.

I may be a purist, but I see a simple picture:
- leave anything related to transport, communication to other layers
- use this revised CEP to express and execute event-relevant logic, the purpose of which is to translate the ambient events into relevant business events
- have these business events trigger business processes (however lightweight you want to make them)
- have these business processes invoke decision services implemented through decision management to decide what they should be doing at every step
- have the business processes invoke action services to execute the actions decided by the decision services
- all the while generating business events or ambient events
- etc.

As such, CEP will include a semantically sound event-with-intrinsic-referential-dependent-context model, a corresponding language (EPL or vocabulary) to express logic, algorithms to efficiently execute (wide open field - tons of people doing analytics, Bayesian, rules, ...), techniques to verify (wide open - and fairly empty: I only know of the real-time folks), etc.

And there, the value of CEP is clear. Of course, it is lower than what CEP vendors would like, but significant anyway.

I am hoping this is controversial enough to bring on flames...

Thursday, November 6, 2008

Enterprise Decision Management? How about State Decision Management?

This is interesting: http://www.nytimes.com/2008/03/28/world/americas/28cybersyn.html?_r=1&oref=slogin&ref=americas&pagewanted=print and http://www.guardian.co.uk/technology/2003/sep/08/sciencenews.chile

Very interesting for me since I work in decision management, and Chile is where I come from. I had seen mentions of this in the past (hence my sudden recollection of it during this sleepless night) in Andreseen's blog. Wikipedia has a summary (http://en.wikipedia.org/wiki/Project_Cybersyn) that mentions usage of Bayesian filtering although does not insist on the learning and adaptive aspects.

It also refers to this: http://www.williambowles.info/sa/FanfareforEffectiveFreedom.pdf - which I really do not know what to make of. It brings me back to my early days in Engineering school when I was studying what we used to call "Automatique" and was all about adaptive control. But applied to social and economic matters at the scale of a country!?

On the other hand, this lecture on cybernetics by Beer is intriguing: http://lispmeister.com/downloads/Stafford-Beer-forty-years-of-cybernetics-lecture.wma (pretty big)

So it turns out that Chile does not just do wine and pisco (sorry Peru).

Wednesday, November 5, 2008

The new programming languages discussions

I am spending time getting a hold of the new approaches on programming languages.

Many years ago, I discovered Bentley's Programming Pearls (http://www.amazon.com/Programming-Pearls-2nd-ACM-Press/dp/0201657880/ref=sr_11_1/176-0790589-8960217?ie=UTF8&qid=1225956297&sr=11-1). I have true love for well defined "little languages" that have a clear purpose and deliver it with elegance, concision.

I was lucky to work around a language for a business rules engine, and I remember with fondness the long discussions around language design and pretty important things that are now obvious but back then, in my youth and the earlier days of OO, seemed subject for debate: value versus object, object identity versus object state, etc... How the Java JDK got it wrong in a number of classes in java.util.
And the sempiternal debate around strong vs other typing systems.

Since then I discovered (through Paul Ford's excellent ftrain.com site) Processing (http://ftrain.com/ProcessingProcessing.html), a very nice language. And then many others.
However my work diverted my energy into less fun stuff, repositories, server side infrastructure, etc.. And management (see other posts...)

But now I am back to more fundamental thinking about all this, and it's now starting to make sense again. DSLs (graphical, syntactic) constructed using tooling provided by dynamic languages leveraging DLR support resting on language neutral VMs for which the low level core code is written in strongly-typed Java like languages. "Mais bon sang, mais c'est bien sur".

The low level core code needs scalability, performance, compile time checking for mission critical code. Java gives you that - and yes, Java does give you that, hotspot compilation is doing wonders this day.
The dynamic, sometimes reflective, languages in the middle (and I will call dynamic even those that do 'duck typing' - 'if it looks like a duck, it sounds like a duck, it smells like a duck, then it's most likely a duck') give you productivity - and Python and Ruby have proven they can deliver and be safe. The 'Da Vinci' VM, even more so than the DLR effort on .NET, promise to combine dynamic language interoperability (making it easier to free DSLs from the tyranny of the underlying supporting programming language) and performance (enabling hotspot compilation) (read John Rose's very instructive blog: http://blogs.sun.com/jrose/).
These dynamic and reflective languages enable the easy creation of maintainable DSLs.

This is powerful.
Makes me rethink a lot of things.

Then there is the question of 'model driven' versus 'dsl driven'. For me, at a high level, DSLs are there to express the models; and, in the other direction, models define the boundaries for the DSLs. The key is that DSLs should not express more than the corresponding models (lest they become just other horizontal programming languages), but they should express them fully, with all the required flexibility, with elegance and in a concise fashion.
Back to the beginning of this post and Bentley's pearls.

But this is a big exciting subject. It will keep me awake at night. Complicated, but in a good way ;)

Tuesday, November 4, 2008

Back...

After a long absence from the blog, I am back to it.

I have spent a couple of months heads down focusing on management issues, most of them consequences of the way my current employer has chosen to face the current crisis. Typical 'protect the EPS' reaction.

I posted earlier on the counter-cyclists, and the reasons why it makes a lot of sense to manage good times better so that we can manage bad times in a strategic way and benefit from the exit from downturns rather than play catch up. It looks simple on paper - I am certain it's more difficult than that - but I have to say my growing experience with upper management does not give me the feeling this requires anything else than thinking strategically rather than tactical. Where this fails is when upper management is mercenary upper management: on board for 2 to 5 years and not interested in long term growth.

This is important for us software architects - not just because of the obvious impact on our continued employment, but also because architecture is strategic, and gets sacrificed at the altar of short term tactical moves. This is obvious to say, again, and obvious to understand, but it still amazes me how selective corporate perception is, and how little care is given to preserving corporate memory.

I will for a while - hopefully - stop blogging about this kind of thing. And go back to more technical stuff - however difficult it is to do in abstract of the overall situation...