ArticleNovember 21, 2019 · 8 min read time
Every healthy business is a data-driven software business these days. Innovative companies increasingly turn to real-time processing of their data as event streams to gain a competitive advantage in the digital service economy.
A pioneer in the field of data streaming is S-Group, Finland's largest retail and service sector operation:
"We're working with Nitor to establish an event streaming platform as the central nervous system of our digital business," says Jari Korpinen, Technical Architect at SOK IT.
"It's a new capability for us and we're learning as we go, so Amazon AWS and Confluent Cloud have been instrumental in enabling us to experiment without capital expenditures."
As more and more organizations seem to be experimenting with data streaming at the moment, I decided to put together a comprehensive recipe for success based on our work with S-Group. Enjoy!
From data to events
Accurate and timely data is imperative to success in the digital service economy. Data is the lifeline of a customer-centric digital business, and it comes in many flavors. There's the traditional relatively static master data in various enterprise systems of record. Behavioral signals and customer transactions from apps add timeliness as does sensor data from end-user devices or industrial sensors.
Everyone has data, but not everyone is able to react to changes in the data as they happen. Seeing changes in data as a stream of actionable events is like placing a finger on the pulse of your business processes and customer activity. This is key to gaining a competitive advantage.
Events are facts
Events represent something that happened. They are facts. Facts are actionable: They can be triggers for actions such as communicating with customers or operational business processes. Actions triggered by events change the world around us and lead to more events. This creates a loop that feeds a successful digital business.
Events contain varying amounts of data about the facts that they represent. This dual role of events is essential: They are data, but more importantly, they are dynamic facts about real world phenomena successful digital businesses strive to react to.
Facts can be related to each other and considered together as aggregate facts. Consider, for example, a car with hundreds of individual sensors where the aggregate of all of them is the overall state of the car.
Gathering factual data from different sources, making sense of it and ultimately making business decisions based on those facts in real-time is what event streaming is all about. Doing this effectively is a crucial source of business acumen for digital businesses and a new exciting field in enterprise software engineering.
Here's the event streaming loop as a diagram:
Let's have a look at the engineering part in more detail.
Curating streams of events
For event streams to be actionable, the data has to be reliable. Properties such as “at-least-once” or the mythical “exactly-once” message delivery and reliable ordering of events matter. Scalability in terms of event volume is a key feature as well.
Current practical solutions to these challenges are based on the idea of a distributed commit log for events. Various data streaming services offered by cloud providers like AWS (Kinesis), Azure (Event Hubs) and Google Cloud (Google pub/sub) and the open source Apache Kafka implement a distributed commit log. This forms the basis of an event streaming platform, the event store.
Events at rest in a commit log are not worth much unless we get to run computations on them, so we need a framework for streaming, transforming, and taking action on the events. There's a lot of activity in this space, and tools should be chosen based on the type of computations that need to be run.
Event streams come in many shapes and sizes. The main two ways to categorize them are whether they are bounded or unbounded and whether processing consists of simple stateless logic or more complex stateful computations possibly across multiple streams.
Simple stateless event streams
You could be happy with simply replicating data through a stream or triggering straightforward actions based on a stream of events of a single type. To achieve this, look for producer/consumer frameworks for your chosen streaming service. Consider serverless “Functions As A Service” platforms for deployment to minimize runtime and operational expenses.
Stateless streams are quite a stable technology comparable to processing messages off of message queues, but more scalable. To unlock real-time capabilities with aggregate facts, they might not cut it, however. You'll end up with a scattered codebase deployed as multiple FaaS functions that are hard to make sense of afterward. Latency might also be an issue depending on your needs as FaaS invocations, database queries for additional facts, and the inherent delays in many cloud streaming services can lead to delays of multiple seconds.
Stateful Streams for aggregate facts
Stateful streaming frameworks like Kafka Streams, Flink or Storm (all Open Source Apache projects) take multiple streams and build an in-memory or local disk low latency state. The facts arriving at different times can thus be joined together efficiently to form aggregate facts.
Stateful stream processing can be done at different abstraction levels. One is to write the logic out as code in a general purpose programming language. Another option is to specify queries and transformations in a SQL-like query language. Confluent KSQL, Apache Flink and Storm support this kind of usage, for example.
Stateful streams are more complex and require more effort to develop and test as they maintain internal state in multiple instances for scalability. Along with the increased effort, the benefits can also be significant here. You'll be able to move from batch based data warehouse type computations to real-time stream processing over multiple event streams. This is topped with actionable enriched aggregate events that trigger actions at the time a user is interacting with your services.
Unbounded event streams carry facts like continuously updating sensor data or signals about user behavior in an app or a website. There will always be more of these facts available which makes the data and the stream more or less infinite. Data like this is often temporal, so it's good to be able to react to it continuously and without delay when it still matters. This is why real-time stream processing is vital with this kind of data.
Unbounded streams are the most important real world triggers for event processing as they clearly represent facts that are external input to your software and services. They canserve as the glue that connects you to your customers' behavior.
Bounded event streams carry a finite set of data. There are only so many users of a service or products in a catalog, for example. The set does change, but at a slower pace compared to unbounded streams. Master data in various enterprise systems of record typically flows through streams which are considered bounded.
The relatively slow pace of change means the size of the data set is manageable for storing the entire set in a stream (e.g., a "compacted topic" in Apache Kafka). This kind of streams combined with appropriate processing technology approach the concept of a database and offer the possibility to create aggregate facts that combine master data with real world fact events.
Streams with aggregate event data joined from source streams of behavioral and master data can trigger real-time business processes which combine the state of your business with the state of the real world or behavior of your customers.
Once you have the data streams out there, how do development teams discover them and get authorized to use them? In large organizations, the stream discovery capability is an important accelerator for getting things done. An analogous capability is well established in the API scene where API Portals exist for these purposes. The main capabilities needed are developer self service subscriptions to streams supported by an authorization workflow for data governance.
I'm not aware of a similar offering for streams at this time, but a clear need for it exists. The beginnings of this can be seen in the form of schema registries, which carry information on the content and format of streams but do not at this time provide functionality for discovery or subscription.
Notably the big 3 cloud platform offerings do not seem active in this space, but the cross cloud Apache Kafka ecosystem is moving in this direction.
Towards an event streaming platform
Event streaming looks to be a good way to accelerate the feedback loop of strategic business initiatives. The components of an event streaming platform are starting to come together, but the event portal is clearly missing:
Storage service: Apache Kafka, AWS Kinesis, Azure Event Hubs, Google PubSub
Streaming frameworks: Kafka Streams, Apache Flink, Apache Storm
Discoverability: Event Portal?
Governance: Event Portal?
An event streaming platform with capabilities and technologies chosen to fulfill the specific needs of an organization can be a competitive advantage. It will shorten time to market for new stream processing use cases. To get on this train early, some in-house development effort is clearly needed. Here's an opportunity to become the next Netflix or Zalando, to contribute to open source tools in this area and build a software engineering culture inside the company. It's going to be a lifeline for future business agility needs!
Data streaming in the enterprise is taking small steps forward. It has all the potential to become a huge differentiator. Early adopters are hedging their bets on gaining business agility by bringing the data out from legacy vaults in a developer enabled but controlled way.
It's early days for large scale enterprise event data streaming. The development tools and practices for event streams are still rudimentary but improving at a rapid pace at the moment.
We'll get there, eventually!