So, you want to process data that flows and you want to show information based on this data in real-time, right?
What are the available options you have?
First, let’s cover the components that you will need:
Figure 1. Flow of Data
As seen in figure 1, the Data Streaming & Processing is only one part in the architecture.
In this post I will focus only on this part, but will suggest what are the available tools that may fill in for the rest of the components:
Data Source: This is any source of data that may exist. Known tools to send the data into the Processing component are Apache Kafka or Amazon Kinesis.
Analytical Database: Any database that applies you need for aggregating the information for the UI. Some suggestions may be: Cassandra, HBase, Elasticsearch, Couchbase, etc.
Front End: This will generally be a web server that serves a web application for your use.
Data Streaming and Processing engine
Now we are at the heart of this post.
What are the available Data Streaming and Processing engines that exist and what is the difference between them?
Those are the expected abilities of the Stream processor:
- Distributed system with ability to scale linearly.
- Fault Tolerance and Durability (Consistency)
- Resource Management and rebalance resources
- Mature enough
Note: For item #1 and #2, according to the CAP theorem, it may be difficult to gain both of them. Therefore, a system that will allow to choose between the two based on the expected SLA will be a good catch!
We can find today two different approaches that can be adapted
- Real-time streaming infrastructures. Data streams between processing units via network (Apache Storm, Apache Flink)
- Micro-batch infrastructure data is persisted between the hops (Spark Streaming)
There is also a third approach that combines some features from each approach. Apache Samza streams data in real-time, but keeps the intermediates between the hops persisted.
In this post, I will cover the following tools:
Storm was the first real-time event processing framework that was release ans that addressed most items in our requirements list. It was developed by Nathan Marz and BackType team and then acquired and open sourced in Twitter.
It is also the most adopted system as can be seen here.
Let’s check if Storm addresses all our needs:
Distributed System: Yes. Storm is certainly a distributed system that can scale. I will not go into the architecture of Storm here. I will only mention that Storm is built on small runnable components that are called Spouts and Bolts. Tuples are passed between the Spouts and Bolts via ZeroMQ/Netty. An application can grow in scale by setting more bolts or by setting more parallelism. For more information please refer to Storm home page.
Fault Tolerance Vs. Fast streaming: Here, there was a lot of work done to allow using a more durable application or a faster application, with a penalty of latency.
So on one side, Storm provided a very good configurable way to address your SLA.
On the other side, when working with Storm you may reach points of not durable enough and trying to be more durable will have a latency cost.
In General, it needs a lot of tuning to adjust the best configuration for your application.
Resource Management: Here again, Storm does not make it easy for you. As real-time data may have daily and seasonal peaks, you may need to adjust your resources to be sure you do not lag behind.
There is however a good initiative of Mesos for Storm that make your like much easier in this regard!
Spark was initiated by Matei Zaharia at UC Berkeley. Originally this project meant to address some issues in Hadoop like latency and to improve machine learning capabilities. Later on it was taken by Databricks and added to the Apache project.
As Spark idea grew with Hadoop in mind, the batch processing was the backbone of it. The core of Spark is Resilient Distributed Dataset (RDD), which is is a logical collection of data partitioned across machines.
Spark streaming was added after the project started to grow significantly.
This went very well with the Lambda Architecture. A concept that was spoken a lot but was not implemented very much. With Spark, you can gain both offline batch processing and real-time streaming in one platform.
So with spark, that is based on batch processing, streaming is in fact still batch processing and not real time streaming. The idea is that if you shorten your batches very much until they become micro-batches, you get a latency close to what real-time streaming suggest.
As latency may be a problem, the durability of batch is superior over real-time processing. Fault tolerance is easier to maintain and resource utilization is more efficient.
Samza was developed in LinkedIn and was aimed to address the pitfalls of Storm.
There is some similarity between Samza and Storm, but some important differences.
In general, the architecture is similar: A distributed system that can scale horizontally; Streams and Tasks replace Spouts and Bolts.
Fault Tolerance Vs. Fast streaming: In this aspect, Samza took the Consistency approach over fast streaming. the messages that are sent are always persisted to disk. Therefore the issue of bottlenecks that may occur due to temporarily high throughput is solved. But the whole throughput may take a bit longer, and it is dependent on the number of hops in the way of the message.
A good comparison between Storm and Samza can be found here.
Samza comes with Yarn integration, so the Resource Management should be well addressed.
Choosing Streaming framework
When coming to decide which streaming framework to choose for a new project, there may be some guidelines.
First: define your expected SLA (Service Level Agreement).
A distributed system is not easy to manage and to monitor, and there are trade-offs that you need to decide about
If you need a very low latency system that can get you responses in milliseconds or few seconds, then storm is preferred. There is also a good pro for Storm, that it lets you manage your data flow in the system in a visual topology map.
Choose Spark Streaming if you can compromise the low latency to few seconds or even minutes, but to know that you can sustain high throughput and unexpected spikes. Another benefit is that you can use the same infrastructure for both batch processing and streaming.
We saw that there are several good available Cluster base, Open Source, Streaming frameworks today to choose from.
Each one took a different technology path but they all address the same goal: providing an infrastructure that is fault tolerant and can run on clusters, on which you can deploy small processing units that process streaming data.
Storm is quite mature and proved very good performance results.
On the other hand, Spark Streaming has an emerging community and lets you develop durable data frameworks
Samza took a kind of chimeric approach between the two and did not gain lot of popularity yet.